Written 2022-11-23 hosting has been hosted by Nalan Nijesh's HappyBee Host since December 2017. Outside of one hiccup when my VM's host ran out of disk space, it's been quite reliable, and low priced.

Part of what made this cost effective is a container technology called OpenVZ containers. HappyBee Host is now transitioning their users off OpenVZ to KVM, which they've also supported for some time.

Expect this site to go down a bit in the coming weeks.

Modifying a 5 foot Lowe's poseable skeleton for cycling

Written 2022-10-23


So, you have a spare tandem bicycle but no one to ride halloween with you? Simply add a skeleton!


Skelly is a five-foot poseable skeleton from Lowe's. However, their joints are designed to be poseable, and make a loud clicking noise when adjusted - not ideal for continuous operation, and would probably wear out quickly. Here's the joint mechanism:


There are a few different variations on this theme throughout their body, but they all share a spring-loaded toothed gear(or gears) sprung into a fixed gear that takes a fair amount of torque to snap into the next position. This creates a number of positions where the joint is stable. Ends of the joint are held into the next bone either by 4 plastic posts(top of this joint), or by a plastic stick-out(bottom of this joint). Here's an assembled view screwdriver replacing bolt):


Here's the hip(a one-sided leg to hip joint):


Here's the ankle(lower leg reaches over both sides of ankle joint):


Disassembly notes: Each joint is held with a phillips machine screw and nut. Since the joints are spring loaded, don't do it over a skeleton-colored carpet. Try to hold or at least cover the joint assembly as you remove it. For example, this spring-loaded ankle:


Modification notes: We only need to flatten the gear-teeth to achieve smooth motion. I used a box cutter. Some of the pieces are easy enough to cut on directly. Others you may find easier to bolt half the joint on to one end to hold it still while cutting. The plastic is fairly strong, and it took me several passes to achieve my desired results. The gears don't need fully flattened, but they should be mostly parallel to each other, and at least 2/3 of thickness of the teeth should be removed.

IMG20220928111114 IMG20220927202709

Optimized division of uint64_t by a constant on 32-bit microcontrollers

Written 2022-07-21

Tags:Multiplication Division Optimization ARM 

Problem Statement

Recently, I ran into a bottleneck and my profiler highlighted that a large fraction of program time was being spent in __aeabi_uldivmod(), an ARM math support function for division with the following prototype:

struct result {
	u64 quot, rem;

struct result __aeabi_uldivmod(u64 numerator, u64 denominator)
	struct result res;
    //compute res here
	return res;

There's several differeny ways to implement __aeabi_uldivmod(), including 1-bit-at-a-time long division, the same with speedup from a count-leading-zeros instrucion like ARM's CLZ, or 32-bit-at-a-time long-division using the udiv instruction present on some ARM cores.

A partial solution

But most of my program's calls to __aeabi_uldivmod() were with a fixed denominator - this common special-case is often optimized by compilers by replacing division with multiplication by the inverse - since we're dealing with integers, they actually use multiplication by a scaled inverse, something like replacing x/N with x * (2^32/N), then shifting the result right to drop off the remainder. Here's what we need to optimize(and some related functions, like ns_to_us, ns_to_ms, which can be done similarly):

uint64_t ns_to_s( uint64_t ns64 )
return ns64 / 1000000000ULL;

And it turns out GCC does know this trick, at least aarch64 GCC10.2.1 does. When we compile this we get:

    mov x1, 23123              #Load part of a contant into x1
    lsr x0, x0, 9              #x0 is ns64, shift it right 9 bits(divide by 512)
    movk    x1, 0xa09b, lsl 16 #load more bits into x1
    movk    x1, 0xb82f, lsl 32 #...
    movk    x1, 0x44, lsl 48   #...
    umulh   x0, x0, x1         #multiply x0 by x1, keeping only upper 64 bits(discarding 64lsbs)
    lsr x0, x0, 11             #unsigned shift right 11 more bits(

But when we compile with 32-bit arm GCC10.2.1, we get a call to uldivmod.

    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    push    {r3, lr}
    adr r3, .L14
    ldrd    r2, [r3]
    bl  __aeabi_uldivmod(PLT)
    pop {r3, pc}
    .word   1000000000
    .word   0

At first I thought this made sense, since 32-bit arm doesn't have a umulh instruction that takes two 64bit registers and multiplies them together to compute a 128bit result then truncates it. But 32bit by 32bit multiplication with 64bit result is relatively quick on 32-bit arm cores, and add and subtract instructions can be chained together to do a 128-bit add in only 4 instructions. If only I had a umulh function, the aarch64 division trick above could be made to work...

The hack

...why not implement one?

At first I started with a slightly naive approach to 128-bit addition by adding after each multiplication using a uint64_t. However, these additions can be overlapped - several uint32_ts can be added together with a uint64_t result without overflowing. By pipelining the 128-bit accumulator to operate LSBs to MSBs, 32 bits at a time, several instructions can be optimized out, and since 64 LSBs are dropped in the result, only their carry-bits are needed. Here's what I came up with:

uint64_t umulh( uint64_t a, uint64_t b )
const uint32_t a_lo = a;
const uint32_t a_hi = a >> 32;
const uint32_t b_lo = b;
const uint32_t b_hi = b >> 32;

/* FOIL method of multiplication
but instead of binomials with constants a,b,c,d variables x,y: (ax+b) * (cy + d),
consider it with variables a,b,c,d, constants x,y = 1<<32 */
const uint64_t acc0 = (uint64_t)a_lo * b_lo;
const uint64_t acc1 = (uint64_t)a_hi * b_lo;
const uint64_t acc2 = (uint64_t)a_lo * b_hi;
const uint64_t acc3 = (uint64_t)a_hi * b_hi;

/* Break up into 32-bit values */
const uint32_t lo0 = acc0;
const uint32_t hi0 = acc0 >> 32;
const uint32_t lo1 = acc1;
const uint32_t hi1 = acc1 >> 32;
const uint32_t lo2 = acc2;
const uint32_t hi2 = acc2 >> 32;
const uint32_t lo3 = acc3;
const uint32_t hi3 = acc3 >> 32;

/* The first 32 bits aren't used in the result,
no need to store them, so no need to compute them.
In fact, don't even worry about res0, start computing res1*/
uint64_t acc = 0;
const uint32_t res0 = lo0;

acc += hi0;
acc += lo1;
acc += lo2;
const uint32_t res1 = acc;

acc >>= 32;
acc += hi1;
acc += hi2;
acc += lo3;

const uint32_t res2 = (uint32_t)acc;

acc >>= 32;
acc += hi3;
const uint32_t res3 = (uint32_t)acc;

return (((uint64_t)res3) << 32 ) + res2;

uint64_t ns_to_s_inv( uint64_t ns64 )
//constants and shifts from aarch-64 GCC
return umulh( 0x44b82fa09b5A53ULL, ns64 >> 9 ) >> 11;


I timed it on an nRF52, and it beats all __aeabi_uldivmod implementations I've seen so far, though udiv based approachs are quite close.

ns_to_s()/__aeabi_uldivmod - traditional50.4-64us30.8-40.8us
ns_to_s()/__aeabi_uldivmod - udiv based2.64-3.36us2.16-4.64us

This comes roughly to a 25x to 50x runtime improvement from where I started. The reason for some of this variability is that __aeabi_uldivmod() often takes a different number of cycles based on the values of the inputs. When I saw this in the timing with my oscilloscope, I set the oscilloscope to trigger on the start of each computation, then average 256x resulting triggers together.

Oscilloscope captures

__aeabi_uldivmod common approach:


Now here's GCC11 for Cortex-M3/M4's udiv-based approach:


And here's my umulh-emulation-based approach - averaging is still on, but since the umulh-based approach contains no loops or other data-dependent control-flow the calculation time is far more repeatable when given different inputs. Also note this is 20x zoomed in time compared to the first plot:



Generally, the umulh-based approach above may apply to any machine with a 32 bit x 32 bit multiply instruction with 64-bit result. This instruction is called umull for ARM cores. When a 32-bit udiv instruction is available, __aeabi_uldivmod is competitive on: Cortex-M3, M4, M33, M35P, M55, and newer Cortex-A cores including Cortex-A7, Cortex-A15, Cortex-A53, Cortex-A57. Cores with umull but without udiv is where the umulh approach really shines: ARM7TDMI, ARM9, ARM10, ARM11, Cortex-A8, and Cortex-A9. Possibly also XScale and StrongARM.

Latency Measurement, Part 2: Resampler Tuning

Written 2022-03-30

Tags:Latency CorrelationCodes Video BarkerCodes Correlation 

One of the challenges with time-aligning correlation codes blinked out by a computer monitor and into a camera is that the sample-rates(FPS) are usually not the same, and even if close, they're slightly different.

On my first pass, my stopwatch started when the frame went out, and stopped when the next frame in decreased in correlation strength(also a frame late). This limits the timing resolution to 1 input frame time. Adding quadratic interpolation to compute the peak correlation strength in between frames helped enormously. Adding an anti-aliasing filter when resampling the correlation code used at the camera helps too.

In the following plot, Lag curve is the old approach, and Lag2 curve include parabolic peak interpolation and resampler filter. Lag1 is roughly 33ms(one input frame time) later/higher than Lag2 because it always picks the first decreasing correlation frame, but Lag2 picks a closer point in time.


While there is still a significant offset error in the below graph, the variation is down to around 4ms now, which we can see in the following histogram(horizontal scale width same as previous post below).


Automatic Video Latency Measurement, Part 1

Written 2022-03-22

Tags:Latency CorrelationCodes Video BarkerCodes Correlation 


I want an automatic method for measuring video latency of a wireless video transmission system. As always, I'm willing to spend far more time automating than it would save one person.

A common current approach is point the wireless camera at a moving target like an analog ticking clock or video of counters or another moving pattern, then use an additional camera to record both the wireless video link display along with the reference target. One of my favorite targets was a chain of indicators on lego gears driven at high speed, each indicator gear reducing the ratio for the next so it had an enormous repeatition interval.

Overall system idea

If we can run some code on the video transmission system, render some images on its display, and analyze frames coming in, we can make the thing measure itself. Once that's measured, we can insert another video link if desired to characterize it.

Video Target Design

To support cameras with auto-exposure and auto-white balance, we need a target with a somewhat stable average brightness and a spread of colors for white-balance.

After a few tries here's what I use:


The inner red/green/blue/grey squares are fixed while the outer corners blink together and edges blink together opposite the corners. In this way, there's always plenty of full brightness and darkness in every frame, and a little color.

Short intro to Barker Coding

Barker codes are short binary(in the sense of two discrete values, -1 and 1, not 0 and 1) patterns of numbers that have a few useful properties:

This means that if we transmit a Barker code by modulating the display by blinking similarly to programming a Timex Datalink, we can measure when an event occurs by sending out a Barker code then listening for the same Barker code. This general approach of marking a transmission with a correlation code of favorable properties is often used in communications to mark the start of a packet or other timing sensitive information.

Go here to read more:


Here's what it looks like - left side is the output target, right side is the camera preview with thin blue alignment rectangles to help you see where to aim it for the corner boxes. Program supports automatic measurements and one-shot(default).


Initial Results

Here's a recording of latency over 71 latency measurements from a 60Hz Thinkpad LCD to a 30Hz PS3 Eye camera. Variation is within 20ms and an input frame is at most 33ms here. Computing at sub-frame offsets in time is the next step for improving this.


Here it is again in a histogram:


Implementation Challenges and Tradeoffs

Operating System Support

Windows has bad OS timing resolution. VSYNC on Linux is hard. I ended up using Linux.

Video Output

Video output seemed straightforward except that at first I used OpenCV which doesn't directly support VSYNC which is needed to measure output FPS, so I ported the video output to OpenGL, then reconfigured my video driver to enable VSYNC, then reconfigured my xserver to enable VSYNC. Year of the Linux desktop and all.

Video Input

Using OpenCV's video stream blocks until a frame is ready, which can often take longer than an output frame depending on input and output frame rates, causes the output stream to fail to draw each frame - this is important for correctly transmitting a code.

The common solution is to use one thread for the camera and one for the display works well, though the startup code is complicated as we use camera resolution to decide display window resolution, and some of the initialization code on each thread seems to cause the other thread to stutter a few frames until we get going - maybe it's Python's GIL?

scan-in/scan-out synchro

When I first got this working with a USB webcam, each run would have different average latency - not wildly different, always within 1 input frame time. I suspect this is due to variation in when the camera starts its scanout vs when the display starts its scan-in. Also, the camera and LCD VSYNCs are not synchronized, so they do tend to drift over time.

Possible Future Improvements