Optimized division of uint64_t by a constant on 32-bit microcontrollers

Written 2022-07-21

Tags:Multiplication Division Optimization ARM 

Problem Statement

Recently, I ran into a bottleneck and my profiler highlighted that a large fraction of program time was being spent in __aeabi_uldivmod(), an ARM math support function for division with the following prototype:

struct result {
	u64 quot, rem;

struct result __aeabi_uldivmod(u64 numerator, u64 denominator)
	struct result res;
    //compute res here
	return res;

There's several differeny ways to implement __aeabi_uldivmod(), including 1-bit-at-a-time long division, the same with speedup from a count-leading-zeros instrucion like ARM's CLZ, or 32-bit-at-a-time long-division using the udiv instruction present on some ARM cores.

A partial solution

But most of my program's calls to __aeabi_uldivmod() were with a fixed denominator - this common special-case is often optimized by compilers by replacing division with multiplication by the inverse - since we're dealing with integers, they actually use multiplication by a scaled inverse, something like replacing x/N with x * (2^32/N), then shifting the result right to drop off the remainder. Here's what we need to optimize(and some related functions, like ns_to_us, ns_to_ms, which can be done similarly):

uint64_t ns_to_s( uint64_t ns64 )
return ns64 / 1000000000ULL;

And it turns out GCC does know this trick, at least aarch64 GCC10.2.1 does. When we compile this we get:

    mov x1, 23123              #Load part of a contant into x1
    lsr x0, x0, 9              #x0 is ns64, shift it right 9 bits(divide by 512)
    movk    x1, 0xa09b, lsl 16 #load more bits into x1
    movk    x1, 0xb82f, lsl 32 #...
    movk    x1, 0x44, lsl 48   #...
    umulh   x0, x0, x1         #multiply x0 by x1, keeping only upper 64 bits(discarding 64lsbs)
    lsr x0, x0, 11             #unsigned shift right 11 more bits(

But when we compile with 32-bit arm GCC10.2.1, we get a call to uldivmod.

    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    push    {r3, lr}
    adr r3, .L14
    ldrd    r2, [r3]
    bl  __aeabi_uldivmod(PLT)
    pop {r3, pc}
    .word   1000000000
    .word   0

At first I thought this made sense, since 32-bit arm doesn't have a umulh instruction that takes two 64bit registers and multiplies them together to compute a 128bit result then truncates it. But 32bit by 32bit multiplication with 64bit result is relatively quick on 32-bit arm cores, and add and subtract instructions can be chained together to do a 128-bit add in only 4 instructions. If only I had a umulh function, the aarch64 division trick above could be made to work...

The hack

...why not implement one?

At first I started with a slightly naive approach to 128-bit addition by adding after each multiplication using a uint64_t. However, these additions can be overlapped - several uint32_ts can be added together with a uint64_t result without overflowing. By pipelining the 128-bit accumulator to operate LSBs to MSBs, 32 bits at a time, several instructions can be optimized out, and since 64 LSBs are dropped in the result, only their carry-bits are needed. Here's what I came up with:

uint64_t umulh( uint64_t a, uint64_t b )
const uint32_t a_lo = a;
const uint32_t a_hi = a >> 32;
const uint32_t b_lo = b;
const uint32_t b_hi = b >> 32;

/* FOIL method of multiplication
See https://en.wikipedia.org/wiki/FOIL_method,
but instead of binomials with constants a,b,c,d variables x,y: (ax+b) * (cy + d),
consider it with variables a,b,c,d, constants x,y = 1<<32 */
const uint64_t acc0 = (uint64_t)a_lo * b_lo;
const uint64_t acc1 = (uint64_t)a_hi * b_lo;
const uint64_t acc2 = (uint64_t)a_lo * b_hi;
const uint64_t acc3 = (uint64_t)a_hi * b_hi;

/* Break up into 32-bit values */
const uint32_t lo0 = acc0;
const uint32_t hi0 = acc0 >> 32;
const uint32_t lo1 = acc1;
const uint32_t hi1 = acc1 >> 32;
const uint32_t lo2 = acc2;
const uint32_t hi2 = acc2 >> 32;
const uint32_t lo3 = acc3;
const uint32_t hi3 = acc3 >> 32;

/* The first 32 bits aren't used in the result,
no need to store them, so no need to compute them.
In fact, don't even worry about res0, start computing res1*/
uint64_t acc = 0;
const uint32_t res0 = lo0;

acc += hi0;
acc += lo1;
acc += lo2;
const uint32_t res1 = acc;

acc >>= 32;
acc += hi1;
acc += hi2;
acc += lo3;

const uint32_t res2 = (uint32_t)acc;

acc >>= 32;
acc += hi3;
const uint32_t res3 = (uint32_t)acc;

return (((uint64_t)res3) << 32 ) + res2;

uint64_t ns_to_s_inv( uint64_t ns64 )
//constants and shifts from aarch-64 GCC
return umulh( 0x44b82fa09b5A53ULL, ns64 >> 9 ) >> 11;


I timed it on an nRF52, and it beats all __aeabi_uldivmod implementations I've seen so far, though udiv based approachs are quite close.

ns_to_s()/__aeabi_uldivmod - traditional50.4-64us30.8-40.8us
ns_to_s()/__aeabi_uldivmod - udiv based2.64-3.36us2.16-4.64us

This comes roughly to a 25x to 50x runtime improvement from where I started. The reason for some of this variability is that __aeabi_uldivmod() often takes a different number of cycles based on the values of the inputs. When I saw this in the timing with my oscilloscope, I set the oscilloscope to trigger on the start of each computation, then average 256x resulting triggers together.

Oscilloscope captures

__aeabi_uldivmod common approach:


Now here's GCC11 for Cortex-M3/M4's udiv-based approach:


And here's my umulh-emulation-based approach - averaging is still on, but since the umulh-based approach contains no loops or other data-dependent control-flow the calculation time is far more repeatable when given different inputs. Also note this is 20x zoomed in time compared to the first plot:



Generally, the umulh-based approach above may apply to any machine with a 32 bit x 32 bit multiply instruction with 64-bit result. This instruction is called umull for ARM cores. When a 32-bit udiv instruction is available, __aeabi_uldivmod is competitive on: Cortex-M3, M4, M33, M35P, M55, and newer Cortex-A cores including Cortex-A7, Cortex-A15, Cortex-A53, Cortex-A57. Cores with umull but without udiv is where the umulh approach really shines: ARM7TDMI, ARM9, ARM10, ARM11, Cortex-A8, and Cortex-A9. Possibly also XScale and StrongARM.

Latency Measurement, Part 2: Resampler Tuning

Written 2022-03-30

Tags:Latency CorrelationCodes Video BarkerCodes Correlation 

One of the challenges with time-aligning correlation codes blinked out by a computer monitor and into a camera is that the sample-rates(FPS) are usually not the same, and even if close, they're slightly different.

On my first pass, my stopwatch started when the frame went out, and stopped when the next frame in decreased in correlation strength(also a frame late). This limits the timing resolution to 1 input frame time. Adding quadratic interpolation to compute the peak correlation strength in between frames helped enormously. Adding an anti-aliasing filter when resampling the correlation code used at the camera helps too.

In the following plot, Lag curve is the old approach, and Lag2 curve include parabolic peak interpolation and resampler filter. Lag1 is roughly 33ms(one input frame time) later/higher than Lag2 because it always picks the first decreasing correlation frame, but Lag2 picks a closer point in time.


While there is still a significant offset error in the below graph, the variation is down to around 4ms now, which we can see in the following histogram(horizontal scale width same as previous post below).


Automatic Video Latency Measurement, Part 1

Written 2022-03-22

Tags:Latency CorrelationCodes Video BarkerCodes Correlation 


I want an automatic method for measuring video latency of a wireless video transmission system. As always, I'm willing to spend far more time automating than it would save one person.

A common current approach is point the wireless camera at a moving target like an analog ticking clock or video of counters or another moving pattern, then use an additional camera to record both the wireless video link display along with the reference target. One of my favorite targets was a chain of indicators on lego gears driven at high speed, each indicator gear reducing the ratio for the next so it had an enormous repeatition interval.

Overall system idea

If we can run some code on the video transmission system, render some images on its display, and analyze frames coming in, we can make the thing measure itself. Once that's measured, we can insert another video link if desired to characterize it.

Video Target Design

To support cameras with auto-exposure and auto-white balance, we need a target with a somewhat stable average brightness and a spread of colors for white-balance.

After a few tries here's what I use:


The inner red/green/blue/grey squares are fixed while the outer corners blink together and edges blink together opposite the corners. In this way, there's always plenty of full brightness and darkness in every frame, and a little color.

Short intro to Barker Coding

Barker codes are short binary(in the sense of two discrete values, -1 and 1, not 0 and 1) patterns of numbers that have a few useful properties:

This means that if we transmit a Barker code by modulating the display by blinking similarly to programming a Timex Datalink, we can measure when an event occurs by sending out a Barker code then listening for the same Barker code. This general approach of marking a transmission with a correlation code of favorable properties is often used in communications to mark the start of a packet or other timing sensitive information.

Go here to read more: https://en.wikipedia.org/wiki/Barker_code.


Here's what it looks like - left side is the output target, right side is the camera preview with thin blue alignment rectangles to help you see where to aim it for the corner boxes. Program supports automatic measurements and one-shot(default).


Initial Results

Here's a recording of latency over 71 latency measurements from a 60Hz Thinkpad LCD to a 30Hz PS3 Eye camera. Variation is within 20ms and an input frame is at most 33ms here. Computing at sub-frame offsets in time is the next step for improving this.


Here it is again in a histogram:


Implementation Challenges and Tradeoffs

Operating System Support

Windows has bad OS timing resolution. VSYNC on Linux is hard. I ended up using Linux.

Video Output

Video output seemed straightforward except that at first I used OpenCV which doesn't directly support VSYNC which is needed to measure output FPS, so I ported the video output to OpenGL, then reconfigured my video driver to enable VSYNC, then reconfigured my xserver to enable VSYNC. Year of the Linux desktop and all.

Video Input

Using OpenCV's video stream blocks until a frame is ready, which can often take longer than an output frame depending on input and output frame rates, causes the output stream to fail to draw each frame - this is important for correctly transmitting a code.

The common solution is to use one thread for the camera and one for the display works well, though the startup code is complicated as we use camera resolution to decide display window resolution, and some of the initialization code on each thread seems to cause the other thread to stutter a few frames until we get going - maybe it's Python's GIL?

scan-in/scan-out synchro

When I first got this working with a USB webcam, each run would have different average latency - not wildly different, always within 1 input frame time. I suspect this is due to variation in when the camera starts its scanout vs when the display starts its scan-in. Also, the camera and LCD VSYNCs are not synchronized, so they do tend to drift over time.

Possible Future Improvements

TRS to XLR Adapter Noise Floor

Written 2021-12-30

Tags:microphone audio XLR 

Today I take a look at the noise floor of my Tascam DR-44WL with a couple phantom power XLR to plug-in power TRRS microphone adapters.

Test Setup

Tascam DR-44WL set away from other electronics and recorded with no adapter, and again with each adapter in XLR port 1 with 48V phantom power. No input termination was used. The recorder was set for +15dB gain, the most sensitive setting.

Analysis and Results

All three recordings were run through a 16k point spectrum analysis in Audacity, then each spectrum exported and combined in OpenOffice Calc. This first plot has a linear frequency horizontal axis.


The first thing I noticed was a slight peak around 34.5KHz, but I would never hear it, and it's present without the adapters so it's either inside the recorder or around my home. Also note that the orange(Rode) and yellow(Movo) curves overlap nearly exactly, and are both higher than the blue curve(no adapter).


If we rescale the horizontal axis with a log scale, we can see the effect of the added noise of the adapters at 20Hz can be 20dB worse than without one, but the difference rolls off as we go up in frequency. Really though, these are likely to be combined with a low-cut filter for voice recording.

Tomi Engdahl has a good write-up on plug-in-power and a few different designs for how these adapters work, if you scroll to "Balanced electret microphone circuit" it is quite similar to what Zach Poff found.

Initially, I was concerned these adapters were going to add a lot of wideband noise, as that's what Zener diodes emit. However, there's a little more going on here. The actual impact to noise floor is both lower than I had suspected, but still a little disappointing that it reaches up into the voice band.

First, the tantalum capacitor in parallel with the Zener diode does a lot to filter out noise. The choice of tantalum here may be important, as electrolytic capacitors tend to have higher ESR than tantalum, and more ESR would limit ability to filter noise here.

Second, the Zener diode is powered from the XLR audio+ and audio- lines through a pair of resistors. Because XLR is differential, and similar amounts of Zener noise should flow back through both resistors, most of the Zener noise seen through the XLR should be common-mode and thus easy to reject.

TRS to XLR Adapter Frequency Response

Written 2021-12-30

Tags:microphone audio XLR 

I've got a couple TRS microphone to phantom power XLR adapters, one Rode VXLR+ and Move F-XLR Pro. The Rode has a better housing and interoperability with my equipment, so it's already my favorite, but I was curious about their frequency responses and if there were any band limits.

Test Setup

  1. Rigol DG1022 DDS waveform generator emitting 100mVPP sine waves, 50Ohm source
  2. 6uF capacitor, 1kOhm resistor in series
  3. TRRS pigtail, select leads according to TRS/XLR adapter under test
  4. Tascam DR-44WL audio recorder in 96KHz 24bit WAV mode, using first XLR input
  5. Step through each frequency, dwelling several seconds


For each tone, I selected the relevant segment in Audacity's spectrogram view:


Once a tone is selected, I used analysis->plot spectrum to bring up an FFT of only the selected interval. I used a Hann window and 65536 points, though this image shows a 4096 point, as the 65536 point FFT pixels are smaller than display pixels and become hard to see.


Then I selected the peak frequency and amplitude and saved to a spreadsheet.

Because of the DDS generator, I did not try to measure SNR. SNR through these adapters is interesting, as they use a Zener diode and resistors to conver the 48V phantom power to 3-5v microphone power, but in doing so, emit some noise or hiss, and the selection of Zener diode becomes quite important. Kamil's tech tips found the Movo to be around 7dB noisier than the Rode. I might look at noise floor separately later.


The Rode and Movo adapters have eerily similar response curves. It turns out that they are largely the same design with a slightly different wiring, so this actually makes sense. Zach Poff has an explanation and reverse engineered schematic here.


The low end is fine for speech and the high end actually extends a bit past nyquist(48KHz here) and aliases back in the spectrogram - I'm not sure if the adapters are limiting it or an anti-aliasing filter inside the Tascam recorder. Make no mistake, this roll-off is necessary and important, but if I were making my own adapter, it might make sense to move the cut-off even lower

Error Analysis