Neat unit conversion trick

Written 2022-12-11

Tags:Ham Radio 

Amateur radio requires a multiple-choice test for each new grant of operator permissions, each granting a larger amount of spectrum and capabitilies. You are allowed a calculator, but it's possible to get by with pen and paper or memorization. One question that comes up on the tests fairly often is conversion between band names and frequency band edges. If you've memorized the band-edge frequencies, you're all done. But if you haven't, you can rule out most of the possibilities from the answer set with a simple relationship.

Though I usually use a unit-aware calculator most days, that's not available on the radio tests. One simple math trick I've found I enjoy is expressing relationships in particularly useful units. The basic physics relationship we have is wavelength * frequency = speed of light. On the test, wavelength is expressed in meters, and frequency in MHz. If we represent speed of light not as 3e8 meters per second, but 300e6 meters per second, or 300 megameters per second, we can rearrange our equation without units as: wavelength = 300 / frequency or frequency = 300 / wavelength, which feels a little easier to work out.

New Neighbors 2

Written 2022-12-11

Tags:rsaxvc.net hosting 

rsaxvc.net is back up at HappyBee host, transition was quite uneventful. Here's the new neighborhood:

rsaxvc.net maintenance

Written 2022-11-23

Tags:rsaxvc.net hosting 

rsaxvc.net has been hosted by Nalan Nijesh's HappyBee Host since December 2017. Outside of one hiccup when my VM's host ran out of disk space, it's been quite reliable, and low priced.

Part of what made this cost effective is a container technology called OpenVZ containers. HappyBee Host is now transitioning their users off OpenVZ to KVM, which they've also supported for some time.

Expect this site to go down a bit in the coming weeks.

Modifying a 5 foot Lowe's poseable skeleton for cycling

Written 2022-10-23

Tags:Halloween 

So, you have a spare tandem bicycle but no one to ride halloween with you? Simply add a skeleton!

IMG20220928192748

Skelly is a five-foot poseable skeleton from Lowe's. However, their joints are designed to be poseable, and make a loud clicking noise when adjusted - not ideal for continuous operation, and would probably wear out quickly. Here's the joint mechanism:

IMG20220928111012

There are a few different variations on this theme throughout their body, but they all share a spring-loaded toothed gear(or gears) sprung into a fixed gear that takes a fair amount of torque to snap into the next position. This creates a number of positions where the joint is stable. Ends of the joint are held into the next bone either by 4 plastic posts(top of this joint), or by a plastic stick-out(bottom of this joint). Here's an assembled view screwdriver replacing bolt):

IMG20220927202824

Here's the hip(a one-sided leg to hip joint):

IMG20220928111039

Here's the ankle(lower leg reaches over both sides of ankle joint):

IMG20220927204802

Disassembly notes: Each joint is held with a phillips machine screw and nut. Since the joints are spring loaded, don't do it over a skeleton-colored carpet. Try to hold or at least cover the joint assembly as you remove it. For example, this spring-loaded ankle:

IMG20220927201938

Modification notes: We only need to flatten the gear-teeth to achieve smooth motion. I used a box cutter. Some of the pieces are easy enough to cut on directly. Others you may find easier to bolt half the joint on to one end to hold it still while cutting. The plastic is fairly strong, and it took me several passes to achieve my desired results. The gears don't need fully flattened, but they should be mostly parallel to each other, and at least 2/3 of thickness of the teeth should be removed.

IMG20220928111114 IMG20220927202709

Optimized division of uint64_t by a constant on 32-bit microcontrollers

Written 2022-07-21

Tags:Multiplication Division Optimization ARM 

Problem Statement

Recently, I ran into a bottleneck and my profiler highlighted that a large fraction of program time was being spent in __aeabi_uldivmod(), an ARM math support function for division with the following prototype:

struct result {
	u64 quot, rem;
};

struct result __aeabi_uldivmod(u64 numerator, u64 denominator)
{
	struct result res;
    //compute res here
	return res;
}

There's several differeny ways to implement __aeabi_uldivmod(), including 1-bit-at-a-time long division, the same with speedup from a count-leading-zeros instrucion like ARM's CLZ, or 32-bit-at-a-time long-division using the udiv instruction present on some ARM cores.

A partial solution

But most of my program's calls to __aeabi_uldivmod() were with a fixed denominator - this common special-case is often optimized by compilers by replacing division with multiplication by the inverse - since we're dealing with integers, they actually use multiplication by a scaled inverse, something like replacing x/N with x * (2^32/N), then shifting the result right to drop off the remainder. Here's what we need to optimize(and some related functions, like ns_to_us, ns_to_ms, which can be done similarly):

uint64_t ns_to_s( uint64_t ns64 )
{
return ns64 / 1000000000ULL;
}

And it turns out GCC does know this trick, at least aarch64 GCC10.2.1 does. When we compile this we get:

    mov x1, 23123              #Load part of a contant into x1
    lsr x0, x0, 9              #x0 is ns64, shift it right 9 bits(divide by 512)
    movk    x1, 0xa09b, lsl 16 #load more bits into x1
    movk    x1, 0xb82f, lsl 32 #...
    movk    x1, 0x44, lsl 48   #...
    umulh   x0, x0, x1         #multiply x0 by x1, keeping only upper 64 bits(discarding 64lsbs)
    lsr x0, x0, 11             #unsigned shift right 11 more bits(
    ret

But when we compile with 32-bit arm GCC10.2.1, we get a call to uldivmod.

ns_to_s:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    push    {r3, lr}
    adr r3, .L14
    ldrd    r2, [r3]
    bl  __aeabi_uldivmod(PLT)
    pop {r3, pc}
.L14:
    .word   1000000000
    .word   0

At first I thought this made sense, since 32-bit arm doesn't have a umulh instruction that takes two 64bit registers and multiplies them together to compute a 128bit result then truncates it. But 32bit by 32bit multiplication with 64bit result is relatively quick on 32-bit arm cores, and add and subtract instructions can be chained together to do a 128-bit add in only 4 instructions. If only I had a umulh function, the aarch64 division trick above could be made to work...

The hack

...why not implement one?

At first I started with a slightly naive approach to 128-bit addition by adding after each multiplication using a uint64_t. However, these additions can be overlapped - several uint32_ts can be added together with a uint64_t result without overflowing. By pipelining the 128-bit accumulator to operate LSBs to MSBs, 32 bits at a time, several instructions can be optimized out, and since 64 LSBs are dropped in the result, only their carry-bits are needed. Here's what I came up with:

uint64_t umulh( uint64_t a, uint64_t b )
{
const uint32_t a_lo = a;
const uint32_t a_hi = a >> 32;
const uint32_t b_lo = b;
const uint32_t b_hi = b >> 32;

/* FOIL method of multiplication
See https://en.wikipedia.org/wiki/FOIL_method,
but instead of binomials with constants a,b,c,d variables x,y: (ax+b) * (cy + d),
consider it with variables a,b,c,d, constants x,y = 1<<32 */
const uint64_t acc0 = (uint64_t)a_lo * b_lo;
const uint64_t acc1 = (uint64_t)a_hi * b_lo;
const uint64_t acc2 = (uint64_t)a_lo * b_hi;
const uint64_t acc3 = (uint64_t)a_hi * b_hi;

/* Break up into 32-bit values */
const uint32_t lo0 = acc0;
const uint32_t hi0 = acc0 >> 32;
const uint32_t lo1 = acc1;
const uint32_t hi1 = acc1 >> 32;
const uint32_t lo2 = acc2;
const uint32_t hi2 = acc2 >> 32;
const uint32_t lo3 = acc3;
const uint32_t hi3 = acc3 >> 32;

/* The first 32 bits aren't used in the result,
no need to store them, so no need to compute them.
In fact, don't even worry about res0, start computing res1*/
uint64_t acc = 0;
const uint32_t res0 = lo0;

acc += hi0;
acc += lo1;
acc += lo2;
const uint32_t res1 = acc;

acc >>= 32;
acc += hi1;
acc += hi2;
acc += lo3;

const uint32_t res2 = (uint32_t)acc;

acc >>= 32;
acc += hi3;
const uint32_t res3 = (uint32_t)acc;

return (((uint64_t)res3) << 32 ) + res2;
}

uint64_t ns_to_s_inv( uint64_t ns64 )
{
//constants and shifts from aarch-64 GCC
return umulh( 0x44b82fa09b5A53ULL, ns64 >> 9 ) >> 11;
}

Results

I timed it on an nRF52, and it beats all __aeabi_uldivmod implementations I've seen so far, though udiv based approachs are quite close.

Algorithm\InstructionCacheDisabledEnabled
ns_to_s()/__aeabi_uldivmod - traditional50.4-64us30.8-40.8us
ns_to_s()/__aeabi_uldivmod - udiv based2.64-3.36us2.16-4.64us
ns_to_s_inv()/umulh1.26us1.09us

This comes roughly to a 25x to 50x runtime improvement from where I started. The reason for some of this variability is that __aeabi_uldivmod() often takes a different number of cycles based on the values of the inputs. When I saw this in the timing with my oscilloscope, I set the oscilloscope to trigger on the start of each computation, then average 256x resulting triggers together.

Oscilloscope captures

__aeabi_uldivmod common approach:

aeabi_uldivmod

Now here's GCC11 for Cortex-M3/M4's udiv-based approach:

aeabi_uldivmod_gcc11

And here's my umulh-emulation-based approach - averaging is still on, but since the umulh-based approach contains no loops or other data-dependent control-flow the calculation time is far more repeatable when given different inputs. Also note this is 20x zoomed in time compared to the first plot:

umulh_div

Applicability

Generally, the umulh-based approach above may apply to any machine with a 32 bit x 32 bit multiply instruction with 64-bit result. This instruction is called umull for ARM cores. When a 32-bit udiv instruction is available, __aeabi_uldivmod is competitive on: Cortex-M3, M4, M33, M35P, M55, and newer Cortex-A cores including Cortex-A7, Cortex-A15, Cortex-A53, Cortex-A57. Cores with umull but without udiv is where the umulh approach really shines: ARM7TDMI, ARM9, ARM10, ARM11, Cortex-A8, and Cortex-A9. Possibly also XScale and StrongARM.

Older