Binary Multiplication Tricks

Written 2017-10-15

Tags:Multiplication Optimization Math Logic 

This is a rambly math post. Sean Anderson's Bit Twiddling Hacks is full of interesting uses of multiplication to manipulate bits, but some of the examples use multiplication that isn't immediately clear. At some point in my software development career, there was a moment where I understood how multiplication worked, and the day before then I didn't. This post is an attempt to explain how I use multiplication for bit and byte twiddling, as opposed to scalar arithmetic multiplication. In particular, I'm going to focus on multiplication with one constant, and tricks for determining the right constant to use.

Multiplication Algorithms

There are a number of different multiplation algorithms, each more or less useful in different circumstances. When learning multiplication as a child, I remember the Grid Method, Long Multiplication, and Peasant Method. These all work great, but a derivative of Peasant Method, Shift and Add is a good model for digital multiplication.

Shift-and-add multiplication model

Shift-and-add treats multiplication as a loop over all the bits in one multiplicand and multiplies one bit at a time by the multiplier. Results are accumulated at each step. Digitally, a 1xN-bit multiplier can be implemented with an N-bit adder and a mask. A*B is equivalent to If(A) This function implements an 8x8=>16 bit multiplier using a 1xN bit multiplier and 16-bit accumulator.


uint16_t mult8( uint8_t a, uint8_t b )
{
uint16_t accum = 0;
for( unsigned i = 0; i < 8; ++i ) //For every bit in B
    if( b & ( 1 << i ) ) //If corresponding bit
        accum += ( a << i ); //shift-accumulate to multiply
return accum;
}

While not particularly well suited to hardware-implementation(long, wide adders are complicated), when unrolled, N-bit multiplication can be implemented with N shifters(holding 'A'), a set of N masks, each combining a bit of 'B' with the corresponding shifter output, and the output of each mask fed into an N-input adder. The concept scales to arbitrary length integers with more shifts and a wider adder. In short, we can treat multiplication as a bit-vectorized shift & add accumulator.

By the Grace of Boole, Let us Compute

Multiplying by 16

To multiply by a power-of-two, we need only set the corresponding bit in the multiplier. So for B=16, shifter number 4 needs activated, and no others.

        00000000 | B & ( 0 << 0 )
       000000000 | B & ( 0 << 1 )
      0000000000 | B & ( 0 << 2 )
     00000000000 | B & ( 0 << 3 )
    AAAAAAAA0000 | B & ( 1 << 4 )
   0000000000000 | B & ( 0 << 5 )
  00000000000000 | B & ( 0 << 6 )
 000000000000000 | B & ( 0 << 7 )
               +
---------------------------------
     AAAAAAAA*10

Multiplying by 10

To multiply by B=10, we represent 10 as 2+8, and correspondingly activate B shifters 1 and 3. This is equivalent to adding 2*B + 8*B:

        00000000 | B & ( 0 << 0 )
       AAAAAAAA0 | B & ( 1 << 1 )
      0000000000 | B & ( 0 << 2 )
     AAAAAAAA000 | B & ( 1 << 3 )
    000000000000 | B & ( 0 << 4 )
   0000000000000 | B & ( 0 << 5 )
  00000000000000 | B & ( 0 << 6 )
 000000000000000 | B & ( 0 << 7 )
               +
---------------------------------
     AAAAAAAA*10

Multiplying by 255

To multiply by B=255, every bit is set in B, so we just add all 8 inputs:

        AAAAAAAA | B & ( 1 << 0 )
       AAAAAAAA0 | B & ( 1 << 1 )
      AAAAAAAA00 | B & ( 1 << 2 )
     AAAAAAAA000 | B & ( 1 << 3 )
    AAAAAAAA0000 | B & ( 1 << 4 )
   AAAAAAAA00000 | B & ( 1 << 5 )
  AAAAAAAA000000 | B & ( 1 << 6 )
 AAAAAAAA0000000 | B & ( 1 << 7 )
               +
---------------------------------
    AAAAAAAA*255

Some simple bit twiddling

Now that we understand our swiss-army chainsaw of shift-accumulation:

Repeat a 2-bit pattern 4-times

Bit-repeating a pattern is as simple as multipling by 85(0b01010101):
        000000Aa | B & ( 1 << 0 )
       000000000 | B & ( 0 << 1 )
      000000Aa00 | B & ( 1 << 2 )
     00000000000 | B & ( 0 << 3 )
    000000Aa0000 | B & ( 1 << 4 )
   0000000000000 | B & ( 0 << 5 )
  000000Aa000000 | B & ( 1 << 6 )
 000000000000000 | B & ( 0 << 7 )
               +
---------------------------------
        AaAaAaAa

Repeat an n-bit pattern m(power-of-two)-times

As seen above, repetition of a pattern by a power-of-two can be done by scaling with a bit-pattern of (n-1) zeros followed by 1 one. We can use ( 255 )/( 2^n - 1 ) to synthesize it . In C we might use the following to generate a C-integer-sized repetition pattern:
#define REPEATER(n) ( ~0 / ( ( 1 << n ) - 1 ) )
You'll actually see the equivalent of REPEATER(8), ~0UL/255, on 5 of Sean Anderson's collection of bit-twiddling hacks.

Compiler optimizations for constant multipliers

On many architectures, a shift and an add is faster or encodes smaller than a multiply. With this knowledge, compilers often replace constant-multiplies with shifts and adds when that input can be represented with just a few bits set, or just a few bits cleared.

Multiplying by three:

To multiply by three, we can simply add A to A<<1.

        000000AA | B & ( 1 << 0 )
       000000AA0 | B & ( 1 << 1 )
      0000000000 | B & ( 0 << 2 )
     00000000000 | B & ( 0 << 3 )
    000000000000 | B & ( 0 << 4 )
   0000000000000 | B & ( 0 << 5 )
  00000000000000 | B & ( 0 << 6 )
 000000000000000 | B & ( 0 << 7 )
               +
---------------------------------
             A*3 = A*2+A

It is trivial to extend this to fit multiplication by any number of set bits

Multiplying by seven:

Instead of multiplying by seven, we could add A + A<<1 + A<<2. Or...we could also take A<<3 - A. Both are equivalent, but the second is usually complies shorter:

        000000AA | B & ( 1 << 0 )
       000000AA0 | B & ( 1 << 1 )
      000000AA00 | B & ( 1 << 2 )
     00000000000 | B & ( 0 << 3 )
    000000000000 | B & ( 0 << 4 )
   0000000000000 | B & ( 0 << 5 )
  00000000000000 | B & ( 0 << 6 )
 000000000000000 | B & ( 0 << 7 )
               +
---------------------------------
             A*7 = A*8-A

     000000AA000 | B & ( 1 << 3 )
        000000AA | B & ( 1 << 0 )
               -
---------------------------------
           A*8-A

Everything is awesome and terrible

Written 2017-07-30

Tags:MIPS Instruction ARM DSP 

Originally, this post was going to be titled, "your instruction set is an API", and discuss how instruction sets have changed over the ages and different ways things have been done well and done poorly. But pretty quickly it turned into a ramble describing how everything is different and terrible and beautiful, and nothing is the right tool for everything.

Branch Delay Slots

Branch delay slots are a handful of instructions that are both listed, and occur, after a branch. DSPs and older RISC cores tended to do this to keep the pipeline and branch logic both simple and performant. SPARC, PA-RISC, and MIPS have one branch delay slot. SHARC has two. Check out this MIPS32 code to return 0:
	return_zero:
		jr $r31             ;return from function. Well, maybe more like start returning from the function.
		mov $v0, $zero      ;Copy value from zero register into a0(first returned value). This still happens even though there's a jump before it.
		                    ;At this point, we've finished returning from the functions.
		addi $v0, $v0, 1    ;This does not happen, because it is too far after the jump.

Why it's awesome.

Branch delay slots give assembly and compiler writers the opportunity to fold instructions into otherwise unused spots to keep the pipeline filled. And the core stays simple...right?

Why it sorta sucks.

Encoding details of a CPU pipeline into an instruction set works really well until you need to change your CPU pipeline. Then you have two choice. Choice A, maintaining the existing instruction set API requires emulating the previous CPU pipeline, likely forever. MIPS32 does this, and poorly at that, as we'll see later. TI C66X DSPs even go so far as to add new opcodes for functional improvements, while manually stalling older ones to keep the cycle times the same.

Another hypothetical approach is to make vary the number of delay slots to match the pipeline, invalidating the API. This would only be useful for highly configurable cores with a correspondingly configurable compiler, like ARC who took this concept to an absurdist level - ARCompact cores have one delay slot, but you can turn it on or off at runtime. ARCtangent-A5 has two delay slots. ARC600 has three, but only the first is always executed, the other two are killed if the branch isn't taken.

What happens when an exception or interrupt occurs in the branch delay slot? One way around this is to require that all branch delay slots be repeatable - destination registers independent from source registers, no stores, other branches, stack manipulation, or other shenanigans either. Suddenly, the branch delay slot is a whole lot less useful. Alternatively, the processor can include additional hardware to record more state about exactly wherabouts inside of the branch the CPU was when the exception occured. Did I forget to mention that fetching an instruction can trigger an exception on operating systems with demand paging, or on cores with software TLB fetches?

Load delay slots

On our speed-fueled trip through exposing pipeline innards to developers, non-interlocked load delays describe a non-blocking memory to register load. Non-blocking? To dig a up a little ancient history, MIPS is an acronym for Microprocessor without Interlocking Pipeline Stages. MIPS R2000 and MIPS R3000 load instructions aren't guaranteed finish before any instruction that uses the freshly-loaded register.

Why it's awesome.

TI VLIW DSPs use this to great effect, dispatching many instructions in parallel, mixing loads in freely with arithmetic so that the loads complete just ahead of upcoming arithmetic instructions. The original PlayStation did so to a lesser effect.

Why it sorta sucks.

The fact that this code fails on a PlayStation MIPS R3000 and works on an R5900 sucks:

		why_does_this_only_work_on_newer_mips_cores: ;static uint32_t fetch( uint32_t * ptr ) { return *ptr };
			lw $v0, 0($a0)  ;Load Word from *a0 into v0.
			mov $v0, $v0    ;R3000 overwrites v0 when lw completes. R5900 stalls until lw completes.
			jr $r31         ;start returning from function.
			nop             ;finish returning from function.
	
And it sucks that for load delay slots to be as short as possible without needing to implement interlock the pipeline, memory has to have known latency. Which means this approach is either incompatible with dynamic RAM, caches and other bus masters, or the processor only ends up hiding some of the interlock stall.

Functional unit delay slots

Functional unit latency is the amount of time part of the processor takes to do a certain operation. On a dual-issue dual-ALU system two independent shifts might get done in a single cycle, but their latency is still one if no further instruction can use the result until then. But what if a processor allowed developers to manually schedule individual functional units on a processor to minimize latency. And if developers were already doing that, why bother implementing stalls in the pipeline? Surely the developer would insert a NOP to wait for the ALU to complete before reading from it. Much like load delay slots, functional unit delay slots allow the developer and processor to interleave operations. Unlike load delay slots, most functional unit timings can be known before run-time, so it's often pretty usable.

Why it's awesome.

TI C66X DSPs have schedule 32 16-bit multiplies in parallel, and 16 floating point operations. And this isn't even SIMD, it's just keeping the pipeline near optimally filled.

Why it sorta sucks.

Reasoning about the state of the CPU can be incredibly difficult, since removing or relocating an instruction changes the meaning of the register map for following instructions. Additionally, programs often end up padded with VLIW NOPs.

Interrupt disablement hazards - MIPS

On MIPS, disabling interrupts requires the programmer to set a flag with MTC(Move to Coprocessor), then wait a few cycles. Sometimes you can fold a few register operations in after the MTC, but usually you end up with an MTC instruction and a pile of superscalar-NOPS. How many cyles do we need to wait before we can be sure an interrupt won't pop in? If we ask Linux, we get an answer of somewhere between 0(R10000) and 3(Classic MIPS). That's right, a two hundred line file that doesn't actually implement anything other than NOPs. If we ask FreeBSD, we get an answer of between 4 and 9(SiByte-1), unless it's a Cavium or Raza, those implement it in hardware. A better way to implement this would be a GCC intrinsic that padded loads and stores out past a parameterized number of cycles, but allowed the compiler to replace the padded NOPs with register instructions from later in the function. Eventually MIPS added the ExecutionHazardBarrier(EHB) instruction, which acts like 'the correct number of NOPs'.

At first, it sounds kinda awesome.

Who wants to wait 9 whole cycles on a SiByte core for interrupts to disable???

And it always kinda sucked.

It turns out that pretty much everyone is willing to wait 9 whole cyles for to be absolutely certain interrupts are disabled. An unexpected interrupt or context switch is such bad news it just isn't worth the risk. I cannot say for certain that nobody takes advantage of the exposed pipeline hazards - in assembly it's actually quite easy to do so, but the fact that nobody has bothered teaching GCC really shows when you disassembly router firmware - there are a lot of NOPs. It also sucks that the instruction sets are basically incompatible, though these changes are really limited to a handful of kernel operations, it's still a pain for RTOS work.

Interrupt disablement hazards - ARM32

With ARM architecture version 6(previously interrupt disabling was fully synchronous), ARM introduced two handy instructions, CPSID(Disable Interrupts), and CPSIE(Enable Interrupts). Previously it was always a read-modify-write sequence on the CPU flags register. Unlike MIPS, ARM also included an implicit hardware barrier in CPSID, but not in CPSIE. It turns out CPSID is the one you really want to have hardware support for. For CPSIE, interrupts usually come in a few instructions past the CPSIE, which you can think of as having sneaked a few instructions into time that would've otherwise been idle or spent on NOPs like on MIPS. Except for one rare specific case.

It's kinda awesome.

By the time CPSID completes, you can assume interrupts are disabled. And an interrupt being delayed by a handful of CPU cycles is rarely anything worry about. Codespace cost is a single instruction for each. Sanity stays at a maximum level. For the vast majority of compiler and assembly writers this implemntation of CPSID/CPSIE makes instruction scheduling a breeze...

...and then it sorta sucks.

Because there's this thing. The following code tries to reduce interrupt latency by manually scheduling the interruptions for a convenient time in the instruction stream. Luckily, it's quite rare in hosted operating systems, but more common in RTOSs. It's also not guaranteed to work:

	wait_for_interrupt:         ;Assume interrupts are disabled
	    CPSIE i                 ;Enable interrupts. But don't bother waiting for it...
	    CPSID i                 ;Stall until interrupts are disabled
	    b wait_for_interrupt    ;rinse and repeat

Will another interrupt ever occur? If the processor design favors performance, it may finish the current pipeline of instructions before taking an interrupt. If the design favors latency, it could discard whatever is in the pipeline as soon as it can take an incoming interrupt. On Cortex-M0/0+/1, the processor can execute one instruction with interrupts disabled after CPSID. On Cortex-M3/4/4F, one or two instructions can occur. But there's a footnote in ARM's documentation that the above code will work on Cortex-M0/0+/1/3/4/4F because the next CPSID should figure it all out. But this documentation predates the dual issue Cortex-M7, as well as the new M23 and M33, whose documentation has not been announced yet. Clear as mud? Adding an instruction synchronization barrier will always work:

	wait_for_interrupt:         ;Assume interrupts are disabled
	    CPSIE i                 ;Enable interrupts. But don't bother waiting for it...
	    ISB                     ;Pause instruction execution
	    CPSID i                 ;Stall until interrupts are disabled
	    b wait_for_interrupt    ;rinse and repeat

...but really, it's a pretty sensible tradeoff.

Older ARM cores, at least ARM7TDMI and Cortex-A9 supported a different mechanism for interrupt disablement, where all writes to the CPSR were synchronizing, this was often unneccesarily slow. Exposing only the enable hazard will almost never cause a problem.

Instruction Set Compatibility

As we've seen, there are lots of crazy ways to lock developers into archaic details of the CPU. Wouldn't it be great if the same executable could be run on future chip versions? ARM handles this with pipeline stalls in the appropriate places. After MIPS R3000, things are handled the same way in userspace, with a few oddities left for the kernel.

TI C66x, with its near fully asynchronous architecture is left in a tough spot - improving instruction timings would break existing binaries and libraries, but handcuffing the CPU to an old pipeline doesn't make sense either. TI's solution is to implement an entire, distinct instruction set each time they improved the processor throughput, leaving the old instruction encodings padded with stalls. Conveniently, compilers can now issue these when they need a result a few cycles later than originally expected.

Tensilica and ARC don't bother with instruction set compatibility. The main driver behind selecting one of these is so custom instructions can be tailor made into a core. These are available both from the the supplier, but ARC also allows adding your own. Simply put, when changing over from one processor to another in this family, only the base instructions are available, and DCT/IDCT might not be, or Vitterbi decoder may or may not be, or motion estimator, or FPU, ...

Lets Talk About IC Interconnect Buses

Written 2017-07-23

Tags:Bus Nebraska AMBA 

Simple busses

In the beginning, CPUs had data buses for connecting to RAM and ROM. Often as simple as address and data pins, some simple external logic to decode an address and pick a chip was all that was needed to integrate a system. This architecture is still used in simple CPUs still produced today, like 8051 and AVR.

Multi-master simple busses

Pretty early on, someone needed to hook another bus master, often a DMA controller, to a simple CPU. And thus bus arbitration was born, attempting to answer the question of - what happens when two masters want to talk to the same bus slave? One simple approach is to stall the CPU clock while the DMA accesses the bus. TI MSP430 DMAs work this way, as did Intel 8086/8088 CPUs where the Intel 8237 DMA controller would set the HOLD pin on the CPU to stall it. Once a CPU gets pipelining and caches, it sometimes makes sense to stall the DMA while the CPU uses the bus. Slightly more complicated is letting the CPU program a priority so that the CPU can be stalled by DMA0, and DMA1 only runs when DMA0 and the CPU are idle.

Multi-master, multi-slave, crossbar switch

Letting a CPU fetch instructions from ROM while a DMA copies into RAM is a lot more performant if both can be done at the same time. Borrowing a hint from 1915 telephone equipment, the crossbar switch connects a number of bus masters with a number of bus slaves through a number of connections.

Distributed crossbar switch

As chips gained more peripherals, connecting every bus master(CPU, a few DMA controllers, a GPU, maybe a few other things), to every bus slave (SPI, I2C, FlexBus, GPIO, USB, ...) gets a little bit excessive and ends up being a bit of a mess in the silicon as it requires routing all the internal buses correspodning to the masters and slaves to a single location. Additionally, a big crossbar gives a large fan-out, which is where one output signal has to drive a large number of, albeit unused, input signals.

This led to systems with a fast crossbar for the CPU, DMA, RAM, and ROM, and another smaller, slower switch for things like I2C, SPI, GPIO, that would then all be routed to one or two ports on the big crossbar. This architecture is common on ARM platforms, with a fast AHB and a slower APB for peripherals.

Things get a little more interesting with multicore microcontrollers. It's certainly possible to connect two microcontroller cores to the same main crossbar, but another approach is to give each core its own crossbar, and then routing a slave port for each crossbar to a master for the other crossbar, giving each core some of its own RAM, and tying the slow peripheral bus into both crossbars. Both of these approaches work, have their own upsides and downsides, and are an evolutionary step to...

Network-on-a-chip

As we break up the main crossbar into smaller and more numerous interconnects, the result is a multi-point network connecting each component to switches, both large and small, with connections between scaled to the amount of expected data. NXP Vybrid Architecture chips merge M4 and A5 cores along with DMA . All of the cores and peripherals are routed together using an ARM NIC-301 network interconnect, which supports connecting 18 bus masters to 10 bus slaves using five separate switches to emulate a single, giant, logical crossbar switch. This means that both the A5 and M4 cores can access the entire logical address space, but a few really neat options are implemented. Access to the Secure-RAM is limited so that only certain masters can access it. These limitations were often available on earlier crossbar bus systems, but with an on-chip network the implementation is simpler - there is simply no route from disallowed masters to the Secure-RAM slave.

BikeFon

Written 2017-05-17

Tags:Bicycle USB Power Charger Phone DCDC 

My bike has a headlight with a huge battery pack and a DC barrel plug. With the addition of a simple DCDC converter, some waterproofing grease, and some heatshrink, I have built a cable that charges my phone from my existing bike battery.

IMG_20170517_235739

TempTale Direct Teardown

Written 2017-04-05

Tags:STM32 Teardown flash ARM 

A Handy Shipping Watchdog

Ever wonder if your kobe beef shipped quickly enough? If your suspicous neighbor popped the lid on your coolerful of science experiment? Some other, simple, zany reason? TempTale makes a line of very handy environmental monitors. Drop one in when a sensitive shipment is packed, and the receipient can quickly verify that the package temperature was maintained through the shipment by opening an automatically generated PDF file on the device. The device appears as a USB mass storage device with a single PDF file onboard.

But, how does it work?

Album link:
TempTaleDirect

Internally, it consists of an STM32 ARM microcontroller, external serial flash, a coin cell battery with solder terminals, some buttons, a high-speed and low-speed oscillator, and a bare LCD panel.

The bare LCD is neat, as the static discharge on my fingers caused a shimmer of pixels when I first touched it. Also, pay attention to the front edge below: The glass has metal contacts that then touch thin metal sheets embedded in the thick white-faced foam strip at the front. Contact is held in by the case and pcb.

TempTale direct teardown

Software-wise, only a few components are needed for something like this. A USB mass-storage driver, a filesystem driver, a PDF writer, and some custom logic to wake up the CPU from a low-power event, likely the RTC alarm, read the temperature, and store it to a circular buffer. Just before USB enumeration, write the circular buffer into a PDF file. Possible example components are linked above.

These are really handy little devices, I hope they become more widely deployed in the future for shipment validation. Sensitech even makes models with GPS and accelerometers as well.

Older