Caldwell Ballistic Precision Chronograph Theory of Operation and Signal Path

Written 2018-12-04

Tags:Caldwell Chronograph 

Inside my chronograph, we find this simple PCB. With only two chips and a handful of passive components, this device can measure speeds from 5 to nearly 10000 feet per second. Here is how it works. The chronograph waits for an object to pass the first light-gate, starts a timer, waits for the object to pass the second light gate, records the finish time, and divides the distance between light gates by the time, and converts to appropriate human-readable units for display.

Caldwell Chronograph

PCB

Caldwell Precision Ballistic Chronograph Teardown

Sensors

At each end of the device we see a large, black sensor module.

Caldwell Precision Ballistic Chronograph Teardown

Due to the presence of three wires running to each pod, it is likely either a phototransistor, or a photodiode with a bias circuit to help return a signal over the long wires.

Sensor Amplifier

Connected to the sensors is a TI LM324 quad operational amplifier. This common component adds a lot of flexibility in the analog domain, but its general purpose is to amplify the signal coming from the sensors over a range of frequencies.

Analog to Digital Converter - ADC

Connected to the output of the LM324 is the Silicon Labs C8051F300 microcontroller. From the datasheet we can see that the ADC is 10-bit resolution, up to 500kilosamples per second.

ADC

This is a pretty simple, basic ADC design. The window-compare feature may be used to allow the 8051 core to sleep while waiting for a signal pulse to arrive, but given that the chrono does nothing else, it may not be utilized. One nice thing about this ADC is that it features differential inputs. This means that instead of attempting to read the signals from both sensors we can instead subtract one from the other and feed it into the single ADC channel(this may also be done with the OpAmp).

Signal Processing - what a puzzler

I frankly have no idea what the signal of an arrow or bullet passing over a phototransistor when amplified by an unknown amplifier, but we can figure out some basic requirements from the device specifications.

This is a problem, because the ADC can only run at 0.5MHz. What does that mean? With a sample-rate of 500kHz, we can work out the error over speed like so:

error_plot

This means that either the device doesn't operate quite as it appears to, or Battenfeld Technologies / Caldwell cheated a little on their specifications. Not that anyone would notice - no bullet travels anywhere near 10000 feet per second, rare bullets travel approximately 5000 feet per second, but most are much, much lower than that. A 220 Swift might be 4000 feet per second, and the chrono would be around 0.8 percent accurate. Sadly, 0.25 percent accuracy is achieved only for 1000 feet per second and slower velocities.

Taking Apart the Caldwell Ballistic Precision Chronograph

Written 2018-12-04

Tags:Caldwell Chronograph 

Caldwell Chronograph

First we pull four massive screws out

Caldwell Precision Ballistic Chronograph Teardown

Flip over the whole assembly and remove the top cover

Caldwell Precision Ballistic Chronograph Teardown

On the top of the PCB we see a Silicon Labs 8051

Caldwell Precision Ballistic Chronograph Teardown

And an LM324 Quad OpAmp

Caldwell Precision Ballistic Chronograph Teardown

Unaligned Access on Cortex-M: Loads vs Stores, M4 vs M7

Written 2018-11-24

Tags:Cortex-M Memory ARM Cortex 

Unaligned Accesses

Anytime a processor attempts to load or store to memory at an address modulo the size of the transfer not equal to zero, it is considered unaligned. For example, loading 4 bytes from address 1 is unaligned. There are two common ways a processor can react to this, fixing up the load to make it work, or faulting so a programmer can fix the error(weirdly, ARMv5 would treat unaligned loads as a Load+Rotate, and was generally considered a bad idea to implement this behaviour).

Unaligned Loads on Cortex-M3/4

On Cortex-M3/4, the core breaks up unaligned loads into the smallest usable loads. For a 4 byte load at example address we can use the following table:

Address Modulo 4How DonePenalty Cycles
04B0
11B+2B+1B2
22B+2B1
31B+2B+1B2

Generally, this is considered a pretty good deal as it is faster than constructing the data in registers using loads and shifts

Reference: Cortex-M4 Technical Reference Manual

Unaligned Loads on Cortex-M7

Another approach to unaligned loads is to always load two 32-bit words. This gives a more deterministic timing with 1 memory penalty cycle for all unaligned loads, however it is slower in the event that the memory pointed to is accessed by a 16-bit or 8-bit bus, as is common for slower external RAMs.

Address Modulo 4How DonePenalty Cycles
04B0
14B+4B1
24B+4B1
34B+4B1

Reference: Cortex-M7 Technical Reference Manual

A word about the store buffer

On Cortex-M3/4 stores are handed off to a unit known as the store-buffer. It allows program execution to continue with the store instruction consuming seeming only a single cycle, but in reality the store buffer defers it until memory is ready to accept the change. However, if too many stores are issued too quickly and memory is unable to keep up, the core may become blocked on the store buffer.

Unaligned Stores on Cortex-M3/4/7

When an unaligned store occurs, the processor pushes it into the store buffer and continues on as it does for all stores. Beneath the store buffer, the following occurs:

Address Modulo 4How DonePenalty Cycles
04B0
11B+2B+1B2
22B+2B1
31B+2B+1B2
For unaligned accesses, ARM states "A STR or STRH cannot delay the processor because of the store buffer." That's not entirely true - again if memory is slower than the processor, or there's just a lot of stores in the instruction stream, the penalty cycles caused by unaligned accesses can bubble up and eventually stall the processor.

How this affects embedded developers

Rarely, when benchmarking performance-critical code, I'll have strange, hard-to-explain results. A memset to one location might run slower than another, even though they're both in the same fixed-wait-state memory. Or, a graphical operation might run slower depending on the coordinates used, even though the number of pixels is the same from run to run. The cause is often unaligned memory accesses and the good news is that it can often be at least partially optimized, usually by aligning

Optimizations for memset

For a write-only, memset-like function, it's simple enough to eliminate all unaligned accesses by hand-aligning the head and tail of the operation, and if the operation is large enough, it's almost always worth it.

Check out newlib's memset() for an example. Note that Newlib is portable to platforms where unaligned accesses are undefined(trap), so it must eliminate them, but in doing so, this makes for a fast portable memset.

Optimizations for memcpy

For a read-write, memcpy-like function, it's sometimes impossible to possible to align both ends of the copy. In this case, it's best to align either source or destination, then use a mix of unaligned loads and aligned stores, or aligned loads and unaligned stores to carry out the bulk of the work.

ARM authored Newlib's memcpy, and it has an excellent example of aligning at least one pointer, then proceeding even if the other isn't aligned.

Optimizations for stream-mixing

For a read-read-write function, things are even more complicated. One example of this is mixing two 8-bit audio streams four samples at a time using LDR(streamA), LDR(streamB), QADD8, STR(StreamOut). In this case, you usually want to hand-align so that the largest number of streams are aligned, which could end up being anywhere from 1 to 3, potentially starting with 0 aligned streams.

Considerations for different memory subsystems

So far, the above writings haven't really touched on different memory architectures and how they affect unaligned accesses. For example, many Cortex-M processors are equipped with Flash ROM with a prefetcher. This means that the consecutive accesses caused by unaligned accesses cost only the penalty cycles. However, a slow, cacheless RAM may take many CPU cycles to complete a memory cycle, and now an unaligned access requires up to 3! Stranger yet are stores to SDRAMs, where the store buffer can cover up entire DRAM refreshes; however refreshes stall loads immediately. Finally, only rarely does a memory exist in a system with only a CPU - there are often other things contending for memory access to the same RAMs, often DMAs, GPUs, display controllers, and more. Keep in mind that on Cortex-M, the store-buffer is the only way to hide memory latency, and if you're doing a lot of stores, it may quickly become a bottleneck, doubly-to-tripply so if left unaligned.

Marsaglia's XORSHIFT32 in PSIMD

Written 2018-08-03

Tags:PRNG XORSHIFT PSIMD 

Here's a simple C program using PSIMD to implement Marsaglia's XORSHIFT32 PRNG with four parallel generators.

#include "psimd.h"
#include <stdio.h>

PSIMD_INTRINSIC psimd_u32 psimd_lsr_u32(psimd_u32 a, uint32_t b)  {return a >> b;}
PSIMD_INTRINSIC psimd_u32 psimd_lsl_u32(psimd_u32 a, uint32_t b)  {return a << b;}
PSIMD_INTRINSIC psimd_u32 psimd_xor_u32(psimd_u32 a, psimd_u32 b) {return a ^  b;}

psimd_u32 xorshift32_4( psimd_u32 x )
{
    x = psimd_xor_u32( x,  psimd_lsl_u32( x, 13 ) );
    x = psimd_xor_u32( x,  psimd_lsr_u32( x, 17 ) );
    x = psimd_xor_u32( x,  psimd_lsl_u32( x, 5  ) );
    return x;
}

uint32_t xorshift32(uint32_t x)
{
    x ^= x << 13;
    x ^= x >> 17;
    x ^= x << 5;
    return x;
}

#define print_psimd_u32( _x ) printf("%08x %08x %08x %08x\n", _x[0], _x[1], _x[2], _x[3] )

int main()
{
psimd_u32 state_start = { 1, 2, 3, 4 };
print_psimd_u32( state_start );

psimd_u32 state_vector = xorshift32_4( state_start );
print_psimd_u32( state_vector );

psimd_u32 state_scalar = { xorshift32(state_start[0]),xorshift32(state_start[1]),xorshift32(state_start[2]),xorshift32(state_start[3]) };
print_psimd_u32( state_vector );

return 0;
}

And we get the same results for both PSIMD and scalar execution

0x000000010x000000020x000000030x00000004Starting States
0x000420210x000840420x000c60630x00108084PSIMD States
0x000420210x000840420x000c60630x00108084Scalar States

With GCC6.3, -O3, assembly for the two algorithms comes out to the same number of instructions, even though one does four times the work.

xorshift32_4:
    .cfi_startproc
    movdqa  %xmm0, %xmm1
    pslld   $13, %xmm1
    pxor    %xmm0, %xmm1
    movdqa  %xmm1, %xmm0
    psrld   $17, %xmm0
    pxor    %xmm1, %xmm0
    movdqa  %xmm0, %xmm1
    pslld   $5, %xmm1
    pxor    %xmm1, %xmm0
    ret
    .cfi_endproc

xorshift32:
    .cfi_startproc
    movl    %edi, %eax
    sall    $13, %edi
    xorl    %edi, %eax
    movl    %eax, %edi
    shrl    $17, %edi
    xorl    %edi, %eax
    movl    %eax, %edi
    sall    $5, %edi
    xorl    %edi, %eax
    ret
    .cfi_endproc

PSIMD

Written 2018-08-03

Tags:SSE NEON ARM PSIMD Intel 

PSIMD is a set of Single Instruction, Multiple Data intrinsics portable across multiple CPU backends for GCC and Clang. These instrinsics map nicely to 128-bit SIMD units like Cortex-A ARM NEON, as well as SSE2 on Intel. It may also take advantage of SIMD units on other processors as supported by GCC and Clang, but these are the only ones I'm familiar with.

PSIMD is the first attempt I'm aware of to fully abstract the underlying hardware into a medium level compiler intrinsic. For another post, I'm working on replacing an example of Marsaglia's XORSHIFT PRNG using PSIMD, and I haven't yet had to worry at all about running on ARM or Intel.

Older