Unaligned Access on Cortex-M: Loads vs Stores, M4 vs M7

Written 2018-11-24

Unaligned Accesses

Anytime a processor attempts to load or store to memory at an address modulo the size of the transfer not equal to zero, it is considered unaligned. For example, loading 4 bytes from address 1 is unaligned. There are two common ways a processor can react to this, fixing up the load to make it work, or faulting so a programmer can fix the error(weirdly, ARMv5 would treat unaligned loads as a Load+Rotate, and was generally considered a bad idea to implement this behaviour).

Unaligned Loads on Cortex-M3/4

On Cortex-M3/4, the core breaks up unaligned loads into the smallest usable loads. For a 4 byte load at example address we can use the following table:

Address Modulo 4	How Done	Penalty Cycles
0	4B	0
1	1B+2B+1B	2
2	2B+2B	1
3	1B+2B+1B	2

Generally, this is considered a pretty good deal as it is faster than constructing the data in registers using loads and shifts

Reference: Cortex-M4 Technical Reference Manual

Unaligned Loads on Cortex-M7

Another approach to unaligned loads is to always load two 32-bit words. This gives a more deterministic timing with 1 memory penalty cycle for all unaligned loads, however it is slower in the event that the memory pointed to is accessed by a 16-bit or 8-bit bus, as is common for slower external RAMs.

Address Modulo 4	How Done	Penalty Cycles
0	4B	0
1	4B+4B	1
2	4B+4B	1
3	4B+4B	1

Reference: Cortex-M7 Technical Reference Manual

A word about the store buffer

On Cortex-M3/4 stores are handed off to a unit known as the store-buffer. It allows program execution to continue with the store instruction consuming seeming only a single cycle, but in reality the store buffer defers it until memory is ready to accept the change. However, if too many stores are issued too quickly and memory is unable to keep up, the core may become blocked on the store buffer.

Unaligned Stores on Cortex-M3/4/7

When an unaligned store occurs, the processor pushes it into the store buffer and continues on as it does for all stores. Beneath the store buffer, the following occurs:

Address Modulo 4	How Done	Penalty Cycles
0	4B	0
1	1B+2B+1B	2
2	2B+2B	1
3	1B+2B+1B	2

For unaligned accesses, ARM states "A STR or STRH cannot delay the processor because of the store buffer." That's not entirely true - again if memory is slower than the processor, or there's just a lot of stores in the instruction stream, the penalty cycles caused by unaligned accesses can bubble up and eventually stall the processor.

How this affects embedded developers

Rarely, when benchmarking performance-critical code, I'll have strange, hard-to-explain results. A memset to one location might run slower than another, even though they're both in the same fixed-wait-state memory. Or, a graphical operation might run slower depending on the coordinates used, even though the number of pixels is the same from run to run. The cause is often unaligned memory accesses and the good news is that it can often be at least partially optimized, usually by aligning

Optimizations for memset

For a write-only, memset-like function, it's simple enough to eliminate all unaligned accesses by hand-aligning the head and tail of the operation, and if the operation is large enough, it's almost always worth it.

Check out newlib's memset() for an example. Note that Newlib is portable to platforms where unaligned accesses are undefined(trap), so it must eliminate them, but in doing so, this makes for a fast portable memset.

Optimizations for memcpy

For a read-write, memcpy-like function, it's sometimes impossible to possible to align both ends of the copy. In this case, it's best to align either source or destination, then use a mix of unaligned loads and aligned stores, or aligned loads and unaligned stores to carry out the bulk of the work.

ARM authored Newlib's memcpy, and it has an excellent example of aligning at least one pointer, then proceeding even if the other isn't aligned.

Optimizations for stream-mixing

For a read-read-write function, things are even more complicated. One example of this is mixing two 8-bit audio streams four samples at a time using LDR(streamA), LDR(streamB), QADD8, STR(StreamOut). In this case, you usually want to hand-align so that the largest number of streams are aligned, which could end up being anywhere from 1 to 3, potentially starting with 0 aligned streams.

Considerations for different memory subsystems

So far, the above writings haven't really touched on different memory architectures and how they affect unaligned accesses. For example, many Cortex-M processors are equipped with Flash ROM with a prefetcher. This means that the consecutive accesses caused by unaligned accesses cost only the penalty cycles. However, a slow, cacheless RAM may take many CPU cycles to complete a memory cycle, and now an unaligned access requires up to 3! Stranger yet are stores to SDRAMs, where the store buffer can cover up entire DRAM refreshes; however refreshes stall loads immediately. Finally, only rarely does a memory exist in a system with only a CPU - there are often other things contending for memory access to the same RAMs, often DMAs, GPUs, display controllers, and more. Keep in mind that on Cortex-M, the store-buffer is the only way to hide memory latency, and if you're doing a lot of stores, it may quickly become a bottleneck, doubly-to-tripply so if left unaligned.

prev next