How Memory Layout Affects Performance at the Nanosecond Level

In performance-critical systems, such as high-frequency trading (HFT), real-time games, or scientific simulations, writing fast code isn't just about algorithms. It's about understanding how your data moves through memory, and how that affects the CPU's ability to keep up.

One of the most important, yet often overlooked, design choices is how you lay out your data in memory. This is the battle between:

Array of Structures (AoS) : intuitive, object-oriented, but potentially cache-hostile.
Structure of Arrays (SoA) : slightly more complex, but much more cache-friendly.

Today, you'll see how changing your data layout (and nothing else) can yield up to 40-60% performance improvement in real workloads.

The Problem: Same Computation, Different Performance

Imagine you're building a system to track the best bid and ask prices for 10 million stocks. All you need to do is sum up all the bid prices for some statistical or analytic purpose.

Seems simple, right?

Let’s look at two ways of structuring this data in C++.

Option 1: Array of Structures (AoS)

struct StockData {
    double bid;
    double ask;
};

std::vector<StockData> data1(num_elements);

Here, each StockData contains a bid and an ask. You store 10 million of them in a vector.

In memory, this looks like:

[bid_0, ask_0, bid_1, ask_1, bid_2, ask_2, ...]

Option 2: Structure of Arrays (SoA)

std::vector<double> bidV(num_elements);
std::vector<double> askV(num_elements);

Here, bids and asks are separated into two vectors.

In memory:

[bid_0, bid_1, bid_2, bid_3, ...]
[ask_0, ask_1, ask_2, ask_3, ...]

The Task: Sum All Bid Prices

We'll write two functions:

double sumAoS(const std::vector<StockData>& data) {
    double sum = 0.0;
    for (const auto& stock : data) {
        sum += stock.bid;
    }
    return sum;
}

double sumSoA(const std::vector<double>& bids) {
    return std::accumulate(bids.begin(), bids.end(), 0.0);
}

Each function does the same thing i.e. sum up 10 million double-precision numbers.

The Benchmark

When timed in Release mode, the results are:

AoS Time: ~16.8 milliseconds  
SoA Time: ~11.7 milliseconds

A 43% performance improvement. Just from changing how the data is laid out in memory.

But why?

The Science Behind It: CPU Memory Hierarchy

To understand this, you need to know how modern CPUs work with memory.

When your program accesses memory, it doesn't go straight to RAM. That would be way too slow.

Instead, it works through a multi-level memory hierarchy:

Level	Size (Typical)	Latency
CPU Registers	< 1 KB	< 1 ns
L1 Cache	32-64 KB	~1-2 ns
L2 Cache	256 KB-4 MB	~5-10 ns
L3 Cache	8-64 MB	~20-50 ns
RAM	GBs	~100-200 ns

When you access a variable:

If it's in L1 cache: super fast.
If it's in L2/L3: slower.
If it's in RAM: painfully slow.

That’s why keeping your data cache-friendly is one of the best things you can do for performance.

What Are Cache Lines?

CPUs don't read individual bytes. They read memory in chunks called cache lines, typically 64 bytes.

A double is 8 bytes. So, one cache line holds 8 values.

If you access bid[0], the CPU loads bid[0] to bid[7] into L1 cache.
If your code then accesses bid[1], it’s already in cache -> cache hit.
If your code jumps to bid[100000], that’s a cache miss. The CPU stalls while it fetches data from RAM.

The Hidden Cost of AoS

Let’s now compare what happens under each layout.

Structure of Arrays (SoA):

[bid_0, bid_1, bid_2, ..., bid_7]  ← Cache Line 1

When you loop through bidV, you’re walking through memory sequentially. Every cache line is 100% full of useful data.

The hardware prefetcher sees this access pattern and starts preloading future cache lines. Your loop runs fast and smooth.

Array of Structures (AoS):

[bid_0, ask_0, bid_1, ask_1, bid_2, ask_2, ...]

Now each cache line includes both bids and asks.

If you're only summing bids:

Half of every cache line is wasted (ask values).
The CPU has to load more cache lines to reach the same amount of useful data.
Cache lines are being evicted faster.
Prefetching is less efficient.

End result: more cache misses, more memory traffic, more latency.

Real Cost of Cache Misses

Let’s say each bid access takes 1 ns with a cache hit, but 200 ns with a cache miss.

If you sum 10 million bids:

With 100% cache hits (SoA):
10 million × 1 ns = 10 ms
With 50% cache misses (AoS):
5 million × 1 ns + 5 million × 200 ns = 1,000 ms

This is an extreme example, but it shows why memory layout matters more than you think.

When to Use SoA vs. AoS

Use Case	Recommended Layout
Processing all fields together (e.g., copy)	AoS
Processing one field (e.g., sum bids only)	SoA
SIMD/vectorization opportunities	SoA
Data serialization/deserialization	AoS
Tight loops and data locality optimizations	SoA
Object-oriented modeling and encapsulation	AoS

In data-intensive systems like trading platforms or simulation engines, the benefits of SoA can be massive, especially in hot code paths.

Final Thoughts

If you're coming from a high-level language like Python, you've probably never had to think about cache lines, prefetchers, or memory layout. But in C++, these are first-class concerns, especially if you're writing code where nanoseconds matter.

In the world of HFT and systems programming:

SoA is your friend for performance.
AoS is your friend for readability and structure.

Understanding when to use each can be the difference between good code and code that screams.

“Write code for humans. Lay out data for machines.”
— Modern C++ wisdom

Structure of Arrays (SoA) vs Array of Structures (AoS) in C++: A Deep Dive into Cache-Optimized Performance

Lovekesh Azad