Snake Battle Royale: High-Frequency Game Server optimization project (Part 3)

Hot Path optimization: When cache misses become your biggest enemy

Blog image

The Cache Performance Reality Check

After the previous optimizations brought us from 10.422 μs to 2.562 μs, I was feeling pretty good. But then I bumped the grid size to 10,000×10,000 (~100MB) and reality hit hard. The cache performance became worse as I had expcted. It was because things were big enough to not fit in my L3 cache anymore.

Measuring Cache and Branch Metrics

Snakes	Cache Hit %	Cache Hits	Cache Misses	Branch Pred %	IPC	Iterations
100	53.36%	9,136	6,993	99.77%	2.85	263
300	53.32%	9,350	7,274	99.90%	3.93	223
500	54.02%	17,990	13,517	99.93%	4.35	327
700	76.03%	127,939	33,894	99.73%	4.64	1,615
900	71.62%	160,699	56,923	99.82%	4.65	1,615
1000	69.95%	226,756	88,068	99.87%	4.63	1,820

The pattern was clear: as snakes dispersed across the grid, we have more frequent cache misses. At 1000 snakes, we were only hitting cache ~70% of the time. That's a lot of expensive memory accesses.

Cache Thrashing!

The issue was in my object-oriented design philosophy. I had created clean, encapsulated methods for each class (Snake, Apple, etc.) that operated at a per-object level. While this made the code maintainable, it made global grid-level optimizations tricky.

Here's what was happening in each tick:

Per Snake (1000 snakes per tick):

READ: grid.get_cell(&new_head) - Check for apple consumption
READ: grid.get_cell(&new_head) - Check for collision
WRITE: grid.set_cell(*tail, Cell::Empty) - Clear old tail
WRITE: grid.set_cell(*head, Cell::Snake) - Set new head

Per Apple:

WRITE: grid.set_cell(new_apple_pos, Cell::Apple) - Spawn new apple

With 1000 snakes, that's 4000+ random grid accesses per tick across a 10k×10k grid. We were essentially cache thrashing—jumping around randomly in a 100MB grid with no spatial locality.

The Spatial Batching Solution

I needed to transform random access patterns into spatially-ordered operations. The key insight: collect all movement records, sort them by position, then process in cache-friendly order.

The New Algorithm

Phase 1: Collect Records (NO Grid Reads)

FOR each alive snake:
    calculate new_head_position
    create MovementRecord(snake_id, new_head_position, empty_cell)
    add to records list

Phase 2: Sort by Spatial Locality

SORT records by (y_coordinate, x_coordinate)
// This groups nearby operations for better cache utilization

Phase 3-5: Combined Loop (Read, Process, Write Immediately)

FOR each record in sorted order:
    READ cell_at_new_head from grid
    DEAL with Snake Collision
    DEAL with apple eating
    
    HANDLE Snake Race conditions within tick()
    
    WRITE Snake to new_head position
    IF not growing:
        WRITE Empty to old tail position
    UPDATE snake body

1. Partitioning Instead of Full Sorting

Full sorting was expensive. Instead, I used spatial partitioning with 2^8 buckets:

CREATE 256 buckets (2^8)
FOR each movement record:
    calculate bucket_index = (y_coord >> 8) * 256 + (x_coord >> 8)
    add record to buckets[bucket_index]

This gave us most of the cache benefits with much less overhead.

2. Batched Tail Writes

The tail clearing writes were still causing cache misses. I collected these writes into buckets and processed them at the end:

DURING movement processing:
    IF snake not growing:
        add tail_position to tail_clears records

AT END of tick:
    FOR each tail_position in tail_clears records:
        WRITE Empty to tail_position

Performance Results

Before (Random Access):

1000 snakes: 69.95% cache hit rate

After (Spatial Batching):

1000 snakes: 78.42% cache hit rate

Cache hit rate improved from ~70% to ~78% - a solid improvement, though not as dramatic as I'd hoped.

TinyDeque: The Stack-First Approach

After optimizing the grid access patterns, I turned my attention to snake body storage. Most snakes in the game are small (3-4 segments), so I suspected that heap allocation for every snake body was wasteful.

The VecDeque Problem

The standard VecDeque allocates each snake body on the heap, which means:

1000 snakes = 1000 heap allocations
Poor cache locality - snake bodies scattered across memory
Allocation overhead for small collections

My Custom Deque Attempt

I initially tried implementing a custom deque using SmallVec as the backing store:

pub struct Snake {
    pub body: SmallVec<[Point; 16]>,  // Stack-allocated for small snakes
    pub front_idx: usize,             // Circular buffer head
    pub end_idx: usize,               // Circular buffer tail
    // ...
}

The result? I realised it was going to be a bigger rabbit hole that I had imagined with all sorts of performance and edge case concerns. I was just about to spin off this new project and then ...

Enter TinyDeque

I discovered the tinydeque crate - exactly what I needed!

pub struct Snake {
    pub body: TinyDeque<[Point; 16]>,  // Stack-first, heap spill
    // ...
}

TinyDeque's magic:

Starts on the stack - small snakes use ArrayDeque (stack-allocated)
Automatic heap spill - when the stack buffer fills, seamlessly transitions to VecDeque
Same API as VecDeque - drop-in replacement with no complexity
Best of both worlds - fast stack operations for small snakes, heap flexibility for large ones

Final Results: The Performance Journey

The Numbers

For the 10_000 * 10_000 grid:

Optimization	Performance
Baseline	~18 μs
Spatial Batching + TinyDeque	~8.6 μs

Total improvement: 52% faster - from ~18 μs down to ~8.6 μs per tick.

What Actually Mattered

The performance journey taught me some valuable lessons about optimization:

Cache locality trumps algorithmic complexity - The spatial batching approach was more complex than the simple per-snake loop, but the cache benefits far outweighed the overhead.
Measure, don't guess - There were quite a few things I've tried, that seemed really good in theory, but never worked in practice.
Premature optimization is still the root of all evil - ;)

But for now, I'm happy with this performance. I am parking this project as I've found other projects to bootstrap along the way!

Snake Battle Royale: High-Frequency Game Server optimization project 3/x