Cached at:
06/27/26, 01:53 PM
# Shard your locks: benchmarking 6 Go cache designs
Source: [https://strebkov.dev/posts/shard-your-locks/](https://strebkov.dev/posts/shard-your-locks/)
I built the same in\-memory`string → string`cache six ways, using nothing but the Go standard library, and benchmarked them under read\-heavy, balanced, and write\-heavy load across 1 to 8 cores\. The rankings flip depending on the workload — and one of the “obvious” answers gets*slower*the more cores you give it\.
**TL;DR:**Shard your locks\. A 256\-way striped map \(`sharded`\) was the all\-around winner — up to**8× faster**than a single`sync\.Mutex`at 8 cores — and it’s about 15 lines of code\.`sync\.RWMutex`, the reflexive fix for “reads are contended,” is a trap: it barely helps reads past two cores and is*slower than a plain mutex*for writes\.
## The contenders
CacheIdeaOne\-liner`naive`Plain`map`, no lockingNot thread\-safe — concurrent writes crash the process\. Baseline only\.`mutex`One`sync\.Mutex`Simple, correct, doesn’t scale\.`rwmutex`One`sync\.RWMutex`Parallel reads, exclusive writes\.`syncmap``sync\.Map`The stdlib’s own concurrent map\.`sharded`256 shards, one mutex eachLock striping\. Keys routed by hash\.`cow`Copy\-on\-write via`atomic\.Pointer`Lock\-free reads; every write copies the whole map\.All six satisfy one interface, so a single harness drives them identically\. Code:[github\.com/kluyg/in\-memory\-cache](https://github.com/kluyg/in-memory-cache)\.
## How I measured \(the short version\)
`testing\.B`\+`b\.RunParallel`, 1,000,000 keys, GOMAXPROCS swept 1→8, on a 20\-core i7\-14700K\. Each data point is the median of 10 runs summarized with`benchstat`; variation was mostly ±0–3%\. Throughput below is`1000 / \(ns/op\)`in millions of ops/sec — higher is better\. I measured the cache*in\-process*, not behind HTTP: net/http \+ JSON cost microseconds, which would bury the nanosecond\-scale differences I’m chasing\.
The 14700K is a*hybrid*chip — 8 performance cores \(with hyperthreading\) plus 12 efficiency cores — so an unpinned sweep is a trap: as GOMAXPROCS rises, the OS can spill goroutines onto E\-cores or hyperthread siblings and migrate them mid\-run, which confounds the scaling curves\. So the process is pinned to one thread per physical P\-core \(affinity`0x5555`\); each GOMAXPROCS step adds a real P\-core\. Pinning shifted the absolute numbers by 10–25% in places but left every ranking and curve shape unchanged\.
One deliberate non\-axis:**value size doesn’t matter here\.**Go strings are immutable, so`Set`stores a 16\-byte header and never touches the value bytes — 64 B and 16 KB benchmark identically \(0 B/op\)\. Value size affects memory and GC, not op throughput\.
## The results

Read the slopes, not just the heights:
- **`sharded`and`cow`climb;`mutex`is flat\.**More cores, more throughput — unless you picked a single lock\.
- **`cow`owns read\-only**\(87 Mops/s at 8 cores, fully lock\-free reads\) and**vanishes the moment writes appear**— it’s pinned to ≈0 on the three write panels because every`Set`copies the entire million\-entry map\.
- **`sharded`is the only design that’s near the top in*every*panel\.**
### The obvious fix scales backwards
Normalize each design to its own single\-core throughput and the story gets sharper:

- **`mutex`is*below*1×**— at 8 cores it’s 0\.66× its single\-core speed\. Reads can’t run in parallel, and the cache line holding the lock ping\-pongs between cores\. You added hardware and lost performance\.
- **`rwmutex`plateaus around 2×\.**The shared reader counter becomes the new contention point; it stops improving after ~4 cores\.
- **`sharded`reaches 6\.9×, while`cow`and`syncmap`track — even slightly exceed — the ideal 8× line**\(lock\-free reads get a bonus from the larger aggregate cache\)\. Caveat:`syncmap`’s great*slope*flatters a poor baseline — it’s still slower in absolute terms than`sharded`\.
### Skew isn’t simply “worse”
Real caches see Zipfian access — a few hot keys take most of the traffic\. The common assumption is that skew hurts\. It’s more interesting than that:

Above 1× means*faster*under skew\.**Reads get faster almost everywhere**— the hot keys stay in CPU cache \(`mutex`reads speed up 1\.6×,`syncmap`1\.9×\)\. The striking exception is**`sharded`’s balanced mix at 0\.82× — skew makes it*slower***: hot keys collide on a few shards, so those locks contend while the rest sit idle\.
`cow`is the control case: its balanced\-mix bar sits at 1\.03×, essentially flat\. That’s the tell\-tale of a design whose write cost is*distribution\-independent*— it copies the whole map on every`Set`regardless of which key changed, so the key distribution can’t touch it\. Skew moves a number only where the distribution changes*where*work lands \(cache lines, shards\); it leaves`cow`’s uniform copy cost alone\. Which way it cuts depends on your design and write ratio\.
### The numbers \(8 cores, ns/op, lower is better\)
Uniform:
mixmutexrwmutexsyncmapshardedcowread\-only168533021**11\.5**read\-heavy16825937**22**12,000,000balanced19028257**24**46,500,000write\-heavy20822273**25**82,500,000Zipfian \(s=1\.1\):
mixmutexrwmutexsyncmapshardedcowread\-only106491617**7**read\-heavy11222524**24**9,040,000balanced12618346**29**45,100,000write\-heavy13114268**32**84,000,000Those eight\-figure`cow`write cells are real, and the whole column is ns: a write copies the entire million\-entry map, ~10⁶× slower than the alternatives\.`82,500,000`ns is**82 milliseconds**— per`Set`\. That’s the price of lock\-free reads\.
## The winner, in a few lines
`sharded`is just N independent maps, each behind its own lock\. A key’s hash picks the shard, so operations on different keys almost never touch the same lock — contention drops by roughly a factor of N:
```
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
```
```
const shards = 256 // power of two, so we can mask instead of modulo
type part struct {
mu sync.Mutex
m map[string]string
}
type Sharded struct{ parts [shards]*part }
func (c *Sharded) at(key string) *part {
h := uint64(14695981039346656037) // FNV-1a
for i := 0; i < len(key); i++ {
h = (h ^ uint64(key[i])) * 1099511628211
}
return c.parts[h&(shards-1)]
}
func (c *Sharded) Get(key string) (string, bool) {
p := c.at(key)
p.mu.Lock()
v, ok := p.m[key]
p.mu.Unlock() // explicit, not defer (+8% here — see note below)
return v, ok
}
func (c *Sharded) Set(key, value string) {
p := c.at(key)
p.mu.Lock()
p.m[key] = value
p.mu.Unlock()
}
```
That’s the whole idea\. The[real version](https://github.com/kluyg/in-memory-cache/blob/main/sharded.go)adds`Delete`/`Len`and pads each shard onto its own cache line \(so locking one shard doesn’t bounce a neighbor’s cache line\), but the engine is right here\.
About that explicit`Unlock`: Go 1\.14’s open\-coded defers made a single leaf`defer`*cheap*— zero extra allocations — but not free\. Measured on Go 1\.26,`defer p\.mu\.Unlock\(\)`here costs**\+8% \(≈1 ns\)**over an explicit unlock \(11\.6 → 12\.6 ns, p < 0\.001\)\. Tiny in isolation, but on a path this hot 8% is 8%, so the hot methods unlock explicitly\. Re\-run it yourself:[`defer\_test\.go`](https://github.com/kluyg/in-memory-cache/blob/main/defer_test.go)\.
### Why 256 shards?
Enough to kill contention, not so many you’re wasting memory\. Sweeping the shard count at 8 cores on the balanced mix, throughput climbs steeply to ~256 and then flattens:

One lock to 256 is a**9× jump**\(4\.6 → 43 Mops/s\)\. Past 256 the returns collapse: 1024 buys \+13%, 4096 only \+18% — for 4× and 16× the shards, each an extra map, mutex, and cache line\. 256 sits right in the knee\. \(Tune it for your core count and write ratio; the sweep is one benchmark away\.\)
## What to actually use
- **Default to`sharded`\.**Best or near\-best everywhere, scales with cores, trivial to write\. This is the answer for most concurrent maps\.
- **`cow`for read\-mostly\-to\-read\-only data**— config snapshots, routing tables, feature flags\. Unbeatable reads, but only if writes are rare and batched\. Never for write\-heavy load\.
- **`sync\.Map`only in its niche**— stable keys written once and read forever, or goroutines touching disjoint key sets\. Outside that it’s mediocre and it*allocates*\(interface boxing: 40–72 B/op\)\.
- **`sync\.RWMutex`: reach for it rarely\.**It only wins in the narrow read\-heavy, low\-core corner, and it’s worse than a plain mutex for writes\.
- **A plain`mutex`is fine**when contention is low or you’re at low core counts\. Don’t reach for cleverness you can’t measure a need for\.
- **Never the naive map across goroutines**— Go’s runtime will deliberately crash the process with a`concurrent map writes`fatal error\.
## Three things that surprised me
1. **More cores made the single\-mutex cache slower\.**Negative scaling is real, and it’s just cache\-line contention on the lock itself\.
2. **`RWMutex`is a half\-measure that backfires on writes\.**The reader\-counter bookkeeping costs more than it saves once writes enter the mix\.
3. **Skew is a double\-edged sword**, not a uniform penalty — it speeds up reads via cache locality while concentrating write contention\.
The full code, raw`benchstat`output, and a one\-command sweep to reproduce all of this on your own hardware are in the[repo](https://github.com/kluyg/in-memory-cache)\.