Optimizing LLVM's bump allocator

Lobsters Hottest Tools

Summary

This blog post details three recent optimizations to LLVM's BumpPtrAllocator, reducing fast-path overhead by removing redundant alignment, null pointer checks, and per-allocation accounting, resulting in improved performance for Clang, lld, and other LLVM components.

<p><a href="https://lobste.rs/s/ltc5ca/optimizing_llvm_s_bump_allocator">Comments</a></p>
Original Article
View Cached Full Text

Cached at: 06/29/26, 06:24 AM

# Optimizing LLVM's bump allocator Source: [https://maskray.me/blog/2026-06-28-optimizing-llvm-bump-allocator](https://maskray.me/blog/2026-06-28-optimizing-llvm-bump-allocator) `BumpPtrAllocator`is LLVM's bump allocator \(arena allocator\): each allocation bumps a pointer within a slab, and everything is freed at once when the allocator dies\. It backs Clang's`ASTContext`, lld's`make<T\>`object pools, TableGen records, and many other arenas\. Here is the fast path before three recent changes: ``` 12345678910111213141516 ``` ``` __attribute__((returns_nonnull)) void *Allocate(size_t Size, Align Alignment) { BytesAllocated += Size; uintptr_t AlignedPtr = alignAddr(CurPtr, Alignment); size_t SizeToAllocate = Size;#if LLVM_ADDRESS_SANITIZER_BUILD SizeToAllocate += RedZoneSize;#endif uintptr_t AllocEndPtr = AlignedPtr + SizeToAllocate; if (LLVM_LIKELY(AllocEndPtr <= uintptr_t(End) && CurPtr != nullptr)) { CurPtr = reinterpret_cast<char *>(AllocEndPtr); ... return reinterpret_cast<char *>(AlignedPtr); } return AllocateSlow(Size, SizeToAllocate, Alignment);} ``` Three changes streamline the three marked lines\. ## A minimum alignment skips the realign \([\#205240](https://github.com/llvm/llvm-project/pull/205240)\) `alignAddr\(CurPtr, Alignment\)`is wasteful: a freshly\-bumped pointer is usually aligned enough already\.[\#205240](https://github.com/llvm/llvm-project/pull/205240)rounds each size up to`MinAlign`\(default 8\), so the fast path realigns only for over\-aligned requests\. I've learned the trick from[Bump Allocation: Up or Down?](https://coredumped.dev/2024/03/25/bump-allocation-up-or-down/): ``` 1234567 ``` ``` SizeToAllocate = alignToPowerOf2(SizeToAllocate, MinAlign);uintptr_t AlignedPtr = uintptr_t(CurPtr);if (Alignment.value() > MinAlign) AlignedPtr = alignAddr(CurPtr, Alignment); ``` `SpecificBumpPtrAllocator<T\>`uses`MinAlign = 1`instead —`DestroyAll`strides at`sizeof\(T\)`, so it needs tight packing, not rounding\. I made a mistake in the first attempt:`nullptr`plus a non\-zero offset triggered a UBSan diagnostic\. Fixed by keeping the math in the`uintptr\_t`domain\. ## A sentinel End drops the null check \([\#205485](https://github.com/llvm/llvm-project/pull/205485)\) `\_\_attribute\_\_\(\(returns\_nonnull\)\)`specifies the return value is non\-null\. In a fresh allocator whose`CurPtr`and`End`are both null,`Allocate\(0\)`used to return null\. In 2022,[https://reviews\.llvm\.org/D125040](https://reviews.llvm.org/D125040)added the`&& CurPtr \!= nullptr`check to the fast path condition, which was not ideal\. I tried ``` 123 ``` ``` if (LLVM_LIKELY(AlignedPtr + SizeToAllocate - 1 < uintptr_t(End))) { ... } ``` but then adopted aengelke's suggestion\. Storing the end as a sentinel one past the real end \(`EndSentinel = realEnd \+ 1`, and`0`when there is no slab\) folds both conditions into one unsigned compare: ``` 1 ``` ``` if (LLVM_LIKELY(AllocEndPtr < EndSentinel)) { ... } ``` An empty allocator has`EndSentinel == 0`, so`AllocEndPtr < 0`is always false and the null case falls through to the slow path with no separate branch\. ## Dropping the per\-allocation accounting \([\#205711](https://github.com/llvm/llvm-project/pull/205711)\) `BytesAllocated \+= Size`was a read\-modify\-write to a member on every allocation, backing a`getBytesAllocated\(\)`that reported*requested*bytes — distinct from`getTotalMemory\(\)`'s slab capacity\. It had only stats/diagnostic consumers: lldb's ConstString memory report, a clangd debug log, TableGen's`dumpAllocationStats`, and one clang regression test\. Dropping the member and migrating those consumers \(mostly to`getTotalMemory\(\)`\) removes the hot\-path store\. **A detail: the red zone and ABI\.**The ASan red\-zone size is also a member\. Gating it on`\#if LLVM\_ADDRESS\_SANITIZER\_BUILD`to drop it in release builds would be an ABI footgun: that macro is*per translation unit*, so an ASan\-instrumented TU and a non\-ASan`libLLVM`would silently disagree on the struct layout\. The member is instead gated on`LLVM\_ENABLE\_ABI\_BREAKING\_CHECKS`, which is fixed per library build and link\-time\-enforced \(via the`EnableABIBreakingChecks`symbol\); the red\-zone arithmetic is then gated on both macros\. Combined, the fast path becomes: ``` 1234567891011121314151617 ``` ``` void *Allocate(size_t Size, Align Alignment) { size_t SizeToAllocate = Size;#if LLVM_ADDRESS_SANITIZER_BUILD && LLVM_ENABLE_ABI_BREAKING_CHECKS SizeToAllocate += RedZoneSize;#endif SizeToAllocate = alignToPowerOf2(SizeToAllocate, MinAlign); uintptr_t AlignedPtr = uintptr_t(CurPtr); if (Alignment.value() > MinAlign) AlignedPtr = alignAddr(CurPtr, Alignment); uintptr_t AllocEndPtr = AlignedPtr + SizeToAllocate; if (LLVM_LIKELY(AllocEndPtr < EndSentinel)) { CurPtr = reinterpret_cast<char *>(AllocEndPtr); ... return reinterpret_cast<char *>(AlignedPtr); } return AllocateSlow(Size, SizeToAllocate, Alignment);} ``` ## Generated assembly Allocating a typical arena object — a 24\-byte, 8\-aligned node via`Allocate<T\>\(\)`— compiles to a six\-instruction fast path \(`clang \-O2`, release\): ``` 123456 ``` ``` mov rax, [rdi] # CurPtr (also the return value)lea rcx, [rax + 0x18] # new = CurPtr + 24cmp rcx, [rdi + 0x8] # vs EndSentineljae .slowmov [rdi], rcx # CurPtr = newret ``` That matches the canonical bump fast path\. A*downward*\-bumping allocator would not need the`rax`/`rcx`distinction — one fewer live value, but the instruction count stays the same\. LLVM bumps upward by design:`identifyObject`, allocation order, and`SpecificBumpPtrAllocator::DestroyAll`'s forward`sizeof\(T\)`stride all assume it\. The remaining gap is space, not instructions\. ## Aggregate compile\-time impact These changes shrink`Allocate`below the inliner's cost threshold, so its callers \(e\.g\.`new \(Context\) T`\) inline at sites that previously called out of line\. Executed instructions fall — but as a*redistribution*: object files where the chain now inlines grow, while the rest shrink slightly from the dropped store\. The performance win is larger at stage2 \(built by stage1 Clang\) than at stage1 \(built by system GCC\)\. Reverting all three on top of`main`isolates their combined effect \([compare](https://llvm-compile-time-tracker.com/compare.php?from=dbd070fbd793c8a9129044abd669466e87d2ea8e&to=3a7d64a882421052101899d7d9c23685db5fd355&stat=instructions:u)\): Significant \(≥3σ vs\. measured noise\): 🟢 improvement\. Unmarked = within noise\. Configurationinstructions:umax\-rssstage1\-O3−0\.04%\+0\.04%stage1\-ReleaseThinLTO−0\.04%−0\.01%stage1\-ReleaseLTO\-g−0\.04%\+0\.06%stage1\-O0\-g🟢 −0\.09%\+0\.25%stage1\-aarch64\-O3−0\.04%\+0\.04%stage1\-aarch64\-O0\-g🟢 −0\.12%−0\.01%stage2\-O3🟢 −0\.14%−0\.15%stage2\-O0\-g🟢 −0\.36%−0\.06%## Takeaways - A bump allocator's fast path is a few instructions of real work wrapped in a realign and accounting; each can be hoisted out of the common case\. - Encoding "empty" as a`0`sentinel folds a null check into the bound compare\. - The measurable instruction\-count win is the inlining a cheaper`Allocate`unlocks, not the removed micro\-op — and it appears as a size*redistribution*, not a uniform shrink\. - A layout\-affecting member may key on`LLVM\_ENABLE\_ABI\_BREAKING\_CHECKS`\(link\-enforced\) but never on the per\-TU`LLVM\_ADDRESS\_SANITIZER\_BUILD`\.

Similar Articles

A faster bump allocator for rust

Lobsters Hottest

Stumpalo is a new high-performance bump allocator for Rust that benchmarks significantly faster than existing alternatives like blink and bumpalo across a wide range of allocation operations. It also features scoped stack support and is available as a Rust crate.

Performance improvements in libffi

Lobsters Hottest

This article details a performance improvement in libffi, where caching argument placement as a flat list of moves (a 'plan') eliminates redundant reclassification on every function call, offering significant speedups without resorting to JIT compilation.

Premature Optimization is Fun Sometimes

Lobsters Hottest

A blog post exploring the optimization of a ring buffer data structure for storing ping timestamps, discussing tagged unions, bitfields, and struct padding to reduce memory footprint.

When compilers surprise you

Lobsters Hottest

Matt Godbolt explores compiler optimizations that convert an O(n) summation loop into an O(1) closed-form solution, highlighting how Clang and GCC employ sophisticated techniques like loop unrolling and mathematical simplification to dramatically improve code performance.

mimalloc: A new, high-performance, scalable memory allocator for the modern era

Lobsters Hottest

mimalloc is an open-source, high-performance, scalable memory allocator that serves as a drop-in replacement for malloc and free. Designed for modern highly concurrent applications and large memory scales, it is used in major services like Bing and integrated into projects such as NoGIL CPython and Unreal Engine.