优化 LLVM 的 bump 分配器

Lobsters Hottest 2026/06/29 04:25 工具

low-level memory-allocation compiler llvm optimization performance allocator

摘要

这篇博客文章详细介绍了对 LLVM 的 BumpPtrAllocator 进行的三项近期优化，通过移除冗余对齐、空指针检查以及每次分配的记账开销来减少快速路径开销，从而提升了 Clang、lld 及其他 LLVM 组件的性能。

查看原文

查看缓存全文

缓存时间: 2026/06/29 06:24

# 优化 LLVM 的 bump allocator 来源：https://maskray.me/blog/2026-06-28-optimizing-llvm-bump-allocator `BumpPtrAllocator` 是 LLVM 的缓冲分配器（arena allocator）：每次分配时在一个 slab 内向前移动指针，并在分配器销毁时一次性释放所有内存。它支撑着 Clang 的 `ASTContext`、lld 的 `make` 对象池、TableGen 的记录以及其他许多 arena。以下是最近三次改动之前的快速路径代码： ```cpp __attribute__((returns_nonnull)) void *Allocate(size_t Size, Align Alignment) { BytesAllocated += Size; uintptr_t AlignedPtr = alignAddr(CurPtr, Alignment); size_t SizeToAllocate = Size; #if LLVM_ADDRESS_SANITIZER_BUILD SizeToAllocate += RedZoneSize; #endif uintptr_t AllocEndPtr = AlignedPtr + SizeToAllocate; if (LLVM_LIKELY(AllocEndPtr <= uintptr_t(End) && CurPtr != nullptr)) { CurPtr = reinterpret_cast<void*>(AllocEndPtr); ... return reinterpret_cast<void*>(AlignedPtr); } return AllocateSlow(Size, SizeToAllocate, Alignment); } ``` 三次改动对标注的三行进行了精简。 ## 最小对齐跳过重对齐（#205240 (https://github.com/llvm/llvm-project/pull/205240)） `alignAddr(CurPtr, Alignment)` 是浪费的：刚被移动过的指针通常已经足够对齐。#205240 (https://github.com/llvm/llvm-project/pull/205240) 将每个大小向上取整到 `MinAlign`（默认 8），这样快速路径只在对齐要求超过 `MinAlign` 时才进行重对齐。我从《Bump Allocation: Up or Down?》(https://coredumped.dev/2024/03/25/bump-allocation-up-or-down/) 中学到了这个技巧： ```cpp SizeToAllocate = alignToPowerOf2(SizeToAllocate, MinAlign); uintptr_t AlignedPtr = uintptr_t(CurPtr); if (Alignment.value() > MinAlign) AlignedPtr = alignAddr(CurPtr, Alignment); ``` `SpecificBumpPtrAllocator` 使用 `MinAlign = 1`——它的 `DestroyAll` 按 `sizeof(T)` 步进，因此需要紧凑排列，而不是取整。我在第一次尝试中犯了一个错误：`nullptr` 加上非零偏移量触发了 UBSan 诊断。通过在 `uintptr_t` 域中保留数学运算来解决。 ## 哨兵 End 消除了空指针检查（#205485 (https://github.com/llvm/llvm-project/pull/205485)） `__attribute__((returns_nonnull))` 指定返回值非空。在一个 `CurPtr` 和 `End` 均为空的分配器中，`Allocate(0)` 原本返回空。2022 年，https://reviews.llvm.org/D125040 在快速路径条件中增加了 `&& CurPtr != nullptr` 检查，这并不理想。我尝试过： ```cpp if (LLVM_LIKELY(AlignedPtr + SizeToAllocate - 1 < uintptr_t(End))) { ... } ``` 但后来采纳了 aengelke 的建议。将 end 存储为超过实际 end 的哨兵（`EndSentinel = realEnd + 1`，无 slab 时为 `0`），这样将两个条件合并为一个无符号比较： ```cpp if (LLVM_LIKELY(AllocEndPtr < EndSentinel)) { ... } ``` 空的分配器有 `EndSentinel == 0`，因此 `AllocEndPtr < 0` 始终为假，空情况直接落入慢路径，无需单独分支。 ## 去掉每次分配的统计（#205711 (https://github.com/llvm/llvm-project/pull/205711)） `BytesAllocated += Size` 在每次分配时对成员进行读-修改-写，它支持 `getBytesAllocated()` 报告 *请求的* 字节数——与 `getTotalMemory()` 的 slab 容量不同。它只有统计/诊断消费者：lldb 的 ConstString 内存报告、clangd 的调试日志、TableGen 的 `dumpAllocationStats`，以及一个 Clang 回归测试。去掉该成员并将这些消费者迁移（大多数迁移到 `getTotalMemory()`）消除了热路径存储。 **细节：红区和 ABI。** ASan 红区大小也是一个成员。用 `#if LLVM_ADDRESS_SANITIZER_BUILD` 在发行版中去掉它会成为 ABI 隐患：该宏是 *每个翻译单元* 的，因此一个 ASan 检测的 TU 和未启用 ASan 的 `libLLVM` 会在结构体布局上默默不一致。该成员改为用 `LLVM_ENABLE_ABI_BREAKING_CHECKS` 守卫，该宏在每个库构建中是固定的，并通过链接时强制（通过 `EnableABIBreakingChecks` 符号）；然后红区运算由这两个宏共同守卫。结合后的快速路径变为： ```cpp void *Allocate(size_t Size, Align Alignment) { size_t SizeToAllocate = Size; #if LLVM_ADDRESS_SANITIZER_BUILD && LLVM_ENABLE_ABI_BREAKING_CHECKS SizeToAllocate += RedZoneSize; #endif SizeToAllocate = alignToPowerOf2(SizeToAllocate, MinAlign); uintptr_t AlignedPtr = uintptr_t(CurPtr); if (Alignment.value() > MinAlign) AlignedPtr = alignAddr(CurPtr, Alignment); uintptr_t AllocEndPtr = AlignedPtr + SizeToAllocate; if (LLVM_LIKELY(AllocEndPtr < EndSentinel)) { CurPtr = reinterpret_cast<void*>(AllocEndPtr); ... return reinterpret_cast<void*>(AlignedPtr); } return AllocateSlow(Size, SizeToAllocate, Alignment); } ``` ## 生成的汇编分配一个典型的 arena 对象——通过 `Allocate()` 分配一个 24 字节、8 字节对齐的节点——编译成六条指令的快速路径（`clang -O2`，发行版）： ```asm mov rax, [rdi] # CurPtr（也是返回值） lea rcx, [rax + 0x18] # new = CurPtr + 24 cmp rcx, [rdi + 0x8] # 与 EndSentinel 比较 jae .slow mov [rdi], rcx # CurPtr = new ret ``` 这与经典的 bump 快速路径一致。*向下* 移动的分配器不需要 `rax`/`rcx` 的区别——少一个活跃值，但指令数相同。LLVM 设计为向上移动：`identifyObject`、分配顺序以及 `SpecificBptrPtrAllocator::DestroyAll` 的向前 `sizeof(T)` 步进都基于此。剩余的差距在于空间，而非指令。 ## 整体编译时影响这些改动使 `Allocate` 低于内联器的成本阈值，因此它的调用者（例如 `new (Context) T`）在之前调用非内联函数的位置现在可以内联。执行的指令数下降——但作为一种 *再分配*：内联链变长的目标文件变大，而其余文件因去掉了存储而略有缩小。性能提升在 stage2（由 stage1 Clang 构建）比 stage1（由系统 GCC 构建）更明显。在 `main` 之上回退所有三个改动可以隔离它们的组合效果（比较 (https://llvm-compile-time-tracker.com/compare.php?from=dbd070fbd793c8a9129044abd669466e87d2ea8e&to=3a7d64a882421052101899d7d9c23685db5fd355&stat=instructions:u)）：显著（≥3σ 相对于测量噪声）：🟢 改善。未标记 = 在噪声范围内。 | 配置 | instructions:u | max-rss | |------|----------------|---------| | stage1-O3 | −0.04% | +0.04% | | stage1-ReleaseThinLTO | −0.04% | −0.01% | | stage1-ReleaseLTO-g | −0.04% | +0.06% | | stage1-O0-g | 🟢 −0.09% | +0.25% | | stage1-aarch64-O3 | −0.04% | +0.04% | | stage1-aarch64-O0-g | 🟢 −0.12% | −0.01% | | stage2-O3 | 🟢 −0.14% | −0.15% | | stage2-O0-g | 🟢 −0.36% | −0.06% | ## 要点 - 一个 bump 分配器的快速路径本质上是几条指令的实质工作，包裹在重对齐和统计中；每一项都可以从常见情况中提取出来。 - 将“空”编码为 `0` 哨兵，将空指针检查折叠到边界比较中。 - 可测量的指令数收益来自于更廉价的 `Allocate` 解锁的内联，而非去除的微操作——并且它表现为大小的 *再分配*，而非统一缩小。 - 影响布局的成员可以以 `LLVM_ENABLE_ABI_BREAKING_CHECKS`（链接时强制）为键，但绝不能以每个 TU 的 `LLVM_ADDRESS_SANITIZER_BUILD` 为键。

优化 LLVM 的 bump 分配器

相似文章

一个更快的 Rust 内存块分配器

libffi 的性能改进

过早优化有时也挺有趣

当编译器让你惊喜

mimalloc：面向现代时代的新型高性能可扩展内存分配器

提交意见反馈