批量 memmove 会加速 std::remove_if 吗？（不会。）

Lobsters Hottest 2026/05/24 05:17 工具

c++ standard-library algorithm-optimization memmove remove-if performance benchmark

摘要

本文探讨了在 std::remove_if 中使用批量 memmove 是否比传统的逐元素移动能提升性能，结论是并不会，因为记账开销以及 memmove 的重叠检查带来了额外负担。

查看原文

查看缓存全文

缓存时间: 2026/05/24 06:54

# 批量 memmove 能加速 `std::remove_if` 吗？（不能。）来源：https://quuxplusone.github.io/blog/2026/05/23/chunked-remove/ 今天早上我在读无数个 std-proposals 讨论串中的某一个，提议某种 `unstable_remove`（https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0041r0.html）。我突然想到，基于 swap-and-pop 的 `unstable_remove` 有一个奇怪之处：它会通过**翻转**保留元素的方式，来替换大块的连续删除。例如（Godbolt（https://godbolt.org/z/f9doebhch））： ```cpp template<class BidirIt, class Pred> BidirIt unstable_remove_if(BidirIt first, BidirIt last, Pred pred) { while (true) { first = std::find_if(first, last, pred); if (first == last) return first; while (true) { --last; if (first == last) return first; if (!pred(*last)) break; } *first++ = std::move(*last); } } int main() { auto in234 = [](int x) { return 2 <= x && x <= 4; }; std::vector<int> v = {1,2,3,4,5,6,7,8,9}; v.erase(unstable_remove_if(v.begin(), v.end(), in234), v.end()); // v 现在是 {1,9,8,7,5,6} } ``` 如果能产生 `{1,7,8,9,5,6}` 会更美观一些：对于可平凡复制的元素，这只需要一次 memmove。某种意义上，“翻转元素”是额外的工作，我们或许可以避免它来节省时间。但要避免**那个**工作，我们可能不得不在簿记方面做**更多**工作。我还没有尝试对 `unstable_remove` 的任何此类改动进行基准测试。但同样的美学考虑也适用于普通的、稳定的 `std::remove_if`。传统实现是一个基于 `operator=` 的循环，像这样： ```cpp template<class FwdIt, class Pred> FwdIt smooth_remove_if(FwdIt first, FwdIt last, Pred pred) { FwdIt dfirst = std::find_if(first, last, pred); if (dfirst != last) { for (first = std::next(dfirst); first != last; ++first) { if (!pred(*first)) { *dfirst++ = std::move(*first); } } } return dfirst; } ``` 但如果我们“分块”写入 `*dfirst`，像这样呢？ ```cpp template<class FwdIt, class Pred> FwdIt chunky_remove_if(FwdIt first, FwdIt last, Pred pred) { FwdIt dfirst = std::find_if(first, last, pred); if (dfirst != last) { for (first = std::next(dfirst); first != last; ++first) { if (!pred(*first)) { FwdIt sfirst = first; first = std::find_if(std::next(first), last, pred); dfirst = std::move(sfirst, first, dfirst); if (first == last) break; } } } return dfirst; } ``` 将工作委托给 `std::move`（https://en.cppreference.com/cpp/algorithm/move）算法，使得库可以在元素类型恰好是可平凡复制的情况下对此使用 memmove。嵌套循环使生成的代码变得复杂（Godbolt（https://godbolt.org/z/ja48jh4nv）），但这样做值得吗？我们真的节省了 CPU 周期吗？嗯，如果“幸存者”序列的平均长度很长，那么我们可以预期一次大的 memmove 的开销会低于循环中使用 `operator=` 的开销。但在另一个极端，如果序列的平均长度只有 1 个元素，那么 memmove 不会节省任何东西；我们为簿记付出了所有成本，却没有得到任何好处。 > 所谓“簿记”不仅指更复杂算法带来的额外寄存器压力和指令缓存压力，还包括调用 memmove 本身的开销，而且 memmove 必然会先检查是否需要处理前向重叠。（这个分支是完全可预测的——不需要处理——但仍然带来了“平滑”版本中没有的开销。）我编写了一个小基准测试（备份（https://quuxplusone.github.io/blog/code/2026-05-23-chunked-remove-benchmark.cpp）），对包含一百万个整数的数组调用 `std::remove_if`，使用一个位掩码谓词，分别移除一半的元素；八分之一；一百二十八分之一；或者一千零二十四分之一。我特意设计了该基准测试来发挥“分块”算法的优势：可平凡复制的元素，大块连续移动。这里的“平滑”列只是为了证明我的 `smooth_remove_if` 实现本质上与 libc++ 的默认实现相同：我的 `chunky_remove_if` 不能将其失败归咎于“库厂商的魔法”。但它确实被击败了，而且是决定性的： | 移除的元素比例 | `std` | smooth | chunky | 惩罚（chunky vs smooth） | smooth 赋值次数 | chunky memmove 次数 | |----------|-------|--------|--------|------------------------|-----------------|---------------------| | 1 in 2 | 3122 us | 3157 us | 3797 us | +20% | 499549 | 24997 | | 1 in 8 | 1082 us | 1105 us | 1471 us | +33% | 874707 | 10972 | | 1 in 128 | 416 us | 420 us | 468 us | +11% | 992148 | 7750 | | 1 in 1024| 327 us | 326 us | 396 us | +21% | 998482 | 958 | 正如预期，当 memmove 的预期长度很短时，`chunky_remove_if` 会输。但即使 memmove 的预期长度很长，它也会输！而且，违反直觉的是，当我们移除**更少**的元素（因此在“平滑”情况下执行**更多**的单独赋值，在“分块”情况下执行**更少**的单独 memmove）时，`smooth_remove_if` 变得越来越快，而 `chunky_remove_if` 的性能惩罚几乎没有变化。我暂时将原因归咎于分支预测器。移除的元素越少，`pred(*first)` 的结果就越可预测。移除一半元素是 `smooth_remove_if` 的最坏情况，仅仅是因为这将其变成了一个紧密循环，其中只有一个完全不可预测的分支。使这个分支更可预测压倒了一切其他考虑因素，包括我们可能从 memmove 中获得的任何加速（毕竟，memmove 本身也是一个紧密循环，其中有一个基本可预测的分支：“我们已经数到了 `n` 吗？”）。结论：将 `remove_if` 移动的元素“分块”成更大的 memmove 并不是一种优化，而是一种劣化。这并不直接说明关于 `unstable_remove` 的任何结论，但确实建议，在对与 `remove` 相关的算法进行基准测试时，我们不应该仅仅在最小化移动赋值次数上钻牛角尖；这种优化可能被缓存和分支预测效应所淹没。

批量 memmove 会加速 std::remove_if 吗？（不会。）

相似文章

使用SIMD加速std::copy_if

也许将KV缓存卸载到RAM并不差

无分支快速排序：性能超越 std::sort 和 pdqsort，提供 C 和 C++ API

C++26 发布了一个无人要求的 SIMD 库

基于复合移动禁忌搜索的快速高效选区重划优化

提交意见反馈