Branchless Quicksort faster than std:sort and pdqsort with C and C++ API

Hacker News Top Tools

Summary

A new branchless Quicksort implementation (blqsort) using sorting networks outperforms std::sort and pdqsort on Apple M1 and AMD Ryzen systems, available as single-header C and C++ libraries. It achieves speedups through branchless partitioning, median-of-medians pivot selection, and custom sorting networks for small arrays.

No content available
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:10 AM

# Branchless Quicksort Source: [https://tiki.li/blog/blqsort](https://tiki.li/blog/blqsort) ## Fast Branchless Quicksort using Sorting\-Networks with C and C\+\+ Interface Performance results naturally depend on the underlying hardware\. The following benchmarks show the execution times for sorting 50 million`doubles`using different sorting implementations\. The measurements were taken on an*Apple M1*system using*Clang*and on an*AMD Ryzen 3 Linux*system using*GCC*, both compiled with the`\-O3`option\. ImplementationApple M1AMD Ryzenstd::sort1\.33s5\.56spdqsort1\.33s2\.81s**blqsort**\(single threaded\)0\.97s2\.06sFor a fair comparison, the[single\-threaded version](https://github.com/chkas/blqsort/blob/main/cpp/test_double.cpp)of`blqs`was used here\. On an M1, the threaded versions are another factor of 3 to 4 faster\. In terms of runtime, the C\+\+ versions differ only very little from the C version\. ### blqsort Full source code is included on[Github](https://github.com/chkas/blqsort)\.There are four implementations of**blqsort**here, each provided as a single header file\. FileDescription[blqs\.h](https://github.com/chkas/blqsort/blob/main/cpp/blqs.h)C\+\+ Single\-Threaded[blqs\_thr\.h](https://github.com/chkas/blqsort/blob/main/cpp/blqs_thr.h)C\+\+ Multi\-Threaded[blqsort\.h](https://github.com/chkas/blqsort/blob/main/c/blqsort.h)C Single\-Threaded[blqsort\_thr\.h](https://github.com/chkas/blqsort/blob/main/c/blqsort_thr.h)C Multi\-Threaded### Branchless programming On modern CPUs,**avoiding branch misprediction**is a key technique to speed up programs\. This branchless approach: ``` for (int i = 0; i < 1000; i++) { small_numbers[smlen] = numbers[i]; smlen += (numbers[i] < 500); } ``` is much faster than the conventional version with a conditional branch:``` for (int i = 0; i < 1000; i++) { if (numbers[i] < 500) { small_numbers[smlen] = numbers[i]; smlen += 1; } } ``` [This paper](https://arxiv.org/abs/1604.06697)by*Edelkamp*and*Weiß*shows how partitioning performance in Quicksort can be improved by avoiding conditional branches\.The strategy of using an auxiliary buffer for**branchless partitioning**is inspired by[fluxsort](https://github.com/scandum/fluxsort)\. The “auxiliary buffer” here means a 1024‑element stack array, not heap memory\. First, 1024 elements are copied from one side into an auxiliary buffer to make room for subsequent operations\. Then, we alternately copy a 1024\-element block to the left and right in a branchless manner\. The left pointer is only incremented if the element is smaller than the pivot, otherwise, the right pointer is incremented \- branchless, of course\. This involves more more than double the necessary copy operations\. For data types that are cheap to copy, however, this is much less expensive than the branch mispredictions that would otherwise occur\. ### Pivot strategy, bad input and sorting network To avoid the`O\(n²\)`runtime caused by bad input data, the program can group identical elements together and switch to*heapsort*for that specific part if it detects a big imbalance during partitioning\. The program also checks if a partition is already sorted\. For larger parts, it uses a median\-of\-medians strategy to find a good pivot\. In addition, critical partitioning loops are explicitly unrolled\. For 2 to 12 elements, the algorithm uses custom sorting networks\. This approach requires a separate code path for each size but sorts small subsets with very few swaps using a branchless*sort‑2*primitive\.[Source for sorting networks](https://bertdobbelaere.github.io/sorting_networks.html) ## C\+\+ For types with higher copy costs that are not`is\_trivially\_copyable`\(such as strings\), the buffer\-based branchless approach becomes less efficient\. In such cases, a**BlockQuicksort**\(*Edelkamp*and*Weiß*\) variant is used instead\. This processes only the element indices in a branchless manner and then moves the actual data with fewer swaps\. Some ideas are from[pdqsort](https://github.com/orlp/pdqsort)\. ### Usage You only need to include`blqs\.h`, and it can be used just as easily as*std::sort*\. ``` #include "blqs.h" double data[SIZE]; blqs::sort(data, data + SIZE); ``` For the C\+\+ multithreaded variant, which uses C\+\+ threads, include`blqs\_thr\.h`instead of`blqs\.h`\. The function call remains the same\.## C In C, the code specialized for the data type is generated using`\#define`\. ### Usage ``` #define BLQS_CMP(a, b) ((a) < (b)) #define BLQS_TYPE double #include "blqsort.h" double data[SIZE]; blqsort(data, SIZE); ``` For the C multithreaded variant, which uses POSIX threads, include`blqsort\_thr\.h`instead of`blqsort\.h`\. The function call remains the same here as well\.## Sorting Custom Data Structures In practice, we often need to sort custom data structures\. This is where*SIMD*libraries like*Google Highway*\- while very fast for simple numbers \- become difficult to use\. Using*std::sort*or*blqsort*gives you much more flexibility\. ### C\+\+ ``` #include "blqs.h" struct entry { int id; int value; bool operator<(const entry& other) const { return id < other.id; } }; struct entry data[SIZE]; blqs::sort(data, data + SIZE); ``` ### C ``` struct entry { int id; int value; }; #define BLQS_CMP(a, b) (((a).id) < ((b).id)) #define BLQS_TYPE struct entry #include "blqsort.h" struct entry data[SIZE]; blqsort(data, SIZE); ``` Execution times for sorting 50 million of these`structs`\.ImplementationApple M1AMD Ryzenstd::sort3\.46s4\.75spdqsort3\.46s4\.72s**blqsort**0\.96s2\.20s### Links [When ‘if’ slows you down, avoid it\.](https://tiki.li/blog/branchless) [Interactive sorting demo](https://tiki.li/sorting) --- christof\.kaser@gmail\.com

Similar Articles

Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Hugging Face Daily Papers

Metal-Sci introduces a 10-task benchmark for optimizing scientific computing kernels on Apple Silicon, paired with an evolutionary search framework driven by large language models. The study evaluates models like Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5, demonstrating significant speedups while using out-of-distribution testing to catch silent performance regressions.

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Reddit r/LocalLLaMA

A developer benchmarked 21 local LLMs on MacBook Air M5 using HumanEval+ and found Qwen 3.6 35B-A3B (MoE) leads at 89.6% with 16.9 tok/s, while Qwen 2.5 Coder 7B offers the best RAM-to-performance ratio at 84.2% in 4.5 GB. Notably, Gemma 4 models significantly underperformed expectations (31.1% for 31B), possibly due to Q4_K_M quantization effects.

@no_stp_on_snek: got it here if ya want to try it out:

X AI KOLs Following

A fork of llama.cpp integrating TurboQuant+ for advanced KV-cache and weight quantization, with cross-backend kernel support (Apple Silicon, NVIDIA CUDA, AMD ROCm, Vulkan) and used in production by LocalAI, Chronara, and AtomicChat.

Dual GPU llama.cpp speedup

Reddit r/LocalLLaMA

A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.