Leaving performance on the table

Lobsters Hottest Tools

Summary

A technical blog post demonstrating how Profile-Guided Optimization (PGO) with LLVM can significantly improve binary performance beyond standard -O3 and LTO, using SQLite as a benchmark.

<p><a href="https://lobste.rs/s/7lrk2t/leaving_performance_on_table">Comments</a></p>
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:54 AM

# Leaving performance on the table Source: [https://fzakaria.com/2026/05/23/leaving-performance-on-the-table](https://fzakaria.com/2026/05/23/leaving-performance-on-the-table) I have been working with LLVM at`$DAYJOB$`, and I have gotten to become familiar with the benefits of optimizing your workloads\. I tend to think of optimizing my binaries as thinking about whether I have attached`\-O3`to my compiler flags; maybe if I’m particularly advanced that day I’ll sprinkle in some`\-flto`\(link time optimziation\) and call it a day\. Turns out though that’s leaving lots of performance on the table\. Compilers work under the assumption that every branch is equally taken, unless you are hints like`\[\[likely\]\]`\([ref](https://en.cppreference.com/cpp/language/attributes/likely)\)\. If we can feed the compilers more information about the likely path that our workloads often take, then they can produce much more performant code\. There are two primary ways to optimize a binary: instrumented or statistical\. When we instrument our binary, we run our workload with an instrumented binary and capture*the exact paths*that are executed\. We will then optimize the binary perfectly tuned to that workload\. If our workloads however are varied, we can collect profiles via`perf`over a length of time and create an optimized binary based on the statistical occurence of call graphs\. Both approaches have their benefits however let’s start with the instrumented variant first, as it’s a little easier to follow and understand\. Let’s look at a very simple benchmark\. We will calculate**fibonocci**using SQL in[sqlite3](https://sqlite.org/)\. This is an ideal workload because it’s purely CPU\-bound and ripe for optimizing\. ``` -- fibonacci.sql (100 Million Iterations) WITH RECURSIVE fibonacci(n, a, b) AS ( -- Seed values SELECT 0, 0, 1 UNION ALL -- Loop 100,000,000 times. -- The modulo keeps the integers safely within 64-bit bounds. SELECT n + 1, b, (a + b) % 1000000007 FROM fibonacci WHERE n < 100000000 ) -- Only output the final computed value to avoid terminal I/O bottlenecks SELECT a FROM fibonacci WHERE n = 100000000; ``` We will compile`sqlite3`from source by downloading it\. ``` > wget https://sqlite.org/2026/sqlite-amalgamation-3530100.zip > unzip sqlite-amalgamation-3530100.zip ``` We can compile a “traditional” optimized binary that merely has`\-O3`and also a version that has LTO enabled since I was also keen to see how much LTO itself adds\. ``` > clang -O3 shell.c sqlite3.c -o sqlite3_base > clang -O3 -flto shell.c sqlite3.c -o sqlite3_lto # let's run it just to see some rough times > time ./sqlite3_base :memory: < ./fibbonoci.sql ╭───────────╮ │ a │ ╞═══════════╡ │ 908460138 │ ╰───────────╯ ________________________________________________________ Executed in 14.22 secs fish external usr time 14.16 secs 0.00 micros 14.16 secs sys time 0.00 secs 802.00 micros 0.00 secs ``` Ok, so it looks like our program takes roughly 14\-15 seconds to run\. Sounds ok? How much better can we do…\. 🤔 Next, we compile our program again but we*instrument the binary*, which effectively injects counters into the program to count invocations of functions\. We get very accurate counts of our calls but the binary itself now runs much slower, which can be a problem if your workload was already very slow\. Luckily for us, we are in a time domain \(~15 seconds\), where that is ok\. After we have our instrumented binary, we run our workload again to generate the profile data and rebuild the binary with that data\. ``` # 1. Build the instrumented version for Clang > clang -O3 -flto -fprofile-generate=. shell.c sqlite3.c -o sqlite3_instr # 2. Run the 100M Fibonacci loop to generate Clang profile data > ./sqlite3_instr :memory: < ../fibonacci.sql # 3. Merge the raw profile > ./llvm-profdata merge -output=sqlite3.profdata *.profraw # 4. Build the PGO-optimized binary, BUT preserve relocations for BOLT (-Wl,-q) ./ clang -O3 -flto -fprofile-use=sqlite3.profdata -Wl,-q shell.c sqlite3.c -o sqlite3_pgo ``` The last step will be to optimize with BOLT, which is a post\-link optimizer, which requires us to keep relocations so I’ve also added`\-Wl,\-q`\. When we run our workload with the final optimized binary, we see massive improvement already\! 🤯 ``` > time ./sqlite3_pgo :memory: < ./fibbonoci.sql ╭───────────╮ │ a │ ╞═══════════╡ │ 908460138 │ ╰───────────╯ ________________________________________________________ Executed in 10.86 secs fish external usr time 10.83 secs 0.00 micros 10.83 secs sys time 0.00 secs 770.00 micros 0.00 secs ``` We’ve cut our workload time down to ~10 seconds which is a nearly a**1\.5x**improvement\. Now let’s optimize the final binary with LLVM’s[BOLT](https://github.com/llvm/llvm-project/blob/main/bolt/README.md)\. BOLT is a post\-link optimizer designed for “large applications”\. What this means, is that it largely works by shuffling code around the binary to keep code\-paths that have high temporal locality near each other \(spatial locality\)\. This can have positive impact on performance due to the instruction cache for instance\. ``` # 1. Instrument the PGO binary with BOLT > llvm-bolt sqlite3_pgo -o sqlite3_bolt_instr --instrument --instrumentation-file=bolt.fdata # 2. Run the workload AGAIN to trace the physical execution path > ./sqlite3_bolt_instr :memory: < ../fibonacci.sql # 3. Apply the final BOLT optimizations on top of the PGO binary > llvm-bolt sqlite3_pgo -o sqlite3_ultimate \ -data=bolt.fdata \ -reorder-blocks=ext-tsp \ -reorder-functions=hfsort+ \ -split-functions \ -dyno-stats > time ./sqlite3_ultimate :memory: < ./fibbonoci.sql ╭───────────╮ │ a │ ╞═══════════╡ │ 908460138 │ ╰───────────╯ ________________________________________________________ Executed in 10.52 secs fish external usr time 10.46 secs 591.00 micros 10.46 secs sys time 0.01 secs 244.00 micros 0.01 secs ``` Looks like it was a little faster but not much\. That makes sense since`sqlite3`itself is a pretty small binary \(~6MB\), but nontheless was good to run through\. Running a more*thorough*benchmark with`hyperfine`we can get a final tally of our results\. Build ConfigurationMean Time ± σMin … MaxRelative Speed \(vs Fastest\)PGO\+BOLT10\.536 s ± 0\.051 s10\.491 s … 10\.631 s1\.00PGO10\.805 s ± 0\.055 s10\.733 s … 10\.889 s1\.03 ± 0\.01LTO14\.252 s ± 0\.026 s14\.225 s … 14\.315 s1\.35 ± 0\.01Basic \(no LTO\)14\.292 s ± 0\.071 s14\.224 s … 14\.435 s1\.36 ± 0\.01Fedora package14\.496 s ± 0\.074 s14\.402 s … 14\.662 s1\.38 ± 0\.01Looks like the`sqlite3`I got from the Fedora ecosystem was the*slowest*\. When all the optimizations were applied I was able to get a maximum of**1\.38x**faster than what was available\. These optimizations would be even more dramatic for code\-bases that are a sprawl and can heavily vary\. Don’t worry also about getting the profile perfectly tuned to your workloads\. I have a coworker who often cites that even poor profiles are still much better than no profile at all\.

Similar Articles

Host-Tuned GCC for Faster Compilation

Lobsters Hottest

This blog post explains how to build a host-tuned GCC compiler using profile-guided optimization, LTO, and -O3 to achieve faster compilation times, with detailed instructions and benchmarks.

Notes from Optimizing CPU-Bound Go Hot Paths

Hacker News Top

The article discusses performance optimization techniques for CPU-bound Go code, highlighting the limitations of generics and interface abstractions due to lack of inlining, and advocates for code duplication in hot paths. It provides examples from a Brotli port and deep-dive benchmarking.

Journey in optimising Elixir application

Lobsters Hottest

A developer shares lessons learned while optimizing Elixir applications, particularly focusing on performance improvements to a Postgres connection pooler (Ultravisor). The article covers profiling techniques using flame graphs, call tracing, and tools like eFlambè and tprof.

When compilers surprise you

Lobsters Hottest

Matt Godbolt explores compiler optimizations that convert an O(n) summation loop into an O(1) closed-form solution, highlighting how Clang and GCC employ sophisticated techniques like loop unrolling and mathematical simplification to dramatically improve code performance.

Premature Optimization is Fun Sometimes

Lobsters Hottest

A blog post exploring the optimization of a ring buffer data structure for storing ping timestamps, discussing tagged unions, bitfields, and struct padding to reduce memory footprint.