Bcachefs 1.38.6 - the performance release

Lobsters Hottest 06/18/26, 07:41 AM Tools

bcachefs filesystem performance-release erasure-coding reconcile rust linux

Summary

Bcachefs 1.38.6, the performance release, removes the experimental label, introduces erasure coding and reconcile for improved data management, and converts userspace code to Rust.

<p><a href="https://lobste.rs/s/gcaiew/bcachefs_1_38_6_performance_release">Comments</a></p>

Original Article

View Cached Full Text

Cached at: 06/18/26, 08:00 AM

# 1.38.6 - the performance release | Kent Overstreet Source: [https://www.patreon.com/bcachefs/posts/1-38-6-release-161366372](https://www.patreon.com/bcachefs/posts/1-38-6-release-161366372) ![Creator profile picture](https://c10.patreonusercontent.com/4/patreon-media/p/campaign/408332/ebf85512fe974074930395cac26424fd/eyJoIjozNjAsInciOjM2MH0%3D/1.jpeg?token-hash=I7n_a0j4dpHbODcJnFZ3RJZb1sFoEE9sk6jEOV3XwCg%3D&token-time=1783036800) ## 1\.38\.6 \- the performance release [https://evilpiepirate\.org/git/bcachefs\-tools\.git/tree/Changelog\.mdwn](https://evilpiepirate.org/git/bcachefs-tools.git/tree/Changelog.mdwn) I suppose it's been awhile since I wrote a proper announcement\. Ah, the busy life of a filesystem engineer/maintainer/do anything/whatever you want to call what I do these days\. So, some catch up: - We're no longer experimental\. I took the label off the website \- a few months ago, I think, based on my usual "the incoming bug reports are slowing down and looking a lot less serious and easy to get through than they were"\. Consider this the belated official announcement :\) - Recapping the big feature news of the past six months, for those new or who missed previous announcements: reconcile, and erasure coding\. Reconcile greatly simplified device and data management, with a state engine for tracking the status of all data vs\. what it should be; it reacts to changes in options and device status, moving data around in the background, rereplicating, applying option changes \(e\.g\. enabling erasure coding\)\. It's designed to be fast; pending reconcile work for rotating devices is indexed in physical LBA order, which means we can do tricks like evacuates, where we find all the copies of the data on the device being evacuated on the other devices in the filesystem \- parallelizing the evacuate across all your devices, in nice sequential order\. - Erasure coding: high performance with no write hole\. The experimental label was lifted on erasure coding, and it's in use and seems to be working quite well\. Bcachefs erasure coding does Reed/Solomon, the exact same algorithm as conventional RAID5/6, but with no write hole \(no update in place on existing stripes\) \- and unlike ZFS, we do it without fragmenting incoming writes\. Stripes are built up asynchronously and incoming writes are initially replicated, with those extra replicas being dropped as soon as the stripe completes\. The allocator has optimizations so that the buckets for those extra replicas will be reused \(and overwritten\) right away provided there hasn't been a journal commit \(i\.e\. nothing was doing fsyncs\), which means that for big streaming writes the extra replicas only cost bus bandwidth \- they'll be overwritten while still in the device write cache\. And it's fully integrated with reconcile \- enable the erasure coding option and it'll convert your data in the background, if a device fails it'll handle it\. Documentation for both in the user manual:[https://bcachefs\.org/bcachefs\-principles\-of\-operation\.pdf](https://bcachefs.org/bcachefs-principles-of-operation.pdf) And the past few months, with core feature work slowing down and incoming bug reports slowing down, I've been turning my attention to other things: - CI improvements: the automated test infrastructure is now fully converted to testing DKMS builds, which has made it much faster \(getting the most out of our cluster of 80 core Arm64 machines\), and also enables automated testing of the DKMS module on distro kernels\. This will be important as we start to push more into the distros in the near future \- there've been reports of DKMS build failures on CachyOS that no one caught but the users, and that's a nasty surprise\. - Rust: As of 7\.0, all major distributions that I've checked have now enabled CONFIG\_RUST in their kernels\. This is a historic moment :\) The bcachefs userspace code has already been converted to Rust, and that work has included safe Rust interfaces for the core btree iterator API and quite a bit of utility code\. The next release will pull these bindings into the DKMS module, and we'll start to convert core code\. The initial conversion \(already staged\) will just be unit and performance tests, so we can roll out Rust kernel side as a soft dependency; we'll need to verify that we can deploy DKMS modules with mixed C/Rust before making it a hard dependency\. Rust is going to mean a codebase that's much less fragile, more stable, and easier to work on, and it'll also make it much easier to bring in younger engineers \- we don't have as many people learning C anymore\! Even better, it opens up full formal verification: with Rust handling memory safety, the rest becomes tractable, in a real production codebase\. And we want that: filesystems are full of invariants that today are checked by debug assertions, but actually exercising those debug assertions in testing requires massive amounts of hardware and testing, and many hours of waiting on the CI\. A fully formally verified filesystem isn't going to happen overnight, or for many years, but being able to start to apply formal verification to the parts of the code that really need is going to result in real improvements to reliability, and soon, and make life so much easier\. It'll change how we do engineering\. - Performance work: It was nice getting to devote time to this\. Performance work is something I prefer to do when I don't have distractions, and can shut everything else out for awhile; it takes awhile to build up a thorough and complete picture of what's going on, figure out exactly where the issues are and what's worth optimizing; running a variety of microbenchmarks to highlight different bottlenecks, compare on different hardware, with different filesystems, endless staring at profiles\. Bottlenecks have a way of hiding, and half the job is figuring out where they might be and making them visible\. Discovered some fun things along the way: One fun one \- gcc doesn't handle static\_branch\_unlikely\(\) very well\. Static branches are a kernel mechanism for making a branch "free", by using runtime patching to enable or disable them\. Unfortunately, it seems gcc isn't smart enough to put the body of the cold branch at the end of the function or somewhere where it won't pollute the icache \- so tracing and debug code was costing us a ridiculous amount of performance\. This was unfortunate for the debug code, because it would be wonderful to have debug checks compiled in but disabled so that if a user suspects an issue they can be enabled at runtime without a debug build\. Alas, debug code is now back behind CONFIG\_BCACHEFS\_DEBUG\. Much swearing when I realized how much that was costing us\. In the end, benchmarking and profiling resulted in 200 patches all throughout core btree code, the journal, and filesystem level code\. The core transaction commit hot path is now down to 4kb of machine code, the btree code has some new tricks for avoiding lock contention, the journal flush path is now completely lockless, and lots more \- check the git history for the full list:[https://evilpiepirate\.org/git/bcachefs\-tools\.git/log/](https://evilpiepirate.org/git/bcachefs-tools.git/log/) Performance on single device filesystems is now looking quite good\. On the Epyc 9454 I've been testing on, 48 Zen4 cores, 1\.38\.6 is pushing 16\.5 GB/sec through dbench 48 clients \- vs\. 16 GB/sec for XFS\. A few performance patches didn't make the release \(some needed additional testing/debugging/design work, others are small on disk format changes that will wait for 1\.39\): with those, bcachefs is pushing**19 GB/sec**through dbench\. Testing 4k random writes with fio, bcachefs is now hitting**700k iops**on this hardware, vs\. 1 million for XFS; both on all their default settings\. In this scenario XFS is just remapping writes to the block device through pre laid out files with giant extents, and bcachefs is going through the full COW write path \- data checksumming, btree update \- for every write\. Note \- hitting these numbers requires btree sharding to be kicking in; for anyone trying to replicate, this requires multiple fio jobs where each fio process is creating the data file, not the master process\. But \- I also didn't do any optimization specifically for this benchmark whatsoever, and from looking at the profile there was still room for improvement :\) Always more to do\. Next few months I'm hoping to do some performance work specifically for multi device filesystems, some users are still having performance issues on giant arrays and I've got a list of things to fix there\. As always \- join the IRC channel, get involved; this thing is a community and it's still growing\. And at some point I really do need to start finding young engineers to teach \- this is a fun project with a user community that's great to work with, if you like filesystems and you think you might have the skills and the dedication, come join the party\. Cheers, Kent --- ## 1\.38\.6 \- the performance release ![Creator profile picture](https://c10.patreonusercontent.com/4/patreon-media/p/campaign/408332/ebf85512fe974074930395cac26424fd/eyJoIjozNjAsInciOjM2MH0%3D/1.jpeg?token-hash=I7n_a0j4dpHbODcJnFZ3RJZb1sFoEE9sk6jEOV3XwCg%3D&token-time=1783036800) ## 1\.38\.6 \- the performance release [https://evilpiepirate\.org/git/bcachefs\-tools\.git/tree/Changelog\.mdwn](https://evilpiepirate.org/git/bcachefs-tools.git/tree/Changelog.mdwn) I suppose it's been awhile since I wrote a proper announcement\. Ah, the busy life of a filesystem engineer/maintainer/do anything/whatever you want to call what I do these days\. So, some catch up: - We're no longer experimental\. I took the label off the website \- a few months ago, I think, based on my usual "the incoming bug reports are slowing down and looking a lot less serious and easy to get through than they were"\. Consider this the belated official announcement :\) - Recapping the big feature news of the past six months, for those new or who missed previous announcements: reconcile, and erasure coding\. Reconcile greatly simplified device and data management, with a state engine for tracking the status of all data vs\. what it should be; it reacts to changes in options and device status, moving data around in the background, rereplicating, applying option changes \(e\.g\. enabling erasure coding\)\. It's designed to be fast; pending reconcile work for rotating devices is indexed in physical LBA order, which means we can do tricks like evacuates, where we find all the copies of the data on the device being evacuated on the other devices in the filesystem \- parallelizing the evacuate across all your devices, in nice sequential order\. - Erasure coding: high performance with no write hole\. The experimental label was lifted on erasure coding, and it's in use and seems to be working quite well\. Bcachefs erasure coding does Reed/Solomon, the exact same algorithm as conventional RAID5/6, but with no write hole \(no update in place on existing stripes\) \- and unlike ZFS, we do it without fragmenting incoming writes\. Stripes are built up asynchronously and incoming writes are initially replicated, with those extra replicas being dropped as soon as the stripe completes\. The allocator has optimizations so that the buckets for those extra replicas will be reused \(and overwritten\) right away provided there hasn't been a journal commit \(i\.e\. nothing was doing fsyncs\), which means that for big streaming writes the extra replicas only cost bus bandwidth \- they'll be overwritten while still in the device write cache\. And it's fully integrated with reconcile \- enable the erasure coding option and it'll convert your data in the background, if a device fails it'll handle it\. Documentation for both in the user manual:[https://bcachefs\.org/bcachefs\-principles\-of\-operation\.pdf](https://bcachefs.org/bcachefs-principles-of-operation.pdf) And the past few months, with core feature work slowing down and incoming bug reports slowing down, I've been turning my attention to other things: - CI improvements: the automated test infrastructure is now fully converted to testing DKMS builds, which has made it much faster \(getting the most out of our cluster of 80 core Arm64 machines\), and also enables automated testing of the DKMS module on distro kernels\. This will be important as we start to push more into the distros in the near future \- there've been reports of DKMS build failures on CachyOS that no one caught but the users, and that's a nasty surprise\. - Rust: As of 7\.0, all major distributions that I've checked have now enabled CONFIG\_RUST in their kernels\. This is a historic moment :\) The bcachefs userspace code has already been converted to Rust, and that work has included safe Rust interfaces for the core btree iterator API and quite a bit of utility code\. The next release will pull these bindings into the DKMS module, and we'll start to convert core code\. The initial conversion \(already staged\) will just be unit and performance tests, so we can roll out Rust kernel side as a soft dependency; we'll need to verify that we can deploy DKMS modules with mixed C/Rust before making it a hard dependency\. Rust is going to mean a codebase that's much less fragile, more stable, and easier to work on, and it'll also make it much easier to bring in younger engineers \- we don't have as many people learning C anymore\! Even better, it opens up full formal verification: with Rust handling memory safety, the rest becomes tractable, in a real production codebase\. And we want that: filesystems are full of invariants that today are checked by debug assertions, but actually exercising those debug assertions in testing requires massive amounts of hardware and testing, and many hours of waiting on the CI\. A fully formally verified filesystem isn't going to happen overnight, or for many years, but being able to start to apply formal verification to the parts of the code that really need is going to result in real improvements to reliability, and soon, and make life so much easier\. It'll change how we do engineering\. - Performance work: It was nice getting to devote time to this\. Performance work is something I prefer to do when I don't have distractions, and can shut everything else out for awhile; it takes awhile to build up a thorough and complete picture of what's going on, figure out exactly where the issues are and what's worth optimizing; running a variety of microbenchmarks to highlight different bottlenecks, compare on different hardware, with different filesystems, endless staring at profiles\. Bottlenecks have a way of hiding, and half the job is figuring out where they might be and making them visible\. Discovered some fun things along the way: One fun one \- gcc doesn't handle static\_branch\_unlikely\(\) very well\. Static branches are a kernel mechanism for making a branch "free", by using runtime patching to enable or disable them\. Unfortunately, it seems gcc isn't smart enough to put the body of the cold branch at the end of the function or somewhere where it won't pollute the icache \- so tracing and debug code was costing us a ridiculous amount of performance\. This was unfortunate for the debug code, because it would be wonderful to have debug checks compiled in but disabled so that if a user suspects an issue they can be enabled at runtime without a debug build\. Alas, debug code is now back behind CONFIG\_BCACHEFS\_DEBUG\. Much swearing when I realized how much that was costing us\. In the end, benchmarking and profiling resulted in 200 patches all throughout core btree code, the journal, and filesystem level code\. The core transaction commit hot path is now down to 4kb of machine code, the btree code has some new tricks for avoiding lock contention, the journal flush path is now completely lockless, and lots more \- check the git history for the full list:[https://evilpiepirate\.org/git/bcachefs\-tools\.git/log/](https://evilpiepirate.org/git/bcachefs-tools.git/log/) Performance on single device filesystems is now looking quite good\. On the Epyc 9454 I've been testing on, 48 Zen4 cores, 1\.38\.6 is pushing 16\.5 GB/sec through dbench 48 clients \- vs\. 16 GB/sec for XFS\. A few performance patches didn't make the release \(some needed additional testing/debugging/design work, others are small on disk format changes that will wait for 1\.39\): with those, bcachefs is pushing**19 GB/sec**through dbench\. Testing 4k random writes with fio, bcachefs is now hitting**700k iops**on this hardware, vs\. 1 million for XFS; both on all their default settings\. In this scenario XFS is just remapping writes to the block device through pre laid out files with giant extents, and bcachefs is going through the full COW write path \- data checksumming, btree update \- for every write\. Note \- hitting these numbers requires btree sharding to be kicking in; for anyone trying to replicate, this requires multiple fio jobs where each fio process is creating the data file, not the master process\. But \- I also didn't do any optimization specifically for this benchmark whatsoever, and from looking at the profile there was still room for improvement :\) Always more to do\. Next few months I'm hoping to do some performance work specifically for multi device filesystems, some users are still having performance issues on giant arrays and I've got a list of things to fix there\. As always \- join the IRC channel, get involved; this thing is a community and it's still growing\. And at some point I really do need to start finding young engineers to teach \- this is a fun project with a user community that's great to work with, if you like filesystems and you think you might have the skills and the dedication, come join the party\. Cheers, Kent ---

Bcachefs 1.38.6 - the performance release

Similar Articles

Bcachefs 1.38.6 Brings Many Performance Improvements

Bun's Rust rewrite has been merged

Content-defined chunking added to Bazel

QBE - Compiler Backend: Version 1.3

@LucSGeorges: perf packed release: safetensors 0.8.0 is out Main takeaways: - direct copy into metal MTLBuffers + dlpack for 0-copy h…

Submit Feedback

Similar Articles

Bcachefs 1.38.6 Brings Many Performance Improvements

Bun's Rust rewrite has been merged

Content-defined chunking added to Bazel

QBE - Compiler Backend: Version 1.3

@LucSGeorges: perf packed release: safetensors 0.8.0 is out Main takeaways: - direct copy into metal MTLBuffers + dlpack for 0-copy h…