Tag
This paper presents PivCo-Huffman, a new approach to Huffman coding using pivot coding from wavelet trees, enabling high-performance SIMD-friendly encoding and decoding. It consistently outperforms state-of-the-art Huffman codecs and shows how ANS coding can be selectively applied to skewed nodes to approach ANS compression ratios while preserving high decompression speeds.
Blog post analyzing and implementing a SIMD-accelerated version of std::copy_if using AVX-512 instructions on AMD Zen 4, with performance analysis and comparisons to compiler auto-vectorization.
An in-depth technical blog post explaining how to efficiently transpose matrices using SIMD instructions on modern x86_64 CPUs, focusing on AVX2 intrinsics like _mm256_shuffle_epi8.
Bun.Image is a zero-dependency chainable image pipeline for decoding, resizing, rotating, and re-encoding JPEG, PNG, WebP, HEIC, and AVIF, running off-thread and inspired by Sharp.
minc is a minimal programming language that compiles directly to native executables for multiple platforms without external tooling. It features modern syntax, built-in SIMD, and a bundled shader compiler.
The article criticizes the new std::simd library in C++26, arguing it is slower than scalar loops, compiles slowly, and is outperformed by auto-vectorizers and alternative libraries like Google Highway, questioning its value after a decade-long standardization process.
turbovec is an open-source Rust vector index using Google Research's TurboQuant algorithm, achieving 16x compression and faster search than FAISS, with integrations for RAG frameworks like LangChain, LlamaIndex, and Haystack.
The author details the third iteration of the bx library's cross-platform SIMD abstraction, advocating for a typeless approach and SSA-style coding to simplify low-level performance optimization across different CPU architectures.
planb-lpm is a portable, MIT-licensed C++17 library implementing efficient IPv6 longest-prefix-match (LPM) using a linearized B+-tree with AVX-512 SIMD, featuring dynamic FIB support, Python bindings, and comprehensive benchmarking against real BGP data.
This article explores the fastest methods for matching characters on ARM processors using SIMD instructions, comparing traditional NEON approaches with newer SVE2 capabilities available on modern ARM chips like AWS Graviton4, Google Axion, and others.
This lecture introduces the flexible evolution of GPU architecture as a SIMD (vector/array) processor, discusses data parallelism, memory bank grouping, bank conflicts, serial bottlenecks, and the history of SIMD instructions (such as MMX), emphasizing how GPUs leverage data parallelism and deal with serial bottlenecks.