Tag
QuestDB introduces a dedicated WINDOW JOIN operator that is parallelized and vectorized, achieving up to 25x speedup over alternative databases for time-series aggregations around event timestamps.
A blog post describing a tiny compiler that demonstrates how to lower data-parallel kernels by converting for loops into vectorized loops with lanes and masks, implemented in ~180 lines of Python.
This blog post analyzes the PivCo-Huffman paper, which introduces 'merge' operations for parallel Huffman decoding, enabling efficient vectorized and GPU-friendly decoding without interleaving overhead.
This paper accelerates the NeurASP neurosymbolic AI framework by implementing vectorization, batch processing, and caching, achieving multiple orders of magnitude speedup on larger tasks.
Blog post analyzing and implementing a SIMD-accelerated version of std::copy_if using AVX-512 instructions on AMD Zen 4, with performance analysis and comparisons to compiler auto-vectorization.
This article explores the fastest methods for matching characters on ARM processors using SIMD instructions, comparing traditional NEON approaches with newer SVE2 capabilities available on modern ARM chips like AWS Graviton4, Google Axion, and others.