Tag
A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.
The author details how removing a Copy-on-Write (Cow) data structure improved the performance of their JSON formatter, JJPWRGEM, by 42%, making it significantly faster than Prettier and Oxfmt.
This technical guide provides a step-by-step process for compiling Emacs from source on various Linux distributions to optimize performance through CPU-specific instruction sets and modern display protocols like Wayland. It also covers configuring dependencies and fine-tuning the native Lisp compiler for faster execution.
The article shares a performance optimization trick for llama.cpp, showing that increasing the micro-batch size (`-ub`) combined with partial CPU offloading (`--n-cpu-moe`) can drastically improve prompt processing speed for large models like gpt-oss-120b on consumer GPUs.
The author details the third iteration of the bx library's cross-platform SIMD abstraction, advocating for a typeless approach and SSA-style coding to simplify low-level performance optimization across different CPU architectures.
The author details the process of optimizing custom matrix multiplication kernels in Swift to train a Large Language Model on Apple Silicon, aiming to outperform C implementations by leveraging CPU, SIMD, AMX, and GPU capabilities.
The article argues that software teams often over-optimise for micro-performance benchmarks at the expense of developer experience and engineering throughput, which are the true bottlenecks for long-term delivery speed and maintainability.
A technical deep-dive into achieving peak TOPS performance on the AMD Ryzen AI 7 350 NPU, comparing it to Xilinx AIE-ML v2 AI engines and explaining the hardware architecture for matrix multiplication workloads.
FractalBits introduces a specialized single-node KV storage engine that eliminates fsync calls to achieve significantly higher write throughput on NVMe SSDs by managing durability directly at the hardware level.
The article argues that AI inference poses unique challenges to cloud data infrastructure, likening its demand to high-concurrency OLTP systems rather than traditional human-speed applications. It emphasizes the need to optimize storage and data access layers to handle the 'AI data tsunami' driven by autonomous agents.
Blog post surveys fast hyperbolic tangent approximations—Taylor, Padé, splines, and bit-level tricks—for neural-network and real-time audio use.
A developer shares lessons learned while optimizing Elixir applications, particularly focusing on performance improvements to a Postgres connection pooler (Ultravisor). The article covers profiling techniques using flame graphs, call tracing, and tools like eFlambè and tprof.
This paper proposes WORC, a weak-link optimization framework for multi-agent LLM systems that identifies and reinforces underperforming agents through meta-learning-based weight prediction and uncertainty-driven resource allocation, achieving 82.2% accuracy on reasoning benchmarks while improving system stability.
This article explores the fastest methods for matching characters on ARM processors using SIMD instructions, comparing traditional NEON approaches with newer SVE2 capabilities available on modern ARM chips like AWS Graviton4, Google Axion, and others.