performance-optimization

#performance-optimization

@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …

X AI KOLs Following ↗ · yesterday

A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.

0 favorites 0 likes

#performance-optimization

Killing a `Cow` made my JSON formatter 42% faster

Lobsters Hottest ↗ · yesterday Cached

The author details how removing a Copy-on-Write (Cow) data structure improved the performance of their JSON formatter, JJPWRGEM, by 42%, making it significantly faster than Prettier and Oxfmt.

0 favorites 0 likes

#performance-optimization

A Technical Guide to Compiling Emacs for Performance on Linux and Unix systems

Lobsters Hottest ↗ · yesterday Cached

This technical guide provides a step-by-step process for compiling Emacs from source on various Linux distributions to optimize performance through CPU-specific instruction sets and modern display protocols like Wayland. It also covers configuring dependencies and fine-tuning the native Lisp compiler for faster execution.

0 favorites 0 likes

#performance-optimization

Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models

Reddit r/LocalLLaMA ↗ · yesterday

The article shares a performance optimization trick for llama.cpp, showing that increasing the micro-batch size (`-ub`) combined with partial CPU offloading (`--n-cpu-moe`) can drastically improve prompt processing speed for large models like gpt-oss-120b on consumer GPUs.

0 favorites 0 likes

#performance-optimization

Making cross-platform SIMD code pleasant

Lobsters Hottest ↗ · 2d ago Cached

The author details the third iteration of the bx library's cross-platform SIMD abstraction, advocating for a typeless approach and SSA-style coding to simplify low-level performance optimization across different CPU architectures.

0 favorites 0 likes

#performance-optimization

Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

Hacker News Top ↗ · 3d ago Cached

The author details the process of optimizing custom matrix multiplication kernels in Swift to train a Large Language Model on Apple Silicon, aiming to outperform C implementations by leveraging CPU, SIMD, AMX, and GPU capabilities.

0 favorites 0 likes

#performance-optimization

Optimize for change not application performance

Hacker News Top ↗ · 4d ago Cached

The article argues that software teams often over-optimise for micro-performance benchmarks at the expense of developer experience and engineering throughput, which are the true bottlenecks for long-term delivery speed and maintainability.

0 favorites 0 likes

#performance-optimization

Getting peak TOPS on a Ryzen AI 7 350 NPU

Lobsters Hottest ↗ · 5d ago Cached

A technical deep-dive into achieving peak TOPS performance on the AMD Ryzen AI 7 350 NPU, comparing it to Xilinx AIE-ML v2 AI engines and explaining the hardware architecture for matrix multiplication workloads.

0 favorites 0 likes

#performance-optimization

Removing fsync from our local storage engine

Hacker News Top ↗ · 6d ago Cached

FractalBits introduces a specialized single-node KV storage engine that eliminates fsync calls to achieve significantly higher write throughput on NVMe SSDs by managing durability directly at the hardware level.

0 favorites 0 likes

#performance-optimization

AI inference just plays by different rules (9 minute read)

TLDR AI ↗ · 6d ago Cached

The article argues that AI inference poses unique challenges to cloud data infrastructure, likening its demand to high-concurrency OLTP systems rather than traditional human-speed applications. It emphasizes the need to optimize storage and data access layers to handle the 'AI data tsunami' driven by autonomous agents.

0 favorites 0 likes

#performance-optimization

Approximating Hyperbolic Tangent

Hacker News Top ↗ · 2026-04-22 Cached

Blog post surveys fast hyperbolic tangent approximations—Taylor, Padé, splines, and bit-level tricks—for neural-network and real-time audio use.

0 favorites 0 likes

#performance-optimization

Journey in optimising Elixir application

Lobsters Hottest ↗ · 2026-04-20 Cached

A developer shares lessons learned while optimizing Elixir applications, particularly focusing on performance improvements to a Postgres connection pooler (Ultravisor). The article covers profiling techniques using flame graphs, call tracing, and tools like eFlambè and tprof.

0 favorites 0 likes

#performance-optimization

Weak-Link Optimization for Multi-Agent Reasoning and Collaboration

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper proposes WORC, a weak-link optimization framework for multi-agent LLM systems that identifies and reinforces underperforming agents through meta-learning-based weight prediction and uncertainty-driven resource allocation, achieving 82.2% accuracy on reasoning benchmarks while improving system stability.

0 favorites 0 likes

#performance-optimization

The fastest way to match characters on ARM processors?

Lobsters Hottest ↗ · 2026-04-19 Cached

This article explores the fastest methods for matching characters on ARM processors using SIMD instructions, comparing traditional NEON approaches with newer SVE2 capabilities available on modern ARM chips like AWS Graviton4, Google Axion, and others.

0 favorites 0 likes

performance-optimization

Submit Feedback