Tag
EGG is an expert-guided agent framework that decomposes GPU kernel generation into algorithmic structure design and hardware-specific tuning, using a stage-aware multi-agent collaboration mechanism. It achieves a 2.13x average speedup over PyTorch on KernelBench and real-world workloads.
Graphsignal is a production-scale inference profiling platform that provides detailed timelines, LLM generation tracing, and system-level metrics to help engineers optimize AI performance across models, GPUs, and other accelerators.
详细介绍了针对语音克隆模型的W4A4 CUDA内核优化,通过INT4量化和融合LoRA,实现了比FP16快2.6倍的推理速度。
A tweet summarizing reflections on the article "How modern browsers work", emphasizing the value of the browser as a modern operating system, and providing 5 key insights for front-end and Agent developers, including multi-process architecture, JS engine compilation pipeline, performance optimization, etc.
A user benchmarks thread count for hybrid CPU-GPU inference with Gemma 4 in llama.cpp, discovering a 80% performance uplift by using 16 threads instead of 6 on a hybrid core CPU, and shares the optimal command configuration.
Josh Tobin teases Recursive_SI's automated researchers, showing early demos of performance optimization capabilities.
This paper accelerates the NeurASP neurosymbolic AI framework by implementing vectorization, batch processing, and caching, achieving multiple orders of magnitude speedup on larger tasks.
Changing quantization from q4_k_m to q4_k_xl in llama.cpp doubles inference speed on the same GPU without hardware or driver changes, as demonstrated with Gemma 4 12B on an RTX 4060.
A user discovered that a hidden PCIe 2.0 x4 electrical limitation on a Threadripper workstation board was crippling one of four RTX 3090s, causing poor multi-GPU LLM inference performance. Fixing the slot layout and switching to tensor split mode doubled Mistral 128B throughput from ~11 to ~24.7 tok/s.
KForge is a cross-platform framework that uses two collaborating LLM-based agents to automatically generate and optimize high-performance compute kernels for diverse AI accelerators, achieving significant speedups on NVIDIA B200 and Intel Arc B580 hardware.
GreptimeDB v1.0 introduces Pending Rows Batcher, a three-stage pipeline that moves CPU-intensive work off the Datanode's critical section, improving Prometheus remote write throughput from 1.20M to 2.17M points/sec and reducing Datanode CPU usage by 20%.
A developer shares learnings from building a 100K-line Rust-based multi-Paxos consensus engine using AI coding agents, achieving dramatic productivity gains and performance improvements.
A Chinese developer discusses a new Codex Skill called Complexity Optimizer that automatically detects performance issues like O(n²) in codebases, making advanced optimization skills accessible to more developers.
A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.
The author details how removing a Copy-on-Write (Cow) data structure improved the performance of their JSON formatter, JJPWRGEM, by 42%, making it significantly faster than Prettier and Oxfmt.
This technical guide provides a step-by-step process for compiling Emacs from source on various Linux distributions to optimize performance through CPU-specific instruction sets and modern display protocols like Wayland. It also covers configuring dependencies and fine-tuning the native Lisp compiler for faster execution.
The article shares a performance optimization trick for llama.cpp, showing that increasing the micro-batch size (`-ub`) combined with partial CPU offloading (`--n-cpu-moe`) can drastically improve prompt processing speed for large models like gpt-oss-120b on consumer GPUs.
The author details the third iteration of the bx library's cross-platform SIMD abstraction, advocating for a typeless approach and SSA-style coding to simplify low-level performance optimization across different CPU architectures.
The author details the process of optimizing custom matrix multiplication kernels in Swift to train a Large Language Model on Apple Silicon, aiming to outperform C implementations by leveraging CPU, SIMD, AMX, and GPU capabilities.
The article argues that software teams often over-optimise for micro-performance benchmarks at the expense of developer experience and engineering throughput, which are the true bottlenecks for long-term delivery speed and maintainability.