performance-optimization

#performance-optimization

EGG: An Expert-Guided Agent Framework for Kernel Generation

arXiv cs.AI ↗ · 2h ago Cached

EGG is an expert-guided agent framework that decomposes GPU kernel generation into algorithmic structure design and hardware-specific tuning, using a stage-aware multi-agent collaboration mechanism. It achieves a 2.13x average speedup over PyTorch on KernelBench and real-world workloads.

0 favorites 0 likes

#performance-optimization

Graphsignal (GitHub Repo)

TLDR AI ↗ · 2d ago Cached

Graphsignal is a production-scale inference profiling platform that provides detailed timelines, LLM generation tracing, and system-level metrics to help engineers optimize AI performance across models, GPUs, and other accelerators.

0 favorites 0 likes

#performance-optimization

@charles_irl: https://x.com/charles_irl/status/2069113412869914944

X AI KOLs Timeline ↗ · 3d ago Cached

详细介绍了针对语音克隆模型的W4A4 CUDA内核优化，通过INT4量化和融合LoRA，实现了比FP16快2.6倍的推理速度。

0 favorites 0 likes

#performance-optimization

@Cander_zhu: This is another article worth reading carefully: "How modern browsers work". After reading it, I had two strong feelings: 1. Browsers are actually the most undervalued "operating system" of modern times. 2. If front-end/Agent developers still treat the browser as a black box, they will only fall behind. From a product and...

X AI KOLs Timeline ↗ · 4d ago Cached

A tweet summarizing reflections on the article "How modern browsers work", emphasizing the value of the browser as a modern operating system, and providing 5 key insights for front-end and Agent developers, including multi-process architecture, JS engine compilation pipeline, performance optimization, etc.

0 favorites 0 likes

#performance-optimization

PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)

Reddit r/LocalLLaMA ↗ · 2026-06-12

A user benchmarks thread count for hybrid CPU-GPU inference with Gemma 4 in llama.cpp, discovering a 80% performance uplift by using 16 threads instead of 6 on a hybrid core CPU, and shares the optimal command configuration.

0 favorites 0 likes

#performance-optimization

@josh_tobin_: Lots of people have been asking me what we're up to at @Recursive_SI. We still can't say much quite yet, but we thought…

X AI KOLs Following ↗ · 2026-06-11 Cached

Josh Tobin teases Recursive_SI's automated researchers, showing early demos of performance optimization capabilities.

0 favorites 0 likes

#performance-optimization

Accelerating NeurASP with vectorization and caching

arXiv cs.AI ↗ · 2026-06-10 Cached

This paper accelerates the NeurASP neurosymbolic AI framework by implementing vectorization, batch processing, and caching, achieving multiple orders of magnitude speedup on larger tasks.

0 favorites 0 likes

#performance-optimization

@leopardracer: SAME GPU SAME MODEL SAME CONTEXT AND 2X THE SPEED rtx 4060, gemma 4 12b, 48k context just switched the quantization fro…

X AI KOLs Timeline ↗ · 2026-06-08 Cached

Changing quantization from q4_k_m to q4_k_xl in llama.cpp doubles inference speed on the same GPU without hardware or driver changes, as demonstrated with Gemma 4 12B on an RTX 4060.

0 favorites 0 likes

#performance-optimization

I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

Reddit r/LocalLLaMA ↗ · 2026-06-04

A user discovered that a hidden PCIe 2.0 x4 electrical limitation on a Threadripper workstation board was crippling one of four RTX 3090s, causing poor multi-GPU LLM inference performance. Fixing the slot layout and switching to tensor split mode doubled Mistral 128B throughput from ~11 to ~24.7 tok/s.

0 favorites 0 likes

#performance-optimization

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

arXiv cs.LG ↗ · 2026-06-03 Cached

KForge is a cross-platform framework that uses two collaborating LLM-based agents to automatically generate and optimize high-performance compute kernels for diverse AI accelerators, achieving significant speedups on NVIDIA B200 and Intel Arc B580 hardware.

0 favorites 0 likes

#performance-optimization

@Greptime: On Prometheus remote write, the bottleneck wasn't network or memtable — it was the Region Worker holding &mut while dec…

X AI KOLs Following ↗ · 2026-06-02 Cached

GreptimeDB v1.0 introduces Pending Rows Batcher, a three-stage pipeline that moves CPU-intensive work off the Datanode's critical section, improving Prometheus remote write throughput from 1.20M to 2.17M points/sec and reducing Datanode CPU usage by 20%.

0 favorites 0 likes

#performance-optimization

Learnings from 100K lines of Rust with AI (2025)

Hacker News Top ↗ · 2026-05-20 Cached

A developer shares learnings from building a 100K-line Rust-based multi-Paxos consensus engine using AI coding agents, achieving dramatic productivity gains and performance improvements.

0 favorites 0 likes

#performance-optimization

@AYi_AInotes: In my early years, I remember a type of person silently admired in the codebase — they could spot N+1 through ten layers of call stack, and point out in a flame graph which function was called three extra times. Today, the Codex Skill that Greg Brockman retweeted makes this no longer a privilege of the few…

X AI KOLs Timeline ↗ · 2026-05-16 Cached

A Chinese developer discusses a new Codex Skill called Complexity Optimizer that automatically detects performance issues like O(n²) in codebases, making advanced optimization skills accessible to more developers.

0 favorites 0 likes

#performance-optimization

@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …

X AI KOLs Following ↗ · 2026-05-12

A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.

0 favorites 0 likes

#performance-optimization

Killing a `Cow` made my JSON formatter 42% faster

Lobsters Hottest ↗ · 2026-05-12 Cached

The author details how removing a Copy-on-Write (Cow) data structure improved the performance of their JSON formatter, JJPWRGEM, by 42%, making it significantly faster than Prettier and Oxfmt.

0 favorites 0 likes

#performance-optimization

A Technical Guide to Compiling Emacs for Performance on Linux and Unix systems

Lobsters Hottest ↗ · 2026-05-12 Cached

This technical guide provides a step-by-step process for compiling Emacs from source on various Linux distributions to optimize performance through CPU-specific instruction sets and modern display protocols like Wayland. It also covers configuring dependencies and fine-tuning the native Lisp compiler for faster execution.

0 favorites 0 likes

#performance-optimization

Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models

Reddit r/LocalLLaMA ↗ · 2026-05-12

The article shares a performance optimization trick for llama.cpp, showing that increasing the micro-batch size (`-ub`) combined with partial CPU offloading (`--n-cpu-moe`) can drastically improve prompt processing speed for large models like gpt-oss-120b on consumer GPUs.

0 favorites 0 likes

#performance-optimization

Making cross-platform SIMD code pleasant

Lobsters Hottest ↗ · 2026-05-11 Cached

The author details the third iteration of the bx library's cross-platform SIMD abstraction, advocating for a typeless approach and SSA-style coding to simplify low-level performance optimization across different CPU architectures.

0 favorites 0 likes

#performance-optimization

Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

Hacker News Top ↗ · 2026-05-10 Cached

The author details the process of optimizing custom matrix multiplication kernels in Swift to train a Large Language Model on Apple Silicon, aiming to outperform C implementations by leveraging CPU, SIMD, AMX, and GPU capabilities.

0 favorites 0 likes

#performance-optimization

Optimize for change not application performance

Hacker News Top ↗ · 2026-05-09 Cached

The article argues that software teams often over-optimise for micro-performance benchmarks at the expense of developer experience and engineering throughput, which are the true bottlenecks for long-term delivery speed and maintainability.

0 favorites 0 likes

performance-optimization

Submit Feedback