kernel-optimization

Tag

Cards List
#kernel-optimization

Linux 7.2 Improves Anonymous/Unnamed Pipe Performance For Shell Pipelines & More

Lobsters Hottest · yesterday Cached

Linux 7.2 kernel merges a performance optimization for anonymous/unnamed pipes, improving throughput by 6-48% and reducing latency by 17-33% by pre-allocating pages outside of mutex lock to avoid contention.

0 favorites 0 likes
#kernel-optimization

@bingxu_: I started INT21 two months ago, and I’m proud to announce that we’re coming out of stealth today with our first product…

X AI KOLs Timeline · 2026-06-16 Cached

INT21 announced PTX Kernel Factory, a self-improving agent swarm that autonomously generates expert-level PTX GPU kernels, with open-source proof-of-concept implementations and beta access.

0 favorites 0 likes
#kernel-optimization

@levidiamode: 163/365 of GPU Programming Looking at a few different agentic GPU kernel optimization systems today. The two I'm most i…

X AI KOLs Timeline · 2026-06-15 Cached

A tweet discussing two agentic GPU kernel optimization systems: Auto GPU Kernel by @dogacel0 and Kernel Design Agents from @songhan_mit's lab, both winners at the MLSys Sparse Attention FlashInfer competition. The thread highlights different approaches using subagents and Claude skills for GPU programming.

0 favorites 0 likes
#kernel-optimization

@charles_irl: A tl;dr for folks who don't care how many warpgroups FA4 devotes to softmax vs MMA loads. Inference is different from t…

X AI KOLs Following · 2026-06-11 Cached

Explains that inference kernels differ from training, with Flash Attention 4 focusing on changing parallelism across KV and supporting small irregular loads.

0 favorites 0 likes
#kernel-optimization

@charles_irl: Last fall, we shared our deep dive on FA4 internals. But we didn't stop at grokking the kernel. Since then, we've been …

X AI KOLs Following · 2026-06-11 Cached

A blog post details contributions to FlashAttention-4 to improve its performance for large language model inference, especially for decode-heavy workloads, by adjusting parallelism strategies and supporting irregular memory accesses.

0 favorites 0 likes
#kernel-optimization

@_akhaliq: GPU Forecasters Language Models as Selective Surrogates for Kernel Runtime Optimization

X AI KOLs Following · 2026-06-02 Cached

This paper proposes using language models as selective surrogates to optimize GPU kernel runtime, demonstrating a novel approach to performance forecasting.

0 favorites 0 likes
#kernel-optimization

Alibaba's Qwen3.7-Max Ran Autonomously for 35 Hours on Unfamiliar Hardware. It Still Kept Getting Better.

Reddit r/ArtificialInteligence · 2026-05-25 Cached

Alibaba's Qwen3.7-Max model autonomously optimized a production kernel on unfamiliar T-Head PPU hardware over 35 hours, making 1,158 tool calls and achieving a 10x speedup, demonstrating sustained autonomous agentic behavior without human guidance.

0 favorites 0 likes
#kernel-optimization

@ickma2311: Efficient AI Lecture 13: LLM Deployment Techniques The lecture helped me understand AWQ, vLLM, and FlashAttention very …

X AI KOLs Timeline · 2026-05-13 Cached

A lecture on LLM deployment techniques covering AWQ, vLLM, FlashAttention, quantization, and activation smoothing for efficient serving.

0 favorites 0 likes
#kernel-optimization

Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Hugging Face Daily Papers · 2026-05-10 Cached

Metal-Sci introduces a 10-task benchmark for optimizing scientific computing kernels on Apple Silicon, paired with an evolutionary search framework driven by large language models. The study evaluates models like Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5, demonstrating significant speedups while using out-of-distribution testing to catch silent performance regressions.

0 favorites 0 likes
#kernel-optimization

@xenovacom: Opus 4.7 just wrote a custom WebGPU kernel that runs Qwen3.5 up to 13x faster using a fused LinearAttention op! Agentic…

X AI KOLs Following · 2026-04-23 Cached

Opus 4.7 auto-generated a custom WebGPU kernel that accelerates Qwen3.5 inference up to 13× via fused LinearAttention, now shipping in Transformers.js v4.2.0.

0 favorites 0 likes
#kernel-optimization

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Hugging Face Daily Papers · 2026-04-15 Cached

AccelOpt is a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels through iterative generation and optimization memory, achieving 49-61% peak throughput improvements on AWS Trainium while being 26x cheaper than Claude Sonnet 4.

0 favorites 0 likes
← Back to home

Submit Feedback