triton

#triton

@PyTorch: FBTriton is the Triton repo where Meta develops its experimental GPU optimization solutions (including TLX/torchTLX and…

X AI KOLs Timeline ↗ · 12h ago Cached

Meta's FBTriton is a downstream fork of OpenAI's Triton compiler that enables rapid development of GPU optimizations like TLX and autoWS while staying synced upstream. This blog details its continuous upstream ingestion strategy, hierarchical L1/L2/L3 validation framework, and the practical challenges of balancing innovation with production stability.

0 favorites 0 likes

#triton

MKEvolve: A Modular Multi-Agent Framework for Kernel Code Generation

arXiv cs.AI ↗ · 2026-07-24 Cached

Presents MKEvolve, a modular multi-agent framework that iteratively co-evolves modular decomposition and LLM-generated kernels for hardware accelerators, achieving improved correctness and speedup over direct synthesis while reducing token usage.

0 favorites 0 likes

#triton

SonicSampler: Unified Tile-Aware Kernels for LLM Sampling and Speculative Verification

arXiv cs.AI ↗ · 2026-07-24 Cached

SonicSampler presents a unified suite of tile-aware Triton kernels that vertically fuse the entire LLM sampling pipeline, supporting dynamic per-request behaviors and speculative verification, achieving up to 16x speedup over state-of-the-art baselines.

0 favorites 0 likes

#triton

Deepseek V4 Flash ~105 t/s on two Nvidia 4090d 48G (ada) in vLLM

Reddit r/LocalLLaMA ↗ · 2026-07-23

Technical post detailing how to run DeepSeek V4 Flash on two Nvidia 4090d GPUs using custom Triton kernels and vLLM, achieving ~105 tokens/second with 262k context.

0 favorites 0 likes

#triton

@PyTorch: The PyTorch-Triton 3.7 release introduces the Triton Plugin Extensions system, a framework for dynamically loading cust…

X AI KOLs Following ↗ · 2026-07-15 Cached

The PyTorch-Triton 3.7 release introduces the Triton Plugin Extensions system, enabling dynamic loading of custom compiler passes and DSL extensions into upstream Triton without forking, with Meta's TLX now supported out of the box.

0 favorites 0 likes

#triton

@elliotarledge: For those wondering why I use a Kimi Linear megakernel instead of Qwen 3.6, first look at the parameter counts. One is …

X AI KOLs Timeline ↗ · 2026-07-03 Cached

Elliot Arledge explains why he prefers using a Kimi Linear megakernel over Qwen 3.6 for kernel performance, comparing parameter counts, layer synchronization, hidden dimensions, and architecture-specific optimizations. The discussion highlights that Kimi Linear architecture is more suitable for megakernel implementation, especially for batch-1 decode on RTX PRO 6000 Blackwell.

0 favorites 0 likes

#triton

@h100envy: CMU PhD who built the kernels NVIDIA now ships in TensorRT-LLM explained fast attention in 68 minutes - better than $12…

X AI KOLs Timeline ↗ · 2026-07-02 Cached

A CMU PhD who developed the kernels now used by NVIDIA in TensorRT-LLM explains fast attention, covering fused CUDA kernels, FlashInfer, Triton, and paged-KV attention, enabling more tokens per second on the same GPU.

0 favorites 0 likes

#triton

@shreyansh_26: https://x.com/shreyansh_26/status/2069125463860302212

X AI KOLs Timeline ↗ · 2026-06-22 Cached

This post explains the Decompose-K technique for accelerating skinny large-K matrix multiplications by splitting the K dimension into chunks, running batched matmuls, and summing partials. It provides a PyTorch implementation and benchmarks showing significant speedups over standard torch.compile for bad-shaped matmuls.

0 favorites 0 likes

#triton

Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

Hugging Face Blog ↗ · 2026-06-11 Cached

This blog post continues the profiling in PyTorch series, exploring nn.Linear, MLP blocks, and fusion techniques using Triton kernels to optimize performance.

0 favorites 0 likes

#triton

@TheAhmadOsman: How to go about learning all of this? 1st: Start with the serving engine view - vLLM: PagedAttention, continuous batchi…

X AI KOLs Following ↗ · 2026-06-08 Cached

A detailed guide on learning AI inference engine internals, covering serving engines like vLLM and SGLang, low-level GPU kernel programming with Triton and CUTLASS, and a sequence of mini-projects to build hands-on expertise.

0 favorites 0 likes

#triton

@PyTorch: More details about the tutorial https://pldi26.sigplan.org/details/pldi-2026-tutorials/1/Writing-Performance-Portable-K…

X AI KOLs Following ↗ · 2026-06-04 Cached

Helion is a Python DSL that compiles to optimized Triton code for performance-portable GPU kernels. This tutorial at PLDI 2026 covers Helion's architecture, autotuning, and CuteDSL backend.

0 favorites 0 likes

#triton

@PyTorch: PyTorch member Meta just open-sourced a GPU kernel that makes attention 2.3x faster on NVIDIA Blackwell. TLX Block Atte…

X AI KOLs Following ↗ · 2026-05-26 Cached

Meta open-sources TLX Block Attention, a warp-specialized Triton kernel that achieves 2.3x speedup for block-diagonal self-attention on NVIDIA Blackwell GPUs, with up to 3.5x speedup when fused with rotary embeddings.

0 favorites 0 likes

#triton

Schanuel's Conjecture and the Semantics of Triton's FPSan

Hacker News Top ↗ · 2026-05-16 Cached

FPSan is a Triton compiler pass that enables verification of algebraic equivalence of floating-point programs by replacing floating-point operations with integer operations, relying on Schanuel's conjecture for correctness.

0 favorites 0 likes

#triton

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Hugging Face Daily Papers ↗ · 2026-05-06 Cached

KernelBench-X is a new benchmark for evaluating LLM-generated GPU kernels, revealing that task structure impacts correctness more than method design and that correctness does not guarantee hardware efficiency.

0 favorites 0 likes

#triton

AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers from Carnegie Mellon, University of Washington, and Arm propose AdaExplore, an LLM agent framework for GPU kernel code generation that achieves 3.12× and 1.72× speedups on KernelBench Level-2 and Level-3 benchmarks through failure-driven adaptation and diversity-preserving search, without additional fine-tuning.

0 favorites 0 likes

triton

Submit Feedback