tensorrt-llm

#tensorrt-llm

@rickawsb: NVIDIA also believes storage is a bigger bottleneck than GPUs — Decoding NVIDIA's latest article. NVIDIA's newly released 'AI Model Co-Design' is a technical article introducing TensorRT-LLM and Blackwell, but also a roadmap for large model design and AI infrastructure in the coming years...

X AI KOLs Following ↗ · yesterday Cached

This article provides an in-depth interpretation of NVIDIA's newly released 'AI Model Co-Design' paper, pointing out that in AI inference scenarios, storage (memory bandwidth, weight reading) has replaced GPU compute as the primary bottleneck. It elaborates on the design strategies of TensorRT-LLM and Blackwell architecture around the Roofline model, emphasizing that reducing data movement is more critical than improving compute power.

0 favorites 0 likes

#tensorrt-llm

@h100envy: CMU PhD who built the kernels NVIDIA now ships in TensorRT-LLM explained fast attention in 68 minutes - better than $12…

X AI KOLs Timeline ↗ · 2026-07-02 Cached

A CMU PhD who developed the kernels now used by NVIDIA in TensorRT-LLM explains fast attention, covering fused CUDA kernels, FlashInfer, Triton, and paged-KV attention, enabling more tokens per second on the same GPU.

0 favorites 0 likes

#tensorrt-llm

@TheAhmadOsman: How to go about learning all of this? 1st: Start with the serving engine view - vLLM: PagedAttention, continuous batchi…

X AI KOLs Following ↗ · 2026-06-08 Cached

A detailed guide on learning AI inference engine internals, covering serving engines like vLLM and SGLang, low-level GPU kernel programming with Triton and CUTLASS, and a sequence of mini-projects to build hands-on expertise.

0 favorites 0 likes

#tensorrt-llm

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

arXiv cs.CL ↗ · 2026-05-13 Cached

This paper introduces Ada-MK, an adaptive MegaKernel optimization method that uses automated DAG-based search to eliminate runtime branching and reduce shared memory usage for LLM inference. It demonstrates significant throughput improvements on NVIDIA Ada GPUs by integrating with TensorRT-LLM, achieving up to 23.6% faster performance than vanilla TensorRT-LLM in commercial advertising systems.

0 favorites 0 likes

#tensorrt-llm

Blackwell LLM Toolkit - NVFP4 Config +Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM - 270 tk/s Nemotron 3 Omni

Reddit r/LocalLLaMA ↗ · 2026-05-12

A developer toolkit providing configurations, wheels, and benchmarks for running large language models with NVFP4 precision on Nvidia Blackwell GPUs using TensorRT-LLM.

0 favorites 0 likes

tensorrt-llm

@h100envy: CMU PhD who built the kernels NVIDIA now ships in TensorRT-LLM explained fast attention in 68 minutes - better than $12…

@TheAhmadOsman: How to go about learning all of this? 1st: Start with the serving engine view - vLLM: PagedAttention, continuous batchi…

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

Blackwell LLM Toolkit - NVFP4 Config +Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM - 270 tk/s Nemotron 3 Omni

Submit Feedback