@charles_irl: Somehow missed this one in the hustle and bustle. Very cool demo!

X AI KOLs Following 06/07/26, 08:48 PM Tools

machine-learning llm rust cuda framework training demo

Summary

A developer built a 12M parameter LLM using a custom ML framework with a Rust backend and CUDA kernels, including Flash Attention and AdamW, and trained it from scratch.

Somehow missed this one in the hustle and bustle. Very cool demo! https://t.co/CWsyssqk09

Original Article

View Cached Full Text

Cached at: 06/08/26, 07:17 AM

Somehow missed this one in the hustle and bustle.

Very cool demo! https://t.co/CWsyssqk09

Aadi Kulshrestha (@MankyDankyBanky): I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more.

Wrote the full transformer architecture, and BPE tokenizer from scratch.

The framework features:

Custom CUDA kernels (Flash Attention, fused

Similar Articles

@h100envy: CMU PhD who built the kernels NVIDIA now ships in TensorRT-LLM explained fast attention in 68 minutes - better than $12…

X AI KOLs Timeline

A CMU PhD who developed the kernels now used by NVIDIA in TensorRT-LLM explains fast attention, covering fused CUDA kernels, FlashInfer, Triton, and paged-KV attention, enabling more tokens per second on the same GPU.

@no_stp_on_snek: My second and late Build Small submission. 10 days, 1 dev: a from-scratch Rust engine + custom GPU kernels vs vLLM on N…

X AI KOLs Following

A developer built a from-scratch Rust inference engine with custom GPU kernels that outperforms vLLM on Nemotron-30B decoding, achieving 75.7 vs 57 tok/s, submitted to the Build Small hackathon.

@ekzhang1: me looking at people like this guy who write real gpu kernels :)

X AI KOLs Timeline

AI model Claude was used to write a FlashAttention forward kernel using the pyptx DSL, achieving near-parity performance with hand-tuned FlashAttention-4 on NVIDIA B200 hardware.

@levidiamode: 157/365 of GPU Programming Another FlashAttention4 resource that's been really helpful for me is the talk @charles_irl …

X AI KOLs Following

A daily GPU programming thread highlights a talk by Charles_irl that reverse-engineers FlashAttention4 code before the paper release, praising the Modal team's deep code dissection and inferences about the forward pass.

@charles_irl: Last fall, we shared our deep dive on FA4 internals. But we didn't stop at grokking the kernel. Since then, we've been …

X AI KOLs Following

A blog post details contributions to FlashAttention-4 to improve its performance for large language model inference, especially for decode-heavy workloads, by adjusting parallelism strategies and supporting irregular memory accesses.

Similar Articles

@h100envy: CMU PhD who built the kernels NVIDIA now ships in TensorRT-LLM explained fast attention in 68 minutes - better than $12…

@no_stp_on_snek: My second and late Build Small submission. 10 days, 1 dev: a from-scratch Rust engine + custom GPU kernels vs vLLM on N…

@ekzhang1: me looking at people like this guy who write real gpu kernels :)

@levidiamode: 157/365 of GPU Programming Another FlashAttention4 resource that's been really helpful for me is the talk @charles_irl …

@charles_irl: Last fall, we shared our deep dive on FA4 internals. But we didn't stop at grokking the kernel. Since then, we've been …

Submit Feedback