@charles_irl: Somehow missed this one in the hustle and bustle. Very cool demo!

X AI KOLs Following Tools

Summary

A developer built a 12M parameter LLM using a custom ML framework with a Rust backend and CUDA kernels, including Flash Attention and AdamW, and trained it from scratch.

Somehow missed this one in the hustle and bustle. Very cool demo! https://t.co/CWsyssqk09
Original Article
View Cached Full Text

Cached at: 06/08/26, 07:17 AM

Somehow missed this one in the hustle and bustle.

Very cool demo! https://t.co/CWsyssqk09

Aadi Kulshrestha (@MankyDankyBanky): I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more.

Wrote the full transformer architecture, and BPE tokenizer from scratch.

The framework features:

  • Custom CUDA kernels (Flash Attention, fused

Similar Articles

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

X AI KOLs Timeline

The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.

Me train LLM on 8GB from Scratch. Me happy

Reddit r/LocalLLaMA

Built a repository to train a tiny language model (25M parameters) from scratch on 8GB VRAM, with support for MTP but noting limitations of mHC and BitNet.