@charles_irl: Somehow missed this one in the hustle and bustle. Very cool demo!
Summary
A developer built a 12M parameter LLM using a custom ML framework with a Rust backend and CUDA kernels, including Flash Attention and AdamW, and trained it from scratch.
View Cached Full Text
Cached at: 06/08/26, 07:17 AM
Somehow missed this one in the hustle and bustle.
Very cool demo! https://t.co/CWsyssqk09
Aadi Kulshrestha (@MankyDankyBanky): I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more.
Wrote the full transformer architecture, and BPE tokenizer from scratch.
The framework features:
- Custom CUDA kernels (Flash Attention, fused
Similar Articles
@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587
The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.
@ivanfioravanti: Apple M5 Max + MLX = raw power! Look at this demo I'm playing with "FasterLivePortrait-MLX" I started with MPS but resu…
The author demonstrates that migrating a LivePortrait implementation from MPS to Apple's MLX framework on an M5 Max chip results in significantly better performance and speed.
Me train LLM on 8GB from Scratch. Me happy
Built a repository to train a tiny language model (25M parameters) from scratch on 8GB VRAM, with support for MTP but noting limitations of mHC and BitNet.
235M param LLM from scratch on a single RTX 5080
A hobbyist trained a 235M-parameter LLM from scratch on a single RTX 5080, sharing full PyTorch pipeline and open-sourcing Plasma 1.0.
@evanyou: https://x.com/evanyou/status/2060409444123729935
A developer shares an interesting use case for running LLMs in the browser to inspect internal workings, highlighting a meaningful scenario for client-side AI.