Built an LLM training framework that actually runs on older GPUs without crashing

Reddit r/ArtificialInteligence Tools

Summary

Introduces Picotron, a clean-room rewrite of Nanotron that eliminates mandatory GPU-specific dependencies, enabling LLM training on older GPUs like T4 and V100. It defaults to standard PyTorch SDPA but supports FlashAttention-2 at runtime.

Hey guys, I was playing around with Nanotron recently and got super frustrated by how many heavy, hardware-specific dependencies it imports at the module level ( flash-attn , triton, functorch , etc.). If you try to run it on older or budget GPUs like a T4 or V100, it just crashes on import. So I wrote Picotron (https://github.com/Syntropy-AI-Labs/picotron) to solve this. It's a clean-room rewrite that gets rid of all mandatory GPU-specific dependencies. It runs on pretty much any GPU that supports PyTorch (defaults to FP16 on older cards under compute capability 8.0, and BF16 on newer ones). It falls back to standard PyTorch SDPA by default, but still hooks into FlashAttention-2 at runtime if it detects you have it installed. I used an AI assistant to write a lot of the boilerplate/code modules, but I've got it working locally and just trained a tiny 2M model onFineWeb-Edu. Also added configs for: • GQA / MLA (Multi-head Latent Attention) • QK-Norm & logit soft-capping (Gemma 2 style) • Parallel FFN/Attn runs • ZeRO-1 wrapping on DDP Roadmap is pretty short right now: MoE prep (routing capacity factors and load balancing loss) Making dataset prep easier than streaming manually Check it out if you've been fighting with CUDA dependency hell: https://github.com/Syntropy-AI-Labs/picotron
Original Article

Similar Articles

Me train LLM on 8GB from Scratch. Me happy

Reddit r/LocalLLaMA

Built a repository to train a tiny language model (25M parameters) from scratch on 8GB VRAM, with support for MTP but noting limitations of mHC and BitNet.