Built an LLM training framework that actually runs on older GPUs without crashing

Reddit r/ArtificialInteligence 06/27/26, 04:47 PM Tools

llm-training older-gpus gpu-dependency open-source pytorch nanotron picotron

Summary

Introduces Picotron, a clean-room rewrite of Nanotron that eliminates mandatory GPU-specific dependencies, enabling LLM training on older GPUs like T4 and V100. It defaults to standard PyTorch SDPA but supports FlashAttention-2 at runtime.

Hey guys, I was playing around with Nanotron recently and got super frustrated by how many heavy, hardware-specific dependencies it imports at the module level ( flash-attn , triton, functorch , etc.). If you try to run it on older or budget GPUs like a T4 or V100, it just crashes on import. So I wrote Picotron (https://github.com/Syntropy-AI-Labs/picotron) to solve this. It's a clean-room rewrite that gets rid of all mandatory GPU-specific dependencies. It runs on pretty much any GPU that supports PyTorch (defaults to FP16 on older cards under compute capability 8.0, and BF16 on newer ones). It falls back to standard PyTorch SDPA by default, but still hooks into FlashAttention-2 at runtime if it detects you have it installed. I used an AI assistant to write a lot of the boilerplate/code modules, but I've got it working locally and just trained a tiny 2M model onFineWeb-Edu. Also added configs for: • GQA / MLA (Multi-head Latent Attention) • QK-Norm & logit soft-capping (Gemma 2 style) • Parallel FFN/Attn runs • ZeRO-1 wrapping on DDP Roadmap is pretty short right now: MoE prep (routing capacity factors and load balancing loss) Making dataset prep easier than streaming manually Check it out if you've been fighting with CUDA dependency hell: https://github.com/Syntropy-AI-Labs/picotron

Original Article

Built an LLM training framework that actually runs on older GPUs without crashing

Similar Articles

@tom_doerr: Runs 70B LLMs on single 4GB GPU https://github.com/lyogavin/airllm

@tom_doerr: Trains billion-parameter LLMs from scratch on a single GPU https://github.com/FareedKhan-dev/train-llm-from-scratch…

Developing open source LLM from ground up from pretrain - rlhf(PPO/GRPO)

235M param LLM from scratch on a single RTX 5080

Me train LLM on 8GB from Scratch. Me happy

Submit Feedback

Similar Articles

@tom_doerr: Runs 70B LLMs on single 4GB GPU https://github.com/lyogavin/airllm

@tom_doerr: Trains billion-parameter LLMs from scratch on a single GPU https://github.com/FareedKhan-dev/train-llm-from-scratch…

Developing open source LLM from ground up from pretrain - rlhf(PPO/GRPO)

235M param LLM from scratch on a single RTX 5080
A hobbyist trained a 235M-parameter LLM from scratch on a single RTX 5080, sharing full PyTorch pipeline and open-sourcing Plasma 1.0.

Me train LLM on 8GB from Scratch. Me happy