Tag
MegaTrain enables full-precision training of 100B+ LLMs on a single GPU by treating VRAM as a transient stateless cache, inverting the memory hierarchy.
An update on the Ornith-1.0-35B GGUF model introduces a native MTP speculative-decode graft for faster inference on a single GPU, achieving ~1.3-1.35x decode speedup while maintaining near-identical token distribution. Benchmark numbers for throughput, TTFT, and long-context performance across multiple quants are provided.
This paper studies a staged promotion protocol for micro-pretraining, using escalating budgets from minutes to hours to filter configurations. It finds that early screens are useful but unstable, and that a staged approach can retain a long-horizon reference while identifying alternatives that fail continuation thresholds.
A methodology for autonomously training transformer language models on a single consumer GPU, structured in six stages with verification gates and AGENTS.md specs for orchestration frameworks like OpenClaw.
ModeSwitch-LLM is a lightweight controller that routes LLM inference requests to appropriate fixed modes (e.g., FP16, quantization, speculative decoding) on a single GPU, achieving up to 2.10× latency speedup and 51.7% energy reduction without retraining the model.
An open-source repository called train-llm-from-scratch enables training billion-parameter LLMs on a single GPU, with a configurable pipeline from raw text to inference, including dataset streaming and checkpointing, under MIT License.
TideGS introduces an out-of-core training framework that enables 3D Gaussian Splatting with over one billion primitives on a single GPU by managing parameters across SSD-CPU-GPU hierarchy via block-virtualization, asynchronous pipeline, and differential streaming techniques.
Hugging Face's PEFT library enables parameter-efficient fine-tuning of large models on a single GPU, reducing compute and storage costs while maintaining performance.
A GitHub repository provides scripts to train billion-parameter language models from scratch on a single GPU using PyTorch, based on the Transformer architecture.
Built an open-source pipeline that takes a single sentence and produces a cinematic reel with characters, animation, music, and narration, using FLUX.2, Wan2.2, and other models on a single AMD GPU. The pipeline includes a director agent, character generation, keyframe animation, vision critic, music, and narration stages.
AirLLM is an open-source tool that optimizes inference memory usage, enabling 70B LLMs to run on a single 4GB GPU without quantization, and supports 405B models on 8GB VRAM.
A single RTX 3090 pushes 134 tok/s on the fresh 27B Qwen 3.5 Dense and 73 tok/s on Qwen 3.6-27B via fused kernels plus speculative decoding, with GGUF drops the same evening.
AirLLM is an open-source library that enables running large language models (up to 405B) on a single 4GB GPU without quantization, distillation, or pruning, significantly lowering the hardware barrier for local LLM inference.