Tag
Anthropic's Mythos system achieved a 52x speedup in optimizing training code compared to a human's 4x speedup over 4-8 hours on the same task, with the caveat that absolute multiples depend heavily on starting code quality. The like-for-like comparison shows roughly 3x–52x improvement across models over the past year.
This paper proposes PAT, an adaptive tensor parallelism method that dynamically reconfigures TP during the generation stage of synchronous RLHF training to mitigate long-tail generation bottlenecks. Evaluations on LLaMA3.1-8B and Qwen3-14B show reductions in generation latency by up to 34.6% and end-to-end iteration latency by up to 27.2%.
A new optimization technique for open-source RL training engines introduces prompt caching during training, achieving up to 7.5x speedup on long-prompt, short-response workloads by reducing redundant compute.
Photoroom's PRX Part 3 demonstrates training a text-to-image model in 24 hours by combining optimized architectural and training techniques including perceptual losses, token routing with TREAD, and the Muon optimizer.