Tag
The paper introduces XPERT, a framework that extracts and reuses expert knowledge from pre-trained Mixture-of-Experts (MoE) language models to improve training efficiency and performance in downstream models.
This paper introduces SimReg, a regularization technique for LLM pretraining that uses embedding similarity to improve training convergence by over 30% and boost zero-shot performance.
This paper introduces NoiseRater, a meta-learning framework that assigns importance scores to individual noise samples during diffusion model training to improve efficiency and generation quality.
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.
Tilde Research introduces Aurora, a new optimizer designed to prevent neuron death in MLP layers while maintaining orthogonality, achieving state-of-the-art results on nanoGPT benchmarks and 100x data efficiency on 1B models.
AdaPreLoRA is a novel LoRA optimizer that uses Adafactor diagonal Kronecker preconditioning to improve factor-space updates while maintaining low memory usage, demonstrating competitive performance across various LLMs and tasks.
This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.
Researchers from MIT CSAIL and other institutions introduced CompreSSM, a technique that compresses state-space AI models during training by removing unnecessary components early, resulting in faster training and smaller models without sacrificing performance.
OpenAI presents a simple data augmentation technique that enables autoregressive language models to perform fill-in-the-middle (FIM) text generation without harming left-to-right performance, with extensive ablations and best practices provided for training such models.
OpenAI analyzes trends in AI algorithmic efficiency, showing that compute required to reach AlexNet-level performance has halved roughly every 16 months since 2012, outpacing hardware gains. The study draws comparisons across domains like DNA sequencing and transistor density to contextualize AI progress.