training-efficiency

#training-efficiency

The Context-Ready Transformer

arXiv cs.CL ↗ · 3d ago Cached

The paper introduces the context-ready transformer, a recurrent architecture that pre-contextualizes tokens before the transformer block, achieving significant inference speedups (e.g., 1.7x on A100) while matching or exceeding standard transformer performance with fewer layers.

0 favorites 0 likes

#training-efficiency

Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

Hugging Face Daily Papers ↗ · 3d ago Cached

This paper proposes Nemotron-Labs-Diffusion-Image, a masked discrete diffusion model for high-resolution text-to-image synthesis, introducing a token-editing mechanism and grouped cross-entropy objective to improve token refinement and training efficiency.

0 favorites 0 likes

#training-efficiency

LISA: Likelihood Score Alignment for Visual-condition Controllable Generation

Hugging Face Daily Papers ↗ · 2026-06-25 Cached

This paper introduces LISA, a regularization method that aligns the intermediate features of a side network with an approximated likelihood score to improve training efficiency and the quality of visual-condition controllable generation in score-based generative models.

0 favorites 0 likes

#training-efficiency

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

Hugging Face Daily Papers ↗ · 2026-06-23 Cached

Introduces Holistic Data Scheduler (HDS), a reinforcement learning-based framework that dynamically adjusts data mixtures during LLM pre-training using a multi-objective reward function, achieving 44% fewer iterations to reach target perplexity and a 7.2% improvement on MMLU.

0 favorites 0 likes

#training-efficiency

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

Introduces Bayesian Manifold Curriculum (BMC), an adaptive curriculum learning method for LLMs that leverages the model's latent geometry to allocate training effort across diverse problem types, improving efficiency beyond traditional difficulty-based curricula.

0 favorites 0 likes

#training-efficiency

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

Hugging Face Daily Papers ↗ · 2026-06-05 Cached

Introduces PACI, a bubble-free asynchronous pipeline parallel training method that bounds forward/backward weight inconsistency using local gradient accumulation, achieving higher throughput and faster time-to-accuracy without sacrificing stability or memory usage.

0 favorites 0 likes

#training-efficiency

Spectral Scaling Laws of Muon

arXiv cs.LG ↗ · 2026-06-04 Cached

This paper presents the first systematic study of singular value spectral behavior in Muon optimizer momentum matrices during LLM training, discovering clean power-law scaling relationships across model sizes (77M–2.8B parameters). The findings provide practitioners with principled, layer-aware guidelines for configuring Newton–Schulz iterations to maintain orthonormalization quality at frontier scale without unnecessary computation.

0 favorites 0 likes

#training-efficiency

Why Muon Outperforms Adam: A Curvature Perspective

Hugging Face Daily Papers ↗ · 2026-06-03 Cached

This paper investigates why the Muon optimizer outperforms Adam in large language model training, showing from a curvature perspective that Muon incurs a smaller curvature penalty due to lower normalized directional sharpness, with advantages amplified by data imbalance.

0 favorites 0 likes

#training-efficiency

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

arXiv cs.LG ↗ · 2026-06-02 Cached

This paper systematically analyzes the sensitivity of tool-calling evaluations to minor implementation choices such as random seeds and multi-turn templates, revealing that these can cause substantial performance variation. It also identifies sources of computational waste in RL-based tool-calling training and introduces techniques to accelerate training without sacrificing performance.

0 favorites 0 likes

#training-efficiency

FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

arXiv cs.LG ↗ · 2026-05-21 Cached

The paper proposes FBOS-RL, a feedback-driven bi-objective synergistic reinforcement learning framework that improves training efficiency and performance ceiling over GRPO in LLM alignment and reasoning by using feedback-guided exploration and two mutually reinforcing training objectives: Exploitation-oriented Policy Alignment and Exploration-oriented Capability Cultivation.

0 favorites 0 likes

#training-efficiency

Rethinking Cross-Layer Information Routing in Diffusion Transformers

Hugging Face Daily Papers ↗ · 2026-05-20 Cached

This paper proposes Diffusion-Adaptive Routing (DAR), a learnable, timestep-adaptive residual replacement that improves cross-layer information flow in Diffusion Transformers, leading to significant training acceleration and quality improvements.

0 favorites 0 likes

#training-efficiency

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Hugging Face Daily Papers ↗ · 2026-05-20 Cached

Lens is a compact 3.8B-parameter text-to-image model from Microsoft that achieves competitive performance with larger models while requiring significantly less training compute, using dense captions, multi-resolution batching, and efficient architecture.

0 favorites 0 likes

#training-efficiency

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

arXiv cs.LG ↗ · 2026-05-18 Cached

Introduces DualKV, a FlashAttention kernel variant that eliminates redundant prompt token computation in RL post-training (GRPO/DAPO), achieving up to 3.82x speedup on 30B MoE models.

0 favorites 0 likes

#training-efficiency

@HowToAI_: NVIDIA has done the impossible and nobody's talking about it. They trained a 12 BILLION parameter LLM in 4-bit precisio…

X AI KOLs Timeline ↗ · 2026-05-15

NVIDIA trained a 12-billion parameter LLM in 4-bit precision using the new NVFP4 format with micro-scaling, achieving near-zero intelligence loss while halving memory usage and tripling arithmetic speed, marking a major breakthrough in efficient AI training.

0 favorites 0 likes

#training-efficiency

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

Hugging Face Daily Papers ↗ · 2026-05-15 Cached

Flash-GRPO improves training efficiency for video diffusion models by addressing temporal variance and gradient inconsistency through iso-temporal grouping and temporal gradient rectification, achieving state-of-the-art alignment quality with substantial training acceleration.

0 favorites 0 likes

#training-efficiency

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

Hugging Face Daily Papers ↗ · 2026-05-14 Cached

This paper investigates the impact of subword tokenization on LLM training efficiency and performance by conducting controlled byte-level pretraining experiments. It reveals key factors such as training throughput and the integration of subword boundaries as linguistic priors.

0 favorites 0 likes

#training-efficiency

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

arXiv cs.CL ↗ · 2026-05-12 Cached

The paper introduces XPERT, a framework that extracts and reuses expert knowledge from pre-trained Mixture-of-Experts (MoE) language models to improve training efficiency and performance in downstream models.

0 favorites 0 likes

#training-efficiency

SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization

arXiv cs.CL ↗ · 2026-05-12 Cached

This paper introduces SimReg, a regularization technique for LLM pretraining that uses embedding similarity to improve training convergence by over 30% and boost zero-shot performance.

0 favorites 0 likes

#training-efficiency

NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training

arXiv cs.LG ↗ · 2026-05-12 Cached

This paper introduces NoiseRater, a meta-learning framework that assigns importance scores to individual noise samples during diffusion model training to improve efficiency and generation quality.

0 favorites 0 likes

#training-efficiency

Gradient Extrapolation-Based Policy Optimization

arXiv cs.LG ↗ · 2026-05-11 Cached

The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.

0 favorites 0 likes

training-efficiency

Submit Feedback