Tag
The paper introduces Ember, a lightweight optimizer for embedding and LM-head matrices that exploits gradient geometry to improve efficiency and performance across supervised finetuning, RL, and pretraining, while using far less optimizer state than Adam.
Introduces Duplicated Latent Residual (DLR), a training-only, parameter-free plug-in for low-rank pre-training that improves perplexity across LLaMA models from 60M to 7B parameters, and can be folded into the model after training with zero inference cost.
This paper introduces FRESCO, an Echo State Network architecture operating entirely in the frequency domain to achieve O(N) complexity for dense recurrent updates, matching state-of-the-art performance on benchmarks while reducing computational costs.
ViQ presents a visual quantization framework that balances semantic richness and detail preservation in discrete representations, enabling efficient multimodal training with native-resolution inputs by using text-aligned pre-training and proximal representation learning.
This paper introduces ARIA, a framework that adaptively allocates training effort across regions of the conditioning space for distilling conditional diffusion models, improving performance on unseen and underrepresented conditions.
Researchers trained a vision-language model without a vision encoder for only $100, inspired by Gemma 4 12B, achieving a 30% reduction in end-to-end latency on an M3 Pro MacBook.
This paper proposes PauseRec, a lightweight implicit reasoning paradigm for LLM-based generative recommendation that outperforms explicit chain-of-thought methods while significantly reducing training and inference costs.
Proposes LC-QAT, a 2-bit weight-only vector quantization aware training framework for LLMs that uses a learned affine mapping to enable end-to-end training, achieving state-of-the-art results with only 0.1%-10% of training data.
DOG-DPO is a training-free data selection framework that treats preference pairs as structured geometric signals, decomposing multi-dataset preference geometry into anchor and residual subspaces to select diverse subsets for safety alignment. It achieves strong utility-robustness trade-offs using only 11% of preference pairs across six safety benchmarks.
MaskAlign proposes a token-subset representation alignment method that improves diffusion transformer training by reducing reliance on complete token sets and maintaining stable alignment under perturbations.
BIN16 replaces all floating-point operations with boolean operations (XNOR+popcount) for neural network training and inference, enabling direct computation in off-the-shelf DRAM with zero floats, gradients, or hyperparameter tuning. It achieves 82% accuracy on MNIST in a single epoch, using only 220 lines of C.
Proposes AttentionPO, a token-weighted direct preference optimization method that uses attention from the LLM itself to estimate token weights, improving alignment performance on AlpacaEval, MT-Bench, and ArenaHard without requiring a separate reward model.
Sapient Intelligence has released HRM-Text, a 1B parameter text generation model, trained on only 0.04 trillion tokens (costing approximately $1000), surpassing much larger models trained on 100-1000 times more data on multiple reasoning benchmarks, marking the beginning of a new paradigm for AI training.
HRM-text is a 1B-parameter hierarchical reasoning language model proposed by Sapient Intelligence. It thinks efficiently through internal latent space, achieving performance surpassing most models of the same size with extremely low training cost.
Sapient Intelligence introduces HRM-Text, a 1B-parameter reasoning language model trained on only 40B tokens with a budget of $1,000, achieving competitive performance while drastically reducing data and compute requirements.
This paper introduces OP-Mix, a data mixing algorithm that uses low-rank adapters trained on the current model to cheaply simulate candidate data mixtures, enabling efficient and unified data mixing across pretraining, continual midtraining, and continual instruction tuning. OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of baselines, improving pretraining perplexity by 6.3% and reducing compute by 66-95% in continual learning settings.
Microsoft releases Lens, a 3.8B-parameter foundational text-to-image model with efficient training and fast high-resolution generation, featuring dense-caption pre-training and mixed-resolution learning.
EndPrompt proposes a method for extending the context window of large language models using only short training sequences, by anchoring a terminal prompt with target-length positional indices. It achieves strong benchmark results with substantially less computation than full-length fine-tuning.
Microsoft releases Lens, a 3.8B-parameter foundational text-to-image model designed for efficient training and fast high-resolution generation, achieving competitive quality with reduced compute.
GRACE proposes a gradient-aligned method that scores individual reasoning steps to select the most valuable data for post-training, achieving 108.8% of full-data performance with only 20% of the data.