Tag
This paper introduces ARIA, a framework that adaptively allocates training effort across regions of the conditioning space for distilling conditional diffusion models, improving performance on unseen and underrepresented conditions.
Researchers trained a vision-language model without a vision encoder for only $100, inspired by Gemma 4 12B, achieving a 30% reduction in end-to-end latency on an M3 Pro MacBook.
This paper proposes PauseRec, a lightweight implicit reasoning paradigm for LLM-based generative recommendation that outperforms explicit chain-of-thought methods while significantly reducing training and inference costs.
Proposes LC-QAT, a 2-bit weight-only vector quantization aware training framework for LLMs that uses a learned affine mapping to enable end-to-end training, achieving state-of-the-art results with only 0.1%-10% of training data.
DOG-DPO is a training-free data selection framework that treats preference pairs as structured geometric signals, decomposing multi-dataset preference geometry into anchor and residual subspaces to select diverse subsets for safety alignment. It achieves strong utility-robustness trade-offs using only 11% of preference pairs across six safety benchmarks.
MaskAlign proposes a token-subset representation alignment method that improves diffusion transformer training by reducing reliance on complete token sets and maintaining stable alignment under perturbations.
BIN16 replaces all floating-point operations with boolean operations (XNOR+popcount) for neural network training and inference, enabling direct computation in off-the-shelf DRAM with zero floats, gradients, or hyperparameter tuning. It achieves 82% accuracy on MNIST in a single epoch, using only 220 lines of C.
Proposes AttentionPO, a token-weighted direct preference optimization method that uses attention from the LLM itself to estimate token weights, improving alignment performance on AlpacaEval, MT-Bench, and ArenaHard without requiring a separate reward model.
Sapient Intelligence has released HRM-Text, a 1B parameter text generation model, trained on only 0.04 trillion tokens (costing approximately $1000), surpassing much larger models trained on 100-1000 times more data on multiple reasoning benchmarks, marking the beginning of a new paradigm for AI training.
HRM-text is a 1B-parameter hierarchical reasoning language model proposed by Sapient Intelligence. It thinks efficiently through internal latent space, achieving performance surpassing most models of the same size with extremely low training cost.
Sapient Intelligence introduces HRM-Text, a 1B-parameter reasoning language model trained on only 40B tokens with a budget of $1,000, achieving competitive performance while drastically reducing data and compute requirements.
This paper introduces OP-Mix, a data mixing algorithm that uses low-rank adapters trained on the current model to cheaply simulate candidate data mixtures, enabling efficient and unified data mixing across pretraining, continual midtraining, and continual instruction tuning. OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of baselines, improving pretraining perplexity by 6.3% and reducing compute by 66-95% in continual learning settings.
Microsoft releases Lens, a 3.8B-parameter foundational text-to-image model with efficient training and fast high-resolution generation, featuring dense-caption pre-training and mixed-resolution learning.
EndPrompt proposes a method for extending the context window of large language models using only short training sequences, by anchoring a terminal prompt with target-length positional indices. It achieves strong benchmark results with substantially less computation than full-length fine-tuning.
Microsoft releases Lens, a 3.8B-parameter foundational text-to-image model designed for efficient training and fast high-resolution generation, achieving competitive quality with reduced compute.
GRACE proposes a gradient-aligned method that scores individual reasoning steps to select the most valuable data for post-training, achieving 108.8% of full-data performance with only 20% of the data.
Moonshot AI founder Yang Zhilin released a 40-minute video detailing the training process of the Kimi K2 model, which cost only $4.6 million. In an 8-model real-time programming competition, Kimi K2 took first place, defeating GPT-5.5 and others, demonstrating how a small team can overturn the traditional compute-stacking paradigm through architecture optimization.
This paper introduces jina-embeddings-v5-omni, a suite of multimodal embedding models that extend text embeddings to image, audio, and video using frozen-tower composition. The method trains only 0.35% of the total weights, maintaining text geometry while achieving competitive state-of-the-art performance with significantly lower computational cost.
This paper introduces CapVector, a method that decouples auxiliary training objectives from standard supervised finetuning in Vision-Language-Action models. By extracting transferable capability vectors and applying orthogonal regularization, it enhances model performance and generalization while significantly reducing computational overhead.
Lighthouse Attention is a training-only hierarchical selection-based attention algorithm that reduces computational complexity for long sequence training of causal transformers, enabling faster pre-training with competitive final loss after a recovery phase.