Tag
This paper explores structured pruning and knowledge distillation techniques for compressing large Mixture-of-Experts (MoE) models during pre-training. It demonstrates that progressive pruning and combined distillation strategies, such as multi-token prediction distillation, improve downstream performance, exemplified by compressing Qwen3-Next-80A3B to a more efficient 23A2B model.
This paper introduces ADAPT, an online reweighting framework for LLM data curation that dynamically adjusts sample importance during training via loss weighting, outperforming offline selection and mixing methods in cross-benchmark generalization.
Zyphra releases ZAYA1-74B-Preview, a 74-billion parameter base model trained on AMD hardware, highlighting strong pre-RL reasoning capabilities and agentic performance signals.
Lighthouse Attention is a training-only hierarchical selection-based attention algorithm that reduces computational complexity for long sequence training of causal transformers, enabling faster pre-training with competitive final loss after a recovery phase.
Token-Superposition Training (TST) improves LLM pre-training efficiency by combining contiguous tokens into bags during a superposition phase with a multi-hot cross-entropy objective, achieving up to 2.5x reduction in training time without architectural changes.
MiniCPM4 is a highly efficient large language model designed for end devices, achieving strong performance with 0.5B and 8B parameter versions through innovations in sparse attention, data filtering, training algorithms, and inference systems.
This book covers foundational concepts of large language models, including pre-training, generative models, prompting, and alignment. It serves as a reference for students and practitioners in NLP.
OpenAI publishes an explainer on its core technology, detailing how language models like GPT-4 are developed through pre-training (learning from vast text data) and post-training (alignment with human values and safety practices). The article emphasizes OpenAI's nonprofit mission structure and explains the distinction between raw base models and refined, usable versions.
OpenAI describes the pre-training data filtering and active learning techniques used to reduce harmful content in DALL·E 2, while also addressing unintended bias amplification caused by data filtering—particularly demographic biases in generated images.
OpenAI introduces GPT-3, a 175-billion parameter autoregressive language model that demonstrates strong few-shot learning capabilities across diverse NLP tasks without gradient updates or fine-tuning, representing a paradigm shift in how language models can be applied to new tasks through text interactions alone.
OpenAI presents a two-stage approach for improving language understanding: pretraining a transformer model on large unsupervised datasets using language modeling, then fine-tuning on smaller supervised datasets for specific tasks. The method achieves state-of-the-art results across diverse tasks including commonsense reasoning, semantic similarity, and reading comprehension with minimal hyperparameter tuning.