pre-training

#pre-training

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Hugging Face Daily Papers ↗ · 2026-05-09 Cached

This paper explores structured pruning and knowledge distillation techniques for compressing large Mixture-of-Experts (MoE) models during pre-training. It demonstrates that progressive pruning and combined distillation strategies, such as multi-token prediction distillation, improve downstream performance, exemplified by compressing Qwen3-Next-80A3B to a more efficient 23A2B model.

0 favorites 0 likes

#pre-training

Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

arXiv cs.LG ↗ · 2026-05-08 Cached

This paper introduces ADAPT, an online reweighting framework for LLM data curation that dynamically adjusts sample importance during training via loss weighting, outperforming offline selection and mixing methods in cross-benchmark generalization.

0 favorites 0 likes

#pre-training

ZAYA1-74B-Preview: Scaling Pretraining on AMD

Reddit r/LocalLLaMA ↗ · 2026-05-07 Cached

Zyphra releases ZAYA1-74B-Preview, a 74-billion parameter base model trained on AMD hardware, highlighting strong pre-RL reasoning capabilities and agentic performance signals.

0 favorites 0 likes

#pre-training

Long Context Pre-Training with Lighthouse Attention

Hugging Face Daily Papers ↗ · 2026-05-07 Cached

Lighthouse Attention is a training-only hierarchical selection-based attention algorithm that reduces computational complexity for long sequence training of causal transformers, enabling faster pre-training with competitive final loss after a recovery phase.

0 favorites 0 likes

#pre-training

Efficient Pre-Training with Token Superposition

Hugging Face Daily Papers ↗ · 2026-05-07 Cached

Token-Superposition Training (TST) improves LLM pre-training efficiency by combining contiguous tokens into bags during a superposition phase with a multi-hot cross-entropy objective, achieving up to 2.5x reduction in training time without architectural changes.

0 favorites 0 likes

#pre-training

MiniCPM4: Ultra-Efficient LLMs on End Devices

Papers with Code Trending ↗ · 2025-06-09 Cached

MiniCPM4 is a highly efficient large language model designed for end devices, achieving strong performance with 0.5B and 8B parameter versions through innovations in sparse attention, data filtering, training algorithms, and inference systems.

0 favorites 0 likes

#pre-training

Foundations of Large Language Models

Papers with Code Trending ↗ · 2025-01-16 Cached

This book covers foundational concepts of large language models, including pre-training, generative models, prompting, and alignment. It serves as a reference for students and practitioners in NLP.

0 favorites 0 likes

#pre-training

OpenAI’s technology explained

OpenAI Blog ↗ · 2023-10-11 Cached

OpenAI publishes an explainer on its core technology, detailing how language models like GPT-4 are developed through pre-training (learning from vast text data) and post-training (alignment with human values and safety practices). The article emphasizes OpenAI's nonprofit mission structure and explains the distinction between raw base models and refined, usable versions.

0 favorites 0 likes

#pre-training

DALL·E 2 pre-training mitigations

OpenAI Blog ↗ · 2022-06-28 Cached

OpenAI describes the pre-training data filtering and active learning techniques used to reduce harmful content in DALL·E 2, while also addressing unintended bias amplification caused by data filtering—particularly demographic biases in generated images.

0 favorites 0 likes

#pre-training

Language models are few-shot learners

OpenAI Blog ↗ · 2020-05-28 Cached

OpenAI introduces GPT-3, a 175-billion parameter autoregressive language model that demonstrates strong few-shot learning capabilities across diverse NLP tasks without gradient updates or fine-tuning, representing a paradigm shift in how language models can be applied to new tasks through text interactions alone.

0 favorites 0 likes

#pre-training

Improving language understanding with unsupervised learning

OpenAI Blog ↗ · 2018-06-11 Cached

OpenAI presents a two-stage approach for improving language understanding: pretraining a transformer model on large unsupervised datasets using language modeling, then fine-tuning on smaller supervised datasets for specific tasks. The method achieves state-of-the-art results across diverse tasks including commonsense reasoning, semantic similarity, and reading comprehension with minimal hyperparameter tuning.

0 favorites 0 likes

pre-training

Submit Feedback