Tag
This paper introduces Rotation-Preserving Supervised Fine-Tuning (RPSFT), a method that improves out-of-domain generalization by preserving projected rotations in pretrained singular subspaces during fine-tuning.
The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.
This paper introduces YFPO, a neuron-guided preference optimization framework that uses internal activation signals to improve mathematical reasoning in large language models.
This paper proposes LayerTracer, an interpretable framework for layer allocation in continued pre-training, demonstrating that freezing deep layers while training shallow ones outperforms full-parameter fine-tuning. It offers a low-cost, actionable strategy for resource-constrained teams optimizing Large Language Models.
The article introduces a method to lighten OPD for efficient post-training of Large Language Models.
This paper introduces On-Policy Harness Self-Distillation (OPHSD), a method that internalizes the capabilities of inference-time reasoning harnesses into the base model through self-distillation. The approach improves standalone performance on complex reasoning tasks, allowing the model to retain reasoning scaffolds without permanent external dependencies.
The article introduces DataArc-SynData-Toolkit, an open-source framework designed to simplify multi-path, multimodal, and multilingual synthetic data generation. It aims to lower technical barriers and improve usability for training large language models through a unified, configuration-driven pipeline.
This paper introduces Pion, a novel spectrum-preserving optimizer for large language model training that uses orthogonal equivalence transformations to maintain singular values during weight updates, offering stable performance comparable to standard optimizers.
Unsloth, an open-source library for efficient LLM training and inference, has officially joined the PyTorch Ecosystem to enhance accessibility and performance. The announcement highlights new features like Unsloth Studio and optimized kernels for reduced VRAM usage.
This paper proposes Shadow Mask Distillation (SMD) to solve the off-policy bias caused by KV cache compression during reinforcement learning post-training for large language models. It introduces a mechanism that ensures on-policy alignment and improves memory efficiency for long-context reasoning tasks.
This paper introduces a training-free diagnostic framework to analyze per-token distillation signals for reasoning models, revealing that guidance is more beneficial on incorrect rollouts and depends on student capacity and task context.
This paper introduces G-Zero, a verifier-free framework that enables autonomous large language model self-improvement through co-evolutionary training using intrinsic rewards and hint-based guidance. It aims to overcome the limitations of proxy LLM judges in open-ended tasks by deriving supervision from internal distributional dynamics.
This paper introduces RLRT, a method that reverses teacher signals in self-distillation to reinforce successful student deviations, enhancing reasoning exploration in large language models.
This article analyzes post-training methods for language models through a distributional perspective, comparing how SFT, RL, and on-policy distillation reshape model distributions and impact phenomena like catastrophic forgetting.
The article claims that Stanford has released a free technique for training LLMs to adhere strictly to prompts, a skill Anthropic reportedly pays high salaries for. It urges readers to bookmark the resource before it is removed.
The author details the process of optimizing custom matrix multiplication kernels in Swift to train a Large Language Model on Apple Silicon, aiming to outperform C implementations by leveraging CPU, SIMD, AMX, and GPU capabilities.
MiniMax published a technical blog post providing an in-depth analysis of the systematic vocabulary degradation issue behind its M2 series large models' inability to output specific personal names. It reveals parameter shifts caused by a disconnect in data coverage between pre-training and post-training stages, and proposes an effective solution involving full-scale synthetic data for remediation.
This article recommends a UCLA-led online course on Reinforcement Learning for Large Language Models, covering theory, algorithms like PPO and RLHF, and practical coding exercises.
This paper introduces Entrocraft, a rejection-sampling method for RL that controls entropy schedules to prevent performance saturation in LLMs. It demonstrates improved generalization and training longevity, allowing smaller models to outperform larger baselines.
This research investigates how task geometry influences continual post-training in LLMs, identifying 'geometry conflict' as a cause of forgetting and a mechanism for controlling update integration. The authors propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free method that improves retention and performance across various model sizes.