kl-divergence

#kl-divergence

Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

arXiv cs.LG ↗ · 4d ago Cached

This paper reveals that aggressive post-training quantization of reasoning models leads to increased overthinking errors, where models reach correct intermediate answers but fail to finalize them. A simple logit penalty on overthinking markers reduces chain-of-thought length by 12-23% while improving accuracy, especially for quantized models.

0 favorites 0 likes

#kl-divergence

KL Zero: KL divergence intuition game

Hacker News Top ↗ · 2026-05-30 Cached

KL Zero is an interactive browser game where players draw a probability distribution to match a target KL divergence value, helping users intuitively understand the concept of KL divergence in machine learning.

0 favorites 0 likes

#kl-divergence

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

arXiv cs.AI ↗ · 2026-05-29 Cached

Proposes EKSFT, a selective fine-tuning method for large language models that masks tokens with high entropy or high KL divergence from a reference model, preserving pre-trained distribution while injecting task knowledge. Experiments on mathematical reasoning benchmarks show it outperforms standard SFT and improves subsequent RL fine-tuning.

0 favorites 0 likes

#kl-divergence

Trust Region Q Adjoint Matching

Hugging Face Daily Papers ↗ · 2026-05-26 Cached

Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies. The method consistently outperforms prior arts on 50 OGBench tasks, achieving a 68% success rate in offline RL compared to the strongest baseline's 46%.

0 favorites 0 likes

#kl-divergence

On-Policy Distillation (5 minute read)

TLDR AI ↗ · 2026-05-26

This paper introduces on-policy distillation, which trains a student model on its own trajectories with teacher token-level KL supervision to fix train-inference mismatch, unifying forward-KL, reverse-KL, and JSD losses, with reverse-KL favored for smaller students.

0 favorites 0 likes

#kl-divergence

@maximelabonne: This is so neat! Dynamic Fine-Tuning (DFT) reweights the SFT loss by the model's own token probability, which creates a…

X AI KOLs Following ↗ · 2026-05-20 Cached

Dynamic Fine-Tuning (DFT) is introduced as a method that reweights the SFT loss using the model's own token probability, creating a feedback loop, and adds forward KL to penalize tokens the base model finds likely but the policy has pushed toward zero probability. The tweet expresses skepticism about SFT papers in practice but praises the attempt.

0 favorites 0 likes

#kl-divergence

Measuring Goodhart’s law

OpenAI Blog ↗ · 2022-04-13 Cached

OpenAI research formally analyzes Goodhart's law through best-of-n sampling, providing efficient estimators for measuring how well proxy objectives track true objectives and quantifying optimization effort via KL divergence.

0 favorites 0 likes

kl-divergence

Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

KL Zero: KL divergence intuition game

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

Trust Region Q Adjoint Matching

On-Policy Distillation (5 minute read)

@maximelabonne: This is so neat! Dynamic Fine-Tuning (DFT) reweights the SFT loss by the model's own token probability, which creates a…

Measuring Goodhart’s law

Submit Feedback