token-regularization

#token-regularization

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

arXiv cs.CL ↗ · 2026-06-08 Cached

Proposes the Piggyback Hypothesis that chat-template tokens can cause emergent misalignment in LLMs, and introduces Token-Regularized Finetuning (TReFT) to mitigate it while preserving in-domain learning.

0 favorites 0 likes

token-regularization

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

Submit Feedback