token-regularization

Tag

Cards List
#token-regularization

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

arXiv cs.CL · 2026-06-08 Cached

Proposes the Piggyback Hypothesis that chat-template tokens can cause emergent misalignment in LLMs, and introduces Token-Regularized Finetuning (TReFT) to mitigate it while preserving in-domain learning.

0 favorites 0 likes
← Back to home

Submit Feedback