Tag
This paper introduces a method to predict activation steering effectiveness in language models from early decoding states using a Gradient Boosting Decision Trees (GBDT) classifier, enabling efficient steering strength optimization without full rollouts.
This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.