Tag
This paper identifies a 'positional copying' shortcut where small language models answer arithmetic questions by copying the last number before the answer delimiter, bypassing actual reasoning. This effect explains why shuffling CoT steps retains performance; it accounts for 89-92% of teacher-forcing accuracy in 1-3B models on GSM8K.
SDSR proposes lightweight self-describing structured data with dual-layer guidance to exploit LLM primacy bias, achieving 100% routing accuracy without vector DBs.