Tag
This paper tests whether the standard recipe for identifying attention-head circuits by task-pattern selectivity and causal ablation yields consistent mechanistic claims across different 1B-class language model families (Pythia, OLMo, OLMoE). It finds no two (task, model) cells share the same primary causal screen, and introduces a five-category taxonomy of screen outcomes, with the MoE model showing a distinct prev-token positional substrate.
This paper proposes Byte-Level Distillation (BLD), a simple method for cross-tokenizer knowledge transfer in language models by operating at a shared byte-level interface, achieving competitive or superior performance compared to more complex existing approaches across 1B-8B parameter models.