Tag
Introduces Duplicated Latent Residual (DLR), a training-only, parameter-free plug-in for low-rank pre-training that improves perplexity across LLaMA models from 60M to 7B parameters, and can be folded into the model after training with zero inference cost.
The Prism Transformer replaces uniform multi-head attention with a progressive head schedule that increases head count across layers, enabling a local-to-global hierarchy without extra parameters or FLOPs. It consistently outperforms standard Transformers on language modeling and zero-shot benchmarks at 124M, 354M, and 757M scales.