zero-overhead

#zero-overhead

DLR: Zero-Inference-Cost Latent Residuals for Low-Rank Pre-Training

arXiv cs.LG ↗ · 5d ago Cached

Introduces Duplicated Latent Residual (DLR), a training-only, parameter-free plug-in for low-rank pre-training that improves perplexity across LLaMA models from 60M to 7B parameters, and can be folded into the model after training with zero inference cost.

0 favorites 0 likes

#zero-overhead

Prism Transformer: Progressive Head Schedules for Hierarchical Attention Processing

arXiv cs.LG ↗ · 6d ago Cached

The Prism Transformer replaces uniform multi-head attention with a progressive head schedule that increases head count across layers, enabling a local-to-global hierarchy without extra parameters or FLOPs. It consistently outperforms standard Transformers on language modeling and zero-shot benchmarks at 124M, 354M, and 757M scales.

0 favorites 0 likes

zero-overhead

DLR: Zero-Inference-Cost Latent Residuals for Low-Rank Pre-Training

Prism Transformer: Progressive Head Schedules for Hierarchical Attention Processing

Submit Feedback