Tag
Delta Attention Residuals improve layer-wise routing in transformer models by attending to feature changes (deltas) rather than cumulative hidden states, achieving 1.7–8.2% validation perplexity gains across scales from 220M to 7.6B parameters.