layer-routing

Tag

Cards List
#layer-routing

Delta Attention Residuals

Hugging Face Daily Papers · 2026-05-13 Cached

Delta Attention Residuals improve layer-wise routing in transformer models by attending to feature changes (deltas) rather than cumulative hidden states, achieving 1.7–8.2% validation perplexity gains across scales from 220M to 7.6B parameters.

0 favorites 0 likes
← Back to home

Submit Feedback