sparse-transformers

Tag

Cards List
#sparse-transformers

Adaptive Computation Depth via Learned Token Routing in Transformers

arXiv cs.LG · 2d ago Cached

This paper presents Token-Selective Attention (TSA), a differentiable token routing mechanism that learns to skip unnecessary computations per token in transformer layers, reducing token-layer operations by 14–23% with minimal quality loss on language modeling tasks.

0 favorites 0 likes
← Back to home

Submit Feedback