attention-sink

#attention-sink

P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2^8

arXiv cs.AI ↗ · 3d ago Cached

This paper analyzes precision loss in FP8 attention due to the attention sink phenomenon when casting the softmax output to FP8 (E4M3). It shows that forward KV iteration causes underflow of non-sink attention values, and proposes reverse iteration and a static scaling factor S=256 to eliminate underflow, achieving 3-10x MSE improvement.

0 favorites 0 likes

attention-sink

P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2^8

Submit Feedback