operator-fusion

#operator-fusion

Operator Fusion for LLM Inference on the Tensix Architecture

arXiv cs.LG ↗ · 5d ago Cached

This paper proposes an operator fusion strategy for LLM inference on Tenstorrent's Tensix architecture, fusing RMSNorm with matrix multiplications to improve data locality and reduce DRAM accesses. Experiments on the Wormhole platform with Qwen2.5-0.5B, Qwen3-0.6B, and Qwen3-4B show up to 37.44% latency reduction for attention and 15.89% for MLP.

0 favorites 0 likes

operator-fusion

Operator Fusion for LLM Inference on the Tensix Architecture

Submit Feedback