Tag
This paper proposes an operator fusion strategy for LLM inference on Tenstorrent's Tensix architecture, fusing RMSNorm with matrix multiplications to improve data locality and reduce DRAM accesses. Experiments on the Wormhole platform with Qwen2.5-0.5B, Qwen3-0.6B, and Qwen3-4B show up to 37.44% latency reduction for attention and 15.89% for MLP.