Tag
TRINE is a single-bitstream FPGA accelerator and compiler for end-to-end multimodal inference, unifying diverse layers and incorporating runtime-adaptive compute modes, token pruning, and dependency-aware offloading, achieving up to 22.57x latency reduction over an RTX 4090 at 20-21W.
This paper presents a systematic benchmark of token pruning—a compression technique that removes tokens and embeddings for irrelevant languages—applied to Korean-centric LLM tasks. The study evaluates popular multilingual models (Qwen3, Gemma-3, Llama-3, Aya) across different vocabulary configurations and finds that token pruning significantly improves generation stability and reduces memory footprint for domain-specific deployments.
This paper introduces STOP (Super Token for Pruning), a lightweight method that learns to prune unpromising reasoning paths early during parallel decoding by appending learnable tokens and reading KV cache states, achieving 70% token reduction while improving performance on AIME and GPQA benchmarks.