token-pruning

#token-pruning

TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI

arXiv cs.AI ↗ · 2026-06-01 Cached

TRINE is a single-bitstream FPGA accelerator and compiler for end-to-end multimodal inference, unifying diverse layers and incorporating runtime-adaptive compute modes, token pruning, and dependency-aware offloading, achieving up to 22.57x latency reduction over an RTX 4090 at 20-21W.

0 favorites 0 likes

#token-pruning

Optimizing Korean-Centric LLMs via Token Pruning

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper presents a systematic benchmark of token pruning—a compression technique that removes tokens and embeddings for irrelevant languages—applied to Korean-centric LLM tasks. The study evaluates popular multilingual models (Qwen3, Gemma-3, Llama-3, Aya) across different vocabulary configurations and finds that token pruning significantly improves generation stability and reduces memory footprint for domain-specific deployments.

0 favorites 0 likes

#token-pruning

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

This paper introduces STOP (Super Token for Pruning), a lightweight method that learns to prune unpromising reasoning paths early during parallel decoding by appending learnable tokens and reading KV cache states, achieving 70% token reduction while improving performance on AIME and GPQA benchmarks.

0 favorites 0 likes

token-pruning

TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI

Optimizing Korean-Centric LLMs via Token Pruning

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

Submit Feedback