EarlyTom: Early Token Compression Completes Fast Video Understanding
Summary
EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining accuracy, achieving up to 2.65x TTFT reduction.
View Cached Full Text
Cached at: 05/29/26, 03:02 PM
Paper page - EarlyTom: Early Token Compression Completes Fast Video Understanding
Source: https://huggingface.co/papers/2605.30010
Abstract
EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining model accuracy.
Video large language models(Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts ofvisual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of thevision encoderunoptimized. In this paper, we first show that vision encoding contributes a large portion to thetime-to-first-token(TTFT). Therefore, instead of compressingvisual tokensonly after thevision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, atraining-freetoken compressionframework that performs early-stage visualtoken compressioninside thevision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupledspatial token selectionstrategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x andFLOPsby up to 61% on a single NVIDIA A100 GPU for theLLaVA-OneVision-7Bmodel, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.
View arXiv pageView PDFProject pageGitHub14Add to collection
Get this paper in your agent:
hf papers read 2605\.30010
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.30010 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.30010 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.30010 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs
LiteFrame proposes a lightweight video encoder with Compressed Token Distillation training that reduces latency and enables processing 8x more frames for long-form video understanding in Video LLMs, improving accuracy while reducing compute.
LiteFrame Scales Video LLM Efficiency (6 minute read)
LiteFrame introduces a highly efficient video encoder for Video LLMs that uses Compressed Token Distillation to enable up to 8x more frames and 35% latency reduction while maintaining accuracy, setting a new Pareto frontier for long-form video understanding.
AdaCodec: A Predictive Visual Code for Video MLLMs
AdaCodec reduces video encoding redundancy in multimodal LLMs by transmitting full visual tokens only when scene prediction fails, otherwise using compact inter-frame change descriptions. It outperforms per-frame RGB baselines at matched token budgets and achieves better or comparable results with significantly fewer tokens, reducing time-to-first-token from 9.26s to 1.62s.
Balancing Image Compression and Generation with Bootstrapped Tokenization
Introduces SelfBootTok, a self-bootstrapped tokenization method that separates global and local information, reducing generator computation by ~40% and achieving a new state-of-the-art gFID of 1.56 with only 64 tokens.
Efficient Pre-Training with Token Superposition
Token-Superposition Training (TST) improves LLM pre-training efficiency by combining contiguous tokens into bags during a superposition phase with a multi-hot cross-entropy objective, achieving up to 2.5x reduction in training time without architectural changes.