EarlyTom: Early Token Compression Completes Fast Video Understanding

Hugging Face Daily Papers Papers

Summary

EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining accuracy, achieving up to 2.65x TTFT reduction.

Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.
Original Article
View Cached Full Text

Cached at: 05/29/26, 03:02 PM

Paper page - EarlyTom: Early Token Compression Completes Fast Video Understanding

Source: https://huggingface.co/papers/2605.30010

Abstract

EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining model accuracy.

Video large language models(Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts ofvisual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of thevision encoderunoptimized. In this paper, we first show that vision encoding contributes a large portion to thetime-to-first-token(TTFT). Therefore, instead of compressingvisual tokensonly after thevision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, atraining-freetoken compressionframework that performs early-stage visualtoken compressioninside thevision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupledspatial token selectionstrategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x andFLOPsby up to 61% on a single NVIDIA A100 GPU for theLLaVA-OneVision-7Bmodel, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.

View arXiv pageView PDFProject pageGitHub14Add to collection

Get this paper in your agent:

hf papers read 2605\.30010

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.30010 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.30010 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30010 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

LiteFrame Scales Video LLM Efficiency (6 minute read)

TLDR AI

LiteFrame introduces a highly efficient video encoder for Video LLMs that uses Compressed Token Distillation to enable up to 8x more frames and 35% latency reduction while maintaining accuracy, setting a new Pareto frontier for long-form video understanding.

AdaCodec: A Predictive Visual Code for Video MLLMs

Hugging Face Daily Papers

AdaCodec reduces video encoding redundancy in multimodal LLMs by transmitting full visual tokens only when scene prediction fails, otherwise using compact inter-frame change descriptions. It outperforms per-frame RGB baselines at matched token budgets and achieves better or comparable results with significantly fewer tokens, reducing time-to-first-token from 9.26s to 1.62s.

Efficient Pre-Training with Token Superposition

Hugging Face Daily Papers

Token-Superposition Training (TST) improves LLM pre-training efficiency by combining contiguous tokens into bags during a superposition phase with a multi-hot cross-entropy objective, achieving up to 2.5x reduction in training time without architectural changes.