EarlyTom: Early Token Compression Completes Fast Video Understanding

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

Summary

EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining accuracy, achieving up to 2.65x TTFT reduction.

Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.

Original Article

View Cached Full Text

Cached at: 05/29/26, 03:02 PM

Paper page - EarlyTom: Early Token Compression Completes Fast Video Understanding

Source: https://huggingface.co/papers/2605.30010

Abstract

EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining model accuracy.

Video large language models(Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts ofvisual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of thevision encoderunoptimized. In this paper, we first show that vision encoding contributes a large portion to thetime-to-first-token(TTFT). Therefore, instead of compressingvisual tokensonly after thevision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, atraining-free token compressionframework that performs early-stage visualtoken compressioninside thevision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupledspatial token selectionstrategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x andFLOPsby up to 61% on a single NVIDIA A100 GPU for theLLaVA-OneVision-7Bmodel, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.

View arXiv page View PDF Project page GitHub14 Add to collection

Get this paper in your agent:

hf papers read 2605\.30010

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.30010 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.30010 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30010 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

EarlyTom: Early Token Compression Completes Fast Video Understanding

Paper page - EarlyTom: Early Token Compression Completes Fast Video Understanding

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

LiteFrame Scales Video LLM Efficiency (6 minute read)

AdaCodec: A Predictive Visual Code for Video MLLMs

Submit Feedback

Similar Articles

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

LiteFrame Scales Video LLM Efficiency (6 minute read)

AdaCodec: A Predictive Visual Code for Video MLLMs