Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]

Reddit r/MachineLearning 06/11/26, 09:32 AM Papers

Summary

This paper introduces an adaptive video tokenisation method that exploits temporal redundancy in latent space to allocate tokens dynamically, achieving efficient compression without auxiliary networks. The proposed Latent Inpainting Transformer reconstructs dropped positions, delivering 31x speedup over ElasticTok-CV and 2x over InfoTok.

link - [https://arxiv.org/abs/2606.06158](https://arxiv.org/abs/2606.06158) Abstract : Adaptive video tokenisation seeks to dynamically allocate token budgets based on the underlying visual complexity of a sequence. Current continuous-regime approaches achieve this via iterative binarised searches or trained neural regressors, while discrete methods often require a full-rate decoder pass to estimate information content. We demonstrate that such computational overheads are not strictly necessary. We show that the latent space of a frozen continuous video tokeniser inherently encodes temporal redundancy that can be exploited directly: spatial positions whose latent representations change minimally between consecutive frames carry near-zero additional information. We introduce a parameter-free adaptive token allocation mechanism that applies a fixed threshold to per-position temporal-L1 differences, identifying and dropping redundant latent positions. Consequently, the compression rate emerges naturally from the input content rather than being enforced top-down: static scenes get compressed aggressively, while highly dynamic sequences retain more tokens. To reconstruct the dropped positions, we propose the Latent Inpainting Transformer (LIT), a lightweight factorised spatial-temporal attention architecture. The resulting inference pipeline is highly efficient, requiring only a single encoder pass and one LIT forward pass, eliminating the need for auxiliary routing networks. Evaluations across TokenBench and DAVIS, which are the standard benchmarks used by recent tokenisers, indicate that our framework yields meaningful, content-driven token allocation while maintaining competitive reconstruction fidelity, and delivers a 31x inference-time speedup over the continuous adaptive baseline (ElasticTok-CV) and an 2x speedup over the discrete information-theoretic baseline (InfoTok)

Original Article

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]

Similar Articles

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

EarlyTom: Early Token Compression Completes Fast Video Understanding

Adaptive Computation Depth via Learned Token Routing in Transformers

Generic Triple-Latent Compression with Gated Associative Retrieval

Efficient Pre-Training with Token Superposition

Submit Feedback

Similar Articles

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

EarlyTom: Early Token Compression Completes Fast Video Understanding

Adaptive Computation Depth via Learned Token Routing in Transformers

Generic Triple-Latent Compression with Gated Associative Retrieval

Efficient Pre-Training with Token Superposition