Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Summary
This paper presents a systematic study of long-context continued pre-training for vision-language models, achieving generalization beyond 128K context with an efficient data mixture design and introducing the MMProLong model.
View Cached Full Text
Cached at: 05/14/26, 04:16 AM
Paper page - Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Source: https://huggingface.co/papers/2605.13831 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
Long-context continued pre-training enhances vision-language models’ ability to handle extended documents while maintaining performance across diverse contexts through strategic data mixture design.
Long-context modelingis becoming a core capability of modernlarge vision-language models(LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-contextcontinued pre-trainingfor LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show thatlong-document VQAis substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) forsequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoringretrieval-heavy mixtureswith modest reasoning data for task diversity; and iii) purelong-document VQAlargely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-contextcontinued pre-trainingfrom Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improveslong-document VQAscores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-basedmultimodal needle retrieval, long-contextvision-text compression, andlong-video understandingwithout task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.13831
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.13831 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.13831 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.13831 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
MemLens is a new benchmark for evaluating memory capabilities in large vision-language models through multi-session conversations. It compares long-context and memory-augmented approaches, revealing limitations in both and motivating hybrid architectures.
Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time
This paper introduces a retrieval-augmented vision-language-action policy that eliminates per-task fine-tuning by using pre-trained models with indexed demonstrations, enabling efficient cross-embodiment generalization and task adaptation at test time.
End-to-End Context Compression at Scale
This paper presents Latent Context Language Models (LCLMs), a family of encoder-decoder compressors that efficiently handle long contexts through architectural search and large-scale pretraining, outperforming traditional KV cache methods in accuracy, speed, and memory usage.
Generalization Dynamics of LM Pre-training (17 minute read)
This paper reveals that during pre-training, language models frequently and suddenly switch between pattern-matching and generalization behaviors, a phenomenon called mode-hopping, and presents a toy evaluation suite to study it.
Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation
This paper proposes a reinforcement learning approach to enable large language models to translate unseen languages by leveraging in-context linguistic knowledge, outperforming in-context learning and supervised fine-tuning.