One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA
Summary
Latent Memory introduces a compressed representation approach for external memory in question answering, reducing token consumption and storage requirements while maintaining competitive performance across text-only and multimodal benchmarks.
View Cached Full Text
Cached at: 06/10/26, 05:45 AM
Paper page - One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA
Source: https://huggingface.co/papers/2606.10572
Abstract
Latent Memory introduces a compressed representation approach for external memory in question answering, reducing token consumption and storage requirements while maintaining competitive performance across text-only and multimodal benchmarks.
External memoryeffectively groundslarge language models(LLMs) andvision-language models(VLMs)-basedquestion answering(QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, alatent-space memoryparadigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a smallcompressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevantlatent tokens, and the retrievedlatent tokensare directly prompted to apretrained LLMor VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, anddistillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2606\.10572
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.10572 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.10572 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.10572 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
ElasticMem: Latent Memory as a Learnable Resource for LLM Agents
ElasticMem introduces a learnable latent memory mechanism for LLM agents that adaptively allocates variable budgets to retrieved memories, improving performance on memory-intensive QA and embodied agent tasks while reducing token costs.
DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
DeferMem introduces a long-term memory framework for LLM agents that decouples memory into high-recall candidate retrieval and query-conditioned evidence distillation using reinforcement learning, achieving state-of-the-art QA accuracy with faster runtime.
Generic Triple-Latent Compression with Gated Associative Retrieval
This paper introduces generic triple-latent recurrent models that compress token pair interactions into a latent state, and a gated associative retrieval variant that improves exact recall. The hybrid model outperforms Transformers on byte-level WikiText-2 and a tokenized language benchmark, achieving up to 41.9% associative recall versus 25%.
S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering
S3Mem proposes a structured spatiotemporal scene-event memory framework for long-horizon interactive question answering, using anchor-sensitive retrieval and token-budget-aware evidence interface to outperform standard RAG in multiple environments.
Rethinking LoRA Memory Through the Lens of KV Cache Compression
This paper studies the interaction between parameter-side memory (LoRA adapters) and context-side memory (KV cache) in document-level question answering. It finds that document LoRA becomes most valuable when the KV cache is heavily compressed, recovering up to 13–21 ROUGE-L points, and that QA-supervised adapters outperform next-token-prediction.