One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Hugging Face Daily Papers 06/09/26, 08:36 AM Papers

Summary

Latent Memory introduces a compressed representation approach for external memory in question answering, reducing token consumption and storage requirements while maintaining competitive performance across text-only and multimodal benchmarks.

External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.

Original Article

View Cached Full Text

Cached at: 06/10/26, 05:45 AM

Paper page - One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Source: https://huggingface.co/papers/2606.10572

Abstract

External memoryeffectively groundslarge language models(LLMs) andvision-language models(VLMs)-basedquestion answering(QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, alatent-space memoryparadigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a smallcompressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevantlatent tokens, and the retrievedlatent tokensare directly prompted to apretrained LLMor VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, anddistillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.

View arXiv page View PDF Project page GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2606\.10572

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.10572 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.10572 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.10572 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Paper page - One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

Generic Triple-Latent Compression with Gated Associative Retrieval

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

Rethinking LoRA Memory Through the Lens of KV Cache Compression

Submit Feedback

Similar Articles

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

Generic Triple-Latent Compression with Gated Associative Retrieval

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

Rethinking LoRA Memory Through the Lens of KV Cache Compression