SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory
Summary
Introduces SMMBench, a benchmark to evaluate multimodal agents' ability to retrieve, align, and compose evidence scattered across independently originated sources like conversations, tables, and documents. Experiments show current systems struggle with this source-distributed memory composition task.
View Cached Full Text
Cached at: 05/18/26, 06:34 AM
# SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory
Source: [https://arxiv.org/html/2605.15710](https://arxiv.org/html/2605.15710)
Huacan Chai1, Yukai Wang1, Yingxuan Yang1, Dan Peng2, Yuanyi Song1, Zhihui Fu2 Weiwen Liu1,∗,Jianghao Lin1,∗,Jun Wang2,∗,Weinan Zhang1, 1Shanghai Jiao Tong University, China; 2OPPO, China; \{fatcat, wwliu, linjianghao, wnzhang\}@sjtu\.edu\.cn, wangjun7@oppo\.com
###### Abstract
Existing benchmarks for multimodal memory reasoning largely evaluate systems within pre\-assembled contexts, but under\-evaluate whether agents can use evidence distributed across independently originated sources\. We argue that*source\-distributed memory composition*is an important and under\-examined bottleneck in multimodal agent memory, especially when relevant evidence is fragmented across heterogeneous artifacts such as conversations, profiles, screenshots, tables, images, and documents\. To address this gap, we introduce*Source\-distributed Multimodal Memory Benchmark*\(SMMBench\), which measures whether agents can retrieve, align, and compose multimodal evidence scattered across multiple sources rather than reason within a single curated context\. SMMBench evaluates four core capabilities: \(1\) cross\-source multimodal reasoning; \(2\) conflict resolution; \(3\) preference reasoning; \(4\) memory\-grounded action prediction\. The benchmark contains1,8771,877samples grounded in264264sources\. Experiments on representative memory\-style and retrieval\-based baselines show that current systems still struggle on these capabilities, positioning source\-distributed multimodal memory as an important and still under\-evaluated challenge for multimodal agents\. Our data are available at[https://huggingface\.co/datasets/HuacanChai/SMMBench](https://huggingface.co/datasets/HuacanChai/SMMBench)\.
## 1Introduction
Multimodal agents are increasingly expected to act as persistent assistants in productivity, desktop, and enterprise settings\[[41](https://arxiv.org/html/2605.15710#bib.bib60),[21](https://arxiv.org/html/2605.15710#bib.bib63),[8](https://arxiv.org/html/2605.15710#bib.bib61),[38](https://arxiv.org/html/2605.15710#bib.bib55)\], where most real\-world tasks are inherently*cross\-source*: the information needed to answer questions or execute actions is typically accumulated over time across chats, tables, documents, and other artifacts, rather than packaged in a single context\[[16](https://arxiv.org/html/2605.15710#bib.bib62),[35](https://arxiv.org/html/2605.15710#bib.bib50),[1](https://arxiv.org/html/2605.15710#bib.bib54)\]\. This setting reveals a challenge in agent memory: the difficulty is often not merely reading a long input, but using evidence that is distributed across*independent sources*created at different times and for different purposes, rather than reasoning over one pre\-assembled context prepared for the final query\[[43](https://arxiv.org/html/2605.15710#bib.bib64)\]\.
We argue thatsource\-distributed memory compositionis an under\-evaluated bottleneck in multimodal agent memory\.*Source\-distributed*means that the evidence is fragmented across multiple independently originated sources, such as separate group or private chats, profiles, tables, and documents, each with its own main purpose and local context\. This brings challenges that are qualitatively different from reasoning over a single curated context\. First, relevant evidence is distributed across multiple sources, andno single source is sufficient to determine the final answer on its own\. For example, as illustrated in Figure[1](https://arxiv.org/html/2605.15710#S1.F1), an agent may need to connect evidence from a department chat, a meeting\-location table, and a phone screenshot to infer that ‘John will fly to New York for Meeting A on Nov\. 13’; no single source states this answer directly\. Second, necessary evidence is often distributed across independently originated sources with different purposes and local contexts\. Because these sources are created independently for different purposes rather than jointly organized for the query,their local contexts compartmentalize partial clues and make them harder to connect\. This creates a distinct memory bottleneck: the agent must identify sources and bridge across their contextual boundaries to compose the answer\. Third, information from different sources may conflict with one another,requiring the agent to update evidence and resolve conflictsby reasoning over their different authority levels or time states\. Generally,the key challenge is not merely remembering isolated facts, but composing distributed evidence into answers or actions\.
Prior benchmarks have made important progress on multimodal long\-context and memory settings, but most of them still evaluate reasoning within a single pre\-assembled context\. Multimodal long\-context benchmarks such as MILEBench\[[26](https://arxiv.org/html/2605.15710#bib.bib12)\]and Mementos\[[31](https://arxiv.org/html/2605.15710#bib.bib13)\]evaluate whether MLLMs can retrieve, compare, and reason over long text\-image contexts or visual streams\. Mem\-Gallery\[[2](https://arxiv.org/html/2605.15710#bib.bib6)\]further moves toward conversational memory over coherent multimodal interaction traces\. However, these benchmarks primarily evaluate evidence use within a coherent context or unified retrieval corpus\. They therefore leave under\-evaluated whether agent memory systems can compose multimodal evidence distributed across independently originated sources\.
Figure 1:In real\-world tasks, the necessary evidence is often distributed across multiple sources with distinct purposes and local contexts, while remaining potentially overlapping entities\. Because no single source is sufficient, agents must retrieve and compose fragmented evidence across sources, making source\-distributed memory a key bottleneck for memory\-grounded responses and actions\.To evaluate this gap, we introduce*Source\-distributedMultimodalMemoryBench*\(SMMBench\), a benchmark for multimodal agent memory in which relevant evidence is intentionally distributed across multiple independently originated sources rather than provided as one pre\-assembled context\. The benchmark covers representative artifact types that arise in real\-world persistent assistant scenarios, including conversations, profiles, tables, images, and documents, and organizes evaluation around four core capabilities: cross\-source reasoning, conflict resolution, preference reasoning, and memory\-grounded action prediction\. It provides fine\-grained evidence annotations together with both open\-book and retrieval\-based evaluation settings, enabling analysis of not only end\-task accuracy but also how systems use distributed memory under different access conditions\. Overall, the benchmark contains18771877evaluation samples across55task types and264264sources\. Experimental results show that even the strongest evaluated systems still perform poorly in this setting, highlighting source\-distributed memory as an important and still under\-evaluated challenge for multimodal agents\. Our contributions are as follows:
- •Problem Identification\.We identify*source\-distributed memory composition*as an under\-evaluated bottleneck in multimodal agent memory, where the key challenge is composing evidence across independently originated sources rather than reasoning within one prepared context\.
- •Challenge Characterization\.We clarify how source\-distributed memory differs from standard long\-context reasoning by characterizing the distinct challenges introduced by independent sources, including source\-level incompleteness, cross\-source context bridging, and conflict resolution under different authority levels or time states\.
- •Benchmark Construction\.We introduce*Source\-distributedMultimodalMemoryBench*, a benchmark that operationalizes this challenge through source objects such as conversations, profiles, screenshots, tables, images, and documents, and evaluates cross\-source reasoning, conflict resolution, preference reasoning, and memory\-grounded action prediction\.
- •Empirical Findings\.We provide fine\-grained evidence annotations and evaluation experiments under open\-book and retrieval\-based settings\. Experiments on18771877samples,55task types, and264264sources show that current representative methods remain far from effective\.
Table 1:Comparison with representative memory benchmarks\.✓: Satisfies;✗: Does not satisfy\.BenchmarkMMEv\.M\.Src\. M\.M\.Ev\.Indep\. Src\.X\.S\.C\.R\.P\.R\.A\.P\.LongMemEval\[[33](https://arxiv.org/html/2605.15710#bib.bib4)\]✗TC✓✗✗✗✓✗MemoryAgentBench\[[9](https://arxiv.org/html/2605.15710#bib.bib20)\]✗TC/D✓✗✗✓✓✗LoCCO\[[11](https://arxiv.org/html/2605.15710#bib.bib53)\]✗TC✗✗✗✗✗✗LoCoMo\[[19](https://arxiv.org/html/2605.15710#bib.bib18)\]✓T/IC✓✗✓✗✓✗Mementos\[[31](https://arxiv.org/html/2605.15710#bib.bib13)\]✓II✓✗✓✗✗✗MMDU\[[18](https://arxiv.org/html/2605.15710#bib.bib17)\]✓T/IC✗✗✓✗✗✗MMRC\[[36](https://arxiv.org/html/2605.15710#bib.bib21)\]✓T/IC✗✗✓✓✗✗MultiHaystack\[[35](https://arxiv.org/html/2605.15710#bib.bib50)\]✓D/I/VD/I✗✓✗✗✗✗Mem\-Gallery\[[2](https://arxiv.org/html/2605.15710#bib.bib6)\]✓T/IC✓✗✓✓✗✗\\rowcolorsmmbenchbgSMMBench✓T/I/D/Tab\.C/D/I✓✓✓✓✓✓
MM: contains multimodal inputs;Ev\.M\.&Src\.M\.: modality types of evidence and sources, T for text, I for image, D for document, Tab\. for table, C for conversation;M\.E\.: problems cannot be solved without combining multiple pieces of evidence;Indep\. Src\.: evidence distributed across heterogeneous and independent sources; The following columns show the evaluated capabilities\.X\.S\.: cross\-source reasoning;C\.R\.: conflict resolution;P\.R\.: preference reasoning;A\.P\.: action prediction\.
## 2Related Work
### 2\.1Multimodal Agent Memory Benchmark
Recent benchmarks on multimodal agent memory have largely centered on the performance bottlenecks induced by long input contexts\. MILEBench\[[26](https://arxiv.org/html/2605.15710#bib.bib12)\]evaluates the long\-context understanding ability of MLLMs, while Mementos\[[31](https://arxiv.org/html/2605.15710#bib.bib13)\]focuses on reasoning over long image sequences\. Mem\-Gallery\[[2](https://arxiv.org/html/2605.15710#bib.bib6)\]moves closer to the agent memory setting by emphasizing memory maintenance in multi\-session conversations, yet its conversational trajectories are still largely coherent rather than distributed across independently originated sources\. However, these benchmarks are still insufficient for evaluating the source\-distributed setting targeted by SMMBench\. They mainly assume a*coherent context*, such as a long conversation, image stream, or unified interaction history\. By contrast, SMMBench evaluates whether a system can identify and compose answer\-critical evidence scattered across*independently originated sources*with different purposes and local contexts\. This source\-level fragmentation is not reducible to ordinary long\-context reasoning\.
### 2\.2Multimodal Agent RAG Benchmark
Multimodal RAG benchmarks have similarly developed\. M2RAG\[[17](https://arxiv.org/html/2605.15710#bib.bib52)\]evaluates how effectively MLLMs retrieve and use multimodal documents for open\-domain tasks such as captioning\. MultiHaystack\[[35](https://arxiv.org/html/2605.15710#bib.bib50)\]emphasizes multimodal evidence under noisy retrieval settings\. Nevertheless, most multimodal RAG benchmarks assume a*coherent corpus retrieval*setting, where evidence is retrieved from a shared repository and the main challenge is locating relevant items under scale or noise\. They therefore test retrieval of relevant multimodal evidence, but not composition across independently originated sources with separate contextual boundaries\. By contrast, SMMBench evaluates whether systems can identify relevant sources, recover partial clues from each, and compose them into a coherent answer or executable action\.
## 3SMMBench Benchmark
Figure 2:Overview of SMMBench\.Top: Dataset construction pipeline\.Bottom left: Agents interact with heterogeneous memory sources, where answer\-critical evidence is distributed across independent sources\.Bottom right: Given the constructed environments, agents retrieve from memory and are evaluated on multiple task types, including single\-/multi\-hop QA, conflict resolution, preference reasoning, and function calling\.### 3\.1Problem Formulation
We formulate SMMBench as a memory\-grounded question answering and action prediction benchmark over*source\-distributed*multimodal evidence\. In SMMBench, a*source*is an independently originated memory object with its own local context and organizational boundary, such as a group or private chat, a profile page, an image, a table, or a document, rather than an artificial retrieval chunk obtained by splitting a larger object\. Formally, each evaluation sample contains a set of sources
𝒮=\{S1,S2,…,Sm\},\\mathcal\{S\}=\\\{S\_\{1\},S\_\{2\},\\dots,S\_\{m\}\\\},\(1\)where each sourceSiS\_\{i\}consists of one or more evidence\-bearing items
Si=\{oi,1,oi,2,…,oi,ni\},oi,j=⟨xi,j,si,τi,j⟩,S\_\{i\}=\\\{o\_\{i,1\},o\_\{i,2\},\\dots,o\_\{i,n\_\{i\}\}\\\},\\quad o\_\{i,j\}=\\langle x\_\{i,j\},s\_\{i\},\\tau\_\{i,j\}\\rangle,\(2\)with contentxi,jx\_\{i,j\}, shared source identitysis\_\{i\}, and timestamp or local temporal positionτi,j\\tau\_\{i,j\}\. The contentxi,jx\_\{i,j\}may include text, images, tables, document pages, or other multimodal evidence\.
A sample is considered*source\-distributed*only if its answer\-critical evidence satisfies two conditions: \(1\) the required evidence comes from at least two distinct sources, and \(2\) no single source alone is sufficient to determine the gold answer\. Letℰ∗\(q\)\\mathcal\{E\}^\{\*\}\(q\)be the minimal evidence set required for questionqq, and lets\(e\)s\(e\)denote the source of evidence itemee\. We require
\|\{s\(e\)∣e∈ℰ∗\(q\)\}\|≥2\.\|\\\{s\(e\)\\mid e\\in\\mathcal\{E\}^\{\*\}\(q\)\\\}\|\\geq 2\.\(3\)Thus, each source provides only partial information, and the final answer must be obtained by composing evidence across source boundaries rather than by reading one locally complete source\.
Given the source set𝒮\\mathcal\{S\}, the agent incrementally observes items from these sources and maintains an external memory stateMtM\_\{t\}\. For conversational sources, the observations follow their turn order; for non\-conversational sources such as documents or images, the observations correspond to their associated source items\. We denote the overall observation stream as
𝒪=\{o1,o2,…,oT\},\\mathcal\{O\}=\\\{o\_\{1\},o\_\{2\},\\ldots,o\_\{T\}\\\},\(4\)where each observation retains its source identity\. As each observation arrives, the memory is updated by the memory update operatorΦ\\Phi:
Mt\+1=Φ\(Mt,ot\)\.M\_\{t\+1\}=\\Phi\(M\_\{t\},o\_\{t\}\)\.\(5\)After ingesting the full sources, the agent receives a questionqq, retrieves relevant memory units, and generates the final answer:
Mret=R\(MT,q\),y=G\(q,Mret\)\.M\_\{\\mathrm\{ret\}\}=R\(M\_\{T\},q\),\\quad y=G\(q,M\_\{\\mathrm\{ret\}\}\)\.\(6\)Under this formulation, success requires more than recalling isolated facts from a long input\. A successful system must \(1\) preserve source\-aware memory over heterogeneous source objects, \(2\) retrieve evidence spanning the right source boundaries, and \(3\) compose or reconcile these pieces into a coherent final answer or action\. Therefore, SMMBench evaluates*source\-distributed memory composition*rather than only long\-context multimodal recall\.
### 3\.2Benchmark Construction
We construct SMMBench through a three\-stage pipeline: QA preparation, conversational source synthesis, and source\-aware evidence insertion\. This pipeline turns curated multimodal QA instances into memory\-grounded evaluation samples whose answer\-critical evidence is distributed across multiple sources\. For the detailed building process, please refer to Appendix[B](https://arxiv.org/html/2605.15710#A2)\.
#### QA Preparation
Collected from diverse public multimodal benchmarks, we convert raw samples into a question\-answer pair along with a unified set of evidence units, which will serve as the inputs to the following stages\. Detailed preparation and verification procedures are deferred to Appendix[B\.2](https://arxiv.org/html/2605.15710#A2.SS2)\.
#### Conversational Source Synthesis
Next, we construct a multi\-source conversational environment to host the prepared evidence later\. Concretely, we instantiate agents with predefined profiles and organize them into several conversational sources, including both group and private chats, with some participants recurring across sources\. Each conversational source has its own participants and continuous communicative topics, obtaining multiple parallel yet related interaction streams rather than a single monolithic conversation\. We then simulate ordinary multi\-turn conversations within each source using high\-level topical cues abstracted from the sampled evidence, so that the resulting conversations remain relevant to the later grounding stage without exposing the evidence itself\.
#### Source\-Aware Evidence Insertion
Finally, we ground the prepared evidence into the generated multi\-source environment\. A key design choice is to place complementary evidence units into different sources, rather than merely spreading them across distant positions within the same source, emphasizing the challenge of identifying relevant sources and composing information across sources\. Specifically, an LLM\-based source\-aware inserter is used to route each evidence unit to a suitable source and local position according to the source context and the evidence content, while preserving the readability and continuity of the host source after insertion\. For evidence that is complementary, updated, or conflicting, we further check cross\-source placement consistency and preserve their temporal dependencies, so that newer information appears in a later and coherent position relative to the earlier evidence it supplements or revises\. Therefore, samples that require agents to absorb and compose evidence from distributed sources can be obtained\.
\\captionbox
Illustration of categories in SMMBench\.![[Uncaptioned image]](https://arxiv.org/html/2605.15710v1/x3.png)
\\captionbox
Statistics of SMMBench\. ‘Avg\.’ means ‘Average’, ‘Avg\. Evidence Int\.’ means ‘Average evidence interval turns in conversational sources’\. ‘Avg\. Sources’ means the average number of input sources to solve one QA pair\.Source StatisticsQA Evidence StatisticsMetric per SourceValueMetric per QAValue\#Sources264Avg\. Evidence4\.54Avg\. Turns831Avg\. Sources2\.82Avg\. Images14\.99Avg\. Texts2\.31Avg\. Docs2\.82Avg\. Images0\.95Avg\. Evidence Int\.45\.43Avg\. Docs1\.24
### 3\.3Benchmark Analysis
#### Task Introduction
To better reflect real\-world agent applications and the four core capabilities discussed, we organize the benchmark into55task types: \(1\) & \(2\)Single\-Hop QA\(S\.H\.\) &Multi\-Hop QA\(M\.H\.\), where the agent must retrieve multiple pieces of multimodal evidence from different sources and perform asingle\-ormulti\-step reasoning process to answer the question; \(3\)Conflict Resolution\(C\.R\.\), where the environment contains outdated evidence and the agent must identify the newer evidence from another source to override it; \(4\)Preference Reasoning\(P\.R\.\), where the agent must infer user preferences by integrating implicit personalized cues across multiple sources; and \(5\)Function Call\(F\.C\.\), which is designed to closely resemble realistic agent use cases and requires the agent to learn workflows from different sources, remember detailed parameters in multimodal content, and produce the exact tool invocation, serving as precise action\-prediction tasks\.
#### Benchmark Statistics and Framework
Figure[3\.2](https://arxiv.org/html/2605.15710#S3.SS2.SSS0.Px3)and Table[3\.2](https://arxiv.org/html/2605.15710#S3.SS2.SSS0.Px3)summarize the overall composition of SMMBench\. SMMBench has three notable properties\. \(1\)Source\-distributed and complementary evidence\.The benchmark is explicitly multi\-source: each QA instance requires evidence from 2\.82 sources on average, with 4\.54 supporting evidence items per sample, showing that answers typically depend on evidence composition across multiple sources rather than local retrieval within a single context\. Additional experiments in the Appendix[C\.5](https://arxiv.org/html/2605.15710#A3.SS5)confirm that these evidence pieces are complementary rather than redundant\. \(2\)Rich multimodal evidence\.SMMBench contains complementary textual and non\-textual information, with each QA instance involving 2\.31 text evidence items, 0\.95 image evidence items, and 1\.24 document evidence items on average, indicating that many cases require joint use of different modalities\. \(3\)Sparse long\-horizon evidence placement\.The conversational sources are long and the relevant clues are intentionally sparse: each source contains 831 turns on average, while relevant evidence appears only every 45\.43 turns on average\. This design reduces shortcut solving from locally clustered clues and makes memory retrieval depend on sustained tracking over long interaction histories\.
Each sample is evaluated as a memory\-updated process following Section[3\.2](https://arxiv.org/html/2605.15710#S3.SS2)\. During memory construction, conversational turns from different sources are merged by timestamp and fed into the target agent memory sequentially; non\-conversational sources are also inserted according to their designated temporal positions in the same global sequence\. Once memory construction is complete, the benchmark question is used as a query to trigger memory retrieval, and the recalled items are concatenated with the question as the final context for the backbone LLM\. The exact memory update and retrieval mechanisms are left to each agent memory system\. ForF\.C\.tasks, candidate tools are provided and the agent is required to generate the exact function invocation; for all other tasks, responses are evaluated in a multiple\-choice format\.
## 4Experiment
We conduct extensive experiments on SMMBench to answer the following research questions\.
1. RQ1:How do representative memory/retrieval baselines perform on our benchmark across different task categories?
2. RQ2:Does distributed supporting evidence across multiple sources make memory reasoning more difficult?
3. RQ3:How does the number of required sources affect benchmark difficulty?
4. RQ4:How does the number of retrieved items affect the overall performance?
5. RQ5:What are the main failure modes of current systems on source\-distributed multimodal memory tasks?
### 4\.1Baselines and Evaluation Setup
We evaluate representative baselines from two families:*memory\-style*methods and*RAG\-style*methods\. The memory\-style baselines includeShort\-Term Mem\.\[[40](https://arxiv.org/html/2605.15710#bib.bib41)\],Reflexion Mem\.\[[25](https://arxiv.org/html/2605.15710#bib.bib30)\],Gen\.Agen Mem\.\[[24](https://arxiv.org/html/2605.15710#bib.bib31)\],Self Controlled Mem\.\[[28](https://arxiv.org/html/2605.15710#bib.bib32)\],MIRIX\[[32](https://arxiv.org/html/2605.15710#bib.bib33)\],MemGPT\[[23](https://arxiv.org/html/2605.15710#bib.bib34)\],MemVerse\[[14](https://arxiv.org/html/2605.15710#bib.bib35)\],Mem0\[[4](https://arxiv.org/html/2605.15710#bib.bib36)\], andOmniSimpleMem\[[13](https://arxiv.org/html/2605.15710#bib.bib37)\]\. The RAG\-style baselines includeNative RAG\[[40](https://arxiv.org/html/2605.15710#bib.bib41)\],HMRAG\[[15](https://arxiv.org/html/2605.15710#bib.bib38)\],UniversalRAG\[[39](https://arxiv.org/html/2605.15710#bib.bib39)\], andVRAG\[[30](https://arxiv.org/html/2605.15710#bib.bib40)\]\. Among these methods,MemGPT,MIRIX,UniversalRAG, andVRAGprovide native multimodal memory/retrieval, while other baselines are evaluated in a*Text\+Caption*setting, where non\-textual evidence is converted into textual captions before storage and retrieval\. In addition,MemGPTandMIRIXare also evaluated in a textual\-memory setting for direct comparison across modality\-access settings\. We also include two reference baselines\. TheRandom Baselinerepresents chance\-level performance on the multiple\-choice tasks and serves as a lower anchor for basic answerability, while theGolden Evidence Baselinedirectly provides supporting evidence to the backbone model, serving as a high\-performance reference\.
For evaluation, we use gpt\-4\.1\[[22](https://arxiv.org/html/2605.15710#bib.bib56)\]to generate captions for non\-text evidence in the Text\+Caption setting\. For metrics, we report multiple\-choice accuracy forSingle\-Hop QA,Multi\-Hop QA,Conflict Resolution, andPreference Reasoning, and exact match forFunction Call, together with the overall average\. For all baselines, we use qwen3\-vl\-235b\-instruct\[[37](https://arxiv.org/html/2605.15710#bib.bib57)\]as the backbone model\. For baselines that include retrieval/recall pipelines, the default number of retrieved items is set to 20\. Detailed evaluation prompt templates can be seen in the Appendix[E](https://arxiv.org/html/2605.15710#A5)\. Additional details and configurations are listed in the Appendix[C](https://arxiv.org/html/2605.15710#A3)\.
Table 2:Main results on SMMBench\. Within each metric column, the top three values among benchmarked baselines with complete entries, excluding these two reference rows, are highlighted:\(highest\),\(second\), and\(third\)\.S\.H\.,M\.H\.,C\.R\.,P\.R\., andF\.C\.denoteSingle\-Hop QA,Multi\-Hop QA,Conflict Resolution,Preference Reasoning, andFunction Call\. All scores are averaged over 3 runs and theOverallreports the unweighted average score\.ModalMethodBaselineS\.H\.M\.H\.C\.R\.P\.R\.F\.C\.OverallRandom Baseline0\.25000\.25000\.25000\.25000\.00000\.2000Golden Evidence Baseline0\.87530\.77680\.83650\.96990\.27780\.7473Text \+CaptionMemoryShort\-Term Mem\.0\.29900\.22920\.22460\.27020\.00930\.2064Reflexion Mem\.0\.53850\.48610\.39620\.30640\.02780\.3510Gen\. Agent Mem\.0\.31640\.27080\.27540\.22380\.01850\.2210Self Controlled Mem\.0\.31290\.34030\.28810\.22030\.01850\.2360MIRIX0\.36010\.25690\.30720\.29600\.02780\.2496MemGPT0\.57340\.47220\.3326\\cellcolorrankone0\.43890\.03700\.3708MemVerse0\.35490\.32640\.26910\.27540\.00930\.2470Mem0\\cellcolorrankthree0\.7430\\cellcolorrankthree0\.59030\.36650\.29090\.0398\\cellcolorrankthree0\.4061RAGNative RAG\\cellcolorranktwo0\.7797\\cellcolorranktwo0\.6667\\cellcolorranktwo0\.4068\\cellcolorranktwo0\.3683\\cellcolorrankthree0\.0741\\cellcolorranktwo0\.4591HMRAG\\cellcolorrankone0\.8129\\cellcolorrankone0\.7153\\cellcolorrankone0\.5191\\cellcolorrankthree0\.3081\\cellcolorrankone0\.1111\\cellcolorrankone0\.4933MMMemoryMemGPT0\.34970\.51390\.26910\.19970\.02410\.2713MIRIX0\.55070\.3472\\cellcolorrankthree0\.40250\.2031\\cellcolorranktwo0\.07690\.3161OmniSimpleMem0\.23110\.03750\.21530\.18400\.01850\.1373RAGUniversalRAG0\.33220\.34030\.26060\.15660\.03700\.2253VRAG0\.49130\.45140\.39190\.18420\.04630\.3130
### 4\.2RQ1: Main Results
The following insights can be drawn from Table[2](https://arxiv.org/html/2605.15710#S4.T2)\. Additional analysis is provided in Appendix[C\.2](https://arxiv.org/html/2605.15710#A3.SS2)\.
#### Existing methods still struggle on source\-distributed, multimodal memory tasks\.
The reference baselines show that the benchmark is solvable but still far from performing strongly on it\. Strong baselines clearly outperform theRandom Baseline, yet there remains a clear gap to theGolden Evidence Baseline\. Even baselines with relatively mature memory systems likeMIRIXorMemGPTstill struggle to retrieve, align, and compose the needed evidence under realistic multi\-source access, indicating that current methods remain insufficient for source\-distributed multimodal memory settings\. Performance is also uneven across task families: systems that are relatively strong on some QA\-style tasks often remain clearly weaker on preference reasoning or function calling, and no single method is consistently strong across the board\.
#### Native multimodal access alone does not resolve the challenge of source\-distributed memory\.
Notably, even without observing native multimodal content, some memory systems such asReflexion Mem\.still perform reasonably well\. A clean comparison comes from paired results within the same memory systems\.MIRIXimproves from 0\.2496 to 0\.3161 overall when moving from Text\+Caption to native MM setting, whereasMemGPTdrops from 0\.3708 to 0\.2713\. This suggests that the challenge in the source\-distributed setting is not solely determined by access to native multimodal content, but also by the ability to compose distributed evidence across sources\.
#### Source\-distributed difficulty is amplified on precision\-sensitive tasks\.
Function Callremains the most challenging setting in our benchmark\. The bestF\.C\.score is only 0\.1111, and even theGolden Evidence Baselinereaches only 0\.2778, far below its scores on the other tasks\. This pattern suggests that the difficulty of source\-distributed memory is especially amplified when the task requires precise argument recovery from dispersed multimodal evidence\. In other words, source\-distributed settings do not only make evidence harder to find, but also make it harder to accurately align and assemble fine\-grained information into executable actions\. The same gap is also visible within individual systems:MemGPTandMIRIXboth achieve much stronger QA performance thanF\.C\.\. Therefore, as a stress\-test subset,Function Callcan evaluate whether remembered information can be converted into downstream actions under source\-distributed conditions\.
### 4\.3RQ2: Impact of Source\-Concentrated V\.S\. Source\-Distributed Evidence
Figure 3:Comparison between Source\-Concentrated and Source\-Distributed Settings
Figure 4:Experiment on the Different Source Numbers
We compare*source\-concentrated*and*source\-distributed*evidence settings to evaluate whether dispersing the same supporting evidence across multiple sources increases the difficulty of memory\-grounded reasoning\. To keep the comparison fair, we fix the QA pairs and their gold evidence content, and only reorganize where that gold evidence is placed\. Concretely, for each QA sample, we swap the chunks containing its gold evidence with chunks from other sources so that all supporting evidence is concentrated into one source in the*source\-concentrated*condition, while preserving its original placement across multiple sources in the*source\-distributed*condition; we then lightly polish the local scaffold for fluency without changing the evidence content\. This design changes only whether evidence is concentrated in one source or distributed across sources, while keeping context budget matched across settings\.
According to the overall scores in Figure[4](https://arxiv.org/html/2605.15710#S4.F4), the performance of all three representative baselines consistently drops from the*source\-concentrated*setting to the*source\-distributed*setting\. This result supports the view that the primary challenge comes from the*source\-distributed*design itself\. In the source\-concentrated setting, the required evidence remains accessible within a single local context\. By contrast, in the source\-distributed setting, models must locate the relevant independent sources, connect their separate local contexts, and compose evidence that is individually incomplete but jointly sufficient\. This highlights thatthe difficulty is not reading more content, but coordinating reasoning across independently originated sources\. Appendix[C\.3](https://arxiv.org/html/2605.15710#A3.SS3)task\-level breakdowns follow the same pattern\.
### 4\.4RQ3: Impact of the Number of Sources
To better investigate the effect of source distribution, we analyze the performance of representative baselines on samples with different numbers of required sources\. Evaluation samples with different source numbers are interleaved within the same benchmark environments, and are therefore evaluated against source pools of similar overall length and distracting content, while differing primarily in how the critical evidence is distributed across those sources\. Since the task coverage is not identical across evidence number buckets, we restrict this analysis to the overlapping task subset \(S\.H\.,M\.H\., andP\.R\.\) and report the unweighted average scores over shared tasks\. The results are shown in Figure[4](https://arxiv.org/html/2605.15710#S4.F4)\.
Although the absolute performance levels differ across baselines due to their different algorithmic architectures, the direction of change is fully consistent: all methods perform worse as the number of required sources increases\. This consistency suggests thatdispersion of sources itself is an important difficulty factor in multimodal agent memory, rather than very long context alone\. As the evidence required by a question becomes more distributed across more independent sources, the agent must retrieve, preserve, and integrate increasingly dispersed evidence, which makes the task substantially harder\. In our benchmark, the challenge is not merely reading more multimodal content, but identifying which sources jointly matter, recovering partial clues scattered across them, and composing these fragments into a coherent final answer\.
### 4\.5RQ4: Impact of Retrieval Budget
We vary the retrieval budget by changing the number of retrieved entriesKKto examine the trade\-off between evidence coverage and retrieval noise\. Table[3](https://arxiv.org/html/2605.15710#S4.T3)shows a clear first\-order pattern across representative retrieval\-based baselines: asKKincreases, Recall@KKgenerally rises, while Precision@KKdeclines\. This trend indicates that larger retrieval pools make it easier to surface relevant evidence from sources, but also introduce more irrelevant or weakly relevant candidates into the retrieved set\. In summary,larger retrieval budgets improve evidence coverage but also increase retrieval noise, making downstream evidence filtering and composition more difficult\.
Table 3:Retrieval Recall@KKand Precision@KKby baseline and retrieval budgetKK\.BaselineRecallPrecisionK=10K\{=\}10K=20K\{=\}20K=50K\{=\}50K=100K\{=\}100K=10K\{=\}10K=20K\{=\}20K=50K\{=\}50K=100K\{=\}100Reflexion Mem\.0\.28320\.28350\.28350\.28010\.07460\.03740\.01490\.0073Native RAG0\.43360\.63550\.73130\.79030\.11430\.08370\.03890\.0207HMRAG0\.26770\.40990\.51590\.52440\.07050\.05400\.02720\.0138VRAG0\.09810\.21810\.24890\.24850\.02600\.02870\.01310\.0066This trade\-off is especially important under source\-distributed settings, where the supporting evidence is scattered across multiple independent sources rather than concentrated in a single local context\. With a small retrieval budget, systems can easily miss crucial clues because the evidence needed for one question may be fragmented across different chats, files, documents, or other sources\. IncreasingKKhelps improve coverage over these dispersed sources and therefore reduces the risk of source\-level omission\. However, it also introduces more irrelevant or weakly related candidates, making downstream source selection and evidence composition harder\. From the perspective of a source\-distributed memory environment,larger retrieval budgets are therefore not unconditionally better: they improve the chance of covering the necessary sources, but at the cost of a noisier candidate pool that places greater demands on subsequent filtering, grounding, and cross\-source composition\. End\-task results in the Appendix[C\.4](https://arxiv.org/html/2605.15710#A3.SS4)are broadly consistent with this insight, suggesting that largerKKoften helps in source\-distributed evaluation, although the gains depend on how effectively each system can identify and combine the truly relevant sources\.
### 4\.6RQ5: Failure Analysis
Figure 5:Error Diagnosis CategoriesTo better understand where current systems fail on SMMBench, we use gpt\-4\.1 as an LLM judger to diagnose 600 sampled error cases aggregated from results of the above baselines as shown in Figure[5](https://arxiv.org/html/2605.15710#S4.F5.1)\. The dominant pattern is still evidence access failure: most diagnosed errors arise because the system can not surface the needed evidence from distributed memory\. Two additional patterns also stand out\. First,a non\-trivial portion of errors comes from evidence utilization rather than evidence access alone: even when some relevant information is available, agents still fail to correctly compose clues across sources or to prioritize updated evidence over stale or conflicting records\. Second,a smaller but clear portion of failures appears at the action stage, where models do not convert remembered information into correct executable outputs\.
Taken together, these diagnostics suggest that the main challenge of SMMBench is layered: systems first struggle to recover the right distributed evidence, and then continue to fail on cross\-source composition, temporal prioritization, and action grounding\. Details of the LLM\-judge setup are provided in Appendix[C](https://arxiv.org/html/2605.15710#A3)\. Cases of failure can be seen in Appendix[C\.6](https://arxiv.org/html/2605.15710#A3.SS6)\.
## 5Conclusion
We introduced*Source\-distributed Multimodal Memory Bench*\(SMMBench\), containing18771877samples and55types of tasks, to address an under\-evaluated challenge in multimodal agent memory: whether systems can retrieve, align, and compose multimodal evidence distributed across independently originated sources\. Experiments on representative memory\-style and retrieval\-based baselines show that current systems remain far from effective, with substantial room for improvement even for the strongest methods in a source\-distributed evaluation environment\. Overall, our findings suggest that progress in multimodal agent memory should be evaluated not only by performance within coherent or pre\-assembled contexts, but also by whether systems can retrieve and compose source\-distributed evidence into reliable answers and actions\.
## References
- \[1\]A\. Abdallah, M\. D\. Mounis, M\. Abdalla, M\. S\. Kasem, M\. F\. Senussi, M\. Mahmoud, M\. Ali, A\. Jatowt, and H\. Kang\(2026\)MM\-bright: a multi\-task multimodal benchmark for reasoning\-intensive retrieval\.External Links:2601\.09562,[Link](https://arxiv.org/abs/2601.09562)Cited by:[§1](https://arxiv.org/html/2605.15710#S1.p1.1)\.
- \[2\]Y\. Bei, T\. Wei, X\. Ning, Y\. Zhao, Z\. Liu, X\. Lin, Y\. Zhu, H\. Hamann, J\. He, and H\. Tong\(2026\)Mem\-gallery: benchmarking multimodal long\-term conversational memory for mllm agents\.arXiv preprint arXiv:2601\.03515\.Cited by:[Table 1](https://arxiv.org/html/2605.15710#S1.T1.5.10.1),[§1](https://arxiv.org/html/2605.15710#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.15710#S2.SS1.p1.1)\.
- \[3\]J\. Chen, S\. Xiao, P\. Zhang, K\. Luo, D\. Lian, and Z\. Liu\(2024\)BGE m3\-embedding: multi\-lingual, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.External Links:2402\.03216Cited by:[§C\.1](https://arxiv.org/html/2605.15710#A3.SS1.p2.2)\.
- \[4\]P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav\(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p1.1)\.
- \[5\]K\. Dong, Y\. Chang, S\. Huang, Y\. Wang, R\. Tang, and Y\. Liu\(2025\)Benchmarking retrieval\-augmented multimodal generation for document question answering\.arXiv preprint arXiv:2505\.16470\.Cited by:[§B\.1](https://arxiv.org/html/2605.15710#A2.SS1.p2.1)\.
- \[6\]Y\. Du, K\. Jiang, Z\. Gao, C\. Shi, Z\. Zheng, S\. Qi, and Q\. Li\(2025\)Mmke\-bench: a multimodal editing benchmark for diverse visual knowledge\.arXiv preprint arXiv:2502\.19870\.Cited by:[§B\.1](https://arxiv.org/html/2605.15710#A2.SS1.p4.1)\.
- \[7\]F\. Duan, X\. Huang, and Z\. Wei\(2026\)LifeSim: long\-horizon user life simulator for personalized assistant evaluation\.arXiv preprint arXiv:2603\.12152\.Cited by:[§B\.1](https://arxiv.org/html/2605.15710#A2.SS1.p5.1)\.
- \[8\]Z\. Guo, Z\. Chen, X\. Nie, J\. Lin, Y\. Zhou, and W\. Zhang\(2026\)SkillProbe: security auditing for emerging agent skill marketplaces via multi\-agent collaboration\.External Links:2603\.21019,[Link](https://arxiv.org/abs/2603.21019)Cited by:[§1](https://arxiv.org/html/2605.15710#S1.p1.1)\.
- \[9\]Y\. Hu, Y\. Wang, and J\. McAuley\(2025\)Evaluating memory in llm agents via incremental multi\-turn interactions\.arXiv preprint arXiv:2507\.05257\.Cited by:[Table 1](https://arxiv.org/html/2605.15710#S1.T1.5.3.1)\.
- \[10\]Y\. Jia, K\. Jiang, Y\. Liang, Q\. Ren, Y\. Xin, R\. Yang, F\. Feng, M\. Chen, H\. Lu, H\. Wang, X\. Qu, D\. Liu, L\. Cui, and Y\. Du\(2025\)Benchmarking multimodal knowledge conflict for large multimodal models\.ArXivabs/2505\.19509\.External Links:[Link](https://api.semanticscholar.org/CorpusID:278904487)Cited by:[§B\.1](https://arxiv.org/html/2605.15710#A2.SS1.p4.1)\.
- \[11\]Z\. Jia, Q\. Liu, H\. Li, Y\. Chen, and J\. Liu\(2025\-07\)Evaluating the long\-term memory of large language models\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 19759–19777\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1014/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1014),ISBN 979\-8\-89176\-256\-5Cited by:[Table 1](https://arxiv.org/html/2605.15710#S1.T1.5.4.1)\.
- \[12\]J\. Kim, W\. Kim, W\. Park, and J\. Do\(2025\)MMPB: it’s time for multi\-modal personalization\.arXiv preprint arXiv:2509\.22820\.Cited by:[§B\.1](https://arxiv.org/html/2605.15710#A2.SS1.p5.1)\.
- \[13\]J\. Liu, Z\. Ling, S\. Qiu, Y\. Liu, S\. Han, P\. Xia, H\. Tu, Z\. Zheng, C\. Xie, C\. Fleming, M\. Ding, and H\. Yao\(2026\)Omni\-SimpleMem: autoresearch\-guided discovery of lifelong multimodal agent memory\.arXiv preprint arXiv:2604\.01007\.Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p1.1)\.
- \[14\]J\. Liu, Y\. Sun, W\. Cheng, H\. Lei, Y\. Chen, L\. Wen, X\. Yang, D\. Fu, P\. Cai, N\. Deng,et al\.\(2025\)Memverse: multimodal memory for lifelong learning agents\.arXiv preprint arXiv:2512\.03627\.Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p1.1)\.
- \[15\]P\. Liu, X\. Liu, R\. Yao, J\. Liu, S\. Meng, D\. Wang, and J\. Ma\(2025\)Hm\-rag: hierarchical multi\-agent multimodal retrieval augmented generation\.InProceedings of the 33rd ACM international conference on multimedia,pp\. 2781–2790\.Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p1.1)\.
- \[16\]W\. Liu, J\. Qin, X\. Huang, X\. Zeng, Y\. Xi, J\. Lin, C\. Wu, Y\. Wang, L\. Shang, R\. Tang, D\. Lian, Y\. Yu, and W\. Zhang\(2026\)Position: the real barrier to llm agent usability is agentic roi\.External Links:2505\.17767,[Link](https://arxiv.org/abs/2505.17767)Cited by:[§1](https://arxiv.org/html/2605.15710#S1.p1.1)\.
- \[17\]Z\. Liu, X\. Zhu, T\. Zhou, X\. Zhang, X\. Yi, Y\. Yan, G\. Yu, and M\. Sun\(2025\)Benchmarking retrieval\-augmented generation in multi\-modal contexts\.Proceedings of the 33rd ACM International Conference on Multimedia\.External Links:[Link](https://api.semanticscholar.org/CorpusID:276575528)Cited by:[§2\.2](https://arxiv.org/html/2605.15710#S2.SS2.p1.1)\.
- \[18\]Z\. Liu, T\. Chu, Y\. Zang, X\. Wei, X\. Dong, P\. Zhang, Z\. Liang, Y\. Xiong, Y\. Qiao, D\. Lin,et al\.\(2024\)Mmdu: a multi\-turn multi\-image dialog understanding benchmark and instruction\-tuning dataset for lvlms\.Advances in Neural Information Processing Systems37,pp\. 8698–8733\.Cited by:[Table 1](https://arxiv.org/html/2605.15710#S1.T1.5.7.1)\.
- \[19\]A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang\(2024\)Evaluating very long\-term conversational memory of llm agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13851–13870\.Cited by:[Table 1](https://arxiv.org/html/2605.15710#S1.T1.5.5.1)\.
- \[20\]A\. Masry, M\. S\. Islam, M\. Ahmed, A\. Bajaj, F\. Kabir, A\. Kartha, M\. T\. R\. Laskar, M\. Rahman, S\. Rahman, M\. Shahmohammadi, M\. Thakkar, Md\. R\. Parvez, E\. Hoque, and S\. R\. Joty\(2025\)ChartQAPro: a more diverse and challenging benchmark for chart question answering\.InAnnual Meeting of the Association for Computational Linguistics,External Links:[Link](https://api.semanticscholar.org/CorpusID:277626813)Cited by:[§B\.1](https://arxiv.org/html/2605.15710#A2.SS1.p2.1)\.
- \[21\]X\. Nie, Z\. Guo, Z\. Cui, J\. Yang, Z\. Chen, L\. De, Y\. Zhang, J\. Liao, B\. Huang, Y\. Yang,et al\.\(2026\)Holos: a web\-scale llm\-based multi\-agent system for the agentic web\.arXiv preprint arXiv:2604\.02334\.Cited by:[§1](https://arxiv.org/html/2605.15710#S1.p1.1)\.
- \[22\]OpenAI\(2025\)Introducing GPT\-4\.1 in the api\.Note:[https://openai\.com/index/gpt\-4\-1/](https://openai.com/index/gpt-4-1/)Accessed: 2026\-05\-05Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p2.1)\.
- \[23\]C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez\(2023\)MemGPT: towards llms as operating systems\.arXiv preprint arXiv:2310\.08560\.Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p1.1)\.
- \[24\]J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein\(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th annual acm symposium on user interface software and technology,pp\. 1–22\.Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p1.1)\.
- \[25\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p1.1)\.
- \[26\]D\. Song, S\. Chen, G\. H\. Chen, F\. Yu, X\. Wan, and B\. Wang\(2024\)MileBench: benchmarking mllms in long context\.arXiv preprint arXiv:2404\.18532\.Cited by:[§1](https://arxiv.org/html/2605.15710#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.15710#S2.SS1.p1.1)\.
- \[27\]R\. Tanaka, K\. Nishida, K\. Nishida, T\. Hasegawa, I\. Saito, and K\. Saito\(2023\)SlideVQA: a dataset for document visual question answering on multiple images\.ArXivabs/2301\.04883\.External Links:[Link](https://api.semanticscholar.org/CorpusID:255749397)Cited by:[§B\.1](https://arxiv.org/html/2605.15710#A2.SS1.p2.1)\.
- \[28\]B\. Wang, X\. Liang, J\. Yang, H\. Huang, S\. Wu, P\. Wu, L\. Lu, Z\. Ma, and Z\. Li\(2023\)Enhancing large language model with self\-controlled memory framework\.arXiv preprint arXiv:2304\.13343\.Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p1.1)\.
- \[29\]H\. Wang, A\. Rangapur, X\. Xu, Y\. Liang, H\. Gharwi, C\. Yang, and K\. Shu\(2025\-01\)Piecing it all together: verifying multi\-hop multimodal claims\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 7453–7469\.External Links:[Link](https://aclanthology.org/2025.coling-main.498/)Cited by:[§B\.1](https://arxiv.org/html/2605.15710#A2.SS1.p3.1)\.
- \[30\]Q\. Wang, R\. Ding, Y\. Zeng, Z\. Chen, L\. Chen, S\. Wang, P\. Xie, F\. Huang, and F\. Zhao\(2025\)Vrag\-rl: empower vision\-perception\-based rag for visually rich information understanding via iterative reasoning with reinforcement learning\.arXiv preprint arXiv:2505\.22019\.Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p1.1)\.
- \[31\]X\. Wang, Y\. Zhou, X\. Liu, H\. Lu, Y\. Xu, F\. He, J\. Yoon, T\. Lu, G\. Bertasius, M\. Bansal, H\. Yao, and F\. Huang\(2024\)Mementos: a comprehensive benchmark for multimodal large language model reasoning over image sequences\.arXiv preprint arXiv:2401\.10529\.Cited by:[Table 1](https://arxiv.org/html/2605.15710#S1.T1.5.6.1),[§1](https://arxiv.org/html/2605.15710#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.15710#S2.SS1.p1.1)\.
- \[32\]Y\. Wang and X\. Chen\(2025\)Mirix: multi\-agent memory system for llm\-based agents\.arXiv preprint arXiv:2507\.07957\.Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p1.1)\.
- \[33\]D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu\(2024\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.arXiv preprint arXiv:2410\.10813\.Cited by:[Table 1](https://arxiv.org/html/2605.15710#S1.T1.5.2.1)\.
- \[34\]S\. Xiao, Z\. Liu, P\. Zhang, and N\. Muennighoff\(2023\)C\-pack: packaged resources to advance general chinese embedding\.External Links:2309\.07597Cited by:[§C\.1](https://arxiv.org/html/2605.15710#A3.SS1.p2.2)\.
- \[35\]D\. Xu, Z\. Yang, J\. Chen, Y\. Yuan, M\. Hu, L\. Sun, L\. Van Gool, D\. P\. Paudel, and C\. Feng\(2026\)MultiHaystack: benchmarking multimodal retrieval and reasoning over 40k images, videos, and documents\.arXiv preprint arXiv:2603\.05697\.Cited by:[Table 1](https://arxiv.org/html/2605.15710#S1.T1.5.9.1),[§1](https://arxiv.org/html/2605.15710#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.15710#S2.SS2.p1.1)\.
- \[36\]H\. Xue, F\. Tang, M\. Hu, Y\. Liu, Q\. Huang, Y\. Li, C\. Liu, Z\. Xu, C\. Zhang, C\. Feng,et al\.\(2025\)Mmrc: a large\-scale benchmark for understanding multimodal large language model in real\-world conversation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 22477–22503\.Cited by:[Table 1](https://arxiv.org/html/2605.15710#S1.T1.5.8.1)\.
- \[37\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p2.1)\.
- \[38\]Y\. Yang, H\. Chai, S\. Shao, Y\. Song, S\. Qi, R\. Rui, and W\. Zhang\(2025\)AgentNet: decentralized evolutionary coordination for llm\-based multi\-agent systems\.External Links:2504\.00587,[Link](https://arxiv.org/abs/2504.00587)Cited by:[§1](https://arxiv.org/html/2605.15710#S1.p1.1)\.
- \[39\]W\. Yeo, K\. Kim, S\. Jeong, J\. Baek, and S\. J\. Hwang\(2025\)UniversalRAG: retrieval\-augmented generation over corpora of diverse modalities and granularities\.arXiv preprint arXiv:2504\.20734\.Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p1.1)\.
- \[40\]Z\. Zhang, Q\. Dai, X\. Chen, R\. Li, Z\. Li, and Z\. Dong\(2025\)MemEngine: a unified and modular library for developing advanced memory of llm\-based agents\.arXiv preprint arXiv:2505\.02099\.Cited by:[§4\.1](https://arxiv.org/html/2605.15710#S4.SS1.p1.1)\.
- \[41\]C\. Zhou, H\. Chai, W\. Chen, Z\. Guo, R\. Shan, Y\. Song, T\. Xu, Y\. Yang, A\. Yu, W\. Zhang,et al\.\(2026\)Externalization in llm agents: a unified review of memory, skills, protocols and harness engineering\.arXiv preprint arXiv:2604\.08224\.Cited by:[§1](https://arxiv.org/html/2605.15710#S1.p1.1)\.
- \[42\]Y\. Zhou, M\. Zhao, Z\. Wang, D\. Gu, B\. Guo, R\. Ye, L\. Han, C\. Jin, and D\. N\. Metaxas\(2025\)Mˆ 3\-bench: multi\-modal, multi\-hop, multi\-threaded tool\-using mllm agent benchmark\.arXiv preprint arXiv:2511\.17729\.Cited by:[§B\.1](https://arxiv.org/html/2605.15710#A2.SS1.p6.1)\.
- \[43\]J\. Zhu, M\. Zhu, R\. Rui, R\. Shan, C\. Zheng, B\. Chen, Y\. Xi, J\. Lin, W\. Liu, R\. Tang, Y\. Yu, and W\. Zhang\(2025\)Evolutionary perspectives on the evaluation of llm\-based ai agents: a comprehensive survey\.External Links:2506\.11102,[Link](https://arxiv.org/abs/2506.11102)Cited by:[§1](https://arxiv.org/html/2605.15710#S1.p1.1)\.
## Appendix Contents
## Appendix AProblem Formulation
We formulate SMMBench as a memory\-grounded question answering and action prediction benchmark over*source\-distributed*multimodal evidence\. The key distinction from standard long\-context evaluation is that the relevant evidence is not merely far apart within one long input, but dispersed across multiple*independently originated sources*that must be identified and composed\.
#### Source Objects\.
A*source*in our benchmark is a coherent memory object with its own local context and organizational boundary, such as a group chat, a private thread, a profile page, a document, a table, an image, or another file\-like artifact\. Each source is internally meaningful on its own, but is not constructed to directly present the final answer for the benchmark query\. In this sense, a source is not just a chunk of a larger context; it is an independently originated object that may differ from other sources in participants, purpose, format, temporal role, and informational scope\.
Formally, each sample contains a set of sources
𝒮=\{S1,S2,…,Sm\},\\mathcal\{S\}=\\\{S\_\{1\},S\_\{2\},\\dots,S\_\{m\}\\\},\(7\)where each sourceSiS\_\{i\}consists of one or more evidence\-bearing items:
Si=\{oi,1,oi,2,…,oi,ni\}\.S\_\{i\}=\\\{o\_\{i,1\},o\_\{i,2\},\\dots,o\_\{i,n\_\{i\}\}\\\}\.\(8\)Each item is represented as
oi,j=⟨xi,j,si,τi,j⟩,o\_\{i,j\}=\\langle x\_\{i,j\},s\_\{i\},\\tau\_\{i,j\}\\rangle,\(9\)wherexi,jx\_\{i,j\}is the content of the item,sis\_\{i\}is the source identity shared by all items in sourceSiS\_\{i\}, andτi,j\\tau\_\{i,j\}is its timestamp or local temporal position\. The contentxi,jx\_\{i,j\}may include text, images, tables, document pages, or other forms of multimodal evidence\.
#### What Counts as Source\-Distributed\.
We say that a benchmark sample is*source\-distributed*if its answer\-critical evidence cannot be recovered from a single source alone, and instead must be composed from evidence distributed across at least two independently originated sources\. Letℰ\(q\)\\mathcal\{E\}\(q\)denote the set of minimal evidence items required to answer questionqq\. A sample is source\-distributed if
\|\{s\(o\)∣o∈ℰ\(q\)\}\|≥2,\|\\\{\\,s\(o\)\\mid o\\in\\mathcal\{E\}\(q\)\\,\\\}\|\\geq 2,\(10\)that is, the required evidence spans at least two distinct source identities\. This criterion distinguishes our setting from ordinary long\-context reasoning within a single conversation or document, where evidence may be far apart but still belongs to one unified source\.
Under this definition, long\-range dependencies within one chat or one document are not sufficient by themselves to qualify as source\-distributed\. They remain challenging, but they are treated as*within\-source long\-context*difficulty rather than*cross\-source memory composition*\. By contrast, when one clue appears in a private chat, another in a project document, and a final update in a profile or image, the agent must reason across source boundaries, which is the target difficulty of SMMBench\.
#### Memory Construction and Query Answering\.
Given the source set𝒮\\mathcal\{S\}, the agent incrementally observes items from these sources and maintains an external memory stateMtM\_\{t\}\. For conversational sources, the observations follow their turn order; for non\-conversational sources such as documents or images, the observations correspond to their associated source items\. We denote the overall observation stream as
𝒪=\{o1,o2,…,oT\},\\mathcal\{O\}=\\\{o\_\{1\},o\_\{2\},\\ldots,o\_\{T\}\\\},\(11\)where each observation retains its source identity\. As each observation arrives, the memory is updated by
Mt\+1=Φ\(Mt,ot\),M\_\{t\+1\}=\\Phi\(M\_\{t\},o\_\{t\}\),\(12\)whereΦ\\Phiis the memory update operator\.
After the full source environment has been ingested, the agent receives a questionqqand retrieves relevant memory units:
Mret=R\(MT,q\)\.M\_\{\\mathrm\{ret\}\}=R\(M\_\{T\},q\)\.\(13\)The final answer is then generated as
y=G\(q,Mret\)\.y=G\(q,M\_\{\\mathrm\{ret\}\}\)\.\(14\)
Under this formulation, success requires more than recalling isolated facts from a long input\. A successful system must \(1\) preserve source\-aware memory over heterogeneous source objects, \(2\) retrieve evidence spanning the right source boundaries, and \(3\) compose or reconcile these pieces into a coherent final answer or action\. In this sense, SMMBench evaluates*source\-distributed memory composition*rather than only long\-context multimodal recall\.
## Appendix BBenchmark Construction Details
### B\.1Sources of Benchmark
All basic QA supervision in SMMBench is derived from public open\-source datasets and benchmarks with reusable images, documents, or textual annotations\. We group the sources by the target capability they primarily support\.
Single\-Hop QA\.Single\-hop samples are mainly collected fromChartQA\_Pro111[https://github\.com/vis\-nlp/ChartQAPro](https://github.com/vis-nlp/ChartQAPro)\[[20](https://arxiv.org/html/2605.15710#bib.bib14)\],SlideVQA222[https://github\.com/nttmdlab\-nlp/SlideVQA](https://github.com/nttmdlab-nlp/SlideVQA)\[[27](https://arxiv.org/html/2605.15710#bib.bib22)\], andMMDocRAG333[https://github\.com/MMDocRAG/MMDocRAG](https://github.com/MMDocRAG/MMDocRAG)\[[5](https://arxiv.org/html/2605.15710#bib.bib23)\]\. These sources provide strong multimodal supervision over charts, slides, and document pages, and naturally expose image\- and document\-centric evidence objects\. ForSlideVQAandMMDocRAG, we prefer samples with multiple gold pages or multiple gold evidence items so that the final benchmark instance remains compatible with our multi\-source construction goal\.
Multi\-Hop QA\.Multi\-hop samples are mainly sourced fromMMCV444[https://github\.com/mmcv\-dataset/MMCV](https://github.com/mmcv-dataset/MMCV)\[[29](https://arxiv.org/html/2605.15710#bib.bib24)\]\. We focus on subsets whose reasoning chain already spans multiple supporting items, which makes them suitable for later conversion into cross\-source memory questions\. In practice, we prioritize longer\-hop subsets because they better stress evidence aggregation after insertion into long\-running conversations\.
Conflict Resolution\.Conflict\-oriented samples are mainly sourced fromMLLMKC555[https://github\.com/MLLMKCBENCH/MLLMKC](https://github.com/MLLMKCBENCH/MLLMKC)\[[10](https://arxiv.org/html/2605.15710#bib.bib25)\]andMMKE666[https://github\.com/MMKE\-Bench\-ICLR/MMKE\-Bench](https://github.com/MMKE-Bench-ICLR/MMKE-Bench)\[[6](https://arxiv.org/html/2605.15710#bib.bib26)\]\. These datasets are useful because they explicitly encode knowledge updates, contradictions, or stale\-vs\.\-new evidence\. We use them to construct settings in which a model must decide which memory item is valid after multiple partially conflicting mentions have been woven into the interaction history\.
Preference Reasoning\.Preference\-focused samples are mainly sourced fromMMPB777[https://github\.com/AIDASLab/MMPB](https://github.com/AIDASLab/MMPB)\[[12](https://arxiv.org/html/2605.15710#bib.bib27)\]andLifeSim888[https://github\.com/dfy37/lifesim](https://github.com/dfy37/lifesim)\[[7](https://arxiv.org/html/2605.15710#bib.bib28)\]\.MMPBprovides explicit preference statements together with question\-relevant multimodal context, whileLifeSimis particularly useful for implicit preference signals that unfold over dialogue\. Their combination helps us cover both explicitly stored user constraints and preference information that must be inferred from recurring behavior\.
Function Calling\.Function\-calling samples are mainly sourced fromM3Bench999[https://github\.com/EtaYang10th/Open\-M3\-Bench](https://github.com/EtaYang10th/Open-M3-Bench)\[[42](https://arxiv.org/html/2605.15710#bib.bib29)\]\. From these examples we extract the candidate tool set, the multimodal or file\-based operating context, and the gold function\-call output\. We only retain samples whose action target can be evaluated as a single grounded call rather than a long cascading workflow\.
In addition to QA supervision, we construct lightweight agent profiles and conversational identities from public persona\-style and profile\-style resources\. These materials are not used as evaluation targets themselves; instead, they provide the surrounding user and collaborator context required for long\-horizon conversation simulation\.
### B\.2Open\-Source Dataset Preprocess
Figure 6:Open\-Source Datasets of SMMBenchBefore entering the three\-stage construction pipeline, all open\-source samples are normalized into a shared schema consisting of a question, a gold answer, and multiple evidence units annotated by source type and modality\. This preprocessing step is necessary because the raw sources expose supervision in heterogeneous forms: some use highlighted document spans, some point to images or pages, some provide structured preference statements, and others specify tool arguments or state variables\.
ForChartQA\_Pro, we ask a strong LLM to derive a concise textual condition from the original question, answer, and chart image\. The generated condition captures the intended interpretation angle or constraint without leaking the chart content verbatim\. We then rewrite the original QA pair so that solving it requires jointly using the textual condition and the chart itself, rather than exploiting surface shortcuts from either side alone\.
ForMMDocRAG, we retain retrieved pages and golden evidence quotes, then partition long documents into smaller file\-like units aligned with the gold supporting regions\. This conversion makes the source more compatible with a memory benchmark in which evidence should appear as realistic files, pages, or attachments rather than as one monolithic context block\.
ForLifeSim, we keep preference instances whose supervision is expressed implicitly across dialogue\. To make them compatible with multimodal evaluation, we replace some preference\-bearing entities or activities with corresponding image evidence and lightly rewrite the surrounding turns to avoid directly leaking the image content through text\. This turns implicit preference reasoning into a jointly text\-and\-image grounded task\.
ForM3Bench, we retain single grounded function\-call instances and remove cases that require long cascaded execution chains\. We additionally synthesize short procedural or contextual textual evidence so that the final benchmark sample contains both task\-oriented action requirements and memory\-relevant descriptive context\.
ForSlideVQA,MMCV,MLLMKC,MMKE, andMMPB, preprocessing is comparatively light\. Beyond filtering for samples with sufficiently rich evidence and converting the raw inputs into the unified schema, we preserve their native supervision with minimal rewriting\.
After task\-specific normalization, all samples pass through three shared quality\-control stages\. First, we generate distractor options for the multiple\-choice tasks; these distractors are designed to be semantically plausible and partially related to the evidence rather than trivial negatives\. Second, we run answer verification, where a strong LLM re\-solves the rewritten item under majority\-vote prompting to detect malformed or ambiguous samples\. Third, we perform evidence verification by masking one evidence unit at a time and retaining only samples whose answerability degrades as expected, which helps ensure that the retained evidence items contain genuinely necessary information\. The surviving pool is then passed to manual spot checking before conversation generation\.
We construct each benchmark sample through a staged pipeline that explicitly separates evidence construction, conversation simulation, and evidence grounding\. This design is motivated by the goal of benchmarking source\-distributed, multimodal agent memory: the evaluation sample should not merely contain long conversations, but should also require the model to integrate evidence distributed across heterogeneous sources such as chats, images, tables, documents, and file\-like artifacts\. Accordingly, our construction process starts from question\-answer pairs and evidence collected from open\-source datasets and benchmarks, then grows a profile\-consistent long\-horizon conversation environment around them, and finally inserts the processed evidence back into the conversation stream with locally coherent transitions\.
### B\.3QA Preprocessing
In this stage, we build the QA pool by collecting samples from a diverse set of public datasets and benchmarks spanning multi\-hop reasoning, conflict resolution, preference inference, document QA, and workflow\-oriented function calling\. Preprocessing at this stage is aimed at deriving, from these open\-source resources, a collection of controllable evaluation QA samples: each sample comprises a question\-answer pair, and multiple multimodal evidence items, which allows us to preserve answerability while substantially enriching the surrounding memory context\. Introductory descriptions of each constituent dataset and benchmark, together with the corresponding basic preprocessing steps, are provided in the appendix\.
We first normalize each sample sourced from the open\-source datasets and benchmarks\. Concretely, we reorganize the raw supporting information for every sample into unified evidence units, each annotated with its modality and corresponding multimodal content\. This preprocessing is essential because the original datasets expose supervision in very different forms: some provide textual snippets, some provide image references, some involve tables or schedules, some surface evidence through documents or file\-like artifacts typical of office and collaboration workflows, and some require state variables for function\-calling decisions\. By converting them into a common schema of multimodal memory objects, we can control how they are woven into long\-horizon conversation while still recording well\-defined multi\-source evaluation in later pipeline stages\.
After normalization, we apply modality\-aware verification and refinement\. We first filter out samples that remain answerable from only a single evidence item or a single modality alone, which are too simple in multi\-source, multimodal settings\. For the remaining samples that contain multiple multimodal evidence, we run a modality\-aware ablation test: we repeatedly mask all evidence from one certain modality, prompt a strong LLM to answer using only the unmasked modalities, and retain a sample in the high\-quality pool only if the model fails under every such mask—equivalently, each modality contributes information that cannot be compensated by the others\. Samples rejected by the initial screen are not necessarily discarded; we instead ask a strong LLM to rewrite the question and answer so that the pair does not trivially leak surface cues from the evidence, which recovers additional usable samples while reducing accidental shortcut solutions\.
In addition, we prompt a strong LLM to generate lightweight structured metadata for each retained sample, including coarse topical descriptors and modality\-aware captions that briefly describe non\-textual evidence\. These fields support later filtering, inspection, and conditioning during long\-horizon dialogue synthesis without substituting for the underlying multimodal content\. The output of this stage is a curated pool of QA samples with verified multimodal evidence that can later be woven into simulated agent conversational environments\.
### B\.4Conversational Source Synthesis
Once evidence\-augmented QA samples are ready, we build the conversational sources in which their evidence will eventually reside\. We assemble each pre\-configured cast into multiple group channels and private one\-to\-one threads, then simulate several independent multi\-turn dialogues in which those agents interact under their assigned profiles\. This yields a pool of long\-horizon chat trajectories that later serve as the sources for evidence insertion\.
Agent profiles are assembled from public persona\-style attributes, lightweight demographic or workplace\-style metadata, and sampled preference statements\. These profiles constrain later dialogue generation, but they are not intended to dominate the surface form of every utterance\. Instead, they act as soft behavioral anchors that keep recurring participants reasonably consistent over time\.
After the participants are fixed, we generate topic plans for each channel\. These topic plans are derived from evidence\-bearing QA samples selected for the current batch, but they are phrased as ordinary project, personal, scheduling, or collaboration discussions rather than as explicit question\-answering prompts\. Each topic usually unfolds over multiple turns and is connected to adjacent topics through short transition spans, which helps the final history read as an organic conversation rather than a sequence of isolated evidence dumps\.
After the participating agents and conversation channels for a cluster have been finalized, we randomly sample a handful of the high\-quality evidence\-augmented QA instances from the Stage 1 pool\. For each sample, we propose one or more high\-level conversation themes derived only from its coarse topical metadata and non\-evidence cues, explicitly avoiding any wording that would reveal the associated multimodal evidence or shortcut the eventual answer\. This separation is essential: conditioning dialogue synthesis on gold evidence too early yields unnaturally goal\-directed exchanges and erodes the meandering style typical of everyday chat\.
For each channel–theme assignment, we instruct agents to simulate ordinary multi\-turn interactions with diverse discourse intents—clarifying, questioning, supporting, disagreeing, elaborating, or pivoting the focus—and we modulate per\-turn response length to discourage rhythmic templates\. Throughout generation, the simulator may attend only to prior turns in the same channel, each agent’s profile\-based preferences and constraints, and the active theme; it is not given access to QA evidence\. The resulting long\-horizon trajectories therefore interleave task\-adjacent material with unrelated small talk, yielding a realistic conversational memory environment whose raw text and images still contain no direct leakage of the curated evidence objects\.
Conversation generation proceeds in batches\. Each batch instantiates a shared pool of agents, their lightweight profiles, and a set of candidate QA samples whose evidence will later be inserted into one or more channels\. We create multiple parallel channels per batch, including group\-like discussions and private exchanges, so that some agents naturally reappear across several sources\. This design lets a final evaluation item draw on evidence that is distributed over heterogeneous but partially overlapping histories\.
Utterances within a topic are generated turn by turn\. Before each turn, the selected speaker receives the recent local context, profile constraints, the current topic intent, and a discourse act sampled from a small inventory such as agreement, rebuttal, elaboration, asking for clarification, summarization, or topic shifting\. We also vary the expected utterance length\. This control mechanism improves diversity and reduces the risk that all speakers collapse into the same tone or interaction pattern\.
Evidence insertion is performed after the base dialogue skeleton is available\. Rather than letting agents freely choose when to reveal evidence, we first synthesize a short local scaffold around each evidence item, including its conversational motivation and a small amount of before/after bridging context\. We then insert the evidence object together with the scaffold into a temporally plausible location in one of the channels\. This post\-hoc insertion strategy gives us fine\-grained control over where evidence appears, how it is mentioned, and which other sources remain available as distractors or complementary memories\.
At the close of this stage, we hold a collection of long multi\-turn conversations produced by role\-playing the personalized agents across the configured channels\. None of these conversations yet contains QA evidence: all multimodal support objects remain outside the conversation stream until the following dedicated insertion\.
### B\.5Source\-Aware Evidence Insertion
In this final stage, we weave preprocessed multimodal QA evidence into the long multi\-turn conversations so that the augmented sources read as organically continued conversations rather than stitched attachments\.
A core design choice is multi\-source dispersion: we deliberately route different evidence units to distinct interaction sources, such as separate group or private channels, documents or files\. For QA samples whose gold evidence encodes information updates \(superseding or conflicting facts over time\), we further constrain insertion timestamps so that each newer evidence item is scheduled at least a minimum number of conversational turns after the older material it revises, which keeps chronologically fresh content strictly downstream of stale records\. For other evidence, we still stagger timestamps so that supporting material never collapses into a single tidy exposition block\. Once an interaction source has been chosen, we resolve fine\-grained assignment by prompting a strong LLM with chunked excerpts of the host material together with the candidate evidence; the model proposes the insertion offset \(e\.g\., a turn index for chat logs or a paragraph anchor for documents and files\)\.
Rather than committing each evidence unit to an arbitrary turn, we first synthesize a compact, evidence\-centered textual micro\-thread: a localized multi\-turn aside that situates the forthcoming artifact and its pragmatic stakes while withholding answer\-bearing surface cues\. These micro\-threads function as insertion scaffolds, supplying register\-matched dialogue through which multimodal evidence can enter the stream without clashing with topical momentum or channel\-specific speaking style\. After candidate anchors are screened for temporal plausibility, discourse coherence, and source diversity, we write both the evidence object and its adjoining scaffold into the history, then lightly generate a short span of bridging turns immediately before and after the inserted block so that the augmented segment blends into the pre\-existing trajectory rather than reading as a pasted fragment\. Taken together, this post\-hoc scaffolding affords fine\-grained control over where and in what conversational guise each multimodal item appears, while sustaining the phenomenology of an ongoing interaction\.
Through this three\-stage pipeline, each final benchmark sample contains \(i\) a question derived from curated open\-source supervision, \(ii\) evidence distributed across multiple sources and modalities, and \(iii\) a profile\-consistent long\-context interaction history into which that evidence has been smoothly woven\. This construction protocol is well aligned with our benchmark objective: evaluating whether an agent can use heterogeneous multi\-source multimodal memories that emerge from realistic long\-running interactions rather than from a single explicitly organized context\.
### B\.6Data Quality Control
SourceStage\#Input\#OutputMain ReasonChartQA\_ProCorrectness19481048non\-unique answerEvidence1048415Redundant evidenceSlideVQACorrectness454388Incorrect answerEvidence38837Redundant evidenceMMDocRAGCorrectness419311Incorrect answerEvidence311120Redundant evidenceMMCVCorrectness1687290Incorrect answerEvidence290144Redundant evidenceMMKECorrectness1910545non\-derivable answerEvidence545231Redundant evidenceMLLMKCCorrectness520518Incorrect answerEvidence518241Redundant evidenceMMPBCorrectness98276190Incorrect / non\-derivable answerEvidence6190548Redundant evidenceLifeSimCorrectness11938non\-derivable answerEvidence3833Redundant evidenceM3\_BenchCorrectness231126non\-unique answerEvidence126108Redundant evidenceTable 4:Dataset construction funnel for data quality control\. Each source benchmark first undergoes correctness verification and evidence necessity verification before entering the final curated pool\.To ensure correctness, answerability, evidence dependency, and environmental naturalness, we adopt a two\-stage data quality control pipeline: \(*i*\) open\-source QA preprocessing and \(*ii*\) post\-insertion validation\. Table[4](https://arxiv.org/html/2605.15710#A2.T4)summarizes the corresponding construction funnel\. Across the source benchmarks used in this work, we begin with 17,115 raw candidate QA instances, retain 9,424 after correctness verification, and finally keep 1,877 after evidence necessity verification\. The first stage therefore removes incorrect, unanswerable, non\-unique, or weakly grounded QA candidates, while the second stage verifies that inserted evidence and its local scaffold are naturally integrated into the target conversational or memory environment, without introducing obvious artifacts, semantic inconsistencies, or shortcut leakage\. Together, these procedures are designed to ensure that the final retained samples are not only answer\-correct and evidence\-dependent, but also natural and evaluable in a multi\-source multimodal memory setting\.
In the preprocessing stage, we first performLLM\-based correctness verificationon all candidate QA instances\. For each sample, we provide the evidence, question, and gold answer to a strong verifier model \(gpt\-4\.1\) and ask whether the answer is*correct and derivable*from the given evidence\. We define a positive judgment as “the answer is correct and derivable” and a negative judgment as “the answer is incorrect or not derivable\.” The verification prompt also explicitly asks the model to consider answer uniqueness and decisiveness, so as to exclude cases with obvious ambiguity, multiple plausible answers, or insufficiently constrained evidence\. To reduce single\-run variance, each sample is verified three times independently, and we retain it only if at least two runs return positive judgments under majority voting\. As shown in Table[4](https://arxiv.org/html/2605.15710#A2.T4), the main reasons for removal at this stage are incorrect answers, non\-derivable answers, and non\-unique answers, depending on the source benchmark\.
We then conductevidence necessity verification\. For each sample that passes correctness checking, we iteratively mask one evidence item at a time and submit the remaining evidence, together with the original question and answer, togpt\-4\.1under the same three\-run majority\-vote protocol\. We retain a sample only if masking any single evidence item makes the answer no longer stably verifiable, i\.e\., at least half of the verification runs become negative under that ablation\. This step removes instances with clearly redundant evidence or instances that can be solved from only a subset of the annotated support, and preferentially keeps samples in which each evidence item contributes materially to the final answer\. The dominant removal reason at this stage is redundant or weakly grounded evidence, as also summarized in Table[4](https://arxiv.org/html/2605.15710#A2.T4)\.
After automatic filtering, we further performhuman spot\-check reviewon the retained candidates\. Two expert annotators independently inspect a 10% sample from each source benchmark and verify correctness, derivability, evidence dependency, and answerability\. Concretely, the review checks whether \(*i*\) the question and gold answer are correctly matched, \(*ii*\) the answer can be reasonably derived from the annotated evidence, and \(*iii*\) removing a key evidence item materially weakens answerability\. The reviewers also inspect whether the sample contains ambiguity, multiple plausible answers, or shortcut cues that make the answer recoverable without the intended evidence\. A sample is counted as passing only if it satisfies these criteria\. This first\-stage human audit achieves a pass rate of 96\.3%, providing additional support that the automatically retained samples are generally well\-formed and properly grounded\.
After evidence insertion, we conduct a second round ofhuman post\-insertion reviewto assess whether the inserted evidence and its scaffold are naturally integrated into the target memory environment\. Two expert annotators inspect 5% of the post\-insertion samples, focusing on potential evidence leakage, contextual coherence, speaker and temporal consistency, and naturalness of local tone and phrasing\. In particular, the review checks whether inserted content introduces shortcut cues, abrupt factual insertions, template\-like stitching, or anomalous content blocks unrelated to the surrounding context\. We also explicitly inspect whether captions of multimodal evidence reveal the answer too directly, so that captionized inputs do not create artificial shortcut solutions unavailable in the original source object\. The post\-insertion audit yields a pass rate of 92\.8%, suggesting that most inserted evidence and scaffolds remain natural and contextually compatible after integration\.
Our leakage and ambiguity checks are therefore carried out in two complementary ways\. At the preprocessing stage, ambiguity is controlled through the correctness\-verification prompt, which explicitly rejects non\-unique or insufficiently constrained answers, and through human review, which checks for multiple plausible answers or missing evidence support\. At the post\-insertion stage, leakage is controlled by manually inspecting whether local scaffold text, nearby turns, or generated captions reveal the answer more directly than intended, or whether they introduce shortcut cues that bypass the intended memory composition process\.
### B\.7Benchmark Statistics
We present additional statistics in this section\.
Figure 7:Distribution of conversation lengths across conversational sources in SMMBench\. Most conversation sources fall into either short \(1–250\) or long \(1001–1500\) ranges, indicating substantial variation in source length and context density\.The distribution of conversation lengths in Figure[7](https://arxiv.org/html/2605.15710#A2.F7)shows that conversational sources in the benchmark are not concentrated within a single length range, but instead cover both relatively short and relatively long contexts\. In particular, the 1–250 and 1001–1500 ranges account for a large portion of the conversation sources, indicating that the dataset includes both compact local dialogues and long conversations with denser information and stronger cross\-turn dependencies\. This distribution has two implications\. First, the benchmark does not reduce source\-distributed memory to evidence retrieval in only short conversations\. Second, even when an individual source is already long, the core challenge is still not merely long\-context reading, but locating, relating, and integrating evidence across multiple independent sources\.
Figure 8:Distribution of the number of supporting evidence items per sample\. Most samples require two or three evidence items, while a smaller but non\-negligible subset requires four or more, forming a long\-tailed composition difficulty\.Figure 9:Distribution of modality combinations in SMMBench\. Image\+text samples dominate the benchmark, while other combinations such as table\+text, image\-only, JSON\-based, and mixed multimodal settings provide additional diversity\. The y\-axis is shown in log scale\.The distribution of the number of supporting evidence items per sample in Figure[8](https://arxiv.org/html/2605.15710#A2.F8)shows that most samples require 2 to 3 pieces of supporting evidence to answer correctly, with samples requiring 2 or 3 evidence items making up the majority\. At the same time, there is a clear long tail, where a smaller subset of samples requires 4\+ or even more evidence items\. This pattern suggests that the dominant difficulty in SMMBench is not single\-fact extraction, but multi\-evidence composition\. More importantly, when these pieces of evidence are further distributed across different sources, agent memory must do more than identify a single relevant clue: it must cover all necessary evidence and compose them at answer time\. From a statistical perspective, this result supports the central design of the benchmark, namely that the main difficulty comes from the need to combine multiple local clues rather than solving the problem from a single source or a single evidence item\.
The distribution of modality combinations in Figure[9](https://arxiv.org/html/2605.15710#A2.F9)reveals a head\-heavy but diverse pattern in the benchmark\. image\+text is by far the most dominant modality combination, indicating that joint visual\-language understanding forms the main scenario in the dataset\. At the same time, combinations such as table\+text, image, json, json\+text, and image\+json are also explicitly represented\. This distribution helps prevent the benchmark from being dominated by only one modality structure\. On the one hand, the dominant combinations ensure that the benchmark remains grounded in common multimodal agent interaction settings\. On the other hand, the tail of modality combinations introduces more heterogeneous source forms, requiring models to adapt to different information carriers and compose evidence across them\. From the perspective of source\-distributed memory, such modality diversity further increases the complexity of cross\-source composition, because different sources are not only independent, but may also express complementary evidence in different modalities\.
## Appendix CAdditional Experiments and Analysis
### C\.1Experiment Details
In main experiments, all baselines are evaluated under a unified outer\-loop protocol\. Generally, for every sample in SMMBench, we first prepare the memory environment in the format expected by the baseline, then issue the task query, collect the model output, and finally score it against the gold target\.
For retrieval\-oriented baselines, we serialize the memory environment into retrievable units\. Depending on the baseline, these units may be chunked messages, document pages, image captions, or multimodal records\. In our implementation, all retrieval experiments use the same embedding models, namelybge\-large\-en\-v1\.5101010[https://huggingface\.co/BAAI/bge\-large\-en\-v1\.5](https://huggingface.co/BAAI/bge-large-en-v1.5)\[[34](https://arxiv.org/html/2605.15710#bib.bib58)\]andbge\-m3111111[https://huggingface\.co/BAAI/bge\-m3](https://huggingface.co/BAAI/bge-m3)\[[3](https://arxiv.org/html/2605.15710#bib.bib59)\]\. Unless otherwise specified, the retriever selects the top\-KKcandidates withK=20K\{=\}20, which is also the setting used for all main\-table results\. For text\-only or Text\+Caption settings, non\-textual objects are replaced by captions generated in advance; for native multimodal settings, the original multimodal objects are preserved whenever the baseline supports them\.
For memory\-style baselines, we first replay the conversation history, files, and evidence objects into the model\-specific memory interface\. Some methods store memory as textual notes or summaries, some maintain structured records, and some combine persistent memory with later recall modules\. ForMIRIX,MemGPT,Mem0, andMemVerse, conversational memory is chunked into blocks of 12 dialogue turns with at most 2 images, and each chunk is further capped at 64,000 characters\. Once the environment has been written, we ask the downstream question and record the model output under the baseline’s native memory\-access mechanism\.
To reduce prompt\-induced variance, we keep the evaluation instruction templates unified across methods at the answering stage\. Whether a baseline answers from retrieved context concatenated into a reader prompt or returns an answer through its client\-side memory interface, the final task prompt is kept the same in content and output\-format constraints\. The prompt specifies the task type, the answer format, and any restrictions such as returning only the multiple\-choice option or only the final function call\. Sample prompts are provided in Appendix[E](https://arxiv.org/html/2605.15710#A5)\.
For function\-calling evaluation, we parse model outputs by direct JSON decoding and compare the resulting structured call against the gold target under exact match\. Outputs that fail JSON parsing are counted as incorrect\. For captionized settings, all non\-text evidence captions are generated uniformly withgpt\-4\.1; the caption prompts are listed later in Appendix[E](https://arxiv.org/html/2605.15710#A5)\. The embedding\-based components of our experiments were run on NVIDIA A100 80GB GPUs\. Beyond the experiments reported in the paper, we did not use substantial additional compute resources for large\-scale extra runs\.
Unless otherwise noted, all reported experimental results are obtained with decoding temperature set to 0\.01 for all experiments\. Each evaluation is run three times independently, and we report the mean score across the three runs in order to reduce sampling noise from the backbone model and generation pipeline\. The same scoring protocol is used for both open\-book and retrieval\-based settings\.
### C\.2Main Results
Table 5:Main results on SMMBench\.S\.H\.,M\.H\.,C\.R\.,P\.R\., andF\.C\.denoteSingle\-Hop QA,Multi\-Hop QA,Conflict Resolution,Preference Reasoning, andFunction Call\. All scores are averaged over 3 runs; standard deviations are shown forS\.H\.,M\.H\.,C\.R\., andP\.R\.\. TheOverallreports the unweighted average score\.BaselineS\.H\.M\.H\.C\.R\.P\.R\.F\.C\.OverallShort\-Term Mem\.0\.2990±\\pm0\.01390\.2292±\\pm0\.01990\.2246±\\pm0\.16480\.2702±\\pm0\.00970\.00930\.2064Reflexion Mem\.0\.5385±\\pm0\.01080\.4861±\\pm0\.00960\.3962±\\pm0\.00460\.3064±\\pm0\.00280\.02780\.3510Gen\. Agent Mem\.0\.3164±\\pm0\.01070\.2708±\\pm0\.01700\.2754±\\pm0\.02090\.2238±\\pm0\.01060\.01850\.2210Self Controlled Mem\.0\.3129±\\pm0\.01330\.3403±\\pm0\.01960\.2881±\\pm0\.01540\.2203±\\pm0\.01040\.01850\.2360MIRIX \(T\+C\)0\.3601±\\pm0\.01000\.2569±\\pm0\.02840\.3072±\\pm0\.00170\.2960±\\pm0\.01340\.02780\.2496MemGPT \(T\+C\)0\.5734±\\pm0\.01080\.4722±\\pm0\.01180\.3326±\\pm0\.01080\.4389±\\pm0\.01420\.03700\.3708MemVerse0\.3549±\\pm0\.00570\.3264±\\pm0\.01730\.2691±\\pm0\.01700\.2754±\\pm0\.02050\.00930\.2470Mem00\.7430±\\pm0\.00330\.5903±\\pm0\.00870\.3665±\\pm0\.02100\.2909±\\pm0\.00500\.03980\.4061Native RAG0\.7797±\\pm0\.00460\.6667±\\pm0\.00870\.4068±\\pm0\.01670\.3683±\\pm0\.01500\.07410\.4591HMRAG0\.8129±\\pm0\.00140\.7153±\\pm0\.00000\.5191±\\pm0\.00690\.3081±\\pm0\.00080\.11110\.4933MemGPT \(MM\)0\.3497±\\pm0\.00210\.5139±\\pm0\.01000\.2691±\\pm0\.01150\.1997±\\pm0\.01230\.02410\.2713MIRIX \(MM\)0\.5507±\\pm0\.00790\.3472±\\pm0\.01470\.4025±\\pm0\.00200\.2031±\\pm0\.00610\.07690\.3161OmniSimpleMem0\.2311±\\pm0\.00140\.0375±\\pm0\.00650\.2153±\\pm0\.00360\.1840±\\pm0\.01250\.01850\.1373UniversalRAG0\.3322±\\pm0\.01090\.3403±\\pm0\.00170\.2606±\\pm0\.01340\.1566±\\pm0\.00640\.03700\.2253VRAG0\.4913±\\pm0\.01410\.4514±\\pm0\.01180\.3919±\\pm0\.01850\.1842±\\pm0\.00770\.04630\.3130
#### Existing Methods Still Struggle on Source\-Distributed, Multimodal Memory Tasks\.
The reference baselines again show that the benchmark is solvable but still far from effective\. On the one hand, theGolden Evidence Baselinereaches 0\.7473 Overall when the model is given the gold supporting evidence directly, confirming that the answer space itself is learnable once the required evidence is made available\. On the other hand, realistic systems remain substantially below this reference condition\. The strongest benchmarked baseline,HMRAG, reaches only 0\.4933 Overall, leaving a gap of 0\.2540 to the gold\-evidence setting\. This gap is large enough to suggest that the main difficulty does not lie in answer existence alone, but in whether a system can successfully retrieve, preserve, align, and compose the needed evidence under realistic source\-distributed access\.
The task\-wise results further show that current methods are not uniformly weak in the same way, but instead exhibit clear capability imbalance across task families\. In the Text\+Caption setting,Mem0is quite strong on factual QA, reaching 0\.7430 onS\.H\.and 0\.5903 onM\.H\., yet it drops to 0\.2909 onP\.R\.and only 0\.0398 onF\.C\.\.Native RAGandHMRAGalso perform relatively well onS\.H\.andM\.H\.\(e\.g\., 0\.7797/0\.6667 forNative RAGand 0\.8129/0\.7153 forHMRAG\), but remain much weaker onF\.C\.at 0\.0741 and 0\.1111\. A similar pattern appears in the MM setting:MIRIXreaches 0\.5507 onS\.H\.and 0\.4025 onC\.R\., but still only 0\.0769 onF\.C\.\. These differences indicate that no single system is consistently strong across all task families, and that source\-distributed multimodal memory remains a broad, unsolved challenge rather than a narrow weakness confined to one task type\.
#### Native Multimodal Access Alone Does Not Resolve the Source\-Distributed Challenge\.
A clean comparison comes from paired results within the same memory system, where the primary change is whether the system has access to native multimodal memory or only Text\+Caption representations\. Under this paired view, the impact of native multimodal access is clearly not uniform\.MIRIXimproves from 0\.2496 Overall in the Text\+Caption setting to 0\.3161 in the native MM setting, a gain of 0\.0665\. By contrast,MemGPTdrops from 0\.3708 to 0\.2713 under the same switch, a decrease of 0\.0995\. These opposite trends suggest that native multimodal access is not by itself sufficient to overcome the benchmark difficulty\.
This pattern is consistent with the interpretation that the core challenge is not merely whether a system can “see” native multimodal content, but whether it can use that content effectively once the required evidence is scattered across independent sources\. Native multimodal inputs may preserve richer information than caption conversion in principle, but this advantage only becomes useful if the system can still retrieve the right sources, recover the right clues from them, and align those clues across source boundaries\. In this sense, source\-distributed memory places a stronger demand than modality access alone: it requires not only multimodal perception, but also robust cross\-source memory access and evidence composition\. The fact that native MM access helpsMIRIXbut hurtsMemGPTfurther suggests that improvements in source\-distributed settings depend on the interaction between modality access and the system’s underlying retrieval\-and\-memory architecture, rather than on modality exposure in isolation\.
#### Function Calling Exposes a Persistent Memory\-to\-Action Gap\.
Function calling remains the most challenging setting in the benchmark\. Across all baselines, the bestF\.C\.score is only 0\.1111, achieved byHMRAG, and even theGolden Evidence Baselinereaches only 0\.2778\. This is far below its performance on the other task families, such as 0\.8753 onS\.H\., 0\.7768 onM\.H\., 0\.8365 onC\.R\., and 0\.9699 onP\.R\.\. The gap suggests that function calling is not difficult merely because it uses a structured output format, but because it requires models to recover precise arguments from source\-distributed multimodal evidence and then convert that evidence into executable actions\.
This difficulty is also clearly visible within individual systems rather than only across overall rankings\. In the Text\+Caption setting,MemGPTreaches 0\.5734 and 0\.4722 onS\.H\.andM\.H\., but only 0\.0370 onF\.C\.\.Mem0reaches 0\.7430 and 0\.5903 onS\.H\.andM\.H\., but only 0\.0398 onF\.C\.\. In the MM setting,MIRIXachieves 0\.5507 onS\.H\., 0\.3472 onM\.H\., and 0\.4025 onC\.R\., but only 0\.0769 onF\.C\.\. These row\-wise contrasts show that relatively strong performance on recognition\- or answer\-oriented tasks does not readily translate into success on memory\-grounded action prediction\. Put differently, the source\-distributed setting does not only make evidence harder to find; it also makes it harder to assemble fine\-grained fields into an exact action specification\. This is precisely whyF\.C\.serves as a particularly strong stress test for memory systems that are expected to support downstream agent behavior\.
#### Stronger Overall Performance Remains Highly Profile\-Dependent\.
Beyond the overall ranking, the main results show that stronger systems succeed through noticeably different capability profiles rather than through a common recipe\.HMRAGachieves the best Overall score, 0\.4933, with comparatively balanced performance acrossS\.H\.\(0\.8129\),M\.H\.\(0\.7153\), andC\.R\.\(0\.5191\), although it still remains weak onP\.R\.\(0\.3081\) and especiallyF\.C\.\(0\.1111\)\.Mem0, by contrast, reaches a lower Overall score of 0\.4061 but is unusually strong on factual QA, especiallyS\.H\.andM\.H\., before dropping sharply onP\.R\.andF\.C\.\.MemGPTis not among the strongest factual systems overall, but in the Text\+Caption setting it remains one of the relatively stronger baselines onP\.R\.at 0\.4389, suggesting that preference\-sensitive memory use may favor a different storage\-and\-recall profile than evidence\-heavy factual aggregation\.
These contrasts indicate that the benchmark is separating systems not only by average accuracy, but by the*shape*of their capability profile under source\-distributed multimodal memory demands\. Some methods appear to benefit more from retrieval\-oriented factual access, while others remain relatively more stable on personalized or user\-sensitive reasoning\. This is a useful property of the benchmark, because it suggests that source\-distributed multimodal memory is not a single\-axis capability\. Instead, it contains multiple interacting sub\-problems, including factual evidence composition, preference\-sensitive recall, conflict handling, and action\-oriented assembly\.
#### Memory\-Style and Retrieval\-Style Methods Fail Differently\.
A second clear pattern is that retrieval\-oriented baselines tend to dominate the more factual and conflict\-heavy parts of the benchmark, while memory\-style systems are only occasionally competitive in narrower regions\. In the Text\+Caption setting, bothNative RAGandHMRAGoutperform most memory\-style baselines onS\.H\.,M\.H\., andC\.R\.\.Native RAGreaches 0\.7797/0\.6667/0\.4068 on these three tasks, andHMRAGreaches 0\.8129/0\.7153/0\.5191\. By contrast, several memory baselines such asShort\-Term Mem\.,Gen\. Agent Mem\., andSelf Controlled Mem\.remain near the floor across most task families, with Overall scores of 0\.2064, 0\.2210, and 0\.2360 respectively\.
This gap suggests that when the supporting evidence is distributed across multiple independent sources, explicit retrieval remains a strong practical bias because it increases the chance that the downstream model can at least access the necessary evidence\. However, the results also make clear that retrieval alone is not sufficient\. Even the strongest retrieval\-style systems remain far below theGolden Evidence Baseline, and they still degrade sharply onP\.R\.andF\.C\.\. For example,HMRAGreaches only 0\.3081 onP\.R\.and 0\.1111 onF\.C\., despite being the strongest Overall baseline\. This suggests that source\-distributed difficulty includes at least two layers: first, retrieving the right sources; and second, aligning and composing their contents into the final answer or action\. Retrieval\-oriented methods help most clearly with the first layer, but the second remains a substantial bottleneck\.
### C\.3Source\-Distribution Experiment
For the comparison between source\-centralized and source\-distributed in Section[4\.3](https://arxiv.org/html/2605.15710#S4.SS3), we construct a matched control from the same underlying QA instance rather than comparing different questions\. Starting from the original multi\-source sample, we identify the chunks containing the gold evidence and swap them with non\-gold chunks from other sources so that all gold evidence is relocated into a single source in the concentrated condition\. We then apply light scaffold editing to maintain local fluency and speaker/context consistency after the swap\. Importantly, this procedure keeps the question, answer, gold evidence content, and the surrounding context budget as stable as possible across the two conditions, while changing only whether the relevant evidence is dispersed across sources or concentrated within one source\.
Figure 10:Mem0
Figure 11:MemGPT
Figure 12:MIRIX
This subsection reports the task\-wise and model\-wise breakdown for the single\-source versus multi\-source comparison summarized in the main text\. In particular, we provide the per\-task scores underlying Figure[4](https://arxiv.org/html/2605.15710#S4.F4)\.
The task\-wise results show that the effect of source dispersion is not perfectly uniform across methods or task types, but several patterns are clear\. First, all three representative systems suffer an Overall drop when moving from single\-source to multi\-source evidence, confirming that the degradation in the main paper is not driven by only one baseline\. Second, the clearest and most consistent decreases appear onSingle\-Hop QAandConflict Resolution:Mem0,MemGPT, andMIRIXall perform worse in the multi\-source condition on these two tasks, suggesting that even apparently direct recall becomes less reliable once the supporting clue is no longer concentrated in one source, and that conflict\-sensitive cases become harder when updated evidence must be located across sources\.
Taken together, these task\-wise comparisons refine the main\-paper conclusion\. Source dispersion does not degrade every task in exactly the same way, but it most reliably harms tasks that depend on locating one decisive clue or resolving updated evidence across sources, while its effect on multi\-hop and preference reasoning is more interaction\-dependent with the baseline’s memory and composition strategy\.
### C\.4Retrieval Experiment
BaselineKKS\.H\.M\.H\.C\.R\.P\.R\.F\.C\.OverallReflexion Mem\.100\.52970\.44440\.39830\.30290\.04630\.3443200\.53850\.48610\.39620\.30640\.02780\.3510500\.56470\.47920\.36440\.30120\.04630\.35121000\.55240\.49310\.38770\.30290\.04630\.3565Native RAG100\.75170\.58330\.42800\.35280\.02780\.4287200\.77970\.66670\.40680\.36830\.07410\.4591500\.83230\.65850\.46900\.33410\.11670\.48211000\.86360\.72220\.51060\.38730\.14810\.5264HMRAG100\.79550\.65970\.45130\.24100\.02780\.4350200\.81290\.71530\.51910\.30810\.11110\.4933500\.79720\.73610\.57630\.34420\.12960\.51671000\.73080\.40280\.56570\.30050\.08820\.4176VRAG100\.43360\.46210\.34700\.26510\.02780\.3071200\.49130\.45140\.39190\.18420\.04630\.3130500\.49830\.50690\.38980\.20650\.00000\.32031000\.48250\.44440\.39620\.22030\.00000\.3087Table 6:Detailed retrieval\-budget results for the representative baselines used in RQ3\.This subsection provides the complete retrieval\-budget results for all evaluated values ofKK, including the overall scores summarized in the main text and the corresponding Recall@KK/Precision@KKstatistics\. We also include per\-task breakdowns where available\.
Figure 13:Reflexion Mem\.
Figure 14:Native RAG
Figure 15:HMRAG
Figure 16:VRAG
The appendix results further clarify the trade\-off discussed in the main text\. Across the retrieval\-based methods, increasingKKgenerally improves Recall@KKwhile lowering Precision@KK, but the end\-task benefit depends strongly on the method\.Native RAGshows the clearest monotonic gain: asKKincreases, recall rises substantially and the Overall score continues to improve, indicating that this pipeline is able to convert the additional retrieved evidence into useful downstream gains despite the accompanying precision drop\.
HMRAGfollows a different pattern\. Its recall continues to increase with largerKK, and task\-wise scores improve fromK=10K\{=\}10toK=50K\{=\}50on several categories, especiallyMulti\-Hop QA,Conflict Resolution, andFunction Calling\. However, performance drops sharply atK=100K\{=\}100, even though recall remains high\. This case makes the coverage–noise trade\-off especially visible: beyond a moderate retrieval budget, additional evidence appears to introduce enough distractors to outweigh the benefit of higher coverage\.
VRAGandReflexion Mem\.show more limited sensitivity to retrieval budget\. ForVRAG, recall improves fromK=10K\{=\}10toK=50K\{=\}50, but the Overall gain remains modest and reverses slightly atK=100K\{=\}100, suggesting that larger candidate pools do not translate into proportional downstream improvements\.Reflexion Mem\.changes the least: its recall is nearly flat across budgets, precision steadily decreases, and the Overall score moves only slightly\. Taken together, these detailed results reinforce that larger retrieval pools help only when the downstream pipeline can effectively filter and use the extra evidence, rather than merely accumulate more candidates\.
### C\.5Modality Experiment
Figure 17:Experiments on different modality ablation settings\. ‘Qwen’ for qwen3\-vl\-235b\-instruct, ‘GPT’ for ‘gpt\-4\.1’\.We conduct a modality ablation study by directly providing different evidence subsets to the backbone LLM in Figure[17](https://arxiv.org/html/2605.15710#A3.F17)\. Full uses all available evidence; Caption replaces multimodal evidence with captions while keeping the textual evidence unchanged; w/o text removes textual evidence while preserving native multimodal \(non\-textual\) evidence inputs; w/o MM removes native multimodal evidence while keeping the textual evidence; and w/o Text \+ Caption removes textual evidence and replaces multimodal evidence with captions\.
The clearest result is that textual evidence contributes a large share of benchmark performance\. Across both backbones, removing textual evidence leads to the largest drop relative to Full, indicating that many answer\-critical cues are carried by conversations, documents, and other text\-bearing memory objects\. At the same time, multimodal evidence also contributes non\-trivially: both w/o MM and Caption remain below Full, showing that replacing or removing raw multimodal content incurs a measurable loss\. The comparison between Full and Caption is particularly informative\. Caption\-based inputs remain much closer to Full than to the text\-removed settings, suggesting that the captioning pipeline preserves a large portion of the answer\-relevant visual information\. However, the remaining gap between Caption and Full shows that captioning is not lossless, and that native multimodal evidence still contains useful information beyond textualized summaries\. Additional task\-wise breakdowns are provided in the appendix
### C\.6Diagnosis Experiment
To better understand where current systems fail on SMMBench, we conduct an error\-diagnosis experiment on failed predictions from Table[2](https://arxiv.org/html/2605.15710#S4.T2)\. We randomly sample 600 failed cases aggregated across all benchmarked baselines and use GPT\-4\.1 as an LLM judge to assign each case a single primary failure label\. The judge is given the question, model raw response, gold answer, available context, and optional conflict\-resolution reference when applicable\.
We use a six\-way fine\-grained taxonomy\.Distributed Evidence Missingdenotes cases where the provided context lacks one or more key facts needed for the gold answer\.Updated Evidence Missingdenotes cases where the gold answer depends on newer, corrected, or authoritative evidence that is absent from the context\.Outdated or Conflicting Evidence Misuseddenotes cases where both outdated/conflicting and updated/authoritative evidence are present, but the model follows the wrong one\.Preference Inference Errordenotes cases where the context contains enough cues to infer a preference or convention, but the model fails to infer it correctly\.Function Execution Errordenotes cases where the context is sufficient, but the model fails at the executable output layer, such as using the wrong function, wrong parameters, or malformed output\.Cross\-Source Composition Errordenotes cases where the relevant clues are present, but the model fails to combine evidence across sources, modalities, or artifacts\.
For higher\-level analysis, we further organize these six labels into a two\-level hierarchy with three coarse stages:Access,Utilization, andAction\. TheAccessstage includesDistributed Evidence MissingandUpdated Evidence Missing, capturing failures to recover the necessary evidence from distributed memory\. TheUtilizationstage includesCross\-Source Composition Error,Outdated or Conflicting Evidence Misused, andPreference Inference Error, capturing failures to correctly integrate, prioritize, or interpret already available evidence\. TheActionstage includesFunction Execution Error, capturing failures in converting remembered information into executable outputs\.
Figure[5](https://arxiv.org/html/2605.15710#S4.F5.1)summarizes the diagnosis results\. The dominant failure type isDistributed Evidence Missing, which accounts for 61\.8% of the diagnosed errors, showing that the main bottleneck still lies in recovering the required evidence from source\-distributed memory\. The next largest category isCross\-Source Composition Errorat 16\.5%, indicating that even when some relevant clues are available, systems often fail to combine them correctly across sources\. Smaller but still meaningful portions come fromFunction Execution Error\(8\.2%\),Outdated or Conflicting Evidence Misused\(6\.8%\), andUpdated Evidence Missing\(6\.7%\)\.
Overall, these results suggest that the difficulty of SMMBench is layered\. Most failures occur first at theAccessstage, where systems do not surface the needed distributed evidence at all\. A second group of failures then appears at theUtilizationstage, where models fail to compose clues across sources or prioritize updated evidence over stale or conflicting records\. Finally, a smaller but clear portion of failures arises at theActionstage, where remembered information is not converted into correct executable outputs\. Together, these patterns reinforce that source\-distributed multimodal memory is challenging not only because evidence is hard to retrieve, but also because it must be correctly integrated and grounded after retrieval\.
Table 7:Illustrative ‘Conflicting Evidence Misused’ case\.Red content means misleading or conflicting evidence content\.Golden content means golden evidence or ground truths\.In this case, HMRAG recalled both misleading and true evidence, but used misleading one to answer the question, and output wrong answer\.\\rowcolorgray\!20Case & IDMMKE\_cc30 from HMRAGRetrievedRetrieved Items⋯⋯⋯\\cdots\\cdots\\cdotsCardamine bulbosa, commonly called bulbous bittercress or fall cress, is a perennial plant in the rose family\. It is native to a widespread area of western South America, in both Chile and Argentina\. Its natural habitat is dry soils of highland forests and tundras, often in acidic areas\.In late summer and early fall, flowers are produced well above the foliage\. Its leaves are edible and have a sweet taste\.⋯⋯⋯\\cdots\\cdots\\cdots…/MMKE\_449df6d90ec74289\.png⋯⋯⋯\\cdots\\cdots\\cdotsCardamine bulbosa, commonly called bulbous bittercress or spring cress, is a perennial plant in the mustard family\. It is native to a widespread area of eastern North America, in both Canada and the United States\. Its natural habitat is moist soils of bottomland forests and swamps, often in calcareous areas\. In late spring and early summer, white flowers are produced well above the foliage\. Its leaves are edible, and have a peppery taste\.QABased on Fig\. 9e33599e, Flowers of the species shown in the image are produced during which seasons?\(A\)Mid autumn and early winter\(B\)Winter and early spring\(C\)Late summer and early fall\(D\)Late spring and early summerTable 8:Illustrative ‘Updated Evidence Missing’ case\.Red content means misleading or conflicting evidence content\.Golden content means ground truths\.In this case, HMRAG only recalled misleading evidence, causing wrong answer\.\\rowcolorgray\!20Case & IDMMKE\_4723 from HMRAGRetrievedRetrieved Items⋯⋯⋯\\cdots\\cdots\\cdots…/MMKE\_b9faf19096e44ffc\.png⋯⋯⋯\\cdots\\cdots\\cdotsOxythyrea funesta, known as the Ẅhite spotted rose beetle,ïs a phytophagous beetle from the Cetoniidae family, Cetoniinae subfamily\. This beetle is found throughout most of Europe, the eastern Palearctic realm, and the Near East\. Larvae feed on plant roots and can stay in the soil until the next spring, growing up to 30 mm long\. Adults emerge in early spring, mostly seen from May to July\. They are considered pests as they damage floral organs, especially targeting light\-colored buds and flowers\.These beetles are black, sometimes bronzed, with typically six white spots on the pronotum and several on the elytra\. They are covered in white pubescence, but older beetles often lose these hairs over time\.QABased on Fig\. 0ee4a299, What specific plant is associated with the common name of the species shown in the image?\(A\)Rose\(B\)Oak tree\(C\)Lavender\(D\)Daisy
## Appendix DCases
### D\.1Single\-Hop QA
This subsection presents representativeSingle\-Hop QAsamples, including the reconstructed memory environment, the gold evidence items, the multiple\-choice options, and the gold answer\. The examples are selected to illustrate cases where one decisive clue is embedded in a broader multi\-source context\.
### D\.2Multi\-Hop QA
This subsection provides exampleMulti\-Hop QAinstances whose answers require combining evidence distributed across multiple memory objects\. We include both the final evaluation form and the source\-level evidence annotations to show how reasoning chains are preserved after conversation insertion\.
### D\.3Conflict Resolution
This subsection contains conflict\-resolution examples in which stale, contradictory, or superseded memory items coexist in the environment\. The examples highlight how the benchmark requires not only retrieval of relevant facts, but also correct prioritization of the updated or authoritative source\.
### D\.4Preference Reasoning
This subsection shows examples for user\- and project\-preference reasoning\. We include both explicit preference cases and implicit preference cases whose signals are distributed across several turns or sources, demonstrating the range of preference representations covered by the benchmark\.
### D\.5Function Call
This subsection provides representative function\-calling cases, including the memory context, candidate tool schema, and gold executable output\. These examples illustrate why the task is more stringent than multiple\-choice QA: the model must recover and assemble the exact action arguments rather than merely identify a correct option\.
Table 9:IllustrativeSingle\-Hop QAcase\.Violet text indicates scaffold content surrounding the evidence\.Golden content means golden evidence or ground truths\.\\rowcolorgray\!20Case Source & IDQA\_sample\_5f42a925 from ChartQA\_ProSourcesGroup chat:group\_chat\_food\_environment\_lifestyle\_6d738a64⋯⋯⋯\\cdots\\cdots\\cdotsMiya Cruz2023\-04\-13 11:29:27Speaking of tourism, it’s wild how some cities get totally transformed by visitors…\.Briley Hanson2023\-04\-13 11:30:25Look at this figure: Fig\. 04b3f139Briley Hanson2023\-04\-13 11:40:07![[Uncaptioned image]](https://arxiv.org/html/2605.15710v1/figure/case/ChartQA_Pro_467c9c18.png)Ricardo Bruce2023\-04\-13 11:46:57Wow, I had no idea Croatia had so many more tourists than locals …\.⋯⋯⋯\\cdots\\cdots\\cdotsGroup chat:group\_chat\_films\_transportation\_others\_3bc60e19⋯⋯\\cdots\\cdotsAxel Hart2023\-04\-11 09:14:33I think they started with soap, right? I remember reading somewhere that Lever was all about making affordable soap for everyone, and then Unilever just kept expanding into other stuff like food and personal care\.Miya Cruz2023\-04\-11 09:25:23Speaking of things that have been around forever, it’s kind of like how some tourist destinations have changed so much over time, too\.Bridget Deleon2023\-04\-11 09:25:37Yeah, whether it’s brands or places, it’s interesting to see how they adapt to stay popular\.⋯⋯⋯\\cdots\\cdots\\cdotsGuillermo Lynn2023\-04\-11 09:58:36A country is considered to have a high tourism density if the ratio of tourists to locals exceeds 2\.8\. Additionally, a high tourism influence is defined as having more than 70% of the population being tourists in the given data\.Miya Cruz2023\-04\-11 09:59:06So basically, if a country has way more tourists than locals, it’s considered high density, and if most people there are tourists, that’s high influence? That’s wild—I wonder which countries actually hit those numbers\.⋯⋯⋯\\cdots\\cdots\\cdotsQABased on Fig\. 04b3f139, the definition, which country has "high tourism density" but does not have "high tourism influence"?\(A\)Czech Republic\(B\)Denmark\(C\)Spain\(D\)CroatiaTable 10:IllustrativeMulti\-Hop QAcase\.Violet text indicates scaffold content surrounding the evidence\.Golden content means golden evidence or ground truths\.\\rowcolorgray\!20Case Source & IDQA\_sample\_9c56672c from MMCVSourcesGroup chat:group\_chat\_nature\_films\_fashion\_f3c49653Kenya Decker2023\-04\-05 18:04:00Yeah, it’s kind of like how Coresoft jumped from …Colt Kemp2023\-04\-05 18:04:03It’s funny, because whether it’s travel or game design, being open to surprises seems to make things more memorable\.Selina Gonzalez2023\-04\-05 18:22:14Look at this table: Table\. e45b6ac3Selina Gonzalez2023\-04\-05 18:22:14![[Uncaptioned image]](https://arxiv.org/html/2605.15710v1/figure/case/MMCV.png)Harper Clark2023\-04\-05 18:27:50Wow, I had no idea Coresoft made so many different types of games—everything from Magic: The Gathering to fishing and even Cake Mania\. That’s a pretty wild range\.⋯⋯⋯\\cdots\\cdots\\cdotsGroup chat:group\_chat\_lifestyle\_food\_nature\_eace8e98Andy Stewart2023\-04\-03 23:09:41Absolutely—what we’re calling ‘regional trends’ are often just the fallout…Joselyn Moss2023\-04\-03 23:10:22Speaking of platforms and what people actually want, it’s kind of like how board games get reinvented to fit what’s fun for the group, not just what the rules say\.Andy Stewart2023\-04\-03 23:18:11Squander \(written as “$QUANDER” on the box and in the rules\) is an Avalon Hill board game published in 1965\. It is based loosely on the game Monopoly, but in reverse\. …Harper Clark2023\-04\-04 00:06:24That actually sounds hilarious—I love the idea of trying …\.⋯⋯⋯\\cdots\\cdots\\cdotsGroup chat:group\_chat\_environment\_animals\_economy\_4b965250Briley Hanson2023\-04\-06 07:26:54True, and just like mixing Lego sets, maybe brands need to mix up their strategies for each festival instead of sticking to one formula\.Maggie Rachael2023\-04\-06 07:36:05Avalon Hill Games Inc\. is a game company that specializes in wargames and strategic board games\. …Joselyn Moss2023\-04\-06 07:39:51I remember playing some Avalon Hill games with my dad when I was younger—those strategy ones could go on for hours, but they were always a blast\.QABased on Table\. e45b6ac3, Which company that owns a subsidiary focused on strategic games was the publisher of a video game series in 2004 and created a board game with reversed Monopoly rules in 1965?\(A\)Parker Brothers\(B\)Avalon Hill\(C\)Wizards of the Coast\(D\)HasbroTable 11:IllustrativeConflict Resolutioncase\.Violet text indicates scaffold content surrounding the evidence\.Golden content means golden evidence or ground truths\.\\rowcolorgray\!20Case Source & IDQA\_sample\_1ce15338 from MLLMKCSourcesGroup chat:group\_chat\_films\_art\_and\_design\_music\_dccdfbfaImmanuel Goodwin2023\-04\-04 19:03:41That reminds me—have you noticed how some athletes transition into acting and bring a whole new energy to the screen?…\.Kayden Soto2023\-04\-04 19:09:59John Cena was born on April 23, 1977\. John Cena is Canadian\. John Cena is a professional wrestler and actor\.George Villegas2023\-04\-04 19:28:09I always forget that John Cena is Canadian\! It’s wild how he’s managed to balance wrestling and acting since he was born in 1977\.⋯⋯⋯\\cdots\\cdots\\cdotsGroup chat:group\_chat\_religion\_sports\_animals\_e78b964aGeorge Villegas2023\-04\-04 14:42:03Speaking of merch and visual identity, have you all noticed how wrestling belts and shirts have become iconic symbols too?Kane Owen2023\-04\-04 14:46:31Look at this figure: Fig\. 0fbc5062Kane Owen2023\-04\-04 14:47:11![[Uncaptioned image]](https://arxiv.org/html/2605.15710v1/figure/case/MLLMKC_ab9969e7.jpg)Maggie Rachael2023\-04\-04 14:48:20That championship belt looks awesome\! I love how intense the background is—it really makes the whole image pop\.⋯⋯⋯\\cdots\\cdots\\cdotsGroup chat:group\_chat\_architecture\_history\_technology\_4856238eAlan Woods2023\-04\-05 12:37:49Yeah, it’s wild how some people start in one world—like sports or activism—and then totally pivot to something unexpected, but still keep their influence\.⋯⋯⋯\\cdots\\cdots\\cdotsZander Aguilar2023\-04\-05 12:59:37John Cena was born on April 23, 1977\. John Cena is American\. John Cena is a professional wrestler and actor\.⋯⋯⋯\\cdots\\cdots\\cdotsQABased on Fig\. 0fbc5062, what is the nationality of the person in the picture?\(A\)American\(B\)Canadian\(C\)Panamanian\(D\)MexicanTable 12:IllustrativePreference Reasoningcase\.Violet text indicates scaffold content surrounding the evidence\.Golden content means golden evidence or ground truths\.\\rowcolorgray\!20Case Source & IDQA\_sample\_fbaa3802\_1 from MMPBSourcesGroup chat:group\_chat\_sports\_music\_literature\_dc339f8bBridget Deleon2023\-04\-04 08:04:32That’s a good point, Kayden\. I’ve noticed a lot of crime dramas lately really get into the psychology behind systems and decisions—kind of like what we’re talking about here\.Joselyn Moss2023\-04\-04 08:06:18In terms of entertainment, I enjoys crime thrillers, historical dramas, but dislikes sports TV, romantic dramasBridget Deleon2023\-04\-04 08:06:19Have you seen “Mindhunter”? It’s a crime thriller with a bit of history mixed in, might be right up your alley\.⋯⋯⋯\\cdots\\cdots\\cdotsGroup chat:group\_chat\_music\_politics\_animals\_74b7030eBridget Deleon2023\-04\-04 06:09:59Speaking of surprising roles, I feel like streaming platforms have really changed how we discover actors in new lights\.⋯⋯⋯\\cdots\\cdots\\cdotsBridget Deleon2023\-04\-04 06:20:21Look at this figure: Fig\. d760cff8Bridget Deleon2023\-04\-04 06:21:23![[Uncaptioned image]](https://arxiv.org/html/2605.15710v1/figure/case/MMPB_81ebaeeb.png)Miles Roberts2023\-04\-04 06:21:44I just finished watching the Loki season finale and it was wild\! Disney\+ has so much good stuff lately, I might check out Black Widow next\.⋯⋯⋯\\cdots\\cdots\\cdotsQABased on Fig\. d760cff8, among the activities that could reasonably occur in the given image, which one is Joselyn Moss least likely to be doing?\(A\)Watching movie marathons\(B\)Watching sports TV\(C\)Practicing meditation\(D\)Streaming the latest seriesTable 13:IllustrativeFunction Callcase\.Violet text indicates scaffold content surrounding the evidence\.Golden content means golden evidence or ground truths\.\\rowcolorgray\!20Case Source & IDFC\_sample\_c29ba5bd from M3\_BenchSourcesGroup chat:group\_chat\_business\_nature\_health\_9981760bKara Yates2023\-04\-21 19:32:23You know, all this talk about shelf placement and visual cues got me thinking—what if we had a digital tool that did the same thing? Like, before you even walk into the store?Linda Anderson2023\-04\-21 19:48:59I’ve used those kinds of apps before for supplements, but never for actual meds\. If it pulled straight from FDA data, I’d trust it way more than random Google results\.Kayden Soto2023\-04\-21 19:49:06FDA drug lookup and comparison\. This workflow begins by retrieving official drug information for a specific medication…\.Guillermo Lynn2023\-04\-21 19:49:13That sounds super useful, especially if you’re trying…⋯⋯⋯\\cdots\\cdots\\cdotsGroup chat:group\_chat\_religion\_sports\_television\_0e949769Amiah Sweeney2023\-04\-22 23:52:27Alright, so we’re set on “The Usual Suspects” for tonight\. Can’t wait to see if it holds up\!Justice Clark2023\-04\-22 23:56:07Yeah, I’m looking forward to it\. But before we get too comfy, anyone else always end up bringing random essentials to movie nights?Colt Kemp2023\-04\-23 00:02:55Haha, I do\! I usually have snacks and a mini first aid kit in my bag, just in case\. You never know what’ll happen\.Asia Rivers2023\-04\-23 00:15:59\(Image; see Fig\. 8dc8ab97\.\)![[Uncaptioned image]](https://arxiv.org/html/2605.15710v1/figure/case/M3_bench_c29ba5bd.png)Immanuel Goodwin2023\-04\-23 00:34:02I’ve actually used that cream before and it works pretty well for bug bites and rashes\. It’s nice that you get multiple tubes in one pack too\.⋯⋯⋯\\cdots\\cdots\\cdotsCandidate Tools\{"name": "HEALTHCARE\_MCP\_drug\_lookup", "description": "Look up FDA drug information by name\.", "args": \{"drug\_name": "str"\}\}⋯⋯⋯\\cdots\\cdots\\cdotsQABased on Fig\. 8dc8ab97, what information is available about the medication shown in this image? Invoke functions to solve this questions\.HEALTHCARE\_MCP\_drug\_lookup\("drug\_name"="hydrocortisone cream 1%"\)
## Appendix EPrompts
This section collects the main prompt templates used in construction and evaluation\. It includes prompts for evidence normalization, question rewriting, distractor generation, evidence verification, conversation generation, memory insertion, and final evaluation\. We separate prompts by stage so that future users of the benchmark can reproduce or adapt individual components without reusing the full pipeline verbatim\.
### E\.1Correctness Verification
Prompt for Correctness Verification
\#\#\# TaskYou are a multimodal answer verifier\. You will be given images, pieces of text, a question, and a proposed answer\. Carefully check whether the proposed answer is correct given all the information from both the images and the texts\.If the answer is completely correct, output YES\. If the answer is partially correct, incorrect, incomplete, or contradicted by the text or the image, output NO\.\#\#\# Output FormatOutput exactly one word: YES or NO\.
Figure 18:Prompt template for correctness verification\.
### E\.2Evidence Necessity Verification
Prompt for Evidence Necessity Verification
\#\#\# TaskYou are an assistant tasked with verifying whether a given question can be correctly answered using the provided evidence\. You will be given a question\-answer pair and a set of evidence items, which may include text, images, and tables\. One potentially relevant piece of evidence is missing\.\#\#\# Important Rules1\. Judge only based on the provided evidence\.2\. Do not rely on background knowledge or external reasoning\.3\. If the correct answer cannot be uniquely determined, output No\.4\. If the evidence is sufficient to correctly and unambiguously support the answer, output Yes\.\#\#\# Output FormatOutput only one word: Yes or No\.
Figure 19:Prompt template for evidence necessity verification\.
### E\.3Distractor Generation
Prompt for Distractor Generation
\#\#\# TaskYou are an assistant tasked with creating distractor options for a multiple\-choice question\. You will be given images, pieces of text, a question, and its correct answer\. Generate three distractor options that are related to the question but incorrect\.\#\#\# Additional Instructions1\. The distractors should be plausible enough to be considered potential answers, but not the correct one\.2\. The distractors should have a similar format and length to the original answer, but be clearly different\.3\. The distractors should not be trivial or too far\-fetched\.\#\#\# Output FormatOutput one distractor option per line\. Do not provide explanations or extra text\.
Figure 20:Prompt template for distractor generation\.
### E\.4Metadata and Caption Generation
Prompt for Metadata and Caption Generation
\#\#\# TaskYou are an assistant tasked with topic classification for multimodal content\. You will be given a question\-answer pair, evidence items including text, images, and tables, along with a list of candidate topics\.Select one or more topics from the candidate list that best match the overall content and intent\. If none of the candidate topics are suitable, outputothersand additionally provide a short topic description\.\#\#\# Candidate Topicsfilms,science,politics,sports,video games,transportation,television,music,animals,history,literature,architecture,art and design,fashion,food,health,lifestyle,nature,religion,travel,business,education,environment,government,economy,technology\.\#\#\# Output Rules1\. Output all selected topics on a single line, separated by commas\.2\. If usingothers, it must appear at the beginning of the output\.3\. Do not output explanations or extra text\.
Figure 21:Prompt template for metadata and caption generation\.
### E\.5Conversation Theme Planning
Prompt for Conversation Theme Planning
\#\#\# TaskYou are a group chat topic control agent\. Your job is to help guide and control the flow of the group chat conversation based on the provided inputs\. You will be given the potential QA including multimodal content such as text, images, and tables\.\#\#\# Important Instructions1\. Encourage natural use of the multimodal content, but do not explicitly mention it as a gold clue\.2\. Phrase sub\-topics mainly as declarative sentences\.3\. Encourage productive discussion, decision making, and problem solving\.4\. Keep the group on track and avoid derailment\.5\. Do not mention the question\-answer pair directly\.\#\#\# Output FormatGenerate 10–20 sub\-topics, one per line, mainly as declarative sentences\.
Figure 22:Prompt template for conversation theme planning\.
### E\.6Turn\-Level Conversation Generation
Prompt for Turn\-Level Conversation Generation
\#\#\# TaskYou are an assistant tasked with replying to a group chat\.You will be given:1\. A series of messages from the group chat\.2\. A specific topic to be aware of when replying\.3\. Your personal preferences\.\#\#\# Step 1Internally decide whether the reply should be SHORT \(1–2 sentences\) or LONG \(5–8 sentences\) based on the conversation length\.\#\#\# Step 2Generate the reply accordingly\.\#\#\# Requirements\- Respond relevantly to the ongoing conversation\.\- Break echo chambers by introducing new perspectives when needed\.\- Avoid repetition and mere agreement\.\- Keep the reply conversational and casual\.\- Stay consistent with the provided preferences\.\#\#\# Output FormatOutput a single, natural group\-chat message and nothing else\.
Figure 23:Prompt template for turn\-level conversation generation\.
### E\.7Final Evaluation Prompt
Prompt for Final Evaluation \#\#\# TaskYou are a long\-context multimodal assistant\. You will be given a conversation history that may include a large amount of text and images\. Your task is to carefully read the entire provided conversation, understand the user’s question, and answer it as accurately as possible\.\#\#\# Specific Instructions1\. Use only the information available in the conversation and images\.2\. Pay attention to earlier parts of the conversation if they contain necessary definitions, assumptions, or details\.3\. If the question cannot be answered from the given context and images alone, do not guess\.4\. Some runs may provide only captions or descriptions instead of the original multimodal evidence\. If that information is insufficient, answer cautiously\.\#\#\# Output FormatOutput exactly one choice from\(A\),\(B\),\(C\), or\(D\), and nothing else\.If the question cannot be answered from the available information, output:No, I can not answer this question based on the available information
Figure 24:Prompt template for final evaluation\.
### E\.8Insertion Anchor Selection
Prompt for Insertion Anchor Selection
\#\#\# TaskYou are an expert at analyzing group chat conversations\. Your task is to determine the best turn to insert a given scaffold conversation segment into an existing conversation\.You will see:\- A conversation window with turns labeled by index\.\- A scaffold conversation segment that needs to be inserted\.Determine which turn index is the most suitable insertion point\.\#\#\# Consider1\. Topic relevance\.2\. Natural flow\.3\. Timing\.\#\#\# Output FormatOutput only a single integer, orNONEif no position is suitable\.
Figure 25:Prompt template for insertion anchor selection\.
### E\.9Scaffold and Smoothing
Prompt for Scaffold and Smoothing
You are a conversational conversation generator for group chats\.The user will provide:1\. A central topic, which may be a piece of text, an image, or a table\.2\. A conversation history in "SpeakerName: message" format, involving multiple speakers\.3\. The next speaker’s name \- you must generate a message as if this person is speaking\.Your task is to generate a single message from the specified speaker’s perspective:\- The message should naturally continue the discussion based on the existing turns and the given topic\.\- Write as if you are that specific speaker: use first\-person perspective \(I, my, etc\.\) or respond naturally as that person would in a group chat\.\- The message should be substantial and relevant to the given topic\.\- If the theme involves a particular person’s preferences, you can shape the conversation around the things they like\.\#\#\# Constraints1\. The generated message should be casual, realistic, and human\-like\. Do not output emojis\.2\. Reply in a relaxed, natural tone—like how a real person would talk\.3\. Keep it concise: one or two sentences is appropriate, not too long\.4\. Direct descriptions of images and tables are forbidden; you can extend the discussion around their content\.5\. Do NOT include the speaker’s name in your output—output only the message content, as if the speaker is typing it\.\#\#\# Output formatOutput ONLY the message text that the specified speaker would say\. No prefix, no "Name:", no quotes\. Just the raw message\.
Figure 26:Prompt template for scaffold and smoothing\.
### E\.10Non\-Text Captioning Prompt
Prompt for Non\-Text Captioning
\#\#\# TaskYou are an expert at describing images for retrieval and QA\.Provide a single detailed caption in English: cover main subjects, layout, colors, text, charts/tables/axes if any, and fine\-grained spatial relations\. Be factual and exhaustive without speculation\.
Figure 27:Prompt template for non\-text captioning\.
### E\.11LLM\-Judge Error Diagnosis
Prompt for LLM\-Judge Error Diagnosis
\#\#\# TaskYou are an expert error\-analysis judge for a multimodal, multi\-source memory benchmark\. Given a question, model response, gold answer, available context, and optional conflict\-resolution reference, choose the single most likely primary error type\.\#\#\# Taxonomy1\.Distributed Evidence Missing: The context is missing one or more key facts needed for the gold answer\.2\.Updated Evidence Missing: The gold answer depends on newer, corrected, or authoritative evidence that is absent from the context\.3\.Outdated or Conflicting Evidence Misused: Both outdated/conflicting and updated/authoritative evidence are present, but the model follows the wrong one\.4\.Preference Inference Error: The context contains enough cues to infer a preference or convention, but the model fails to infer it correctly\.5\.Function Execution Error: The context is sufficient, but the model fails at the action layer, such as wrong function, wrong parameters, or malformed output\.6\.Cross\-Source Composition Error: The clues are present, but the model fails to combine evidence across sources, modalities, or artifacts\.\#\#\# Decision Principles\- Prefer missing\-evidence labels when key support is absent\.\- PreferUpdated Evidence Missingwhen the missing fact is specifically the latest or authoritative one\.\- PreferOutdated or Conflicting Evidence Misusedwhen both sides are present and the model follows the wrong one\.\- PreferPreference Inference Errorfor implicit preference questions\.\- PreferFunction Execution Errorfor function\-call or action\-prediction tasks when evidence is sufficient\.\- PreferCross\-Source Composition Errorwhen integration is the main failure\.\#\#\# Output Format\{"label": "<one of the six labels exactly as written\>","reason": "<brief explanation, 1\-3 sentences\>","confidence": <a float between 0 and 1\>\}
Figure 28:Prompt template for LLM\-judge error diagnosis\.Similar Articles
WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
WorldMemArena is a new benchmark with 400 multi-session multimodal tasks for evaluating multimodal agent memory, comparing long-context, RAG, and harness-based memory approaches, revealing that better memory writing does not guarantee better performance and that systems struggle with visual evidence.
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
GroupMemBench is a new benchmark for evaluating LLM agent memory in multi-party conversations, exposing failures in current memory systems with the best achieving only 46% average accuracy.
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.
Learning to Learn from Multimodal Experience
This paper introduces AutoMMemo, a framework that enables multimodal agents to automatically design memory mechanisms (expressible as executable memo programs) for learning from multimodal interaction trajectories, outperforming no-memory and fixed-memory baselines on GUI/Web navigation and visual reasoning benchmarks.
MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks
Introduces MTR-Suite, a unified framework for evaluating and synthesizing conversational retrieval benchmarks, featuring an LLM-based auditor, a multi-agent pipeline for cost-effective dialogue generation, and a benchmark with high discriminative power.