DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection
Summary
Introduces DMV-Bench, an interactive benchmark for evaluating visual memory in multimodal agents using incidental visual cues from product images, and proposes DualMem, a dual-coding memory architecture that outperforms text-only and other multimodal baselines across various chain lengths.
View Cached Full Text
Cached at: 06/29/26, 05:26 AM
# DMV-Bench: Diagnosing Long-Horizon Multimodal Agents’ Visual Memory with Incidental Cue Injection
Source: [https://arxiv.org/html/2606.27499](https://arxiv.org/html/2606.27499)
Yujin Tang Chenming Shang Ruize Xu Nikhil Singh Dartmouth College \{yujin\.tang\.gr, nikhil\.singh\}@dartmouth\.edu
###### Abstract
Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment,*when*an agent genuinely needs to remember what it*saw*rather than what it could write down\. We introduceDMV\-Bench111Code:[https://github\.com/yyyujintang/DMV\-Bench](https://github.com/yyyujintang/DMV-Bench), the first interactive benchmark for multimodal\-agent visual memory\. DMV\-Bench is built on a controlled home\-furnishing e\-commerce catalogue of1,0001\{,\}000product variants in which a text\-leakage contract keeps the discriminative signal of each task in the pixels alone\. Across a chain of autonomous shopping sessions, every visited product image carries a unique, pre\-rendered*incidental cue*, and the agent is later asked to recall a particular cued product and navigate to its URL\. InspirAdded the ed by dual\-coding theory, we proposeDualMem, a memory architecture that maintains a visual and a verbal code in parallel\. On DMV\-Bench, DualMem outperforms a caption baseline and three recent multimodal agent\-memory systems at every chain lengthJ∈\{5,10,15,50\}J\\in\\\{5,10,15,50\\\}on both Gemini 2\.5 Flash and Qwen2\.5\-VL\-7B, with the lead surviving controls for memory\-bank size and encoding\-position bias, and an*asymmetric dual\-coding*regime in which vision carries the cue end\-to\-end while the verbal channel plays a smaller query\-grounding role\.
DMV\-Bench: Diagnosing Long\-Horizon Multimodal Agents’ Visual Memory with Incidental Cue Injection
Yujin Tang Chenming Shang Ruize Xu Nikhil SinghDartmouth College\{yujin\.tang\.gr, nikhil\.singh\}@dartmouth\.edu
Figure 1:Why interactive visual memory matters\.A shopping agent helps a user furnish a room across products spanning*chair*,*lamp*, and*vase*categories\. When the user later returns and refers to “the lamp with the alarm clock,” a*text\-only*memory has stored only nameable attributes \(mushroom\-shape, frosted glass, cream lampshade\) with no record of the incidental alarm\-clock cue, and the agent gets stuck\. A*visual*memory preserves the cue and lets the agent locate the correct lamp and complete the request\.## 1Introduction
Much of what humans remember from a long\-past experience is recovered not by deliberate rehearsal but by a cue: an incidental perceptual detail \(like the colour of a wrapper, or the pattern on a hat\) that was not flagged as important at the time, yet later acts as the key that unlocks the rest of the episode\. This has been theorized; for example encoding specificity\(Tulving and Thomson,[1973](https://arxiv.org/html/2606.27499#bib.bib38)\)holds that a memory is retrievable to the extent that cues present at encoding are reinstated at retrieval, and incidental\-encoding studies\(Hyde and Jenkins,[1969](https://arxiv.org/html/2606.27499#bib.bib39); Craik and Lockhart,[1972](https://arxiv.org/html/2606.27499#bib.bib40)\)show that such cues are routinely laid down without intent to memorise\. In humans these cues are disproportionately visual, and the hippocampal mechanism that exploits them, pattern completion from a partial cue to a full episode\(Marr,[1971](https://arxiv.org/html/2606.27499#bib.bib41); Nakazawaet al\.,[2002](https://arxiv.org/html/2606.27499#bib.bib42)\), has recently begun to inspire memory systems for LM agents\(Gutiérrezet al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib14)\)\.
Multimodal web agents do not yet remember this way\. Working through a task, an agent may stream past hundreds of product images, and unless a detail is flagged as relevant to the current sub\-goal it has little reason to encode it\. When something is committed to memory, most current systems write it down as text\(Packeret al\.,[2023](https://arxiv.org/html/2606.27499#bib.bib9); Zhonget al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib10); Xuet al\.,[2025](https://arxiv.org/html/2606.27499#bib.bib11); Gutiérrezet al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib14)\)\. So if a user later refers back to something by a visual detail \(e\.g\.*the lamp that had the triangular brass base*\), a text memory can confirm a lamp was seen, but may have nothing to say about which one\.
Figure 2:*\(a\) DMV\-Bench\.*Each visited product carries a unique incidental cue baked into its image and barred from every text channel by the L2\-leakage contract\.*\(b\) DualMem Architecture\.*Each observation is dual\-coded into a visual embedding and a verbal embedding, stored as four channels in one bank; at retrieval, visual and verbal top\-kkscores are fused with a tunable weightα\\alphabefore the VLM agent emits an action\.At the same time, carryingevery pixelforward is neither feasible nor needed\. The pertinent question is*when*: which tasks genuinely require an agent to remember what it*saw*, and for which would a text note have served equally well? Existing benchmarks make this question difficult to settle, because they typically combine visual and textual signals rather than isolating the contribution of each\. We buildDMV\-Benchto make this answerable\.
#### Testing visual recall via incidental cue injection\.
DMV\-Benchreduces the question to one task and one mechanism\. An agent runs a chain of ordinary comparison\-shopping sessions on a realistic storefront\. Every product the storefront serves carriesa unique, pre\-rendered visual cue, e\.g\. a small object in a particular color baked into the product image at build time\. The agent is told to comparison\-shop within a category and is given no instruction to attend to or remember any visual detail; cues are present on every visited product but are never mentioned by the task\. Between sessions its in\-context conversation is wiped, so only its memory architecture carries anything forward; an eval\-only agent is later asked to navigate back to a particular cued product\. Because the cue lives in the pixels and not in any text channel, a text memory can answer only if its captioner happened to describe an object the task did not explicitly point out\. The axis of interest is*recall reach*: how many session boundaries separate the visit from the probe\. Sweeping reach turns a single accuracy into a*retention curve*, a direct readout of how long a visual cue survives in a given memory\.
#### Why existing benchmarks cannot answer this\.
Three properties of current benchmarks make this question hard to settle\. They conflate textual and visual recall: in VisualWebArena\(Kohet al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib30)\), WebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib29)\), and most long\-video QA\(Fuet al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib27); Liet al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib28)\), an agent can solve ostensibly visual tasks by reading captions or alt\-text\. When visual recall is genuinely required, the discriminative detail is usually nameable \(a red sofa versus a blue one\), so a text memory is not put under real pressure\. And the evidence is almost always*flagged*in advance and probed at short range, leaving the question of whether an*unflagged*detail survives a long, multi\-session horizon largely unmeasured\. The agentic\-memory literature has matured quickly, but on the text side: MemoryArena\(Heet al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib24)\), for instance, rigorously stresses cross\-session dependence, yet its observations are textual and it does not ask whether a*visual*detail survives a session boundary\.
Overall, our contributions are:
1. 1\.We instantiateDMV\-Bench, to our knowledge the first benchmark for*interactive, multi\-session, visual*agent memory: a realistic e\-commerce environment with a calibrated 1,000\-variant catalogue in which every visited product image carries a unique, baked\-in incidental cue\.
2. 2\.We frame the*when*question for multi\-session agentic visual memory and introduceper\-item incidental cue injectionas the protocol that operationalizes it: the agent encounters cues throughout each session without any instruction to attend to them\.
3. 3\.We propose therecall\-reach retention diagnostic, which probes recall as a function of how many session boundaries a cue survived, evaluated efficiently over a shared\-prefix rollout tree\.
4. 4\.We proposeDualMem, a dual\-coding\-inspired memory architecture that maintains a visual and a verbal signal in parallel and fuses them at retrieval and injection, and audit it against six baselines including three recent multimodal external memory systems\.
## 2Related Work
#### Text\-side memory systems\.
An explicit read/write/inject machinery is well established for purely textual agents, from operating\-system\-style hierarchies and Ebbinghaus\-inspired forgetting\(Packeret al\.,[2023](https://arxiv.org/html/2606.27499#bib.bib9); Zhonget al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib10); Shinnet al\.,[2023](https://arxiv.org/html/2606.27499#bib.bib15)\)to autonomous memory operations\(Xuet al\.,[2025](https://arxiv.org/html/2606.27499#bib.bib11); Wang and Chen,[2025](https://arxiv.org/html/2606.27499#bib.bib12); Chhikaraet al\.,[2025](https://arxiv.org/html/2606.27499#bib.bib13)\)and hippocampal\-style retrieval\(Gutiérrezet al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib14)\)\. A more recent line distils trajectories into reusable units the agent can later compose: Agent Workflow Memory\(Wanget al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib16)\)induces program\-form workflows from past successes, and ReasoningBank\(Ouyanget al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib17)\)extracts strategy\-level reasoning items from both successes and failures\. Across these systems the unit of memory is textual, a sentence, a fact, a graph node, a workflow, a reasoning step, so diagnosing a failure reduces to a text\-retrieval\-quality question\. DMV\-Bench targets the regime where that assumption breaks: the unit becomes visual\.
#### Vision\-side memory systems\.
Once observations are images, the design space widens\.*In\-model*multimodal memories tie storage to a fixed visual encoder: caption\-based entity graphs \(M3\-Agent\(Longet al\.,[2025](https://arxiv.org/html/2606.27499#bib.bib1)\), MA\-LMM\(Heet al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib6)\), EgoLife/EgoRAG\(Yanget al\.,[2025](https://arxiv.org/html/2606.27499#bib.bib4)\)\), continuous\-token memory via a Q\-Former \(CoMEM\(Wuet al\.,[2025b](https://arxiv.org/html/2606.27499#bib.bib2); Liet al\.,[2023](https://arxiv.org/html/2606.27499#bib.bib33)\)\), and discrete\-continuous hybrids \(HSE\-Mem\(Zhuet al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib3)\)\); these are bound to the host model and do not transfer as drop\-in modules\. We instead focus on*external*multimodal memories that any agent can query: WorldMM\(Yeoet al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib5)\)adaptively retrieves across parallel episodic, semantic, and visual modules; M2A\(Fenget al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib7)\)couples a raw\-message store with a semantic\-abstraction store, routed by paired chat and memory\-manager agents; MMA\(Luet al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib8)\)reweights retrieved items by source credibility, temporal decay, and conflict\-aware consensus; MemVerse\(Liuet al\.,[2025](https://arxiv.org/html/2606.27499#bib.bib18)\)maintains a hierarchical multimodal knowledge graph that is periodically distilled back into the host model\. These four are the comparison set we benchmark directly against DualMem\. Evaluation across both waves stays end\-to\-end, with little direct measurement of how long a visual entry actually survives a multi\-session horizon, the quantity DMV\-Bench measures along its reach axis\.
#### Agent memory benchmarks\.
On the text side, LoCoMo\(Maharanaet al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib19)\), LongMemEval\(Wuet al\.,[2025a](https://arxiv.org/html/2606.27499#bib.bib20)\), and MemoryAgentBench\(Huet al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib21)\)evaluate long\-term conversational memory; MemoryArena\(Heet al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib24)\)make the multi\-session agentic dimension explicit, but its observations remain textual and they do not test whether a*visual*detail survives a session boundary\. On the visual side, FindingDory\(Yadavet al\.,[2025](https://arxiv.org/html/2606.27499#bib.bib23)\)stresses embodied long\-trajectory agents and EMemBench\(Liet al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib22)\)probes VLM episodic memory, while the contemporaneous MemEye\(Guoet al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib26)\)evaluates visual\-centric multimodal\-agent memory at multiple levels of evidence granularity; MemEye, however, is a static QA benchmark rather than an interactive environment in which the agent acts and is scored on what it does\. Realistic web\-agent environments\(Zhouet al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib29); Kohet al\.,[2024](https://arxiv.org/html/2606.27499#bib.bib30)\)provide the interactive setting, but they do not isolate the agent’s visual memory as a measurement; in VisualWebArena in particular, screenshots are observations but no probe targets long\-horizon visual retention\. DMV\-Bench occupies the intersection these miss \(Table[1](https://arxiv.org/html/2606.27499#S2.T1)\): an interactive web environment whose evaluation isolates long\-horizon*visual*retention along a controlled reach axis\.
Table 1:Agent\-memory benchmarks contemporaneous with DMV\-Bench\.To the best of our knowledge, DMV\-Bench is the first benchmark designed specifically for*interactive, multi\-session, visual*agent memory: prior memory benchmarks are either QA\-style, GUI\-interactive on mobile screenshots, or mixed web\-and\-reasoning\. None probes the multi\-session retention of*visual*cues an agent saw incidentally inside a live environment\. For DMV\-Bench, the\# Taskscell “46,265/18,58846\{,\}265/18\{,\}588” reports recall\-probe tasks on*Gemini 2\.5 Flash*/*Qwen2\.5\-VL\-7B*\.
## 3DMV\-Bench
DMV\-Bench is a diagnostic benchmark for long\-horizon visual memory in multimodal agents\.
### 3\.1A controlled e\-commerce environment
The benchmark lives inside a realistic modern\-furniture storefront with hero pages, category grids, product detail pages, breadcrumbs, ratings, and “related items” carousels\. Ten product categories \(sofas, lamps, rugs, cushions, chairs, side tables, vases, bookshelves, wall art, plant pots\) appear in ten interior\-design styles \(modern,minimalist,mid\-century,Scandinavian,industrial,vintage,rustic,bohemian,art deco,Japandi\), with ten variants per collection, giving a catalogue of10×10×10=1,00010\\times 10\\times 10=1\{,\}000variants each bound to the storefront by a frozenurlHash\. A storefront screenshot of the four navigation levels is given in Appendix[A](https://arxiv.org/html/2606.27499#A1)\.
#### Variant generation\.
For each variant we first synthesize a natural\-language prompt naming the product class and the collection’s style\. For cued variants the prompt also names a unique*color–object*pair from a bijective cue vocabulary, so every cue is globally unique\.Nano\-Banana\(Google DeepMind,[2025](https://arxiv.org/html/2606.27499#bib.bib36)\)renders the base studio photograph and then performs the cue overlay edit, keeping cue rendering consistent across categories and styles\. A VLM\-as\-judge filters generations whose product class drifts\.
#### The L2\-leakage contract\.
The primary signal for every task is thecue: a small colored object present only in the pixels of one product image\. TheL2\-leakage contractkeeps this signal out of language: the cue vocabulary \(object types×\\timescolours\) appears in no text channel surrounding a product \(not in its title, description, alt\-text, URL slug, meta\-tags, or template reviews\), and a pre\-release audit rejects any such occurrence\. A text\-only memory system therefore has nowhere the cue could be recorded making sure it is truly a test ofvisualmemory\.
### 3\.2The incidental\-cue task
Figure 3:The DMV\-Bench task\.*Phase 1 \(Encoding\):*a chain ofJJsessionsS0,…,SJ−1S\_\{0\},\\ldots,S\_\{J\-1\}in which a memoryless ReAct agent comparison\-shops across at least three product categories \(e\.g\., chair, lamp, vase\); cues appear as unique visual patterns in product images but never in text\. Sessions are cached and shared across rollouts \(§[3\.3](https://arxiv.org/html/2606.27499#S3.SS3)\)\.*Phase 2 \(Retrieval\):*after the chain completes,kkprobes per visited session ask a VLM navigator to re\-locate a cued product by its visual description \(e\.g\., “take me back to the product with the visual cue inS2S\_\{2\}”\); the example probesS2S\_\{2\}fromS4S\_\{4\}at recall reachr=2r\{=\}2\. Scoring is exact\-match on the emitted product URL \(§[3\.4](https://arxiv.org/html/2606.27499#S3.SS4)\)\.Every instance in DMV\-Bench is an*incidental\-cue*\(IC\) task, as shown in Figure[3](https://arxiv.org/html/2606.27499#S3.F3): a chain of autonomous shopping sessions into which a unique visual cue is injected, followed by recall probes at controlled reach\.
#### The session chain\.
A task is a chain ofJJ*sessions*, each one a brief shopping task \(*“I’m furnishing a room; find me a chair, a lamp, and a vase”*\) that a ReAct agent fulfils over 22–28 steps of free browsing\. The open\-ended shopping list sustains a long trajectory of unrelated observations through which an injected cue must survive\. Within a session the agent runs with no memory\. Trajectories are generated once and replayed into each memory baseline, so every baseline sees an identical observation stream\.
#### Per\-product incidental cue injection\.
Every product is carrying one unique pre\-rendered cue, which has three important properties:*\(i\) unannounced*: the session prompt never mentions cues;*\(ii\) identity\-bound and unknowable*: each cue is fixed at build time and the agent cannot know which product will be probed;*\(iii\) text\-leakage\-free*, as mentioned before\.
#### Cue uniqueness\.
Cues are drawn from a bijective object–color vocabulary designed to be globally unique across the catalog, such that a recall query of the form*“the product with the teal sleep mask”*resolves to exactly one product, a necessary condition for deterministic \(e\.g\. URL\) evaluation\.
#### Context wipe and recall probes\.
Between sessions the agent’s in\-context conversation is*wiped*; only the memory bank crosses the boundary\. After the encoding chain, a read\-only ReAct agent issues recall probes against\(visited product,recall session\)\(\\text\{visited product\},\\text\{recall session\}\)pairs: each probe states the cue \(*“take me back to the product with the teal sleep mask”*\) and the agent must navigate to it\. Success is exact URL match\.
#### Recall reach\.
The diagnostic axis is*recall reach*r=\(recall session\)−\(visit session\)r=\(\\text\{recall session\}\)\-\(\\text\{visit session\}\): a reach\-11probe recalls a product seen in the immediately preceding session, a reach\-44probe one whose cue survived four context wipes\. Because trajectories are cached,JJis freely extensible; we reportJ∈\{5,10,15\}J\\in\\\{5,10,15\\\}and a Monte Carlo pilot atJ=50J=50\.
### 3\.3Efficient evaluation: the rollout tree
Long sessions are expensive, and re\-running a fullJJ\-session chain for every recall probe wastes the shared early sessions\. DMV\-Bench instead evaluates over a*shared\-prefix rollout tree*\(annotated in Figure[3](https://arxiv.org/html/2606.27499#S3.F3)\): the first session is run once, thenBBchild sessions branch from its end\-of\-session memory, each branchingBBways in turn to depthJJ\. A node is executed exactly once and all descendants reuse its memory snapshot, so a tree of depthJJand branching factorBBcosts\(BJ−1\)/\(B−1\)\(B^\{J\}\-1\)/\(B\-1\)runs while yielding on the order ofBJ−1B^\{J\-1\}distinct recall paths—roughly aJ×J\\timessaving over flat re\-runs atB=5B\{=\}5\. A memory bank is a deterministic function of its ordered encode sequence\. Children are assigned probes spanning different reaches; each leaf contributes one recall instance tagged with visit session, recall session, reachrr, and bank size\.
### 3\.4Evaluation metrics
We treat every recall probe as an independent task\. Each probeppresolves to a unique ground\-truth product URL; letyp∈\{0,1\}y\_\{p\}\\in\\\{0,1\\\}equal11iff the agent’s finalnavigateaction matches it exactly\. We report a single metric, task success rate
TSR=1\|P\|∑p∈Pyp,\\mathrm\{TSR\}\\;=\\;\\frac\{1\}\{\|P\|\}\\sum\_\{p\\in P\}y\_\{p\},\(1\)optionally stratified by reachrpr\_\{p\}to expose how retention degrades with horizon\. A deterministic URL match, rather than an LLM judge, keeps evaluator noise out of the diagnostic; the bijective cue vocabulary makes each ground\-truth URL unique\.
## 4Baselines
A memory architecture is a choice at three stages:encode\(what the bank stores\),retrieve\(how the recall query is matched\), andinject\(what is re\-presented to the VLM\)\. We audit seven architectures along this interface: three reference baselines, three recent multimodal external memories from the literature, andDualMem \(ours\)\. The side\-by\-side placement of all seven in this common coordinate system, with the per\-system adapter details, is in Appendix[D](https://arxiv.org/html/2606.27499#A4)\(Table[6](https://arxiv.org/html/2606.27499#A4.T6)\)\.
#### DualMem\.
Our architecture \(bottom panel of Figure[2](https://arxiv.org/html/2606.27499#S1.F2)\), follows*dual\-coding theory*\(Paivio,[1971](https://arxiv.org/html/2606.27499#bib.bib37)\): memory is most robust when information is held in a visual and a verbal signal at once, each retrievable on its own\. At encoding, every observed product pageoois dual\-coded into a visual signalvov\_\{o\}via SigLIP\-2\(Tschannenet al\.,[2025](https://arxiv.org/html/2606.27499#bib.bib35)\)and a verbal signaltot\_\{o\}via SBERT\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.27499#bib.bib34)\)over the page’s VLM\-generated caption; both areL2L\_\{2\}\-normalised\. At a recall queryqq, the same two encoders embed the query intoqvq\_\{v\}andqtq\_\{t\}, and for each bank entryiiwe score the two channels by inner productsv\(i\)=⟨qv,vi⟩s\_\{v\}^\{\(i\)\}=\\langle q\_\{v\},v\_\{i\}\\rangleandst\(i\)=⟨qt,ti⟩s\_\{t\}^\{\(i\)\}=\\langle q\_\{t\},t\_\{i\}\\rangle\. We combine them after min\-max normalisation within the bank, so the two channels are commensurate even when their raw similarity ranges differ:
x^\(i\)\\displaystyle\\widehat\{x\}^\{\(i\)\}=x\(i\)−minjx\(j\)maxjx\(j\)−minjx\(j\),\\displaystyle=\\frac\{x^\{\(i\)\}\-\\min\_\{j\}x^\{\(j\)\}\}\{\\max\_\{j\}x^\{\(j\)\}\-\\min\_\{j\}x^\{\(j\)\}\},\(2\)s\(i\)\\displaystyle s^\{\(i\)\}=αs^v\(i\)\+\(1−α\)s^t\(i\),\\displaystyle=\\alpha\\,\\widehat\{s\}\_\{v\}^\{\(i\)\}\+\(1\{\-\}\\alpha\)\\,\\widehat\{s\}\_\{t\}^\{\(i\)\},\(3\)withα=0\.75\\alpha\{=\}0\.75in our runs\. The top entrye∗=argmaxis\(i\)e^\{\*\}=\\arg\\max\_\{i\}s^\{\(i\)\}is then injected back into the VLM as both the raw imageIe∗I\_\{e^\{\*\}\}and the captionce∗c\_\{e^\{\*\}\}\.
## 5Experiments
Table 2:Task success rate \(%\) across chain length and VLM back\-end\.The table is split into two side\-by\-side sub\-blocks, one per back\-end\. Within each block the columns sweepJ=5,10,15J\{=\}5,10,15and a Monte Carlo pilot atJ=50J\{=\}50\. The Gemini and Qwen agents visit different numbers of products, so probe countsnrn\_\{r\}differ; we therefore report each back\-end in its own block\.Bold= best per column,underline= second\-best\.We audit the seven memory architectures of §[4](https://arxiv.org/html/2606.27499#S4)on the incidental\-cue task, sweeping chain lengthJ∈\{5,10,15\}J\\in\\\{5,10,15\\\}plus a Monte Carlo pilot atJ=50J\{=\}50\(N=5N\{=\}5chains, sparse reach sampling11–4949,nr=2,407n\_\{r\}\{=\}2\{,\}407\)\. Each run is executed in parallel onGemini 2\.5 Flash\(Gemini Team, Google,[2024](https://arxiv.org/html/2606.27499#bib.bib31)\)andQwen2\.5\-VL\-7B\-Instruct\(Baiet al\.,[2025](https://arxiv.org/html/2606.27499#bib.bib32)\)\.
#### Whynrn\_\{r\}differs across back\-ends\.
The two VLM back\-ends share an identical task setup, yet the probe countsnrn\_\{r\}differ at every reach since every*distinct product visited*during encoding becomes a recall probe, fewer products visited means fewer probes\. Across the 550 encoded sessions per back\-end \(Table[3](https://arxiv.org/html/2606.27499#S5.T3)\), both agents take essentially the same number of steps, but Gemini 2\.5 Flash visits11\.02±1\.1111\.02\\pm 1\.11distinct products per session versus only4\.21±2\.094\.21\\pm 2\.09for Qwen2\.5\-VL\-7B\. Despite a system\-prompt directive to visit “at least 3 product categories,”33\.1%33\.1\\%of Qwen sessions \(182/550182/550\) fall below this floor, including1818that visit a single product; Gemini violates it in zero\. This instruction\-following gap \(Figure[4](https://arxiv.org/html/2606.27499#S5.F4)\) explains the smallernrn\_\{r\}for Qwen and is orthogonal to the recall\-accuracy axis in Table[2](https://arxiv.org/html/2606.27499#S5.T2)\.
Table 3:Per\-session encoding statistics overN=550N\{=\}550sessions per back\-end\.Figure 4:Per\-session activity for both back\-ends \(N=550N\{=\}550each\)\.\(a\)Agent steps per session\.\(b\)Distinct products visited per session for Qwen2\.5\-VL\-7B \(blue\) and Gemini 2\.5 Flash \(orange\)\.
#### DualMem is the strongest architecture\.
Table[2](https://arxiv.org/html/2606.27499#S5.T2)reports TSR acrossJJon both back\-ends, showing that:*DualMem leads at everyJJon both back\-ends*: All DualMem results in this table useα=0\.75\\alpha\{=\}0\.75visual\-dominant retrieval weight \(see ablation Figure[7](https://arxiv.org/html/2606.27499#S5.F7)\)\. M2A is the consistent runner\-up, and the ranking among Caption, MMA, and WorldMM is less stable across cells\. Finally, the*verbal floors \(NoMemory, TextOnly\) sit at0%0\\%everywhere*, confirming that the L2\-leakage contract holds and visual information is necessary\. Per\-reach breakdowns for all four chain\-lengths are in Appendix[F](https://arxiv.org/html/2606.27499#A6)\.
#### Memory\-bank and positional checks\.
Figures[5](https://arxiv.org/html/2606.27499#S5.F5)and[6](https://arxiv.org/html/2606.27499#S5.F6)stratify TSR along the two axes that most naturally explain a memory\-architecture gap, with one figure per back\-end\. First,*memory\-bank size*\(top row of each figure\): DualMem stays high across the full sweep, while baselines degrade as the bank grows, so its lead in Table[2](https://arxiv.org/html/2606.27499#S5.T2)is not the artefact of a smaller bank\.*Encoding positiontt*\(bottom row\): DualMem is essentially flat acrosstton both back\-ends, while baselines exhibit position drifts, so the lead is not driven by remembering only the most\-recent or earliest sessions\. DualMem’s robustness alongside the baselines’ degradation attributes the gap to memory\-architecture proper\.
Figure 5:Two confound checks, all five memory architectures, Qwen2\.5\-VL\-7B\.*Top:*TSR vs\. memory\-bank size at recall\.*Bottom:*TSR by encoding positiontt\. DualMem \(blue\) stays roughly flat across both axes at everyJJ; baselines degrade as the bank grows and exhibit weak position drifts\.Figure 6:Two confound checks, all five memory architectures, Gemini 2\.5 Flash\.*Top:*TSR vs\. memory\-bank size at recall\.*Bottom:*TSR by encoding positiontt\. Same legend and same conclusions as Figure[5](https://arxiv.org/html/2606.27499#S5.F5), on the Gemini back\-end: DualMem \(orange\) is roughly flat along both axes while baselines degrade\.
#### Asymmetric dual coding: vision contains the key and text grounds the query\.
The L2\-leakage contract places every cue in vision only, such that the two channels are asymmetric by construction\.
*Retrieval\.*Vision does the heavy lifting; theα\\alphasweep in Figure[7](https://arxiv.org/html/2606.27499#S5.F7)rises monotonically toα=0\.75\\alpha\{=\}0\.75\(82\.7%82\.7\\%\), with pure\-visual at80\.180\.1and pure\-verbal collapsing to59\.559\.5\. The interior peak says the verbal channel contributes about a quarter of the signal, as a query\-grounding scaffold and not a cue carrier\.
*Injection\.*The captioner is unconstrained \(not filtered against the cue vocabulary\) but is prompted to focus on product attributes, so it verbalises the product’s actual incidental cue \(both its colour and object name\) in only16\.5%16\.5\\%of the1,0001\{,\}000with\_cuecaptions; most cues do not survive the visual\-to\-text compression\. Image\-only injection \(75\.975\.9\) therefore essentially ties image\+\+caption \(76\.976\.9\), while caption\-only collapses \(65\.165\.1\)\. The bottom sub\-block of Table[4](https://arxiv.org/html/2606.27499#S5.T4)shows this asymmetry*widens*when retrieval is solved \(image79\.079\.0vs caption64\.064\.0under visual\-only retrieval\), isolating the injection bottleneck cleanly\.
*Encoder\.*Replacing SigLIP\-2 with CLIP costs11\.511\.5points \(76\.9→65\.476\.9\{\\to\}65\.4\) because both retrieval and injection depend on visual\-code discriminability\.
Together these results describe*asymmetric dual coding*: vision carries the cue end\-to\-end while text plays a smaller query\-grounding role\.
Table 4:DualMem ablationsatJ=5J\{=\}5on Gemini 2\.5 Flash\.Bold= best SR;underline= second\-best\.
#### Fine\-grainedα\\alphasweep on hybrid retrieval\.
The asymmetric\-dual\-coding picture motivates a finer sweep overα\\alphains=αs^v\+\(1−α\)s^ts=\\alpha\\,\\widehat\{s\}\_\{v\}\+\(1\{\-\}\\alpha\)\\,\\widehat\{s\}\_\{t\}\. Figure[7](https://arxiv.org/html/2606.27499#S5.F7)reports SR at five evenly\-spacedα\\alphavalues, with the encoder \(SigLIP\-2\) and injection format \(image\+caption\) fixed\. The endpoints reproduce the verbal\-only \(α=0\\alpha\{=\}0,59\.5%59\.5\\%\) and visual\-only \(α=1\\alpha\{=\}1,80\.1%80\.1\\%\) rows of Table[4](https://arxiv.org/html/2606.27499#S5.T4); the curve rises monotonically to a peak of82\.7%82\.7\\%atα=0\.75\\alpha\{=\}0\.75before dropping2\.62\.6points atα=1\\alpha\{=\}1\. The0\.250\.25verbal contribution to query grounding beats pure\-visual retrieval, which supports the empirical grounding\-vs\-cue balance of the asymmetric regime\. We adoptα=0\.75\\alpha\{=\}0\.75as the operating point in Table[2](https://arxiv.org/html/2606.27499#S5.T2)\.
Figure 7:α\\alphasweep on Gemini 2\.5 FlashatJ=5J\{=\}5\. Encoder fixed at SigLIP\-2 and injection at image\+caption\. Endpointsα=0\\alpha\{=\}0andα=1\\alpha\{=\}1recover verbal\-only and visual\-only retrieval; the peak atα=0\.75\\alpha\{=\}0\.75\(bold\) exceeds the visual endpoint by2\.62\.6pts\.
## 6Conclusion
For all the progress we have made in giving agents the ability to see, we have largely treated their visual inputs as momentary observations to be acted on and then discarded\. We envision agents with a kind of perceptual continuity, wherein a persistent visual map of their environment can grow richer and fuller over time and power the small acts of recognition and familiarity that make assistance useful over the long haul\. This might in turn facilitate agents that better reflect our preferences and goals\. DMV\-Bench takes a first step toward this by isolating and precisely measuring visual memory\. We invite the community to take perceptual continuity seriously as a design target in its own right, alongside reasoning, planning, and dialog\.
## Limitations
The synthetic modern\-furniture catalogue leaves transfer to other visual domains untested, and the main grid uses two back\-ends \(Gemini 2\.5 Flash, Qwen2\.5\-VL\-7B\); broader cross\-VLM coverage and a human ceiling are deferred\. We sweepα\\alphaat evenly\-spaced values on Gemini 2\.5 Flash atJ=5J=5\(Figure[7](https://arxiv.org/html/2606.27499#S5.F7)\), then apply the sameα=0\.75\\alpha=0\.75to Qwen2\.5\-VL\-7B without a separate sweep\. The consistent DualMem lead across both back\-ends in Table 2 indicates the choice transfers reasonably well, but per\-back\-end tuning could yield additional gains and is left to future work\. A natural follow\-up is a more adaptive vision/verbal fusion \(per\-query weighting or a learned gate conditioned on the query and candidate set\), which we leave to future work\.
## References
- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, and J\. Lin \(2025\)Qwen2\.5\-VL technical report\.InarXiv preprint arXiv:2502\.13923,Cited by:[§5](https://arxiv.org/html/2606.27499#S5.p1.6)\.
- Mem0: building production\-ready AI agents with scalable long\-term memory\.InarXiv preprint arXiv:2504\.19413,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px1.p1.1)\.
- F\. I\. M\. Craik and R\. S\. Lockhart \(1972\)Levels of processing: a framework for memory research\.InJournal of Verbal Learning and Verbal Behavior,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.p1.1)\.
- J\. Feng, B\. Xu, J\. Chen, M\. Dai, C\. Wu, H\. Li, B\. Zeng, Y\. Xie, H\. Liang, M\. Lu, and W\. Zhang \(2026\)M2A: multimodal memory agent with dual\-layer hybrid memory for long\-term personalized interactions\.InarXiv preprint arXiv:2602\.07624,Cited by:[Appendix D](https://arxiv.org/html/2606.27499#A4.SS0.SSS0.Px2.p1.1),[Table 6](https://arxiv.org/html/2606.27499#A4.T6.1.1.9.9.1),[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Fu, Y\. Dai, Y\. Luo, L\. Li, S\. Ren, R\. Zhang, Z\. Wang, C\. Zhou, Y\. Shen, M\. Zhang, P\. Chen, Y\. Li, S\. Lin, S\. Zhao, K\. Li, T\. Xu, X\. Zheng, E\. Chen, C\. Shan, R\. He, and X\. Sun \(2024\)Video\-MME: the first\-ever comprehensive evaluation benchmark of multi\-modal LLMs in video analysis\.InarXiv preprint arXiv:2405\.21075,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.SS0.SSS0.Px2.p1.1)\.
- Gemini Team, Google \(2024\)Gemini: a family of highly capable multimodal models\.InarXiv preprint arXiv:2312\.11805,Cited by:[§5](https://arxiv.org/html/2606.27499#S5.p1.6)\.
- Google DeepMind \(2025\)Introducing gemini 2\.5 flash image \(nano\-banana\), our state\-of\-the\-art image model\.Note:[https://developers\.googleblog\.com/en/introducing\-gemini\-2\-5\-flash\-image/](https://developers.googleblog.com/en/introducing-gemini-2-5-flash-image/)Google Developers BlogCited by:[Appendix B](https://arxiv.org/html/2606.27499#A2.p1.1),[§3\.1](https://arxiv.org/html/2606.27499#S3.SS1.SSS0.Px1.p1.1)\.
- M\. Guo, Q\. Jiao, Z\. Shi, Y\. Quan, B\. Zhang, D\. Li, L\. Che, W\. Xu, S\. Liu, Z\. Liu, M\. Kapadia, V\. Pavlovic, J\. Liu, M\. Wang, Y\. Shi, D\. N\. Metaxas, and R\. Tang \(2026\)MemEye: a visual\-centric evaluation framework for multimodal agent memory\.InarXiv preprint arXiv:2605\.15128,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27499#S2.T1.5.5.10.4.1)\.
- B\. J\. Gutiérrez, Y\. Shu, Y\. Gu, M\. Yasunaga, and Y\. Su \(2024\)HippoRAG: neurobiologically inspired long\-term memory for large language models\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.p1.1),[§1](https://arxiv.org/html/2606.27499#S1.p2.1),[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px1.p1.1)\.
- B\. He, H\. Li, Y\. K\. Jang, M\. Jia, X\. Cao, A\. Shah, A\. Shrivastava, and S\. Lim \(2024\)MA\-LMM: memory\-augmented large multimodal model for long\-term video understanding\.InCVPR,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. He, Y\. Wang, C\. Zhi, Y\. Hu, T\. Chen, L\. Yin, Z\. Chen, T\. A\. Wu, S\. Ouyang, Z\. Wang, J\. Pei, J\. McAuley, Y\. Choi, and A\. Pentland \(2026\)MemoryArena: benchmarking agent memory in interdependent multi\-session agentic tasks\.InarXiv preprint arXiv:2602\.16313,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27499#S2.T1.5.5.8.2.1)\.
- Y\. Hu, Y\. Wang, and J\. McAuley \(2026\)Evaluating memory in LLM agents via incremental multi\-turn interactions\.InICLR,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27499#S2.T1.5.5.9.3.1)\.
- T\. S\. Hyde and J\. J\. Jenkins \(1969\)Differential effects of incidental tasks on the organization of recall of a list of highly associated words\.InJournal of Experimental Psychology,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.p1.1)\.
- J\. Y\. Koh, R\. Lo, L\. Jang, V\. Duvvur, M\. C\. Lim, P\. Huang, G\. Neubig, S\. Zhou, R\. Salakhutdinov, and D\. Fried \(2024\)VisualWebArena: evaluating multimodal agents on realistic visual web tasks\.InACL,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Li, D\. Li, S\. Savarese, and S\. Hoi \(2023\)BLIP\-2: bootstrapping language\-image pre\-training with frozen image encoders and large language models\.InICML,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Li, Y\. Wang, Y\. He, Y\. Li, Y\. Wang, Y\. Liu, Z\. Wang, J\. Xu, G\. Chen, P\. Luo, L\. Wang, and Y\. Qiao \(2024\)MVBench: a comprehensive multi\-modal video understanding benchmark\.InCVPR,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.SS0.SSS0.Px2.p1.1)\.
- X\. Li, Z\. Zhu, S\. Liu, Y\. Ma, Y\. Zang, Y\. Cao, and A\. Sun \(2026\)EMemBench: interactive benchmarking of episodic memory for VLM agents\.InarXiv preprint arXiv:2601\.16690,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Liu, P\. Zhao, Y\. Liang, Q\. Luo, S\. Tang, Y\. Chai, W\. Lin, H\. Xiao, W\. Wang, S\. Chen, Z\. Lu, G\. Wu, H\. Wang, L\. Liu, and Y\. Liu \(2026\)MemGUI\-Bench: benchmarking memory of mobile GUI agents in dynamic environments\.InarXiv preprint arXiv:2602\.06075,Cited by:[Table 1](https://arxiv.org/html/2606.27499#S2.T1.5.5.7.1.1)\.
- J\. Liu, Y\. Sun, W\. Cheng, H\. Lei, Y\. Chen, L\. Wen, X\. Yang, D\. Fu, P\. Cai, N\. Deng, Y\. Yu, S\. Hu, B\. Shi, and D\. Wang \(2025\)MemVerse: multimodal memory for lifelong learning agents\.InarXiv preprint arXiv:2512\.03627,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Long, Y\. He, W\. Ye, Y\. Pan, Y\. Lin, H\. Li, J\. Zhao, and W\. Li \(2025\)Seeing, listening, remembering, and reasoning: a multimodal agent with long\-term memory\.InarXiv preprint arXiv:2508\.09736,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.27499#S2.T1.4.4.4.2)\.
- Y\. Lu, W\. Cheng, Z\. Zhang, and H\. Tang \(2026\)MMA: multimodal memory agent\.InarXiv preprint arXiv:2602\.16493,Cited by:[Appendix D](https://arxiv.org/html/2606.27499#A4.SS0.SSS0.Px2.p1.1),[Table 6](https://arxiv.org/html/2606.27499#A4.T6.1.1.8.8.1),[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang \(2024\)Evaluating very long\-term conversational memory of LLM agents\.InACL,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27499#S2.T1.2.2.2.3)\.
- D\. Marr \(1971\)Simple memory: a theory for archicortex\.InPhilosophical Transactions of the Royal Society of London\. Series B,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.p1.1)\.
- K\. Nakazawa, M\. C\. Quirk, R\. A\. Chitwood, M\. Watanabe, M\. F\. Yeckel, L\. D\. Sun, A\. Kato, C\. A\. Carr, D\. Johnston, M\. A\. Wilson, and S\. Tonegawa \(2002\)Requirement for hippocampal CA3 NMDA receptors in associative memory recall\.InScience,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.p1.1)\.
- S\. Ouyang, J\. Yan, I\. Hsu, Y\. Chen, K\. Jiang, Z\. Wang, R\. Han, L\. T\. Le, S\. Daruki, X\. Tang, V\. Tirumalashetty, G\. Lee, M\. Rofouei, H\. Lin, J\. Han, C\. Lee, and T\. Pfister \(2026\)ReasoningBank: scaling agent self\-evolving with reasoning memory\.InICLR,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez \(2023\)MemGPT: towards LLMs as operating systems\.InarXiv preprint arXiv:2310\.08560,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.p2.1),[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Paivio \(1971\)Imagery and verbal processes\.InHolt, Rinehart and Winston,Cited by:[§4](https://arxiv.org/html/2606.27499#S4.SS0.SSS0.Px1.p1.10)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-BERT: sentence embeddings using siamese BERT\-networks\.InEMNLP,Cited by:[§4](https://arxiv.org/html/2606.27499#S4.SS0.SSS0.Px1.p1.10)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Tschannen, A\. Gritsenko, X\. Wang, M\. F\. Naeem, I\. Alabdulmohsin, N\. Parthasarathy, T\. Evans, L\. Beyer, Y\. Xia, B\. Mustafa, O\. Hénaff, J\. Harmsen, A\. Steiner, and X\. Zhai \(2025\)SigLIP 2: multilingual vision\-language encoders with improved semantic understanding, localization, and dense features\.InarXiv preprint arXiv:2502\.14786,Cited by:[§4](https://arxiv.org/html/2606.27499#S4.SS0.SSS0.Px1.p1.10)\.
- E\. Tulving and D\. M\. Thomson \(1973\)Encoding specificity and retrieval processes in episodic memory\.InPsychological Review,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.p1.1)\.
- Y\. Wang and X\. Chen \(2025\)MIRIX: multi\-agent memory system for LLM\-based agents\.InarXiv preprint arXiv:2507\.07957,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig \(2024\)Agent workflow memory\.InarXiv preprint arXiv:2409\.07429,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu \(2025a\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.InICLR,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27499#S2.T1.3.3.3.2)\.
- W\. Wu, K\. Zhou, R\. Yuan, V\. Yu, S\. Wang, Z\. Hu, and B\. Huang \(2025b\)Auto\-scaling continuous memory for GUI agent\.InarXiv preprint arXiv:2510\.09038,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-MEM: agentic memory for LLM agents\.InarXiv preprint arXiv:2502\.12110,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.p2.1),[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Yadav, Y\. Ali, G\. Gupta, Y\. Gal, and Z\. Kira \(2025\)FindingDory: a benchmark to evaluate memory in embodied agents\.InarXiv preprint arXiv:2506\.15635,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Yang, S\. Liu, H\. Guo, Y\. Dong, X\. Zhang, S\. Zhang, P\. Wang, Z\. Zhou, B\. Xie, Z\. Wang, B\. Ouyang, Z\. Lin, M\. Cominelli, Z\. Cai, B\. Li, Y\. Zhang, P\. Zhang, F\. Hong, J\. Widmer, F\. Gringoli, L\. Yang, and Z\. Liu \(2025\)EgoLife: towards egocentric life assistant\.InCVPR,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Yeo, K\. Kim, J\. Yoon, and S\. J\. Hwang \(2026\)WorldMM: dynamic multimodal memory agent for long video reasoning\.InCVPR,Cited by:[Appendix D](https://arxiv.org/html/2606.27499#A4.SS0.SSS0.Px2.p1.1),[Table 6](https://arxiv.org/html/2606.27499#A4.T6.1.1.7.7.1),[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2024\)MemoryBank: enhancing large language models with long\-term memory\.InAAAI,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.p2.1),[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig \(2024\)WebArena: a realistic web environment for building autonomous agents\.InICLR,Cited by:[§1](https://arxiv.org/html/2606.27499#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Zhu, W\. Wu, K\. Zhou, S\. Wang, and B\. Huang \(2026\)Hybrid self\-evolving structured memory for GUI agents\.InarXiv preprint arXiv:2603\.10291,Cited by:[§2](https://arxiv.org/html/2606.27499#S2.SS0.SSS0.Px2.p1.1)\.
## Appendix ADMV\-Bench storefront layout
DMV\-Bench is served as a live e\-commerce site \(Next\.js \+ Playwright\); the agent’s observations are real DOM snapshots and rendered images, not curated thumbnails\. The site exposes four navigation levels \(Figure[8](https://arxiv.org/html/2606.27499#A1.F8)\):\(a\) Homepage: a single hero panel and a 10\-cell “Shop by category” grid \(chair, sofa, lamp, cushion, vase, rug, side\_table, bookshelf, plant\_pot, wall\_art\);\(b\) Category page: the 10 style\-coherent collections that live under a category, each preview card showing the collection name, item count, and price range;\(c\) Style page: the 10 individual product variants in one collection, each with its rendered photo, name, and price;\(d\) Product detail page: the variant’s main image, price, an L2\-compliant attribute summary \(*colour: n/a, material: varied*\), anAdd to wishlistbutton \(the agent’s terminal action\), customer reviews, and a “More from this collection” carousel\. Together these four levels instantiate the 10 categories×\\times10 styles×\\times10 variants==1,000 products\.
Two design features the figure makes visible are load\-bearing for the diagnostic in §[3\.1](https://arxiv.org/html/2606.27499#S3.SS1)\. First, the*L2\-leakage contract*: every visible textual surface \(titles, prices, attribute labels, breadcrumbs, footer links\) carries only the product class and a collection name\. The discriminative incidental cue baked into each variant’s image \(e\.g\. the red bow on the back of Lumen Chair 01\) appears nowhere in text, so a memory architecture that compresses observations into language cannot recover it\. Second, the*no\-cross\-page\-persistence*\(NCP\) invariant: the small “Recently viewed” strip at the bottom of the category and style pages renders only thumbnails of products visited within the current Playwright tenancy and is reset between sessions, so the storefront UI never leaks a previous\-session observation back to the agent\. The only path that bridges sessions is the memory architecture under test\.
Figure 8:The four navigation levels of the DMV\-Bench storefront\.\(a\) Homepage with the 10\-category shop grid; \(b\) Category page listing the 10 collections in a category; \(c\) Style page listing the 10 variants of one collection; \(d\) Product detail page with image, L2\-compliant attributes, “Add to wishlist” \(the agent’s terminal action\), and a “More from this collection” carousel\.
## Appendix BCue edit prompts
Everywith\_cuevariant is rendered by Nano\-Banana\(Google DeepMind,[2025](https://arxiv.org/html/2606.27499#bib.bib36)\)as an image\-edit on the base studio photograph, instructed by a single templated prompt\. The template is:
Add a small \{color\} \{object\_name\}, \{placement\}, in a naturally placed way\. The object should be subtle and modest in size, clearly visible but not dominating the scene\. Keep the \{cat\_noun\} itself and the background completely unchanged\. Photographic realism, no text, no watermark, no caption overlay\.
Slot fills are deterministic functions of\(cat,style,prod\_idx\)\(cat,style,prod\\\_idx\)\.\{color\}is drawn from a fixed 10\-colour palette \(red, blue, green, yellow, white, black, brown, beige, orange, purple\), keyed byprod\_idx;\{object\_name\}and\{placement\}are drawn from a per\-category, per\-style vocabulary keyed bystyle\_idx, so the \(object, colour\) pair is bijective across the whole catalogue\. Table[5](https://arxiv.org/html/2606.27499#A2.T5)lists one representative cue per category to show the vocabulary’s flavour\.
Table 5:Cue vocabulary \(one representative per category\)\.Each category provides 10 objects \(one per style\) and each is paired with a placement clause appropriate to that product class\. Combined with the 10 colours, this yields the 1,000 bijective\(cat,style,prod\)→\(object,colour\)\(cat,style,prod\)\\to\(\\text\{object\},\\text\{colour\}\)assignments\. Filling the template above with one row of this table and one colour gives the exact prompt shipped to Nano\-Banana\.
## Appendix CSample session dialogue
Every agent back\-end in DMV\-Bench \(Gemini 2\.5 Flash and Qwen2\.5\-VL\-7B\) receives an identical system prompt and the same ReAct\-format user message at every step\. We show one encoding\-session step and one recall\-session step end\-to\-end so the prompt structure is visible\. Long lines are abridged with…\\ldotsfor space; the rendered images and Playwright DOM are passed alongside but not reproduced here\. The same harness, the same prompts, and the same memory injectors are used for both back\-ends; only the model weights differ\.
### System prompt \(sent once per session, both VLMs\)
SystemYou are a shopping\-assistant agent in an e\-commerce website\. You receive \(i\) the customer’s instruction, \(ii\) the conversation so far, \(iii\) memory context from prior sessions \(may be empty\), and \(iv\) the current page\. Decide ONE action\.URL patterns that exist on the site:/,/category/<slug\>,/collection/<slug\>\-<style\>,/product/<8\-hex\-hash\>,/wishlist,/cart\.Valid category slugs:chair, sofa, lamp, cushion, vase, rug, table, bookshelf, plant\_pot, wall\_art\.Valid styles:modern, minimalist, vintage, industrial, scandinavian, bohemian, mid\_century, rustic, japandi, art\_deco\.Action vocabulary:navigate\("/\.\.\."\),click\_index\(N\)on a listing page \(N=0…3N\{=\}0\{\\ldots\}3\),add\_to\_wishlist\(terminal\),done\(failure exit\)\.Some user requests may ask you to navigate to a product the customer has previously visited\. You cannot see earlier sessions in the conversation; only the\[Memory context\]block bridges them\.Strategy hints:•The\[Memory context\]block lists products recalled from prior sessions, each with its/product/<hash\>URL and image\. If a remembered product matches what the customer is asking for,navigate\("/product/<hash\>"\)DIRECTLY; this is the fastest and intended path\. You DO know the exact hash in that case\.•Inspect the memory image\(s\): pick the product whose image matches what the customer is describing, then navigate to that product’s hash\.•Only if memory gives you nothing usable: from a category page usenavigate\("/collection/<slug\>\-<style\>"\)thenclick\_index\(N\)on the N\-th product card\. You can NOT guess/product/<hash\>URLs you have never seen\.•Once on the/product/<hash\>page the customer asked for, emitadd\_to\_wishlist: that ends the session successfully\.Reply in ReAct format, ONE Thought line \+ ONE Action line:Thought: <one sentence reasoning\> Action: <action\>
### Encoding session \(Session 3 of 10, step 7/25\)
UserStep 7/25\.\[Subtask\] Comparison\-shop for a chair: open a wide range \(step 7/25\)\. So far: 5 distinct products viewed across 3 styles\. Aim for≥\\geq3 styles and∼\\sim12 products; open NEW products you have not seen yet\.\[Conversation so far\] User: I’m furnishing a reading nook; show me some chairs\.Agent: Thought: start on the chair category\. Action:navigate\("/category/chair"\) Agent: Thought: open the modern collection first\. Action:navigate\("/collection/chair\-modern"\) Agent: Thought: click the first card to inspect it\. Action:click\_index\(0\) …\\ldots\(4 turns elided\)…\\ldots\[Memory context retrieved this step\] \(no memory retrieved this step\)\[Current page\] URL:/collection/chair\-vintage Title: Vintage Chair CollectionDescription:…\\ldotsfour product cards…\\ldotsReply with one Thought \+ one Action\.
Agent \(Gemini / Qwen\)Thought:I haven’t seen the vintage style yet; open the second vintage card to widen coverage\.Action:click\_index\(1\)
*Behind the scenes:*this step lands on a product page whose image carries a unique pre\-rendered incidental cue \(e\.g\. a teal sleep mask resting on the chair\)\. The encoding agent never sees the cue mentioned; the memory architecture under test ingests the page autonomously\.
### Recall session \(Session 7, step 2/25,r=4r\{=\}4from encoding\)
UserStep 2/25\.\[Subtask\] Recall: go to the product with the teal sleep mask\.\[Conversation so far\] User: Take me to the product with the teal sleep mask I saw earlier\.Agent: Thought: I should check memory first\. Action:navigate\("/"\)\[Memory context retrieved this step\] \(rendered through the memory injector below\)\[Current page\] URL:/ Title: HomeDescription:…\\ldotsReply with one Thought \+ one Action\.
Memory context5 item\(s\) recalled from earlier sessions, most relevant first\.Rank 1\.URL/product/a7c1e9b2: vintage chair \(memory image attached\)Rank 2\.URL/product/3f88c401: modern chair \(memory image attached\)Rank 3\.URL/product/9d44a7e0: minimalist chair \(memory image attached\)Two further entries omitted for space\.
Agent \(Gemini / Qwen\)Thought:Rank\-1’s attached image shows a teal sleep mask on the chair seat, matching the customer’s description\.Action:navigate\("/product/a7c1e9b2"\)
The next step lands on/product/a7c1e9b2, and the agent emitsadd\_to\_wishlist, terminating the session successfully\. Both back\-ends use this exact dialogue surface; the only behavioural difference between Gemini and Qwen is how each parses the attached memory image against the customer’s verbal description \(“teal sleep mask”\), which is exactly the visual\-memory capability DMV\-Bench is designed to measure\. Crucially, the system prompt itself never primes the agent to attend to incidental details during encoding; the agent must surface the cue from memory at recall using only the customer’s natural\-language reference and the images its memory bank chose to retain\.
## Appendix DMemory architectures
#### Reference baselines\.
NoMemorydiscards every entry, so any score above it is attributable to memory\.TextOnlyindexes the bare product class of a page;Captionindexes a VLM\-generated caption\. Both recover a cue only if it was put into words\. Caption is the strongest text\-only baseline and serves as the reference against which the gain from visual encoding is interpreted\.
#### Prior multimodal external memory\.
WorldMM\(Yeoet al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib5)\)maintains parallel episodic / semantic / visual memories and selects across them with an adaptive iterative retriever\.M2A\(Fenget al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib7)\)couples a raw\-message store with a semantic\-abstraction store, routed by chat and memory\-manager agents\.MMA\(Luet al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib8)\)augments retrieval with per\-item reliability scores combining source credibility, temporal decay, and conflict\-aware consensus\. We adapt each to operate inside the DMV\-Bench harness under the sharedencode/retrieve/injectinterface\.
Table[6](https://arxiv.org/html/2606.27499#A4.T6)places all seven audited memory architectures on a commonencode/retrieve/injectcoordinate system\. Reading the rows top\-to\-bottom traces the progression from no memory through verbal\-only baselines, three recent multimodal external memories from the literature, and DualMem \(ours\)\. The DMV\-Bench adapter for each external system preserves its paper’s protocol on every axis where preservation is feasible\.
MemoryEncodeRetrieveInject*Reference baselines*NoMemorynonenonenoneTextOnlyclass textverbaltextCaptionVLM captionverbalcaption*Prior multimodal external memory*WorldMM\(Yeoet al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib5)\)episodic\+semantic\+visualadaptive iterativeretrieved ctxMMA\(Luet al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib8)\)items \+ reliability scoresreliability\-weightedscored itemsM2A\(Fenget al\.,[2026](https://arxiv.org/html/2606.27499#bib.bib7)\)raw log \+ semantic abstr\.agent\-routed \(dual\-layer\)text snippetsDualMem\(ours\)image \+ captionhybrid \(SigLIP\-2\+SBERT\)image \+ caption
Table 6:The seven memory architectures, as choices overencode,retrieve, andinject\. The reference baselines establish whether memory must be visual at all\. The three prior multimodal external memories are recent state\-of\-the\-art systems adapted to the DMV\-Bench harness\. DualMem \(ours\) is the only entry that carries an unreduced visual code and a verbal code through every stage\. Visual retrieval is SigLIP\-2 cross\-modal; verbal is SBERT over captions; hybrid fuses both\.
## Appendix ECluster\-aware statistical analysis
The shared\-prefix rollout tree \(§[3\.3](https://arxiv.org/html/2606.27499#S3.SS3)\) means that probes nested in the same chain trunk share encoding prefix and are not independent\. A naive iid bootstrap or iidtt\-test on the per\-probe vector therefore understates the variance of any cell mean and inflates the apparent significance of any cell\-to\-cell gap\. We give both fixes below\.
#### Cluster bootstrap by chain trunk\.
For each \(back\-end,JJ, architecture\) cell we resample chain trunks with replacement \(one resample==a multiset ofNNtrunks out of theNNin that cell\)\. Within each resample we concatenate all probes of the sampled trunks and recompute the TSR; the2\.5/97\.52\.5/97\.5percentiles of1,0001\{,\}000such resamples give the cluster\-aware 95% CI\. Tables[7](https://arxiv.org/html/2606.27499#A5.T7)and[8](https://arxiv.org/html/2606.27499#A5.T8)report both the naive iid bootstrap CI \(probe\-level resampling, the*wrong*one\) and the cluster bootstrap CI \(the right one\), for Qwen2\.5\-VL\-7B and Gemini 2\.5 Flash respectively\. Cluster CIs are wider than naive CIs in essentially every cell, with the largest inflation on the Gemini back\-end atJ=15J\{=\}15for the M2A baseline \(naive\[56\.0,57\.1\]\[56\.0,57\.1\], cluster\[49\.7,62\.5\]\[49\.7,62\.5\]\)\. DualMem’s cluster CIs are tight at everyJJon both back\-ends, reflecting that its lead is consistent across chain trunks rather than carried by a handful of outliers\.
Table 7:Cluster\-aware 95% CIs, Qwen2\.5\-VL\-7B\.Naive CIs resample probes iid; cluster CIs resample chain trunks with replacement, the correct unit of independence under shared\-prefix rollouts\.1,0001\{,\}000bootstrap resamples; point estimates match Table[2](https://arxiv.org/html/2606.27499#S5.T2)\.Table 8:Cluster\-aware 95% CIs, Gemini 2\.5 Flash\.Same protocol as Table[7](https://arxiv.org/html/2606.27499#A5.T7)\. Largest naive\-vs\-cluster gap is M2A atJ=15J\{=\}15\(naive\[56\.0,57\.1\]\[56\.0,57\.1\], cluster\[49\.7,62\.5\]\[49\.7,62\.5\]\), showing how much the iid assumption can understate variance on the Gemini back\-end\.
#### Paired cluster permutation test \(DualMem vs M2A\)\.
Because every memory architecture sees the same replayed trajectories on the same probes, we can compare DualMem and the runner\-up M2A at the*probe level*: for each probe present under both architectures we record the differencedi=𝟏\[DualMem correcti\]−𝟏\[M2A correcti\]∈\{−1,0,\+1\}d\_\{i\}=\\mathbf\{1\}\[\\text\{DualMem correct\}\_\{i\}\]\-\\mathbf\{1\}\[\\text\{M2A correct\}\_\{i\}\]\\in\\\{\-1,0,\+1\\\}and report the meand¯\\bar\{d\}in percentage points\. We testH0:E\[d¯\]=0H\_\{0\}\\\!:\\,E\[\\bar\{d\}\]=0by a cluster permutation: independently for each chain trunk we flip the sign of alldid\_\{i\}in that trunk with probability0\.50\.5, repeat1,0001\{,\}000times, and compute the two\-sidedpp\-valuePr\(\|d¯perm\|≥\|d¯obs\|\)\\Pr\(\|\\bar\{d\}^\{\\text\{perm\}\}\|\\\!\\geq\\\!\|\\bar\{d\}^\{\\text\{obs\}\}\|\)under the null\. Permuting at the trunk level rather than the probe level keeps the within\-trunk correlation structure intact, so the null distribution respects the same nesting that the data has\. Table[9](https://arxiv.org/html/2606.27499#A5.T9)reports the result\. The DualMem lead over M2A is significant atp≤0\.003p\\\!\\leq\\\!0\.003on all sixJ∈\{5,10,15\}J\\\!\\in\\\!\\\{5,10,15\\\}cells across both back\-ends\. On the two Monte CarloJ=50J\{=\}50pilots \(only five trunks each by design\), the test is underpowered: on Qwen the\+8\.5\+8\.5pp lead reachesp=0\.057p\{=\}0\.057\(borderline\), and on Gemini the\+0\.4\+0\.4pp gap is, in line with Table[2](https://arxiv.org/html/2606.27499#S5.T2), not distinguishable from zero \(p=0\.69p\{=\}0\.69\)\.
Table 9:Paired cluster permutation test, DualMem \(ours\) vs M2A \(runner\-up\)\.d¯\\bar\{d\}is the mean per\-probe outcome difference in percentage points;pp\-values from1,0001\{,\}000trunk\-level sign permutations\. The DualMem lead is significant \(p≤0\.003p\\\!\\leq\\\!0\.003\) on all sixJ∈\{5,10,15\}J\\\!\\in\\\!\\\{5,10,15\\\}cells across both back\-ends\. The Monte CarloJ=50J\{=\}50cells have only five trunks each and are underpowered; the\+8\.5\+8\.5pp Qwen gap is borderline \(p=0\.057p\{=\}0\.057\), and the Gemini cell where DualMem and M2A coincide to within0\.50\.5pp is not significant \(p=0\.69p\{=\}0\.69\)\.
## Appendix FMore results: per\-reach task success rate
Tables[10](https://arxiv.org/html/2606.27499#A6.T10)–[13](https://arxiv.org/html/2606.27499#A6.T13)\(Qwen2\.5\-VL\-7B\) and Tables[14](https://arxiv.org/html/2606.27499#A6.T14)–[17](https://arxiv.org/html/2606.27499#A6.T17)\(Gemini 2\.5 Flash\) give the full per\-reach task success rate \(TSR\) for all four chain\-length settings, one table perJJper back\-end\. Rows are memory architectures; columns are reachesrr\(number of session boundaries between visit and probe\)\. For the Monte CarloJ=50J\{=\}50pilot, we bin reachesr∈\[1,49\]r\\in\[1,49\]into seven contiguous groups of seven; the underlying per\-reach values are sparse \(10 probes per reach per chain\)\.
Table 10:Per\-reach TSR \(%\) on Qwen2\.5\-VL\-7B,J=5J\{=\}5\(nr=1,053n\_\{r\}\{=\}1\{,\}053\)\.Table 11:Per\-reach TSR \(%\) on Qwen2\.5\-VL\-7B,J=10J\{=\}10\(nr=4,821n\_\{r\}\{=\}4\{,\}821\)\.Table 12:Per\-reach TSR \(%\) on Qwen2\.5\-VL\-7B,J=15J\{=\}15\(nr=10,307n\_\{r\}\{=\}10\{,\}307\)\.Table 13:Per\-reach TSR \(%\) on Qwen2\.5\-VL\-7B, Monte CarloJ=50J\{=\}50, reach\-binned \(nr=2,407n\_\{r\}\{=\}2\{,\}407\)\.Table 14:Per\-reach TSR \(%\) on Gemini 2\.5 Flash,J=5J\{=\}5\(nr=2,762n\_\{r\}\{=\}2\{,\}762\)\.Table 15:Per\-reach TSR \(%\) on Gemini 2\.5 Flash,J=10J\{=\}10\(nr=12,344n\_\{r\}\{=\}12\{,\}344\)\.Table 16:Per\-reach TSR \(%\) on Gemini 2\.5 Flash,J=15J\{=\}15\(nr=28,710n\_\{r\}\{=\}28\{,\}710\)\.Table 17:Per\-reach TSR \(%\) on Gemini 2\.5 Flash, Monte CarloJ=50J\{=\}50, reach\-binned \(nr=2,449n\_\{r\}\{=\}2\{,\}449\)\.Similar Articles
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
MemLens is a new benchmark for evaluating memory capabilities in large vision-language models through multi-session conversations. It compares long-context and memory-augmented approaches, revealing limitations in both and motivating hybrid architectures.
SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory
Introduces SMMBench, a benchmark to evaluate multimodal agents' ability to retrieve, align, and compose evidence scattered across independently originated sources like conversations, tables, and documents. Experiments show current systems struggle with this source-distributed memory composition task.
Learning to Learn from Multimodal Experience
This paper introduces AutoMMemo, a framework that enables multimodal agents to automatically design memory mechanisms (expressible as executable memo programs) for learning from multimodal interaction trajectories, outperforming no-memory and fixed-memory baselines on GUI/Web navigation and visual reasoning benchmarks.
WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
WorldMemArena is a new benchmark with 400 multi-session multimodal tasks for evaluating multimodal agent memory, comparing long-context, RAG, and harness-based memory approaches, revealing that better memory writing does not guarantee better performance and that systems struggle with visual evidence.