Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
Summary
This paper introduces RNG-Bench, a benchmark suite for evaluating multimodal foundation models' ability to reconstruct past observations and use them for decision-making in multi-step interactions, featuring two games (Matching Pairs and 3D Maze) with controlled difficulty parameters and a memory gap metric to distinguish forgetting from poor decision-making.
View Cached Full Text
Cached at: 06/18/26, 03:56 AM
Paper page - Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
Source: https://huggingface.co/papers/2606.19338
Abstract
A new benchmark suite called RNG-Bench is introduced to evaluate multimodal foundation models’ ability to reconstruct past observations and use them for decision-making in multi-step interactions, featuring two games with controlled difficulty parameters and a memory gap metric to distinguish forgetting from poor decision-making.
Deployingmultimodal foundation modelsasclosed-loop policiesincreasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduceRNG-Bench(Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model’s ability to reconstruct past observations and act on them duringmulti-step interaction.RNG-Benchincludes two complementary games:Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and aMemory Gapmetric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs.Memory Gapanalysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally,fine-tuningQwen3.5-9Bon optimal-policy rollouts and filtered model demonstrations improves performance onRNG-Benchand transfers to existing benchmarks without degrading general multimodal capability.
View arXiv pageView PDFProject pageGitHub1Add to collection
Get this paper in your agent:
hf papers read 2606\.19338
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.19338 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.19338 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.19338 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games
This paper introduces a multi-turn interactive framework for reasoning evaluation where LLMs must query a hidden environment and integrate partial observations, instantiated as a benchmark of 474 executable games across five difficulty levels, showing discriminative power and exposing differences in reasoning.
Evaluating Large Language Models in a Complex Hidden Role Game
This paper introduces an open-source framework to evaluate LLMs' reasoning, persuasion, and deception capabilities in the hidden role game Secret Hitler, finding that current models fail at sustained multi-turn manipulation while rule-based agents outperform them.
GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models
This paper introduces GENSTRAT, a benchmark that uses procedurally generated strategic environments to evaluate LLMs' strategic reasoning across multiple axes, addressing limitations of fixed game suites.
MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models
MCBench is a new benchmark for assessing the safety of omnimodal large language models across vision, audio, and text modalities. It includes 1196 scenarios and finds current models struggle with cross-modal safety reasoning.
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
MemLens is a new benchmark for evaluating memory capabilities in large vision-language models through multi-session conversations. It compares long-context and memory-augmented approaches, revealing limitations in both and motivating hybrid architectures.