mllm

#mllm

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

Hugging Face Daily Papers ↗ · 2026-05-18 Cached

MementoGUI introduces a plug-in agentic memory framework for GUI agents that uses learned controllers for selective memory management and retrieval, improving performance on long-horizon tasks with compressed visual and textual representations.

0 favorites 0 likes

#mllm

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Hugging Face Daily Papers ↗ · 2026-05-17 Cached

This paper introduces Omni-DuplexEval, a benchmark and automatic evaluation framework for real-time duplex interaction in multimodal large language models, assessing continuous response generation and proactive event detection in streaming scenarios.

0 favorites 0 likes

#mllm

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

arXiv cs.AI ↗ · 2026-05-14 Cached

Proposes VeGAS, a test-time framework for MLLM-based embodied agents that samples multiple candidate actions and uses a generative verifier to select the most reliable, achieving up to 36% relative improvement over CoT baselines on challenging tasks.

0 favorites 0 likes

#mllm

Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs

arXiv cs.AI ↗ · 2026-05-08 Cached

This paper analyzes the reconstruction-concealment tradeoff in intent-obfuscation jailbreak attacks on Multimodal Large Language Models (MLLMs). It proposes concealment-aware variant construction and keyword-related distractor images to exploit model vulnerabilities more effectively.

0 favorites 0 likes

#mllm

RemoteZero: Geospatial Reasoning with Zero Human Annotations

Hugging Face Daily Papers ↗ · 2026-05-06 Cached

RemoteZero is a framework that eliminates the need for human-annotated box supervision in geospatial reasoning by leveraging the semantic verification capabilities of multimodal large language models (MLLMs) to enable self-evolving localization from unlabeled remote sensing data.

0 favorites 0 likes

#mllm

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

Hugging Face Daily Papers ↗ · 2026-04-20 Cached

Researchers introduce MM-JudgeBias, a benchmark that exposes systematic compositional biases in multimodal large language models when used as automatic judges, testing 26 SOTA MLLMs across 1,800 samples.

0 favorites 0 likes

mllm

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs

RemoteZero: Geospatial Reasoning with Zero Human Annotations

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

Submit Feedback