EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies
Summary
EventVLA introduces a sparse visual evidence memory framework for long-horizon robotic manipulation, achieving an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.
View Cached Full Text
Cached at: 06/24/26, 09:47 AM
Paper page - EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies
Source: https://huggingface.co/papers/2606.20092 Authors:
,
,
,
,
,
,
,
,
,
,
,
Abstract
EventVLA addresses long-horizon robotic manipulation challenges by introducing a sparse visual evidence memory framework with visual anchors and dynamic Keyframe Evidence Memory module for improved task performance.
Memory remains a critical bottleneck for long-horizon robotic manipulation, as standardVision-Language-Action(VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existingmemory-augmented methodsutilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparsevisual evidencememory that comprises two core components: foundationalvisual anchorsto retain initial and short-term contexts, and a dynamicKeyframe Evidence Memory(KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA’slatent embeddingsto autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the futurecausal utilityof current observations, preserving transientvisual evidencebefore it becomes unobservable. Furthermore, we propose RoboTwin-MeM, adiagnostic benchmarkspecifically designed to evaluatenon-Markovian manipulation taskswith interactivevisual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.
View arXiv pageView PDFProject pageGitHub14Add to collection
Get this paper in your agent:
hf papers read 2606\.20092
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.20092 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.20092 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.20092 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
Proposes AR-VLA, an autoregressive action expert that generates continuous action sequences with long-term memory for context-aware robotic policy training, improving trajectory smoothness and task success rates over reactive VLA models.
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
IntentVLA is a history-conditioned visual-language-action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations. It also introduces AliasBench, an ambiguity-aware benchmark for evaluating such methods.
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
LabVLA is a vision-language-action model for scientific laboratory automation, trained with a two-stage approach combining action token pretraining and flow matching. It achieves state-of-the-art success rates on the LabUtopia benchmark by leveraging simulated data to bridge the gap between household demonstrations and lab-specific tasks.
AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents
AtlasVA is a teacher-free visual skill memory framework for vision-language model agents that uses spatial heatmaps, visual exemplars, and symbolic text skills to improve spatial decision-making in long-horizon tasks, outperforming baselines on several benchmarks.