Tag
RoboSemanticBench is a benchmark that diagnoses semantic grounding in action prediction for vision-language-action models, revealing that while robots can grasp objects, they fail to select semantically correct targets based on instruction semantics.
This paper introduces the PiSAR benchmark for screen-conditioned action prediction and compares supervised fine-tuned models against frontier zero-shot baselines. Key findings show a fine-tuned Qwen3-VL-8B achieves 0.783 semantic similarity, significantly outperforming Claude Opus 4.7 and GPT-5.5 (0.459 and 0.482), but the same fine-tuning recipe on a larger reasoning-tuned Gemma model yields only 0.441, indicating a model-recipe mismatch.
MementoGUI introduces a plug-in agentic memory framework for GUI agents that uses learned controllers for selective memory management and retrieval, improving performance on long-horizon tasks with compressed visual and textual representations.