Tag
This paper presents Semantic Action RL, which uses reinforcement learning over Vision-Language-Action (VLA) prompts to enable robots to learn new tasks quickly in the real world.
Introduces Neuro-Symbolic Drive, a framework that uses rule-grounded reasoning traces from classical planners to fine-tune a driving VLA (Qwen3.5-4B), achieving significant reductions in average displacement error and miss rate compared to standard CoT reasoning.
This paper introduces PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, enabling style-diverse non-ego agents for closed-loop simulation and improving driving scores on Bench2Drive.
Robot world models and simulation platforms are experiencing open-source acceleration: NVIDIA launched Cosmos 3 and Isaac GR00T physical AI foundation models, AGIBOT released Genie Sim 3.0, a fully open-source simulation platform, VLA models become mainstream for manipulation policies, collectively lowering the entry barrier for the robotics field.
AffordanceVLA introduces a unified framework using structured affordance forecasting as an intermediate representation to improve perception-action mapping in robotic manipulation, leveraging vision-language models and a Mixture-of-Transformer architecture.
X Square Robot releases Wall-OSS-0.5, a 4B open-source VLA robot foundation model evaluated on a 17-task real-robot zero-shot suite without task-specific fine-tuning, aiming to directly measure pretraining capability.
Release of Wall-OSS-0.5, an open-weights vision-language-action model that achieves over 80% task progress on 4 of 17 real-robot tasks with zero fine-tuning, including on a deformable rope task not seen during pretraining. The model preserves general vision-language ability while improving embodied grounding.
FrameSkip is a data-layer frame selection method that improves Vision-Language-Action (VLA) policy training by prioritizing high-importance frames based on action variation and visual-coherence metrics, achieving a macro-average success rate of 76.15% across three benchmarks while using only 20% of unique frames.
NVIDIA's Jim Fan spoke at Sequoia AI Ascent 2026, declaring the VLA architecture obsolete and proposing World Action Models (WAM) as a new paradigm for robotics. He introduced key technologies including DreamZero, EgoScale, and the neural simulator Dream Dojo.
Allen AI releases MolmoAct2, an open-weight Vision-Language-Action model designed for real-world robotic deployment, featuring new datasets, an open action tokenizer, and adaptive reasoning to reduce latency.
NVIDIA and Hugging Face publish a hands-on demo showing Gemma 4 running as a vision-language-action model entirely on the Jetson Orin Nano Super, using local STT/TTS and webcam input.
FlashDrive reduces reasoning vision-language-action model inference latency from 716 ms to 159 ms on RTX PRO 6000—up to 5.7× faster—with zero accuracy loss, enabling real-time autonomous applications.
LeRobot v0.5.0 is a major release featuring support for Unitree G1 humanoid robots, new policy architectures (Pi0-FAST VLAs, Real-Time Chunking), streaming video encoding for 3x faster training, and EnvHub for loading simulation environments from Hugging Face Hub.