Tag
RLDX-1 is a general-purpose robotic policy for dexterous manipulation that uses a Multi-Stream Action Transformer architecture to integrate heterogeneous modalities, outperforming existing VLA models in real-world tasks.
FastVLA, an open-source Vision-Language-Action model, now runs 5 Hz robotics on an L4 GPU.
Cortex 2.0 introduces a plan-and-act control framework that uses visual latent space trajectory generation to enable reliable long-horizon robotic manipulation in complex industrial environments, outperforming reactive Vision-Language-Action models.
OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.
FlashDrive reduces reasoning vision-language-action model inference latency from 716 ms to 159 ms on RTX PRO 6000—up to 5.7× faster—with zero accuracy loss, enabling real-time autonomous applications.
HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.
Google DeepMind introduces Gemini Robotics 1.5 and Gemini Robotics-ER 1.5, advancing physical AI agents that can perceive, plan, think, and act to complete complex multi-step tasks. Gemini Robotics-ER 1.5 is now available to developers via the Gemini API.
Google DeepMind introduces Gemini Robotics On-Device, an efficient VLA model optimized to run locally on robotic devices, enabling low-latency operation and offline capability while maintaining strong dexterous manipulation and task generalization. The model can be fine-tuned with as few as 50-100 demonstrations and comes with an SDK for developers.
Google DeepMind introduces Gemini Robotics, a Gemini 2.0-based vision-language-action model designed to control physical robots with improved generality, interactivity, and dexterity. The company also launches Gemini Robotics-ER for spatial reasoning and partners with Apptronik to develop humanoid robots.