vision-language-action

#vision-language-action

RLDX-1 Technical Report

Hugging Face Daily Papers ↗ · 5d ago Cached

RLDX-1 is a general-purpose robotic policy for dexterous manipulation that uses a Multi-Stream Action Transformer architecture to integrate heterogeneous modalities, outperforming existing VLA models in real-world tasks.

0 favorites 0 likes

#vision-language-action

Just open-sourced FastVLA

Reddit r/LocalLLaMA ↗ · 2026-04-22

FastVLA, an open-source Vision-Language-Action model, now runs 5 Hz robotics on an L4 GPU.

0 favorites 0 likes

#vision-language-action

Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

Hugging Face Daily Papers ↗ · 2026-04-22 Cached

Cortex 2.0 introduces a plan-and-act control framework that uses visual latent space trajectory generation to enable reliable long-horizon robotic manipulation in complex industrial environments, outperforming reactive Vision-Language-Action models.

0 favorites 0 likes

#vision-language-action

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Hugging Face Daily Papers ↗ · 2026-04-20 Cached

OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.

0 favorites 0 likes

#vision-language-action

@zhijianliu_: Reasoning VLAs can think. They just can't think fast. Until now. Introducing FlashDrive 716 ms → 159 ms on RTX PRO 6000…

X AI KOLs Timeline ↗ · 2026-04-19 Cached

FlashDrive reduces reasoning vision-language-action model inference latency from 716 ms to 159 ms on RTX PRO 6000—up to 5.7× faster—with zero accuracy loss, enabling real-time autonomous applications.

0 favorites 0 likes

#vision-language-action

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Hugging Face Daily Papers ↗ · 2026-04-15 Cached

HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.

0 favorites 0 likes

#vision-language-action

Gemini Robotics 1.5 brings AI agents into the physical world

Google DeepMind Blog ↗ · 2025-10-23 Cached

Google DeepMind introduces Gemini Robotics 1.5 and Gemini Robotics-ER 1.5, advancing physical AI agents that can perceive, plan, think, and act to complete complex multi-step tasks. Gemini Robotics-ER 1.5 is now available to developers via the Gemini API.

0 favorites 0 likes

#vision-language-action

Gemini Robotics On-Device brings AI to local robotic devices

Google DeepMind Blog ↗ · 2025-06-24 Cached

Google DeepMind introduces Gemini Robotics On-Device, an efficient VLA model optimized to run locally on robotic devices, enabling low-latency operation and offline capability while maintaining strong dexterous manipulation and task generalization. The model can be fine-tuned with as few as 50-100 demonstrations and comes with an SDK for developers.

0 favorites 0 likes

#vision-language-action

Gemini Robotics brings AI into the physical world

Google DeepMind Blog ↗ · 2025-03-12 Cached

Google DeepMind introduces Gemini Robotics, a Gemini 2.0-based vision-language-action model designed to control physical robots with improved generality, interactivity, and dexterity. The company also launches Gemini Robotics-ER for spatial reasoning and partners with Apptronik to develop humanoid robots.

0 favorites 0 likes

vision-language-action

Submit Feedback