LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Hugging Face Daily Papers 06/11/26, 12:00 AM Papers

vision-language-action robotics laboratory-automation simulation pretraining flow-matching ai-in-science

Summary

LabVLA is a vision-language-action model for scientific laboratory automation, trained with a two-stage approach combining action token pretraining and flow matching. It achieves state-of-the-art success rates on the LabUtopia benchmark by leveraging simulated data to bridge the gap between household demonstrations and lab-specific tasks.

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

Original Article

View Cached Full Text

Cached at: 06/12/26, 02:52 AM

Paper page - LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Source: https://huggingface.co/papers/2606.13578 Authors:

Abstract

LabVLA, a vision-language-action model trained with a two-stage approach combining action token pretraining and flow matching, demonstrates superior performance on laboratory automation tasks through simulated data generation and robot-specific learning.

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, asimulation-based workflowanddata enginethat composes configuredlaboratory workflowsfrom atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we presentLabVLA, trained with atwo-stage recipe:FAST action token pretrainingfirst makes theQwen3-VL-4B-Instructbackbone action aware before any continuous control is learned, andflow matchingposttraining then attaches aDiT action expertunderknowledge insulation. On theLabUtopia benchmark,LabVLAachieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

View arXiv page View PDF Project page GitHub Add to collection

Get this paper in your agent:

hf papers read 2606\.13578

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.13578 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.13578 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.13578 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Paper page - LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

CLAP: Direct VLM-to-VLA Adaptation via Language-Action Grounding

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Submit Feedback

Similar Articles

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

CLAP: Direct VLM-to-VLA Adaptation via Language-Action Grounding

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies