ChatPaper.aiChatPaper

LabVLA:将视觉-语言-动作模型落地于科学实验室

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

June 11, 2026
作者: Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu, Chenxi Li, Daqi Gao, Zeqin Su, Jintao Xing, Zirui Xue, Rui Li, Xiangyu Zhao, Shuofei Qiao, Minting Pan, Wangmeng Zuo, Lei Bai, Dongzhan Zhou, Ningyu Zhang, Huajun Chen
cs.AI

摘要

科学实验室日益依赖AI系统来推理实验方案,但实际的实验操作仍大多超出其能力范围。AI可以协助阅读文献、生成假设和规划实验流程,但在实验台上执行这些流程仍需人类操作员。视觉-语言-动作(VLA)模型为将书面实验流程转化为机器人执行提供了潜在的接口,然而现有的策略主要基于家庭和桌面场景的演示数据进行训练,极少涉及科学实验室中的仪器、透明液体或固定实验流程工作流。要弥合这一差距,既需要实验室专用的监督数据,也需要一个能兼容多种用于执行实验流程的机器人形态的统一学习框架。因此,我们指出数据与具身形态是除模型设计之外的核心瓶颈。为解决数据问题,我们构建了RoboGenesis——一种基于仿真的工作流与数据引擎,它能从原子技能组合出配置好的实验室工作流,验证并筛选执行结果,最终为支持的机器人配置输出结构化的演示数据。在策略层面,我们提出了LabVLA,采用两阶段训练策略:首先通过FAST动作标记预训练,使Qwen3-VL-4B-Instruct骨干网络在接触连续控制学习之前具备动作感知能力;随后通过流匹配后训练,在知识绝缘条件下附加一个DiT动作专家模块。在LabUtopia基准测试中,LabVLA在分布内和分布外场景下均取得了所有基线评估中最高的平均成功率。
English
Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.