LabVLA: 科学実験室における視覚・言語・行動モデルのグラウンディング

要旨

科学的実験室では、実験の推論にAIシステムを活用することが増えているが、科学を実際に行う物理的な作業は依然としてその範囲外にある。AIは文献の読み取り、仮説の生成、プロトコルの計画を支援できるが、実験台でのプロトコルの実行には人間の操作者が依然として必要である。視覚言語行動モデルは、文書化されたプロトコルとロボット実行との間のインターフェースの一つとして有望だが、既存の方策は主に家庭や卓上のデモンストレーションで訓練されており、科学実験室で見られるような器具、透明な液体、固定されたプロトコルワークフローに遭遇することはほとんどない。このギャップを埋めるには、実験室固有の教師データと、実験プロトコルを実行するために用いられる多様なロボットの具現化に対応できる統一的な学習フレームワークの両方が必要である。そこで我々は、モデル設計に加えて、データと具現化が中心的なボトルネックであると特定する。データ面に対処するため、我々はRoboGenesisを構築した。これはシミュレーションベースのワークフローおよびデータエンジンであり、設定済みの実験室ワークフローを原子的スキルから構成し、ロールアウトを検証・フィルタリングし、サポートされているロボットプロファイル全体にわたる構造化デモンストレーションを出力する。方策面では、LabVLAを提案する。これは2段階のレシピで訓練される。まずFAST行動トークン事前学習により、連続制御を学習する前にQwen3-VL-4B-Instructバックボーンを行動認識可能にし、続くフローマッチング後訓練では、知識絶縁下でDiT行動エキスパートを付加する。LabUtopiaベンチマークにおいて、LabVLAは、分布内および分布外の両方の設定で、評価されたすべてのベースラインの中で最高の平均成功率を達成した。

English

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.