LabVLA: 과학 실험실에서의 비전-언어-행동 모델 접목

초록

과학 연구실은 점차 실험 추론을 위해 AI 시스템에 의존하고 있지만, 과학을 실제로 수행하는 물리적 행위는 여전히 AI의 범위를 크게 벗어나 있다. AI는 문헌 읽기, 가설 생성, 프로토콜 계획 수립을 도울 수 있지만, 실험대에서 해당 프로토콜을 실행하는 데는 여전히 인간 작업자가 필요하다. Vision-Language-Action(VLA) 모델은 문서화된 프로토콜과 로봇 실행 간의 한 가지 가능한 인터페이스를 제공하지만, 기존 정책들은 대부분 가정용 및 탁상형 시연 데이터에 훈련되어 있으며 과학 연구실에서 발견되는 장비, 투명한 액체, 또는 고정된 프로토콜 워크플로우를 거의 접하지 못한다. 이러한 격차를 해소하려면 연구실 특화된 지도 학습과 실험 프로토콜을 실행하는 데 사용되는 다양한 로봇 구현체를 수용할 수 있는 통합 학습 프레임워크가 모두 필요하다. 따라서 우리는 모델 설계와 함께 데이터와 구현체를 핵심 병목 지점으로 식별한다. 데이터 측면을 해결하기 위해, 우리는 RoboGenesis를 구축한다. 이는 시뮬레이션 기반 워크플로우이자 데이터 엔진으로, 구성된 연구실 워크플로우를 원자적 스킬로 조합하고, 롤아웃을 검증 및 필터링하며, 지원되는 로봇 프로파일 전반에 걸쳐 구조화된 시연 데이터를 내보낸다. 정책 측면에서는 LabVLA를 제시한다. 이는 2단계 훈련 레시피로 훈련된다. 먼저 FAST 행동 토큰 사전 훈련을 통해 연속 제어 학습 전에 Qwen3-VL-4B-Instruct 백본을 행동 인식하도록 만들고, 그 후 흐름 매칭 사후 훈련을 통해 지식 절연 하에 DiT 행동 전문가를 부착한다. LabUtopia 벤치마크에서 LabVLA는 분포 내 및 분포 외 설정 모두에서 평가된 모든 기준 모델 중 가장 높은 평균 성공률을 달성한다.

English

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.