EmboAlign: 구성적 제약 조건과 비디오 생성 정렬을 통한 제로샷 조작

초록

대규모 인터넷 데이터로 사전 학습된 비디오 생성 모델(VGM)은 풍부한 객체 동역학을 포착한 시간적 일관성을 갖춘 롤아웃 비디오를 생성할 수 있어, 제로샷 로봇 매니픽레이션에 대한 매력적인 기반을 제공합니다. 그러나 VGM은 종종 물리적으로 비현실적인 롤아웃을 생성하며, 기하학적 재타게팅을 통해 픽셀 공간의 운동을 로봇 동작으로 변환할 때는 불완전한 깊이 추정 및 키포인트 추적로 인한 누적 오류가 추가로 발생합니다. 이러한 문제를 해결하기 위해, 우리는 추론 시점에 시각-언어 모델(VLM)이 생성한 구성적 제약 조건을 VGM 출력과 정렬하는 데이터 무료 프레임워크인 을 제시합니다. 핵심 통찰은 VLM이 VGM과 상호 보완적인 능력, 즉 매니픽레이션 실행의 성공과 안전에 중요한 물리적 제약 조건을 식별할 수 있는 구조화된 공간 추론 능력을 제공한다는 점입니다. 은 언어 명령이 주어지면 VLM을 사용하여 작업별 요구 사항을 포착하는 일련의 구성적 제약 조건을 자동으로 추출하며, 이는 두 단계에 적용됩니다: (1) 제약 조건 기반 롤아웃 선택: 여러 VGM 롤아웃 배치를 점수화 및 필터링하여 가장 물리적으로 현실적인 후보를 남깁니다. (2) 제약 조건 기반 궤적 최적화: 선택된 롤아웃을 초기값으로 사용하고 동일한 제약 조건 집합 하에서 로봇 궤적을 개선하여 재타게팅 오류를 수정합니다. 우리는 정밀하고 제약 조건에 민감한 실행이 필요한 6가지 실제 로봇 매니픽레이션 작업에서 을 평가하였으며, 작업별 훈련 데이터 없이 가장 강력한 베이스라인 대비 전체 성공률을 43.3%p 향상시켰습니다.

English

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present , a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.

EmboAlign: 구성적 제약 조건과 비디오 생성 정렬을 통한 제로샷 조작

EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

초록

Support