EmboAlign: 構成的制約によるビデオ生成のアライメントとゼロショット操作

要旨

大規模なインターネットデータで事前学習されたビデオ生成モデル（VGM）は、豊富なオブジェクトダイナミクスを捉えた時間的に一貫性のあるロールアウト動画を生成でき、ゼロショットロボットマニピュレーションの魅力的な基盤を提供する。しかし、VGMは物理的に不可能なロールアウトを生成することが多く、その画素空間の動きを幾何学的リターゲティングを通じてロボット動作に変換する際には、不完全な深度推定とキーポイントトラッキングに起因する累積誤差がさらに生じる。これらの課題に対処するため、我々は推論時に視覚言語モデル（VLM）によって生成される合成的制約をVGMの出力に整合させる、データ不要のフレームワーク「」を提案する。重要な洞察は、VLMがVGMと相補的な能力、すなわちマニピュレーション実行の成功と安全性に不可欠な物理的制約を特定できる構造化された空間推論能力を提供する点である。言語指示が与えられると、はVLMを用いてタスク固有の要件を捉えた一連の合成的制約を自動抽出し、これを2つの段階で適用する：（1）制約誘導型ロールアウト選択：VGMのロールアウトのバッチをスコアリングしフィルタリングして、最も物理的に妥当な候補を保持する。（2）制約ベース軌道最適化：選択されたロールアウトを初期値として用い、同じ制約セットの下でロボット軌道を精緻化し、リターゲティング誤差を補正する。我々は、精密かつ制約に敏感な実行を必要とする6つの実ロボットマニピュレーションタスクでを評価し、タスク固有の学習データを一切用いずに、最強のベースラインと比べて成功率を43.3%ポイント向上させた。

English

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present , a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.

EmboAlign: 構成的制約によるビデオ生成のアライメントとゼロショット操作

EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

要旨

Support