EmboAlign：基于组合约束的视频生成对齐技术实现零样本操控

摘要

基於大規模網絡數據預訓練的視頻生成模型（VGMs）能夠生成時序連貫的推演視頻，捕捉豐富的物體動態，為零樣本機器人操作提供了有力基礎。然而，VGMs常產生物理層面不合理的推演結果，且通過幾何重定向將其像素空間運動轉換為機器人動作時，會進一步引入因深度估計與關鍵點追蹤不精確而產生的累積誤差。為解決這些難題，我們提出一種無需額外訓練數據的框架，在推理階段利用視覺語言模型（VLMs）生成的組合式約束來校準VGM輸出。核心思路在於：VLMs具備與VGMs互補的能力——即能通過結構化空間推理識別對操作執行成功與安全性至關重要的物理約束。給定語言指令後，該框架利用VLM自動提取一組捕捉任務特定需求的組合式約束，並將其應用於兩個階段：（1）約束引導的推演篩選：對批量VGM推演結果進行評分過濾，保留物理合理性最高的候選序列；（2）基於約束的軌跡優化：以選定推演作為初始化，在相同約束集下優化機器人軌跡以修正重定向誤差。我們在六項需要精準且約束敏感執行的真實機器人操作任務上進行評估，相較於最強基線方法，在無需任何任務特定訓練數據的情況下，整體成功率提升43.3%。

English

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present , a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.

EmboAlign：基于组合约束的视频生成对齐技术实现零样本操控

EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

摘要

Support