EmboAlign：基于组合约束的视频生成对齐技术实现零样本操控

摘要

基于大规模互联网数据预训练的视频生成模型能够生成具有时间连贯性的推演视频，这些视频能捕捉丰富的物体动态，为零样本机器人操作提供了有力基础。然而，视频生成模型常产生物理层面不合理的推演结果，且通过几何重定向将其像素空间运动转换为机器人动作时，会因深度估计与关键点跟踪的累积误差而进一步放大问题。为解决这些挑战，我们提出一种无需训练数据的框架，该框架在推理阶段通过视觉语言模型生成的组合约束来校准视频生成模型的输出。其核心思路在于：视觉语言模型具备与视频生成模型互补的能力——即能识别对操作执行成功与安全至关重要的物理约束条件，进行结构化空间推理。给定语言指令后，该框架利用视觉语言模型自动提取一组捕获任务特定需求的组合约束，并分两个阶段实施：(1) 约束引导的推演筛选：对批量视频推演进行评分过滤，保留物理合理性最高的候选序列；(2) 基于约束的轨迹优化：以选定推演作为初始化轨迹，在相同约束集下优化机器人轨迹以修正重定向误差。我们在六项需要精确且约束敏感执行的实体机器人操作任务上评估该框架，在无需任何任务特定训练数据的情况下，相较最强基线模型将整体成功率提升了43.3个百分点。

English

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present , a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.

EmboAlign：基于组合约束的视频生成对齐技术实现零样本操控

EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

摘要

Support