第二幕目标:从世界模型到通用目标条件策略
Act2Goal: From World Model To General Goal-conditioned Policy
December 29, 2025
作者: Pengfei Zhou, Liliang Chen, Shengcong Chen, Di Chen, Wenzhi Zhao, Rongjun Jin, Guanghui Ren, Jianlan Luo
cs.AI
摘要
如何以兼具表达力与精确度的方式定义机器人操作任务,至今仍是核心挑战。虽然视觉目标能以紧凑且明确的方式定义任务,但现有基于目标条件的策略因依赖单步动作预测而缺乏对任务进度的显式建模,往往难以实现长时程操作。我们提出Act2Goal——一种集成目标条件视觉世界模型与多尺度时序控制的通用目标条件操作策略。给定当前观测和目标视觉状态,该世界模型能生成符合长时程逻辑的中间视觉状态序列。为实现从视觉规划到鲁棒执行的转化,我们引入多尺度时序哈希(MSTH)技术,将预测轨迹分解为用于细粒度闭环控制的密集近端帧,以及锚定全局任务一致性的稀疏远端帧。该策略通过端到端交叉注意力机制将多尺度表征与运动控制耦合,在保持局部干扰响应能力的同时实现连贯的长时程行为。Act2Goal在新物体、空间布局及环境场景中展现出强大的零样本泛化能力。通过基于LoRA微调的后视目标重标定技术,我们进一步实现了无需奖励信号的在线自适应,使系统能在无外部监督下快速自主优化。真实机器人实验表明,在具有挑战性的分布外任务中,Act2Goal仅需数分钟自主交互即可将成功率从30%提升至90%,验证了具备多尺度时序控制的目标条件世界模型能为鲁棒的长时程操作提供必要的结构化引导。项目页面:https://act2goal.github.io/
English
Specifying robotic manipulation tasks in a manner that is both expressive and precise remains a central challenge. While visual goals provide a compact and unambiguous task specification, existing goal-conditioned policies often struggle with long-horizon manipulation due to their reliance on single-step action prediction without explicit modeling of task progress. We propose Act2Goal, a general goal-conditioned manipulation policy that integrates a goal-conditioned visual world model with multi-scale temporal control. Given a current observation and a target visual goal, the world model generates a plausible sequence of intermediate visual states that captures long-horizon structure. To translate this visual plan into robust execution, we introduce Multi-Scale Temporal Hashing (MSTH), which decomposes the imagined trajectory into dense proximal frames for fine-grained closed-loop control and sparse distal frames that anchor global task consistency. The policy couples these representations with motor control through end-to-end cross-attention, enabling coherent long-horizon behavior while remaining reactive to local disturbances. Act2Goal achieves strong zero-shot generalization to novel objects, spatial layouts, and environments. We further enable reward-free online adaptation through hindsight goal relabeling with LoRA-based finetuning, allowing rapid autonomous improvement without external supervision. Real-robot experiments demonstrate that Act2Goal improves success rates from 30% to 90% on challenging out-of-distribution tasks within minutes of autonomous interaction, validating that goal-conditioned world models with multi-scale temporal control provide structured guidance necessary for robust long-horizon manipulation. Project page: https://act2goal.github.io/