ChatPaper.aiChatPaper

通过闭环世界建模实现视频化身的主动智能

Active Intelligence in Video Avatars via Closed-loop World Modeling

December 23, 2025
作者: Xuanhua He, Tianyu Yang, Ke Cao, Ruiqi Wu, Cheng Meng, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Qifeng Chen
cs.AI

摘要

当前视频虚拟人生成方法在身份保持和动作对齐方面表现优异,但缺乏真正的自主性——它们无法通过自适应环境交互自主追求长期目标。为此,我们提出L-IVA(长程交互式视觉虚拟人)这一用于评估随机生成环境中目标导向规划能力的任务与基准,并开发了首个实现视频虚拟人主动智能的框架ORCA(在线推理与认知架构)。ORCA通过两大创新实现了内部世界模型能力:(1)采用观察-思考-行动-反思的闭环OTAR周期,通过持续比对预测结果与实际生成效果,在生成不确定性下保持稳健的状态追踪;(2)构建分层双系统架构,其中系统2负责带状态预测的战略推理,系统1则将抽象计划转化为精确的模型专用动作指令。通过将虚拟人控制建模为部分可观测马尔可夫决策过程,并实施结合结果验证的持续信念更新,ORCA实现了开放域场景下的自主多步任务完成。大量实验表明,ORCA在任务成功率和行为连贯性上显著优于开环与非反思基线,验证了我们受内部世界模型启发的设计能有效推动视频虚拟人智能从被动动画向主动目标导向行为演进。
English
Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.
PDF21December 25, 2025