FlowAct-R1:迈向交互式人形视频生成
FlowAct-R1: Towards Interactive Humanoid Video Generation
January 15, 2026
作者: Lizhen Wang, Yongming Zhu, Zhipeng Ge, Youwei Zheng, Longhao Zhang, Tianshu Hu, Shiyang Qin, Mingshuang Luo, Jiaxu Zhang, Xin Chen, Yulong Wang, Zerong Zheng, Jianwen Jiang, Chao Liang, Weifeng Chen, Xing Wang, Yuan Zhang, Mingyuan Gao
cs.AI
摘要
交互式人形视频生成旨在合成能够通过连续响应式视频与人类互动的逼真视觉智能体。尽管视频合成技术近期取得进展,现有方法仍难以兼顾高保真合成与实时交互需求。本文提出FlowAct-R1框架,专为实时交互式人形视频生成设计。该框架基于MMDiT架构,可实现任意时长的流式视频合成,同时保持低延迟响应特性。我们引入分块扩散强制策略,并结合新型自强制变体,以缓解持续交互过程中的误差累积问题,确保长期时间一致性。通过高效蒸馏与系统级优化,本框架在480p分辨率下可实现稳定25fps生成速率,首帧生成时间仅约1.5秒。所提方法提供整体化与细粒度结合的全肢体控制,使智能体能在交互场景中自然过渡不同行为状态。实验结果表明,FlowAct-R1在保持跨角色风格鲁棒泛化能力的同时,实现了卓越的行为生动性与感知真实度。
English
Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction requirements. In this paper, we propose FlowAct-R1, a framework specifically designed for real-time interactive humanoid video generation. Built upon a MMDiT architecture, FlowAct-R1 enables the streaming synthesis of video with arbitrary durations while maintaining low-latency responsiveness. We introduce a chunkwise diffusion forcing strategy, complemented by a novel self-forcing variant, to alleviate error accumulation and ensure long-term temporal consistency during continuous interaction. By leveraging efficient distillation and system-level optimizations, our framework achieves a stable 25fps at 480p resolution with a time-to-first-frame (TTFF) of only around 1.5 seconds. The proposed method provides holistic and fine-grained full-body control, enabling the agent to transition naturally between diverse behavioral states in interactive scenarios. Experimental results demonstrate that FlowAct-R1 achieves exceptional behavioral vividness and perceptual realism, while maintaining robust generalization across diverse character styles.