LPM 1.0：基于视频的角色表演模型

摘要

表演，即通过视觉、语音和时间行为外化意图、情感与个性，是赋予角色生命力的核心。从视频中学习此类表演，是传统3D制作流程的一种前景广阔的替代方案。然而现有视频模型难以同时实现高表现力、实时推理和长时身份稳定性，这一矛盾我们称之为"表演三难困境"。对话是最具综合性的表演场景，角色需在持续维持身份特征的同时，完成说话、倾听、反应和情绪表达等复合行为。为此，我们推出LPM 1.0（大型表演模型），专注于单人多模态全双工会话表演。具体而言，我们通过严格筛选、听说音视频配对、表演理解及身份感知多参考提取构建了以人为核心的多模态数据集；训练了170亿参数的扩散Transformer模型（基础LPM），通过多模态条件控制实现高度可控且身份一致的表演；并将其蒸馏为因果流式生成器（在线LPM）以支持低延迟、无限时长的交互。推理时，给定包含身份感知参考的角色图像，LPM 1.0能以实时速度生成听用户音频的倾听视频和基于合成音频的说话视频，并通过文本提示实现动作控制，同时保持身份稳定与无限时长生成。因此LPM 1.0可作为对话智能体、直播角色和游戏NPC的视觉引擎。为系统评估该设定，我们提出首个交互式角色表演基准LPM-Bench。LPM 1.0在所有评估维度均达到最先进水平，并始终保持实时推理能力。

English

Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.

LPM 1.0：基于视频的角色表演模型

LPM 1.0: Video-based Character Performance Model

摘要

Support