LPM 1.0:基於影片的角色表演模型
LPM 1.0: Video-based Character Performance Model
April 9, 2026
作者: Ailing Zeng, Casper Yang, Chauncey Ge, Eddie Zhang, Garvey Xu, Gavin Lin, Gilbert Gu, Jeremy Pi, Leo Li, Mingyi Shi, Sheng Bi, Steven Tang, Thorn Hang, Tobey Guo, Vincent Li, Xin Tong, Yikang Li, Yuchen Sun, Yue, Zhao, Yuhan Lu, Yuwei Li, Zane Zhang, Zeshi Yang, Zi Ye
cs.AI
摘要
表演——透過視覺、語音與時序行為外化意圖、情感與人格的過程——是賦予角色生命力的核心。從影片中學習此類表演,為傳統3D製作流程提供了極具前景的替代方案。然而現有影片模型難以同時實現高表現力、即時推理與長時身份穩定性,我們將此矛盾稱為「表演三重困境」。對話作為最全面的表演場景,要求角色在持續說話、聆聽、反應與情緒表達的同時保持身份一致性。為此,我們提出LPM 1.0(大型表演模型),專注於單人全雙工視聽對話表演。具體而言,我們透過嚴格篩選、說聽視聽音訊配對、表演理解與身份感知多參考提取構建多模態人本數據集;訓練包含170億參數的擴散變換器(基礎LPM),藉由多模態條件控制實現高可控且身份一致的表演;並將其蒸餾為因果流式生成器(線上LPM)以支持低延遲、無限時長的互動。推理時,LPM 1.0基於帶身份感知參考的角色圖像,可從用戶音訊生成聆聽影片,從合成音訊生成說話影片,並透過文字提示進行動作控制,所有過程均以即時速度實現身份穩定、無限時長的生成。因此LPM 1.0可作為對話代理、直播角色與遊戲NPC的視覺引擎。為系統性評估此設定,我們提出首個互動角色表演基準LPM-Bench。實驗表明,LPM 1.0在保持即時推理的同時,於所有評估維度均達到最先進水平。
English
Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.