LPM 1.0：映像ベースキャラクターパフォーマンスモデル

要旨

演技とは、視覚的・音声的・時間的行動を通じて意図、感情、人格を外在化するものであり、これによってキャラクターに命が吹き込まれる。このような演技を動画から学習することは、従来の3D制作パイプラインに代わる有望な手法である。しかし既存の動画モデルは、高い表現力、リアルタイム推論、長期的なアイデンティティ安定性を同時に達成するのに苦労しており、私たちはこの緊張関係を「演技の三項ジレンマ」と呼ぶ。対話は最も総合的な演技シナリオであり、キャラクターは発話、傾聴、反応、感情表出を同時に行いながら、時間を通じてアイデンティティを維持する。この課題に対処するため、私たちは単独人物の全二重音声視覚対話演技に焦点を当てたLPM 1.0（Large Performance Model）を提案する。具体的には、厳格なフィルタリング、発話・傾聴音声動画のペアリング、演技理解、アイデンティティ認識型マルチ参照抽出を通じてマルチモーダルな人間中心データセットを構築し、マルチモーダル条件付けによる高度に制御可能でアイデンティティ整合性の高い演技のための170億パラメータDiffusion Transformer（Base LPM）を学習し、低遅延で無限長のインタラクションを実現する因果的ストリーミング生成器（Online LPM）へ蒸留する。推論時には、キャラクター画像とアイデンティティ認識参照を入力として、LPM 1.0はユーザー音声から傾聴動画を、合成音声から発話動画を生成し、テキストプロンプトによる動作制御を可能とする。これらすべてをリアルタイム速度で、アイデンティティが安定した無限長の生成が可能である。したがってLPM 1.0は、対話エージェント、ライブ配信キャラクター、ゲームNPCのための視覚エンジンとして機能する。この設定を体系的に評価するため、対話型キャラクター演技における最初のベンチマークであるLPM-Benchを提案する。LPM 1.0はリアルタイム推論を維持しながら、評価されたすべての次元で最先端の結果を達成した。

English

Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.

LPM 1.0：映像ベースキャラクターパフォーマンスモデル

LPM 1.0: Video-based Character Performance Model

要旨

Support