ChatPaper.aiChatPaper

GenLCA:基于野外视频生成全身虚拟角色的三维扩散模型

GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos

April 8, 2026
作者: Yiqian Wu, Rawal Khirodkar, Egor Zakharov, Timur Bagautdinov, Lei Xiao, Zhaoen Su, Shunsuke Saito, Xiaogang Jin, Junxuan Li
cs.AI

摘要

我们提出GenLCA——一种基于扩散模型的生成方法,能够通过文本和图像输入生成并编辑具有照片级真实感的全身虚拟形象。生成的虚拟形象在忠实反映输入内容的同时,支持高保真度的面部与全身动画。其核心创新在于一种新型训练范式,使得能够从部分可观测的2D数据中训练全身3D扩散模型,从而将训练数据集扩展至数百万真实世界视频。这种可扩展性显著提升了GenLCA的视觉真实感与泛化能力。具体而言,我们通过改造预训练的前馈式虚拟形象重建模型作为可动画的3D标记器,将非结构化视频帧编码为结构化3D标记,从而实现数据集的大规模扩展。然而,多数真实视频仅提供身体部位的局部观测,导致3D标记中出现过度模糊或透明伪影。为此,我们提出一种可见性感知的扩散训练策略:用可学习标记替换无效区域,并仅在有效区域计算损失函数。随后我们在标记数据集上训练基于流的扩散模型,天然继承预训练虚拟形象重建模型所提供的真实感与可动性。该方法有效实现了利用大规模真实视频数据直接训练3D扩散模型。通过多样化的高保真生成与编辑结果,我们验证了本方法的卓越性能,其效果大幅超越现有解决方案。项目页面详见https://onethousandwu.com/GenLCA-Page。
English
We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training dataset to scale to millions of real-world videos. This scalability contributes to the superior photorealism and generalizability of GenLCA. Specifically, we scale up the dataset by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which encodes unstructured video frames into structured 3D tokens. However, most real-world videos only provide partial observations of body parts, resulting in excessive blurring or transparency artifacts in the 3D tokens. To address this, we propose a novel visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid regions. We then train a flow-based diffusion model on the token dataset, inherently maintaining the photorealism and animatability provided by the pretrained avatar reconstruction model. Our approach effectively enables the use of large-scale real-world video data to train a diffusion model natively in 3D. We demonstrate the efficacy of our method through diverse and high-fidelity generation and editing results, outperforming existing solutions by a large margin. The project page is available at https://onethousandwu.com/GenLCA-Page.
PDF01April 10, 2026