DreamActor-H1：基于运动设计扩散Transformer的高保真人机交互演示视频生成

摘要

在电子商务与数字营销领域，制作高保真的人与产品展示视频对于有效呈现产品至关重要。然而，现有的大多数框架要么无法同时保留人与产品的身份特征，要么缺乏对人与产品空间关系的理解，导致呈现效果失真、互动不自然。为解决这些难题，我们提出了一种基于扩散变换器（DiT）的框架。我们的方法通过注入成对的人与产品参考信息，并利用额外的掩码交叉注意力机制，同步保留了人物身份及产品特有的细节，如标志与纹理。我们采用3D人体网格模板和产品边界框来提供精确的运动指导，使得手势与产品摆放能够直观对齐。此外，通过结构化文本编码融入类别级语义，增强了帧间小幅旋转变化时的3D一致性。在采用广泛数据增强策略的混合数据集上训练后，我们的方法在保持人与产品身份完整性及生成逼真展示动作方面，均超越了现有最先进技术。项目页面：https://submit2025-dream.github.io/DreamActor-H1/。

English

In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://submit2025-dream.github.io/DreamActor-H1/.

DreamActor-H1：基于运动设计扩散Transformer的高保真人机交互演示视频生成

DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

摘要

Support