DreamActor-H1：基于运动设计扩散变换器的高保真人类产品演示视频生成

摘要

在電子商務與數位行銷領域，製作高保真度的人與產品展示影片對於有效的產品呈現至關重要。然而，現存的多數框架要么無法同時保留人與產品的身份特徵，要么缺乏對人與產品空間關係的理解，導致展示效果失真且互動不自然。為應對這些挑戰，我們提出了一種基於擴散變換器（Diffusion Transformer, DiT）的框架。該方法通過注入配對的人與產品參考信息，並利用額外的掩碼交叉注意力機制，同步保留人物身份及產品特定細節，如標誌與紋理。我們採用3D人體網格模板與產品邊界框來提供精確的動作指導，實現手勢與產品擺放位置的直觀對齊。此外，結構化文本編碼被用於融入類別層次的語義信息，增強了幀間微小旋轉變化時的3D一致性。通過在採用廣泛數據增強策略的混合數據集上訓練，我們的方法在保持人與產品身份完整性及生成逼真展示動作方面，均超越了現有最先進技術。項目頁面：https://submit2025-dream.github.io/DreamActor-H1/。

English

In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://submit2025-dream.github.io/DreamActor-H1/.

DreamActor-H1：基于运动设计扩散变换器的高保真人类产品演示视频生成

DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

摘要

Support