MakeAnything:利用擴散Transformer進行多領域程序序列生成
MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
February 3, 2025
作者: Yiren Song, Cheng Liu, Mike Zheng Shou
cs.AI
摘要
人類智慧的一個特徵是能夠通過結構化的多步驟過程創建複雜的工件。利用人工智能生成程序化教程是一個歷史悠久但具有挑戰性的目標,面臨三個關鍵障礙:(1)多任務程序數據集的稀缺性,(2)在步驟之間保持邏輯連貫性和視覺一致性,以及(3)在多個領域之間進行泛化。為應對這些挑戰,我們提出了一個涵蓋 21 項任務的多領域數據集,包含超過 24,000 個程序序列。在此基礎上,我們引入了基於擴散變換器(DIT)的 MakeAnything 框架,該框架利用微調來激活 DIT 的上下文能力,以生成一致的程序序列。我們引入了用於圖像生成的非對稱低秩適應(LoRA),通過凍結編碼器參數並自適應調整解碼器層來平衡泛化能力和任務特定性能。此外,我們的 ReCraft 模型通過時空一致性約束實現了從圖像到過程的生成,允許將靜態圖像分解為合理的創作序列。大量實驗表明,MakeAnything 超越了現有方法,為程序生成任務設定了新的性能基準。
English
A hallmark of human intelligence is the ability to create complex artifacts
through structured multi-step processes. Generating procedural tutorials with
AI is a longstanding but challenging goal, facing three key obstacles: (1)
scarcity of multi-task procedural datasets, (2) maintaining logical continuity
and visual consistency between steps, and (3) generalizing across multiple
domains. To address these challenges, we propose a multi-domain dataset
covering 21 tasks with over 24,000 procedural sequences. Building upon this
foundation, we introduce MakeAnything, a framework based on the diffusion
transformer (DIT), which leverages fine-tuning to activate the in-context
capabilities of DIT for generating consistent procedural sequences. We
introduce asymmetric low-rank adaptation (LoRA) for image generation, which
balances generalization capabilities and task-specific performance by freezing
encoder parameters while adaptively tuning decoder layers. Additionally, our
ReCraft model enables image-to-process generation through spatiotemporal
consistency constraints, allowing static images to be decomposed into plausible
creation sequences. Extensive experiments demonstrate that MakeAnything
surpasses existing methods, setting new performance benchmarks for procedural
generation tasks.Summary
AI-Generated Summary