煉金術士：將公開文本轉圖像數據轉化為生成式黃金

摘要

預訓練為文本到圖像（T2I）模型提供了廣泛的世界知識，但僅此往往不足以實現高美學品質和對齊。因此，監督式微調（SFT）對於進一步的精煉至關重要。然而，其效果高度依賴於微調數據集的質量。現有的公開SFT數據集通常針對狹窄的領域（例如動漫或特定藝術風格），而創建高質量、通用目的的SFT數據集仍然是一個重大挑戰。當前的策展方法通常成本高昂，且難以識別真正有影響力的樣本。這一挑戰因公開通用數據集的稀缺而進一步複雜化，因為領先模型往往依賴於大型、專有且文檔記錄不佳的內部數據，阻礙了更廣泛的研究進展。本文介紹了一種新穎的方法，通過利用預訓練的生成模型作為高影響力訓練樣本的估計器，來創建通用目的的SFT數據集。我們應用這一方法構建並發布了Alchemist，這是一個緊湊（3,350個樣本）但極其有效的SFT數據集。實驗表明，Alchemist顯著提升了五個公開T2I模型的生成質量，同時保持了多樣性和風格。此外，我們還向公眾發布了微調模型的權重。

English

Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.

煉金術士：將公開文本轉圖像數據轉化為生成式黃金

Alchemist: Turning Public Text-to-Image Data into Generative Gold

摘要

Support