ChatPaper.aiChatPaper

炼金师:将公共文本到图像数据转化为生成式黄金

Alchemist: Turning Public Text-to-Image Data into Generative Gold

May 25, 2025
作者: Valerii Startsev, Alexander Ustyuzhanin, Alexey Kirillov, Dmitry Baranchuk, Sergey Kastryulin
cs.AI

摘要

预训练为文本到图像(T2I)模型提供了广泛的世界知识,但仅凭此往往不足以实现高美学质量与精准对齐。因此,监督微调(SFT)对于进一步精炼模型至关重要。然而,其效果高度依赖于微调数据集的质量。现有的公开SFT数据集多聚焦于狭窄领域(如动漫或特定艺术风格),而构建高质量、通用目的的SFT数据集仍面临重大挑战。当前的筛选方法成本高昂,且难以识别真正具有影响力的样本。这一挑战因公开通用数据集的稀缺而更加复杂,领先模型往往依赖庞大、私有且文档记录不足的内部数据,阻碍了更广泛的研究进展。本文提出了一种创新方法,通过利用预训练生成模型作为高影响力训练样本的评估器,来创建通用目的的SFT数据集。我们应用此方法构建并发布了Alchemist,一个虽小(3,350个样本)但极为有效的SFT数据集。实验表明,Alchemist显著提升了五个公开T2I模型的生成质量,同时保持了多样性与风格。此外,我们还将微调后的模型权重向公众开放。
English
Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.

Summary

AI-Generated Summary

PDF632May 27, 2025