超越对象:面向细粒度分类的上下文合成数据生成
Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
October 28, 2025
作者: William Yang, Xindi Wu, Zhiwei Deng, Esin Tureci, Olga Russakovsky
cs.AI
摘要
文本到图像(T2I)模型正日益广泛地应用于合成数据集的生成,但为分类任务生成有效的合成训练数据仍具挑战性。基于少量真实样本对T2I模型进行微调可提升合成训练数据的质量,但同时也可能导致过拟合并降低生成样本的多样性。我们提出BOB(BeyondOBjects)微调策略以缓解细粒度分类中的这些问题。给定少量真实样本集,我们首先提取类别无关属性(如场景背景和物体姿态),随后在T2I模型微调过程中显式约束这些属性,并在生成阶段对其进行边缘化处理。该设计能有效缓解过拟合、保留T2I模型的生成先验、降低估计误差,并进一步减少非预期的类间关联。通过在多个T2I模型、骨干网络和数据集上的广泛实验表明,当采用合成数据增强时,我们的方法在低样本细粒度分类任务中实现了最先进的性能。具体而言,在Aircraft数据集上,BOB相较DataDream提升7.4%(当使用5张真实图像与100张合成图像增强微调CLIP分类器时,准确率从50.0%提升至57.4%)。在四项基准测试中,有三项采用BOB增强的5张真实图像微调下游模型的效果优于直接使用10张真实图像微调。总体而言,BOB在24个实验设置中的18个表现优于现有技术,其中14个设置的准确率提升超过2%。
English
Text-to-image (T2I) models are increasingly used for synthetic dataset
generation, but generating effective synthetic training data for classification
remains challenging. Fine-tuning a T2I model with a few real examples can help
improve the quality of synthetic training data; however, it may also cause
overfitting and reduce diversity in the generated samples. We propose a
fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for
fine-grained classification. Given a small set of real examples, we first
extract class-agnostic attributes such as scene background and object pose. We
then explicitly condition on these attributes during fine-tuning of the T2I
model and marginalize them out during generation. This design mitigates
overfitting, preserves the T2I model's generative prior, reduces estimation
errors, and further minimizes unintended inter-class associations. Extensive
experiments across multiple T2I models, backbones, and datasets show that our
method achieves state-of-the-art performance in low-shot fine-grained
classification when augmented with synthetic data. Concretely, BOB outperforms
DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning
a CLIP classifier with five real images augmented with 100 synthetic images).
In three of the four benchmarks, fine-tuning downstream models with 5 real
images augmented with BOB achieves better performance than fine-tuning with 10
real images. Collectively, BOB outperforms prior art in 18 of 24 experimental
settings, with 2+% accuracy improvements in 14 of these settings.