超越对象:面向细粒度分类的上下文感知合成数据生成
Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
October 28, 2025
作者: William Yang, Xindi Wu, Zhiwei Deng, Esin Tureci, Olga Russakovsky
cs.AI
摘要
文本到图像(T2I)模型正日益广泛应用于合成数据集生成,但为分类任务生成有效的合成训练数据仍具挑战性。基于少量真实样本对T2I模型进行微调可提升合成训练数据质量,但可能引发过拟合并降低生成样本的多样性。针对细粒度分类任务,我们提出BOB(超越对象边界)微调策略以缓解上述问题。该方法首先从少量真实样本中提取类别无关属性(如场景背景和物体姿态),随后在T2I模型微调过程中显式约束这些属性,并在生成阶段对其进行边缘化处理。该设计能有效抑制过拟合、保留T2I模型的生成先验、降低估计误差,并进一步减少非预期的类间关联。通过在多个T2I模型、骨干网络和数据集上的广泛实验表明,本方法在使用合成数据增强的低样本细粒度分类任务中达到最先进性能。具体而言,在Aircraft数据集上,BOB相较DataDream方法提升7.4%(当使用5张真实图像与100张合成图像微调CLIP分类器时,准确率从50.0%提升至57.4%)。在四项基准测试中,有三项使用BOB增强的5张真实图像微调下游模型的效果优于直接使用10张真实图像。总体而言,BOB在24个实验设置中的18个超越现有技术,其中14个设置的准确率提升超过2%。
English
Text-to-image (T2I) models are increasingly used for synthetic dataset
generation, but generating effective synthetic training data for classification
remains challenging. Fine-tuning a T2I model with a few real examples can help
improve the quality of synthetic training data; however, it may also cause
overfitting and reduce diversity in the generated samples. We propose a
fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for
fine-grained classification. Given a small set of real examples, we first
extract class-agnostic attributes such as scene background and object pose. We
then explicitly condition on these attributes during fine-tuning of the T2I
model and marginalize them out during generation. This design mitigates
overfitting, preserves the T2I model's generative prior, reduces estimation
errors, and further minimizes unintended inter-class associations. Extensive
experiments across multiple T2I models, backbones, and datasets show that our
method achieves state-of-the-art performance in low-shot fine-grained
classification when augmented with synthetic data. Concretely, BOB outperforms
DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning
a CLIP classifier with five real images augmented with 100 synthetic images).
In three of the four benchmarks, fine-tuning downstream models with 5 real
images augmented with BOB achieves better performance than fine-tuning with 10
real images. Collectively, BOB outperforms prior art in 18 of 24 experimental
settings, with 2+% accuracy improvements in 14 of these settings.