自动樱桃采摘机：从由语言驱动的高质量生成数据中学习

摘要

基于扩散的模型展现出在生成具有不同布局的高质量图像方面的巨大潜力，这有助于下游感知任务。然而，仅由语言驱动的完全自动布局生成以及用于衡量多个生成实例的合适度量尚未得到很好的探索。在这项工作中，我们提出了Auto Cherry-Picker（ACP），这是一个新颖的框架，用于生成高质量的多模态训练示例，以增强感知和多模态训练。从一个简单的自然语言概念列表开始，我们促使大型语言模型（LLMs）生成详细描述并设计合理的布局。接下来，我们使用现成的文本到图像模型生成多个图像。然后，利用一个全面设计的度量对生成的数据进行改进以确保质量。特别地，我们提出了一个新的度量，即复合布局和图像分数（CLIS），用于公平评估生成的图像。我们的合成高质量示例通过定制初始概念列表在各种场景中提升性能，特别是在解决长尾分布和不平衡数据集相关挑战方面。下游任务的实验结果表明，Auto Cherry-Picker可以显著提高现有模型的性能。此外，我们已经深入研究了CLIS与下游任务性能提升之间的相关性，发现更好的CLIS分数会导致更好的性能。这一发现显示了评估指标在各种视觉感知和MLLM任务中的潜力。代码将会提供。

English

Diffusion-based models have shown great potential in generating high-quality images with various layouts, which can benefit downstream perception tasks. However, a fully automatic layout generation driven only by language and a suitable metric for measuring multiple generated instances has not been well explored. In this work, we present Auto Cherry-Picker (ACP), a novel framework that generates high-quality multi-modal training examples to augment perception and multi-modal training. Starting with a simple list of natural language concepts, we prompt large language models (LLMs) to generate a detailed description and design reasonable layouts. Next, we use an off-the-shelf text-to-image model to generate multiple images. Then, the generated data are refined using a comprehensively designed metric to ensure quality. In particular, we present a new metric, Composite Layout and Image Score (CLIS), to evaluate the generated images fairly. Our synthetic high-quality examples boost performance in various scenarios by customizing the initial concept list, especially in addressing challenges associated with long-tailed distribution and imbalanced datasets. Experiment results on downstream tasks demonstrate that Auto Cherry-Picker can significantly improve the performance of existing models. In addition, we have thoroughly investigated the correlation between CLIS and performance gains in downstream tasks, and we find that a better CLIS score results in better performance. This finding shows the potential for evaluation metrics as the role for various visual perception and MLLM tasks. Code will be available.

自动樱桃采摘机：从由语言驱动的高质量生成数据中学习

Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language

摘要

Support