合作者建模：通过LLM工具实现主观视觉分类，减少人力投入

摘要

从内容审核到野生动物保护，需要模型识别微妙或主观视觉概念的应用数量正在增加。传统上，为这类概念开发分类器需要大量手动工作，需要花费数小时、数天，甚至数月来识别和注释训练所需的数据。即使使用最近提出的敏捷建模技术，可以快速引导图像分类器，用户仍然需要花费30分钟甚至更多的单调重复数据标记来训练单个分类器。借鉴菲斯克的认知懒汉理论，我们提出了一个新框架，通过用自然语言交互取代人工标注，减少定义概念所需的总工作量一个数量级：从标记2,000张图像到仅需100张加一些自然语言交互。我们的框架利用了最近基础模型的进展，包括大型语言模型和视觉-语言模型，通过对话和自动标记训练数据点来划分概念空间。最重要的是，我们的框架消除了对众包注释的需求。此外，我们的框架最终产生了可在成本敏感场景中部署的轻量级分类模型。在15个主观概念和2个公共图像分类数据集中，我们训练的模型表现优于传统的敏捷建模以及ALIGN、CLIP、CuPL等最先进的零样本分类模型，以及大型视觉问答模型如PaLI-X。

English

From content moderation to wildlife conservation, the number of applications that require models to recognize nuanced or subjective visual concepts is growing. Traditionally, developing classifiers for such concepts requires substantial manual effort measured in hours, days, or even months to identify and annotate data needed for training. Even with recently proposed Agile Modeling techniques, which enable rapid bootstrapping of image classifiers, users are still required to spend 30 minutes or more of monotonous, repetitive data labeling just to train a single classifier. Drawing on Fiske's Cognitive Miser theory, we propose a new framework that alleviates manual effort by replacing human labeling with natural language interactions, reducing the total effort required to define a concept by an order of magnitude: from labeling 2,000 images to only 100 plus some natural language interactions. Our framework leverages recent advances in foundation models, both large language models and vision-language models, to carve out the concept space through conversation and by automatically labeling training data points. Most importantly, our framework eliminates the need for crowd-sourced annotations. Moreover, our framework ultimately produces lightweight classification models that are deployable in cost-sensitive scenarios. Across 15 subjective concepts and across 2 public image classification datasets, our trained models outperform traditional Agile Modeling as well as state-of-the-art zero-shot classification models like ALIGN, CLIP, CuPL, and large visual question-answering models like PaLI-X.

合作者建模：通过LLM工具实现主观视觉分类，减少人力投入

Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use

摘要

Support