视觉常识模型的本地符号知识蒸馏

摘要

视觉语言（VL）模型遵循指令提供了一种灵活的接口，支持以零样本方式进行广泛的多模态任务。然而，基于完整图像操作的接口并不能直接让用户“指向”并访问图像中的特定区域。这种能力不仅对支持基于参考的VL基准测试至关重要，而且对于需要精确图像内推理的实际应用也是必要的。我们构建了定位视觉常识模型，允许用户指定（多个）区域作为输入。我们通过从大型语言模型（LLM）中采样局部常识知识来训练我们的模型：具体而言，我们提示LLM根据全局文字图像描述和由一组VL模型自动生成的局部文字区域描述收集常识知识。通过一个单独训练的评论者模型选择高质量示例，我们发现在局部常识语料库上训练可以成功地提炼现有的VL模型，以支持以参考为输入的接口。零样本设置中的实证结果和人类评估表明，我们的提炼方法导致比将生成的指称表达式传递给LLM的基准更精确的推理VL模型。

English

Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to "point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects high-quality examples, we find that training on the localized commonsense corpus can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in a zero-shot setup demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM.

视觉常识模型的本地符号知识蒸馏

Localized Symbolic Knowledge Distillation for Visual Commonsense Models

摘要

Support