視覺常識模型的本地符號知識蒸餾

摘要

遵循視覺語言（VL）模型的指示提供了一個靈活的界面，支持以零樣本方式進行廣泛的多模式任務。然而，基於完整圖像運作的界面並不能直接讓用戶“指向”並訪問圖像中的特定區域。這種能力不僅對支持基於參考的VL基準很重要，也對需要精確圖像內推理的實際應用至關重要。我們建立了局部視覺常識模型，允許用戶指定（多個）區域作為輸入。我們通過從大型語言模型（LLM）中抽樣局部常識知識來訓練我們的模型：具體來說，我們提示LLM根據全局文字圖像描述和一組VL模型自動生成的本地文字區域描述來收集常識知識。通過一個獨立訓練的評論模型來選擇高質量示例，我們發現在局部常識語料庫上訓練可以成功地提煉現有的VL模型，以支持以參考為輸入的界面。在零樣本設置中的實證結果和人類評估表明，我們的提煉方法導致比將生成的參考表達傳遞給LLM的基準更精確的VL推理模型。

English

Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to "point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects high-quality examples, we find that training on the localized commonsense corpus can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in a zero-shot setup demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM.

視覺常識模型的本地符號知識蒸餾

Localized Symbolic Knowledge Distillation for Visual Commonsense Models

摘要

Support