CogCoM：通過一系列操作深入細節訓練大型視覺語言模型

摘要

視覺語言模型（VLMs）通過對視覺指示與答案進行廣泛訓練，展示了其廣泛的可行性。然而，這種明確的對齊導致模型忽略了關鍵的視覺推理，進而導致在細緻的視覺問題上失敗和不忠實的回答。在本文中，我們提出了一種稱為Manipulations Chain的機制，該機制使VLMs能夠通過一系列操作來解決問題，其中每個操作都是指對視覺輸入進行的操作，可以是通過先前訓練獲得的內在能力（例如基礎）或是模仿人類行為（例如放大）。這種機制鼓勵VLMs生成具有證據性視覺推理的忠實回答，並允許用戶在可解釋的路徑中追踪錯誤原因。因此，我們訓練了CogCoM，一個具有基於記憶的兼容架構的通用17B VLM，並賦予了這種推理機制。實驗表明，我們的模型在來自3個類別的8個基準測試中實現了最先進的性能，並且在有限的訓練步驟中，迅速獲得了具有競爭力的性能。代碼和數據可在https://github.com/THUDM/CogCoM 公開獲取。

English

Vision-Language Models (VLMs) have demonstrated their widespread viability thanks to extensive training in aligning visual instructions to answers. However, this conclusive alignment leads models to ignore critical visual reasoning, and further result in failures on meticulous visual problems and unfaithful responses. In this paper, we propose Chain of Manipulations, a mechanism that enables VLMs to solve problems with a series of manipulations, where each manipulation refers to an operation on the visual input, either from intrinsic abilities (e.g., grounding) acquired through prior training or from imitating human-like behaviors (e.g., zoom in). This mechanism encourages VLMs to generate faithful responses with evidential visual reasoning, and permits users to trace error causes in the interpretable paths. We thus train CogCoM, a general 17B VLM with a memory-based compatible architecture endowed this reasoning mechanism. Experiments show that our model achieves the state-of-the-art performance across 8 benchmarks from 3 categories, and a limited number of training steps with the data swiftly gains a competitive performance. The code and data are publicly available at https://github.com/THUDM/CogCoM.

CogCoM：通過一系列操作深入細節訓練大型視覺語言模型

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

摘要

Support