CogCoM：通过一系列操作训练大型视觉-语言模型，深入细节。

摘要

视觉-语言模型（VLMs）通过在将视觉指令与答案对齐的广泛训练中展示了其普遍的可行性。然而，这种确定性的对齐导致模型忽视了关键的视觉推理，进而导致在繁琐的视觉问题上失败以及不忠实的回答。在本文中，我们提出了一种称为“操作链”的机制，使VLMs能够通过一系列操作解决问题，其中每个操作都指的是对视觉输入的操作，可以是通过先前训练获得的内在能力（例如，基础）或者模仿类人行为（例如，放大）。这种机制鼓励VLMs生成具有证据性视觉推理的忠实回答，并允许用户在可解释路径中追踪错误原因。因此，我们训练了一种名为CogCoM的通用 17B VLM，它具有基于内存的兼容架构，并赋予了这种推理机制。实验表明，我们的模型在来自3个类别的8个基准测试中实现了最先进的性能，并且在有限数量的训练步骤中，迅速获得了竞争性能。代码和数据可在https://github.com/THUDM/CogCoM 上公开获取。

English

Vision-Language Models (VLMs) have demonstrated their widespread viability thanks to extensive training in aligning visual instructions to answers. However, this conclusive alignment leads models to ignore critical visual reasoning, and further result in failures on meticulous visual problems and unfaithful responses. In this paper, we propose Chain of Manipulations, a mechanism that enables VLMs to solve problems with a series of manipulations, where each manipulation refers to an operation on the visual input, either from intrinsic abilities (e.g., grounding) acquired through prior training or from imitating human-like behaviors (e.g., zoom in). This mechanism encourages VLMs to generate faithful responses with evidential visual reasoning, and permits users to trace error causes in the interpretable paths. We thus train CogCoM, a general 17B VLM with a memory-based compatible architecture endowed this reasoning mechanism. Experiments show that our model achieves the state-of-the-art performance across 8 benchmarks from 3 categories, and a limited number of training steps with the data swiftly gains a competitive performance. The code and data are publicly available at https://github.com/THUDM/CogCoM.

CogCoM：通过一系列操作训练大型视觉-语言模型，深入细节。

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

摘要

Support