CogCoM: 操作の連鎖を通じて詳細に踏み込む大規模視覚言語モデルの学習

要旨

視覚言語モデル（VLMs）は、視覚的指示と回答を整合させるための広範なトレーニングにより、その汎用性を広く実証してきました。しかし、この決定的な整合は、モデルが重要な視覚的推論を無視することを引き起こし、さらに細心の視覚的問題での失敗や不正確な応答を招く結果となっています。本論文では、Chain of Manipulationsというメカニズムを提案します。これは、VLMsが一連の操作を通じて問題を解決することを可能にするもので、各操作は視覚的入力に対する操作を指し、事前のトレーニングを通じて獲得された内在的な能力（例：グラウンディング）や、人間のような行動（例：ズームイン）の模倣から行われます。このメカニズムは、VLMsが証拠に基づいた視覚的推論を用いて正確な応答を生成することを促し、ユーザーが解釈可能なパスでエラーの原因を追跡することを可能にします。そこで我々は、この推論メカニズムを備えたメモリベースの互換性のあるアーキテクチャを持つ汎用17B VLMであるCogCoMをトレーニングしました。実験結果は、我々のモデルが3つのカテゴリーにわたる8つのベンチマークで最先端の性能を達成し、限られたトレーニングステップとデータで迅速に競争力のある性能を獲得することを示しています。コードとデータはhttps://github.com/THUDM/CogCoMで公開されています。

English

Vision-Language Models (VLMs) have demonstrated their widespread viability thanks to extensive training in aligning visual instructions to answers. However, this conclusive alignment leads models to ignore critical visual reasoning, and further result in failures on meticulous visual problems and unfaithful responses. In this paper, we propose Chain of Manipulations, a mechanism that enables VLMs to solve problems with a series of manipulations, where each manipulation refers to an operation on the visual input, either from intrinsic abilities (e.g., grounding) acquired through prior training or from imitating human-like behaviors (e.g., zoom in). This mechanism encourages VLMs to generate faithful responses with evidential visual reasoning, and permits users to trace error causes in the interpretable paths. We thus train CogCoM, a general 17B VLM with a memory-based compatible architecture endowed this reasoning mechanism. Experiments show that our model achieves the state-of-the-art performance across 8 benchmarks from 3 categories, and a limited number of training steps with the data swiftly gains a competitive performance. The code and data are publicly available at https://github.com/THUDM/CogCoM.

CogCoM: 操作の連鎖を通じて詳細に踏み込む大規模視覚言語モデルの学習

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

要旨

Support