CogCoM: 조작의 연쇄를 통해 세부 사항에 집중하는 대규모 시각-언어 모델 학습

초록

비전-언어 모델(VLMs)은 시각적 지시를 답변에 정렬하는 광범위한 훈련을 통해 그들의 광범위한 실현 가능성을 입증해 왔습니다. 그러나 이러한 결정적인 정렬은 모델이 중요한 시각적 추론을 무시하도록 이끌며, 이는 세심한 시각적 문제에서의 실패와 신뢰할 수 없는 응답으로 이어집니다. 본 논문에서는 조작의 연쇄(Chain of Manipulations)라는 메커니즘을 제안합니다. 이 메커니즘은 VLMs이 일련의 조작을 통해 문제를 해결할 수 있도록 하며, 각 조작은 사전 훈련을 통해 획득된 내재적 능력(예: 그라운딩)이나 인간과 유사한 행동(예: 확대)을 모방한 시각적 입력에 대한 작업을 의미합니다. 이 메커니즘은 VLMs이 증거 기반의 시각적 추론을 통해 신뢰할 수 있는 응답을 생성하도록 장려하며, 사용자가 해석 가능한 경로에서 오류 원인을 추적할 수 있도록 합니다. 우리는 이러한 추론 메커니즘을 부여한 메모리 기반 호환 아키텍처를 가진 일반적인 17B VLM인 CogCoM을 훈련시켰습니다. 실험 결과, 우리의 모델은 3개 범주의 8개 벤치마크에서 최첨단 성능을 달성했으며, 제한된 수의 훈련 단계와 데이터로도 빠르게 경쟁력 있는 성능을 얻었습니다. 코드와 데이터는 https://github.com/THUDM/CogCoM에서 공개적으로 이용 가능합니다.

English

Vision-Language Models (VLMs) have demonstrated their widespread viability thanks to extensive training in aligning visual instructions to answers. However, this conclusive alignment leads models to ignore critical visual reasoning, and further result in failures on meticulous visual problems and unfaithful responses. In this paper, we propose Chain of Manipulations, a mechanism that enables VLMs to solve problems with a series of manipulations, where each manipulation refers to an operation on the visual input, either from intrinsic abilities (e.g., grounding) acquired through prior training or from imitating human-like behaviors (e.g., zoom in). This mechanism encourages VLMs to generate faithful responses with evidential visual reasoning, and permits users to trace error causes in the interpretable paths. We thus train CogCoM, a general 17B VLM with a memory-based compatible architecture endowed this reasoning mechanism. Experiments show that our model achieves the state-of-the-art performance across 8 benchmarks from 3 categories, and a limited number of training steps with the data swiftly gains a competitive performance. The code and data are publicly available at https://github.com/THUDM/CogCoM.

CogCoM: 조작의 연쇄를 통해 세부 사항에 집중하는 대규모 시각-언어 모델 학습

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

초록

Support