改进视觉语言模型的思维链推理
Improve Vision Language Model Chain-of-thought Reasoning
October 21, 2024
作者: Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang
cs.AI
摘要
在视觉语言模型(VLMs)中,思维链(CoT)推理对于提高可解释性和可信度至关重要。然而,当前的训练方法缺乏强大的CoT推理数据,依赖于由简短注释主导且理由最少的数据集。在这项工作中,我们展示了在简短答案上训练VLM并不能很好地推广到需要更详细回答的推理任务。为了解决这个问题,我们提出了一个双重方法。首先,我们从GPT-4o模型中提炼理由以丰富训练数据,并微调VLMs,提升其CoT性能。其次,我们应用强化学习来进一步校准推理质量。具体来说,我们通过将模型生成的推理链的预测与注释的简短答案进行比较,构建正(正确)负(错误)对。利用这些成对数据,我们应用直接偏好优化算法来提升模型的推理能力。我们的实验表明,在基准数据集上CoT推理有显著改进,并且对直接答案预测有更好的泛化能力。这项工作强调了在训练中纳入详细的理由以及利用强化学习来增强VLMs推理能力的重要性。
English
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial
for improving interpretability and trustworthiness. However, current training
recipes lack robust CoT reasoning data, relying on datasets dominated by short
annotations with minimal rationales. In this work, we show that training VLM on
short answers does not generalize well to reasoning tasks that require more
detailed responses. To address this, we propose a two-fold approach. First, we
distill rationales from GPT-4o model to enrich the training data and fine-tune
VLMs, boosting their CoT performance. Second, we apply reinforcement learning
to further calibrate reasoning quality. Specifically, we construct positive
(correct) and negative (incorrect) pairs of model-generated reasoning chains,
by comparing their predictions with annotated short answers. Using this
pairwise data, we apply the Direct Preference Optimization algorithm to refine
the model's reasoning abilities. Our experiments demonstrate significant
improvements in CoT reasoning on benchmark datasets and better generalization
to direct answer prediction as well. This work emphasizes the importance of
incorporating detailed rationales in training and leveraging reinforcement
learning to strengthen the reasoning capabilities of VLMs.Summary
AI-Generated Summary