ChatPaper.aiChatPaper

改进视觉语言模型的思维链推理

Improve Vision Language Model Chain-of-thought Reasoning

October 21, 2024
作者: Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang
cs.AI

摘要

在视觉语言模型(VLMs)中,思维链(CoT)推理对于提高可解释性和可信度至关重要。然而,当前的训练方法缺乏强大的CoT推理数据,依赖于由简短注释主导且理由最少的数据集。在这项工作中,我们展示了在简短答案上训练VLM并不能很好地推广到需要更详细回答的推理任务。为了解决这个问题,我们提出了一个双重方法。首先,我们从GPT-4o模型中提炼理由以丰富训练数据,并微调VLMs,提升其CoT性能。其次,我们应用强化学习来进一步校准推理质量。具体来说,我们通过将模型生成的推理链的预测与注释的简短答案进行比较,构建正(正确)负(错误)对。利用这些成对数据,我们应用直接偏好优化算法来提升模型的推理能力。我们的实验表明,在基准数据集上CoT推理有显著改进,并且对直接答案预测有更好的泛化能力。这项工作强调了在训练中纳入详细的理由以及利用强化学习来增强VLMs推理能力的重要性。
English
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses. To address this, we propose a two-fold approach. First, we distill rationales from GPT-4o model to enrich the training data and fine-tune VLMs, boosting their CoT performance. Second, we apply reinforcement learning to further calibrate reasoning quality. Specifically, we construct positive (correct) and negative (incorrect) pairs of model-generated reasoning chains, by comparing their predictions with annotated short answers. Using this pairwise data, we apply the Direct Preference Optimization algorithm to refine the model's reasoning abilities. Our experiments demonstrate significant improvements in CoT reasoning on benchmark datasets and better generalization to direct answer prediction as well. This work emphasizes the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of VLMs.

Summary

AI-Generated Summary

PDF272November 16, 2024