ChatPaper.aiChatPaper

夏洛克:视觉语言模型中的自我校正推理

Sherlock: Self-Correcting Reasoning in Vision-Language Models

May 28, 2025
作者: Yi Ding, Ruqi Zhang
cs.AI

摘要

推理視覺語言模型(VLMs)在處理複雜的多模態任務上展現了顯著的性能。然而,這些模型仍面臨重大挑戰:它們對推理錯誤極為敏感,需要大量註釋數據或精確的驗證器,並且在特定領域之外的泛化能力較弱。為應對這些限制,我們探索了自我校正作為增強推理VLMs的策略。首先,我們深入分析了推理VLMs的自我校正能力,並識別出關鍵差距。基於這些發現,我們引入了Sherlock,一個自我校正與自我改進的訓練框架。Sherlock引入了軌跡層面的自我校正目標、基於視覺擾動的偏好數據構建方法,以及用於偏好調節的動態beta值。一旦模型僅使用20k隨機抽樣的註釋數據獲得自我校正能力,它便能繼續在無外部監督的情況下自我改進。基於Llama3.2-Vision-11B模型構建的Sherlock在八個基準測試中取得了顯著成果,直接生成的準確率達到64.1,自我校正後提升至65.4。它超越了LLaVA-CoT(63.2)、Mulberry(63.9)和LlamaV-o1(63.4),同時使用的註釋數據量不到20%。
English
Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic beta for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.

Summary

AI-Generated Summary

PDF502May 29, 2025