Sherlock：视觉语言模型中的自我校正推理

摘要

推理视觉语言模型（VLMs）在复杂多模态任务中展现出优异性能。然而，这些模型仍面临显著挑战：对推理错误极为敏感，需要大量标注数据或精确验证器，且在特定领域之外的泛化能力不足。为应对这些局限，我们探索了自我校正作为增强推理VLMs的策略。我们首先深入分析了推理VLMs的自我校正能力，并识别出关键差距。基于研究发现，我们提出了Sherlock，一个自我校正与自我提升的训练框架。Sherlock引入了轨迹级自我校正目标、基于视觉扰动的偏好数据构建方法，以及动态beta值用于偏好调优。模型仅需使用20k随机采样的标注数据获得自我校正能力后，便能在无外部监督的情况下持续自我提升。基于Llama3.2-Vision-11B模型构建的Sherlock在八个基准测试中取得了显著成果，直接生成的平均准确率达到64.1，自我校正后提升至65.4。它超越了LLaVA-CoT（63.2）、Mulberry（63.9）和LlamaV-o1（63.4），同时使用的标注数据量不到20%。

English

Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic beta for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.

Sherlock：视觉语言模型中的自我校正推理

Sherlock: Self-Correcting Reasoning in Vision-Language Models

摘要

Support