无需人工标注的自改进视觉语言模型评估系统
Self-Improving VLM Judges Without Human Annotations
December 2, 2025
作者: Inna Wanyin Lin, Yushi Hu, Shuyue Stella Li, Scott Geng, Pang Wei Koh, Luke Zettlemoyer, Tim Althoff, Marjan Ghazvininejad
cs.AI
摘要
视觉-语言模型(VLM)的有效评判器对模型发展至关重要。当前训练VLM评判器的方法主要依赖大规模人工偏好标注,但这种方式成本高昂,且随着模型快速迭代标注数据极易过时。本研究提出无需人工偏好标注、仅使用自合成数据的VLM评判器自训练框架。该方法采用迭代式三阶段流程:(1)生成不同质量级别的多样化多模态指令-响应对;(2)为每对数据生成推理轨迹与判断结果,剔除不符合预期质量级别的数据;(3)基于正确评判答案及其推理轨迹进行训练。我们在Multimodal RewardBench和VL-RewardBench的多个维度(准确性、偏好性、推理能力、安全性及视觉问答)上评估所得评判器。实验表明,该方法将Llama-3.2-11B多模态评判器在VL-RewardBench上的总体准确率从0.38提升至0.51,在通用性、幻象识别和推理维度表现尤为突出,甚至经常优于Llama-3.2-90B、GPT-4o和Claude 3.5 Sonnet等更大规模模型。这种无需人工标注的方法所展现的整体效能,预示着未来评判器有望伴随VLM能力的快速进化而实现自主迭代。
English
Effective judges of Vision-Language Models (VLMs) are crucial for model development. Current methods for training VLM judges mainly rely on large-scale human preference annotations. However, such an approach is costly, and the annotations easily become obsolete as models rapidly improve. In this work, we present a framework to self-train a VLM judge model without any human preference annotations, using only self-synthesized data. Our method is iterative and has three stages: (1) generate diverse multimodal instruction-response pairs at varying quality levels, (2) generate reasoning traces and judgments for each pair, removing the ones that do not match our expected quality levels, and (3) training on correct judge answers and their reasoning traces. We evaluate the resulting judge on Multimodal RewardBench and VL-RewardBench across domains: correctness, preference, reasoning, safety, and visual question-answering. Our method improves a Llama-3.2-11B multimodal judge from 0.38 to 0.51 in overall accuracy on VL-RewardBench, often outperforming much larger models including Llama-3.2-90B, GPT-4o, and Claude 3.5 Sonnet, with particularly strong gains in general, hallucination, and reasoning dimensions. The overall strength of these human-annotation-free results suggest the potential for a future self-judge that evolves alongside rapidly improving VLM capabilities.