ChatPaper.aiChatPaper

Flex-Judge:一次思考,随处判断

Flex-Judge: Think Once, Judge Anywhere

May 24, 2025
作者: Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun
cs.AI

摘要

人类生成的奖励信号对于将生成模型与人类偏好对齐至关重要,它们不仅指导训练过程,还影响推理阶段的评估。尽管作为代理评估者的大型语言模型(LLMs),即“LLM-as-a-Judge”,显著降低了人工标注的成本,但它们通常需要大量特定模态的训练数据,并且在跨多种多模态任务时泛化能力有限。本文提出Flex-Judge,一种基于推理引导的多模态评判模型,它利用极少的文本推理数据,就能在多种模态和评估格式间实现稳健的泛化。我们的核心洞见是,结构化的文本推理解释本质上编码了可泛化的决策模式,从而能够有效迁移至涉及图像或视频等多模态的评判任务中。实证结果表明,Flex-Judge尽管在显著更少的文本数据上训练,却能与最先进的商业API及经过大量训练的多模态评估器相媲美甚至超越。值得注意的是,Flex-Judge在分子等模态上展现出广泛影响力,这些领域往往缺乏全面的评估基准,凸显了其在资源受限场景中的实用价值。我们的框架强调,基于推理的文本监督作为一种强大且成本效益高的替代方案,相较于传统的标注密集型方法,极大地推动了可扩展的多模态“模型即评判者”的发展。
English
Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

Summary

AI-Generated Summary

PDF252May 27, 2025