Flex-Judge：一次思考，隨處判斷

摘要

人類生成的獎勵信號對於使生成模型與人類偏好保持一致至關重要，這些信號指導著訓練和推理階段的評估。雖然大型語言模型（LLMs）作為代理評估者（即LLM-as-a-Judge）顯著降低了手動註釋的成本，但它們通常需要大量的特定模態訓練數據，並且在多樣化的多模態任務中難以良好泛化。在本文中，我們提出了Flex-Judge，這是一種基於推理引導的多模態評判模型，它利用極少的文本推理數據，在多種模態和評估格式中實現了穩健的泛化。我們的核心直覺是，結構化的文本推理解釋本質上編碼了可泛化的決策模式，從而能夠有效地轉移到多模態判斷中，例如涉及圖像或視頻的判斷。實驗結果表明，儘管Flex-Judge在顯著更少的文本數據上進行訓練，但其性能與最先進的商業API和經過廣泛訓練的多模態評估者相比，具有競爭力甚至更優。值得注意的是，Flex-Judge在分子等模態中展現了廣泛的影響力，這些領域缺乏全面的評估基準，這凸顯了其在資源受限領域的實用價值。我們的框架強調了基於推理的文本監督作為傳統註釋密集型方法的一種強大且成本效益高的替代方案，大大推進了可擴展的多模態模型即評判者的發展。

English

Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

Flex-Judge：一次思考，隨處判斷

Flex-Judge: Think Once, Judge Anywhere

摘要

Support