Flex-Judge: 一度考えれば、どこでも判定可能

要旨

人間が生成する報酬信号は、生成モデルを人間の好みに合わせるために重要であり、トレーニングと推論時の評価の両方を導きます。プロキシ評価者として使用される大規模言語モデル（LLM）、すなわちLLM-as-a-Judgeは、手動アノテーションに関連するコストを大幅に削減しますが、通常、広範なモダリティ固有のトレーニングデータを必要とし、多様なマルチモーダルタスクにわたる汎化が不十分です。本論文では、最小限のテキスト推論データを活用して、複数のモダリティと評価形式にわたって頑健に汎化する、推論ガイド型マルチモーダル評価モデルであるFlex-Judgeを提案します。私たちの核心的な直感は、構造化されたテキスト推論説明が本質的に汎化可能な意思決定パターンをエンコードしており、画像や動画などのマルチモーダル判断への効果的な転移を可能にするというものです。実験結果は、Flex-Judgeが、大幅に少ないテキストデータでトレーニングされているにもかかわらず、最先端の商用APIや広範にトレーニングされたマルチモーダル評価者と比較して、競争力のあるまたは優れたパフォーマンスを達成することを示しています。特に、Flex-Judgeは、分子などのモダリティにおいて広範な影響を示し、包括的な評価ベンチマークが不足している領域での実用的価値を強調しています。私たちのフレームワークは、推論ベースのテキスト監視が、従来のアノテーション集約型アプローチに対する強力でコスト効率の高い代替手段であることを示し、スケーラブルなマルチモーダルモデル-as-a-Judgeを大幅に進歩させます。

English

Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.