Flex-Judge: 한 번 생각하고, 어디서든 판단하라

초록

인간이 생성한 보상 신호는 생성 모델을 인간의 선호도에 맞추고, 훈련 및 추론 시점의 평가를 안내하는 데 있어 매우 중요합니다. 대형 언어 모델(LLMs)을 프록시 평가자로 사용하는, 즉 LLM-as-a-Judge 방식은 수동 주석과 관련된 비용을 크게 줄이지만, 일반적으로 방대한 양의 모달리티별 훈련 데이터를 필요로 하며 다양한 다중 모달 작업에서 잘 일반화하지 못합니다. 본 논문에서는 최소한의 텍스트 추론 데이터를 활용하여 여러 모달리티와 평가 형식에 걸쳐 견고하게 일반화할 수 있는 추론 기반 다중 모달 판단 모델인 Flex-Judge를 제안합니다. 우리의 핵심 직관은 구조화된 텍스트 추론 설명이 본질적으로 일반화 가능한 의사결정 패턴을 인코딩하여 이미지나 비디오와 같은 다중 모달 판단으로의 효과적인 전이를 가능하게 한다는 것입니다. 실험 결과는 Flex-Judge가 상당히 적은 텍스트 데이터로 훈련되었음에도 불구하고, 최신 상용 API와 광범위하게 훈련된 다중 모달 평가자들과 비교하여 경쟁력 있거나 우수한 성능을 달성함을 보여줍니다. 특히 Flex-Judge는 분자와 같은 모달리티에서 포괄적인 평가 벤치마크가 부족한 상황에서도 광범위한 영향을 미치며, 자원이 제한된 분야에서의 실용적 가치를 강조합니다. 우리의 프레임워크는 전통적인 주석 집약적 접근 방식에 비해 추론 기반 텍스트 감독을 강력하고 비용 효율적인 대안으로 제시함으로써, 확장 가능한 다중 모델-as-a-Judge를 크게 발전시킵니다.

English

Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.