조건부 이미지 생성을 평가하기 위한 통합 에이전트 프레임워크

초록

조건부 이미지 생성은 콘텐츠를 개인화할 수 있는 능력으로 인해 상당한 주목을 받고 있습니다. 그러나 이 분야에서는 작업에 구애받지 않고, 신뢰할 수 있으며, 설명 가능한 평가 지표를 개발하는 데 어려움을 겪고 있습니다. 본 논문은 조건부 이미지 생성 작업을 포괄적으로 평가하기 위한 통합 에이전트 프레임워크인 CIGEval을 소개합니다. CIGEval은 대규모 멀티모달 모델(LMMs)을 핵심으로 활용하며, 다기능 도구 상자를 통합하고 세분화된 평가 프레임워크를 구축합니다. 또한, 평가 궤적을 합성하여 더 작은 LMMs가 적절한 도구를 자율적으로 선택하고 도구 출력을 기반으로 미묘한 분석을 수행할 수 있도록 합니다. 7가지 주요 조건부 이미지 생성 작업에 대한 실험 결과, CIGEval(GPT-4o 버전)은 인간 평가와 0.4625의 높은 상관 관계를 달성하여 주석자 간 상관 관계인 0.47에 근접했습니다. 더욱이, 7B 오픈소스 LMMs로 구현된 CIGEval은 단 2.3K의 훈련 궤적만을 사용하여 이전 GPT-4o 기반의 최신 방법을 능가했습니다. GPT-4o 이미지 생성에 대한 사례 연구는 CIGEval이 주체 일관성 및 제어 지침 준수와 관련된 미묘한 문제를 식별할 수 있는 능력을 강조하며, 인간 수준의 신뢰도로 이미지 생성 작업의 평가를 자동화할 수 있는 큰 잠재력을 보여줍니다.

English

Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval's capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.

조건부 이미지 생성을 평가하기 위한 통합 에이전트 프레임워크

A Unified Agentic Framework for Evaluating Conditional Image Generation

초록

Support