RefVNLI: 주제 기반 텍스트-이미지 생성의 확장 가능한 평가를 향하여

초록

주체 기반 텍스트-이미지(T2I) 생성은 참조된 주체 이미지의 시각적 정체성을 유지하면서 주어진 텍스트 설명과 일치하는 이미지를 생성하는 것을 목표로 합니다. 이미지 생성에서의 개인화 강화부터 비디오 렌더링에서의 일관된 캐릭터 표현에 이르기까지 다양한 하위 분야에서 적용 가능성에도 불구하고, 이 분야의 발전은 신뢰할 수 있는 자동 평가 방법의 부재로 인해 제한받고 있습니다. 기존 방법들은 작업의 단일 측면(즉, 텍스트 정렬 또는 주체 보존)만 평가하거나, 인간의 판단과 일치하지 않거나, 비용이 많이 드는 API 기반 평가에 의존합니다. 이를 해결하기 위해, 우리는 텍스트 정렬과 주체 보존을 단일 예측에서 모두 평가하는 비용 효율적인 메트릭인 RefVNLI를 소개합니다. 비디오 추론 벤치마크와 이미지 변형에서 파생된 대규모 데이터셋으로 학습된 RefVNLI는 여러 벤치마크와 주체 카테고리(예: 동물, 물체)에서 기존 기준선을 능가하거나 동등한 성능을 보이며, 텍스트 정렬에서 최대 6.4포인트, 주체 일관성에서 8.5포인트의 향상을 달성했습니다. 또한 덜 알려진 개념에서도 우수한 성능을 보이며, 87% 이상의 정확도로 인간의 선호도와 일치합니다.

English

Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability -- ranging from enhanced personalization in image generation to consistent character representation in video rendering -- progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., Animal, Object), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87\% accuracy.

RefVNLI: 주제 기반 텍스트-이미지 생성의 확장 가능한 평가를 향하여

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

초록

Support