ChatPaper.aiChatPaper

RefVNLI:面向主题驱动文本到图像生成的可扩展评估

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

April 24, 2025
作者: Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Itay Laish, Dani Lischinski, Idan Szpektor
cs.AI

摘要

主題驅動的文字到圖像(T2I)生成旨在產生與給定文字描述相符的圖像,同時保留參考主題圖像的視覺特徵。儘管其下游應用廣泛——從圖像生成中的增強個性化到視頻渲染中的一致角色表現——該領域的進展因缺乏可靠的自動評估而受限。現有方法要么僅評估任務的一個方面(即文字對齊或主題保留),要么與人類判斷不一致,要么依賴於成本高昂的基於API的評估。為解決這一問題,我們引入了RefVNLI,這是一種成本效益高的度量標準,能夠在單次預測中同時評估文字對齊和主題保留。RefVNLI基於從視頻推理基準和圖像擾動中提取的大規模數據集進行訓練,在多個基準和主題類別(例如,動物、物體)上超越或匹配現有基線,在文字對齊方面實現了高達6.4分的提升,在主題一致性方面實現了高達8.5分的提升。它還在處理較少為人知的概念時表現出色,與人類偏好的對齊準確率超過87%。
English
Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability -- ranging from enhanced personalization in image generation to consistent character representation in video rendering -- progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., Animal, Object), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87\% accuracy.

Summary

AI-Generated Summary

PDF542April 25, 2025