RefVNLI：迈向可扩展的主体驱动文本到图像生成评估

摘要

主题驱动的文本到图像（T2I）生成旨在根据给定的文本描述生成图像，同时保持参考主题图像的视觉特征。尽管其下游应用广泛——从增强图像生成的个性化到视频渲染中角色的一致性表现——该领域的进展因缺乏可靠的自动评估而受限。现有方法要么仅评估任务的一个方面（即文本对齐或主题保留），要么与人类判断不符，或依赖于成本高昂的API评估。为此，我们引入了RefVNLI，一种成本效益高的评估指标，能够在单一预测中同时评估文本对齐和主题保留。RefVNLI基于大规模视频推理基准和图像扰动数据集训练，在多个基准和主题类别（如动物、物体）上超越或匹配现有基线，在文本对齐上提升高达6.4分，在主题一致性上提升高达8.5分。此外，它在处理较少为人所知的概念时表现优异，与人类偏好的吻合度超过87%。

English

Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability -- ranging from enhanced personalization in image generation to consistent character representation in video rendering -- progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., Animal, Object), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87\% accuracy.

RefVNLI：迈向可扩展的主体驱动文本到图像生成评估

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

摘要

Support