ChatPaper.aiChatPaper

AlignBench:基于合成图文对的细粒度图文对齐基准评测

AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

November 25, 2025
作者: Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku
cs.AI

摘要

评估CLIP等图文对齐模型对于 bridging 视觉与语言表征至关重要。然而现有基准依赖基于规则的扰动或简短描述,限制了其衡量细粒度对齐的能力。我们推出AlignBench基准,通过评估多种图生文与文生图模型生成的精细图文配对,为图文对齐提供了全新衡量指标。每个句子均标注正确性,可直接评估视觉语言模型作为对齐评判器的能力。对大量基于解码器的视觉语言模型进行基准测试后,我们获得三项关键发现:(i) 基于CLIP的模型(即便是专为组合推理优化的版本)仍近乎"失明";(ii) 检测器系统性地高估前序句子的得分;(iii) 它们表现出强烈的自我偏好,倾向于给自身输出更高评分,从而损害检测性能。项目页面详见https://dahlian00.github.io/AlignBench/。
English
Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.
PDF31December 5, 2025