AlignBench:基于合成图像-标题对的细粒度图文对齐基准评测
AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs
November 25, 2025
作者: Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku
cs.AI
摘要
评估CLIP等图文对齐模型对于 bridging 视觉与语言表征至关重要。然而现有基准依赖基于规则的扰动或简短描述,限制了其衡量细粒度对齐的能力。我们推出AlignBench这一新型基准,通过评估多种图生文与文生图模型生成的精细图文配对,为图文对齐提供了全新衡量指标。每个句子均标注正确性,可直接评估视觉语言模型作为对齐评判器的能力。对广泛基于解码器的视觉语言模型进行基准测试后,我们获得三项关键发现:(一)基于CLIP的模型,即便是专为组合推理优化的版本,仍近乎处于"视觉盲"状态;(二)检测器系统性地高估前序句子的评分;(三)这些模型表现出强烈的自我偏好,倾向于优待自身输出从而损害检测性能。项目页面详见https://dahlian00.github.io/AlignBench/。
English
Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.