AIBench:学术插图生成中视觉-逻辑一致性的评估研究
AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation
March 31, 2026
作者: Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, Shiwei Zhang, Chen-Wei Xie, Yun Zheng, Xihui Liu
cs.AI
摘要
尽管图像生成技术通过快速发展推动了多种应用,但当前最先进的模型能否为论文生成可直接使用的学术插图仍有待探索。直接使用视觉语言模型(VLM)比较或评估插图虽直观,但需要理想的多模态理解能力,这对于冗长复杂的文本和插图而言并不可靠。为此,我们提出AIBench——首个通过视觉问答(VQA)评估学术插图逻辑正确性、并利用VLM评估美学价值的基准框架。具体而言,我们根据论文方法部分总结的逻辑图设计了四个层级的问题,从不同尺度检验生成插图与论文内容的一致性。基于VQA的评估方法在降低对评判VLM能力依赖的同时,能对视觉-逻辑一致性进行更精准细致的评估。通过高质量构建的AIBench,我们开展了大量实验并发现:模型在此任务上的性能差距远大于通用任务,反映出其复杂推理和高密度生成能力的差异。此外,逻辑性与美学性难以像手工插图那样同步优化。补充实验进一步表明,对两种能力进行测试时扩展能显著提升该任务的表现。
English
Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored. Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.