ChatPaper.aiChatPaper

AIBench:评估学术插图生成中的视觉-逻辑一致性

AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

March 31, 2026
作者: Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, Shiwei Zhang, Chen-Wei Xie, Yun Zheng, Xihui Liu
cs.AI

摘要

尽管图像生成技术通过其快速发展推动了多种应用,但当前最先进的模型能否为学术论文生成可直接使用的插图仍有待探索。直接使用视觉语言模型(VLM)比较或评估插图虽直观,但需要理想的多模态理解能力,这对于冗长复杂的文本和插图而言并不可靠。为此,我们提出AIBench——首个通过视觉问答(VQA)评估学术插图逻辑正确性、并利用VLM评估美学质量的基准。具体而言,我们根据论文方法部分总结的逻辑图设计了四个层级的问题,从不同尺度检验生成插图与论文内容的一致性。这种基于VQA的方法能更精准细致地评估视觉-逻辑一致性,同时降低对评判VLM能力的依赖。基于高质量构建的AIBench,我们开展了大量实验并发现:各模型在此任务上的性能差距远大于通用任务,反映出它们在复杂推理和高密度生成能力上的差异。此外,逻辑性与美学质量难以像人工绘制插图那样同步优化。进一步实验表明,对这两种能力进行测试时缩放可显著提升任务表现。
English
Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored. Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.
PDF61April 4, 2026