ChatPaper.aiChatPaper

科学图像生成:基准测试、方法论与下游应用价值

Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility

January 17, 2026
作者: Honglin Lin, Chonghan Qin, Zheng Liu, Qizhi Pei, Yu Li, Zhanping Zhong, Xin Gao, Yanfeng Wang, Conghui He, Lijun Wu
cs.AI

摘要

尽管合成数据在提升文本领域的科学推理能力方面已见成效,多模态推理仍受限于生成科学严谨图像的困难。现有文生图模型常产生视觉合理但科学错误的输出,导致视觉与逻辑的持续偏离,限制了其在下游推理中的价值。基于新一代文生图模型的最新进展,我们系统研究了科学图像合成的生成范式、评估方法及下游应用。我们同时分析了基于像素的直接生成与程序化合成方法,并提出ImgCoder——一个遵循显式"理解-规划-编码"工作流程的逻辑驱动框架,以提升结构精度。为严格评估科学正确性,我们推出SciGenBench评估体系,从信息效用与逻辑有效性两个维度评估生成图像。实验揭示了像素级模型的系统性缺陷,并凸显了表达能力与精度之间的根本性权衡。最后,我们证明基于严格验证的合成科学图像对大型多模态模型进行微调,能获得稳定的推理提升,其扩展趋势与文本领域具有相似性,这验证了高保真科学合成作为解锁海量多模态推理能力的可行路径。
English
While synthetic data has proven effective for improving scientific reasoning in the text domain, multimodal reasoning remains constrained by the difficulty of synthesizing scientifically rigorous images. Existing Text-to-Image (T2I) models often produce outputs that are visually plausible yet scientifically incorrect, resulting in a persistent visual-logic divergence that limits their value for downstream reasoning. Motivated by recent advances in next-generation T2I models, we conduct a systematic study of scientific image synthesis across generation paradigms, evaluation, and downstream use. We analyze both direct pixel-based generation and programmatic synthesis, and propose ImgCoder, a logic-driven framework that follows an explicit "understand - plan - code" workflow to improve structural precision. To rigorously assess scientific correctness, we introduce SciGenBench, which evaluates generated images based on information utility and logical validity. Our evaluation reveals systematic failure modes in pixel-based models and highlights a fundamental expressiveness-precision trade-off. Finally, we show that fine-tuning Large Multimodal Models (LMMs) on rigorously verified synthetic scientific images yields consistent reasoning gains, with potential scaling trends analogous to the text domain, validating high-fidelity scientific synthesis as a viable path to unlocking massive multimodal reasoning capabilities.
PDF342January 28, 2026