ChatPaper.aiChatPaper

SO-Bench:多模态大语言模型的结构化输出评估基准

SO-Bench: A Structural Output Evaluation of Multimodal LLMs

November 23, 2025
作者: Di Feng, Kaixin Ma, Feng Nan, Haofeng Chen, Bohan Zhai, David Griffiths, Mingfei Gao, Zhe Gan, Eshan Verma, Yinfei Yang, Zhifeng Chen, Afshin Dehghan
cs.AI

摘要

多模态大语言模型(MLLMs)正日益部署于现实世界的智能体场景中,此类场景要求输出结果不仅需准确无误,还必须符合预定义的数据模式。尽管近期文本领域的结构化生成已取得进展,但目前仍缺乏系统评估视觉输入中基于模式的信息抽取与推理能力的基准。本研究通过精心设计的SO-Bench基准,对MLLMs的视觉结构化输出能力展开全面评估。该基准涵盖UI界面、自然图像、文档及图表四大视觉领域,基于超过6500个多样化JSON模式与1800组经人工质量校验的图像-模式配对数据构建而成。对开源模型及前沿商用模型的基准测试表明,现有模型在生成准确且符合模式要求的输出方面仍存在明显差距,凸显了提升多模态结构化推理能力的必要性。除基准测试外,我们进一步通过训练实验显著提升了模型的结构化输出能力,并计划将该基准向学界开放。
English
Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model's structured output capability. We plan to make the benchmark available to the community.
PDF31December 2, 2025