START：面向图表理解的空间与文本联合学习框架

摘要

图表理解对于在多模态大语言模型（MLLMs）中部署现实应用场景（如科学论文与技术报告分析）至关重要。与自然图像不同，图表同时具备结构化视觉布局（空间属性）和底层数据表征（文本属性）——精确的细粒度图表推理需要同时掌握这两种特性。基于此发现，我们提出START框架（面向图表理解的空间与文本联合学习）。具体而言，我们引入（1）图表元素定位和（2）图表转代码生成两项技术，以增强MLLM对图表视觉布局与数据细节的联合理解能力。为促进空间与文本学习，我们通过新型数据生成流程构建了START数据集：首先利用MLLM将真实图表图像转换为可执行图表代码，在保持真实图表视觉分布的同时还原其底层数据表征；随后通过大语言模型（LLM）对代码进行演化，精确定位捕捉图表视觉结构的元素空间位置，解决现有方法难以应对的挑战。为评估模型对图表空间结构的理解能力，我们提出图表空间理解基准（CS-Bench），填补了全面图表理解评估的关键空白。通过空间与文本联合学习，START在不同模型规模与基准测试中均较基础模型实现稳定提升，并以显著优势超越现有最优方法。代码、数据及模型将公开提供。

English

Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -- grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM's understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart's visual structure, addressing challenges that existing methods cannot handle. To evaluate a model's ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.

START：面向图表理解的空间与文本联合学习框架

START: Spatial and Textual Learning for Chart Understanding

摘要

Support