RealChart2Code：基于真实数据与多任务评估的图表转代码技术新突破

摘要

视觉语言模型（VLMs）在跨领域代码生成任务中已展现出卓越能力，然而其在真实数据场景下复现复杂多面板可视化图表的能力尚未得到系统评估。为填补这一空白，我们推出\texttt{RealChart2Code}——一个基于真实数据集构建的大规模基准测试平台，包含逾2,800个具有明确分析意图的任务实例。该平台首次系统评估了基于大规模原始数据的图表生成能力，并在多轮对话场景下检验代码迭代优化过程。通过对14个主流VLM模型的综合测试，我们发现相较于简单基准测试，模型在RealChart2Code上出现显著性能衰减，暴露出其处理复杂图表结构和真实数据时的局限性。分析表明，专有模型与开源模型之间存在明显性能差距，且即使最先进的VLM也往往难以准确复现精细的多面板图表。这些发现为理解VLM当前局限提供了重要参考，并为未来研究方向指明了路径。基准测试资源与代码已发布于https://github.com/Speakn0w/RealChart2Code。

English

Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \texttt{RealChart2Code}, a new large-scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on RealChart2Code reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open-weight models and confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at https://github.com/Speakn0w/RealChart2Code.

RealChart2Code：基于真实数据与多任务评估的图表转代码技术新突破

RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation

摘要

Support