RealChart2Code：基於真實數據與多任務評估的圖表轉程式碼生成技術進展

摘要

視覺語言模型在多個領域的程式碼生成任務中已展現出令人矚目的能力。然而，這些模型在根據真實世界數據複製複雜多面板視覺化圖表方面的能力仍缺乏系統性評估。為填補此空白，我們提出 \texttt{RealChart2Code}——一個基於真實數據集構建的大規模基準測試，包含超過2,800個具明確分析意圖的任務實例。該基準首次實現了對大規模原始數據生成圖表的系統性評估，並在多輪對話情境下檢驗迭代式程式碼優化能力。我們對14個主流視覺語言模型的綜合評估表明，相較於簡單基準測試，模型在處理複雜繪圖結構與真實數據時出現顯著性能衰退。分析結果揭露專有模型與開源模型之間存在明顯性能差距，並證實即使最先進的視覺語言模型也難以精準復現複雜的多面板圖表。這些發現為理解當前視覺語言模型的局限性提供了重要見解，並為未來研究方向提供指引。我們已於 https://github.com/Speakn0w/RealChart2Code 公開基準數據集與相關程式碼。

English

Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \texttt{RealChart2Code}, a new large-scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on RealChart2Code reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open-weight models and confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at https://github.com/Speakn0w/RealChart2Code.

RealChart2Code：基於真實數據與多任務評估的圖表轉程式碼生成技術進展

RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation

摘要

Support