構造化された指示によるチャートからコード生成のための改良された反復改良法

要旨

近年、マルチモーダル大規模言語モデル（MLLMs）は、その強力な視覚理解能力により、研究の注目を集めている。様々な視覚タスクで印象的な結果を達成している一方で、チャートからコードへの生成タスクにおける性能は最適とは言えない。このタスクでは、与えられたチャートを再現するための実行可能なコードを生成する必要があり、正確な視覚理解だけでなく、視覚要素を構造化されたコードに正確に変換する能力が求められる。MLLMsに直接この複雑なタスクを実行させるよう促しても、満足のいく結果が得られないことが多い。この課題に対処するため、我々は構造化された指示に基づく反復的改良手法である{ChartIR}を提案する。まず、視覚理解とコード変換という2つのタスクを区別する。視覚理解の部分を達成するために、記述指示と差異指示という2種類の構造化された指示を設計する。記述指示は参照チャートの視覚要素を捉え、差異指示は参照チャートと生成されたチャートの間の不一致を特徴付ける。これらの指示は視覚的特徴を言語表現に効果的に変換し、それによって後続のコード変換プロセスを促進する。次に、全体のチャート生成パイプラインを初期コード生成と反復的改良の2段階に分解し、最終出力を段階的に向上させる。実験結果は、他の手法と比較して、我々の手法がオープンソースモデルのQwen2-VLとクローズドソースモデルのGPT-4oの両方で優れた性能を達成することを示している。

English

Recently, multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose {ChartIR}, an iterative refinement method based on structured instruction. First, we distinguish two tasks: visual understanding and code translation. To accomplish the visual understanding component, we design two types of structured instructions: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations, thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement, enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o.

構造化された指示によるチャートからコード生成のための改良された反復改良法

Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction

要旨

Support