VisCoder: 実行可能なPython可視化コード生成のためのLLMファインチューニング

要旨

大規模言語モデル（LLMs）は、図やチャートのプロットなどの可視化タスクにおいて、コードの正確性と視覚的意味論の両方に依存するため、しばしば困難を抱えている。既存の指示チューニングデータセットは、実行に基づく監督を欠いており、反復的なコード修正に対するサポートも限られているため、脆弱で信頼性の低いプロット生成が生じる。本論文では、Pythonベースの可視化と自己修正のための大規模指示チューニングデータセットであるVisCode-200Kを提案する。VisCode-200Kは、以下の2つのソースから得られた20万以上の例を含む：（1）オープンソースリポジトリから検証済みのプロットコードと、自然言語指示およびレンダリングされたプロットをペアにしたもの；（2）Code-Feedbackから得られた4万5千のマルチターン修正対話で、モデルがランタイムフィードバックを使用して誤ったコードを修正できるようにする。VisCode-200Kを用いてQwen2.5-Coder-Instructをファインチューニングし、VisCoderを作成し、PandasPlotBenchで評価した。VisCoderは、強力なオープンソースベースラインを大幅に上回り、GPT-4o-miniのようなプロプライエタリモデルの性能に近づいた。さらに、反復的修復を評価するために自己デバッグ評価プロトコルを採用し、実行可能で視覚的に正確なコード生成に対するフィードバック駆動学習の利点を実証した。

English

Large language models (LLMs) often struggle with visualization tasks like plotting diagrams, charts, where success depends on both code correctness and visual semantics. Existing instruction-tuning datasets lack execution-grounded supervision and offer limited support for iterative code correction, resulting in fragile and unreliable plot generation. We present VisCode-200K, a large-scale instruction tuning dataset for Python-based visualization and self-correction. It contains over 200K examples from two sources: (1) validated plotting code from open-source repositories, paired with natural language instructions and rendered plots; and (2) 45K multi-turn correction dialogues from Code-Feedback, enabling models to revise faulty code using runtime feedback. We fine-tune Qwen2.5-Coder-Instruct on VisCode-200K to create VisCoder, and evaluate it on PandasPlotBench. VisCoder significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4o-mini. We further adopt a self-debug evaluation protocol to assess iterative repair, demonstrating the benefits of feedback-driven learning for executable, visually accurate code generation.

VisCoder: 実行可能なPython可視化コード生成のためのLLMファインチューニング

VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation

要旨

Support