チャートベース推論: LLMからVLMへの能力転移

要旨

ビジョン・ランゲージモデル（VLM）は、マルチモーダルタスクにおいてますます高い性能を達成しています。しかし、特に小規模なVLMでは推論能力が限られており、一方で大規模言語モデル（LLM）の推論能力は多くの改善が見られています。本研究では、LLMの能力をVLMに転移する手法を提案します。最近導入されたChartQAにおいて、本手法をchen2023pali3のPaLI3-5B VLMに適用した結果、最先端の性能を達成し、さらにPlotQAとFigureQAにおいても大幅な性能向上を実現しました。まず、チャート表現を改善するために、liu2023deplotによる改良版チャート・ツー・テーブル翻訳タスクを用いて事前学習段階を継続します。次に、元のトレーニングセットよりも20倍大きなデータセットを構築することを提案します。一般的な推論能力を向上させ、数値演算を改善するために、チャートのテーブル表現を使用して推論トレースを合成します。最後に、hsieh2023distillingによって導入されたマルチタスク損失を使用してモデルをファインチューニングします。我々のバリアントであるChartPaLI-5Bは、上流のOCRシステムを使用せずに、PaLIX-55Bのような10倍大きなモデルを上回り、PaLI3-5Bベースラインと比較して推論時間を一定に保ちます。さらに、chen2023programによるシンプルなプログラム・オブ・シンクプロンプトで根拠を洗練すると、最近導入されたGemini UltraとGPT-4Vを上回る性能を発揮します。

English

Vision-language models (VLMs) are achieving increasingly strong performance on multimodal tasks. However, reasoning capabilities remain limited particularly for smaller VLMs, while those of large-language models (LLMs) have seen numerous improvements. We propose a technique to transfer capabilities from LLMs to VLMs. On the recently introduced ChartQA, our method obtains state-of-the-art performance when applied on the PaLI3-5B VLM by chen2023pali3, while also enabling much better performance on PlotQA and FigureQA. We first improve the chart representation by continuing the pre-training stage using an improved version of the chart-to-table translation task by liu2023deplot. We then propose constructing a 20x larger dataset than the original training set. To improve general reasoning capabilities and improve numerical operations, we synthesize reasoning traces using the table representation of charts. Lastly, our model is fine-tuned using the multitask loss introduced by hsieh2023distilling. Our variant ChartPaLI-5B outperforms even 10x larger models such as PaLIX-55B without using an upstream OCR system, while keeping inference time constant compared to the PaLI3-5B baseline. When rationales are further refined with a simple program-of-thought prompt chen2023program, our model outperforms the recently introduced Gemini Ultra and GPT-4V.

チャートベース推論: LLMからVLMへの能力転移

Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs

要旨

Support