基於圖表的推理：從大型語言模型到小型語言模型的能力轉移

摘要

視覺語言模型（VLMs）在多模式任務上的表現日益強大。然而，特別是對於較小的VLMs，其推理能力仍然有限，而大型語言模型（LLMs）的推理能力則已經得到許多改進。我們提出了一種從LLMs轉移能力到VLMs的技術。在最近推出的ChartQA上，我們的方法在應用於chen2023pali3的PaLI3-5B VLM時獲得了最先進的表現，同時還在PlotQA和FigureQA上實現了更好的表現。我們首先通過繼續使用liu2023deplot改進的圖表到表格翻譯任務的改進版本來改進圖表表示。然後，我們提出構建比原始訓練集大20倍的數據集。為了提高一般推理能力和改進數值運算，我們使用圖表的表格表示來合成推理軌跡。最後，我們的模型使用hsieh2023distilling引入的多任務損失進行微調。我們的變體ChartPaLI-5B的表現甚至優於PaLIX-55B等10倍大的模型，而無需使用上游OCR系統，同時與PaLI3-5B基線相比保持推理時間恆定。當使用chen2023program提出的簡單思維程序進一步優化原因時，我們的模型的表現優於最近推出的Gemini Ultra和GPT-4V。

English

Vision-language models (VLMs) are achieving increasingly strong performance on multimodal tasks. However, reasoning capabilities remain limited particularly for smaller VLMs, while those of large-language models (LLMs) have seen numerous improvements. We propose a technique to transfer capabilities from LLMs to VLMs. On the recently introduced ChartQA, our method obtains state-of-the-art performance when applied on the PaLI3-5B VLM by chen2023pali3, while also enabling much better performance on PlotQA and FigureQA. We first improve the chart representation by continuing the pre-training stage using an improved version of the chart-to-table translation task by liu2023deplot. We then propose constructing a 20x larger dataset than the original training set. To improve general reasoning capabilities and improve numerical operations, we synthesize reasoning traces using the table representation of charts. Lastly, our model is fine-tuned using the multitask loss introduced by hsieh2023distilling. Our variant ChartPaLI-5B outperforms even 10x larger models such as PaLIX-55B without using an upstream OCR system, while keeping inference time constant compared to the PaLI3-5B baseline. When rationales are further refined with a simple program-of-thought prompt chen2023program, our model outperforms the recently introduced Gemini Ultra and GPT-4V.

基於圖表的推理：從大型語言模型到小型語言模型的能力轉移

Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs

摘要

Support