基於圖表的推理:從大型語言模型到小型語言模型的能力轉移
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
March 19, 2024
作者: Victor Carbune, Hassan Mansoor, Fangyu Liu, Rahul Aralikatte, Gilles Baechler, Jindong Chen, Abhanshu Sharma
cs.AI
摘要
視覺語言模型(VLMs)在多模式任務上的表現日益強大。然而,特別是對於較小的VLMs,其推理能力仍然有限,而大型語言模型(LLMs)的推理能力則已經得到許多改進。我們提出了一種從LLMs轉移能力到VLMs的技術。在最近推出的ChartQA上,我們的方法在應用於chen2023pali3的PaLI3-5B VLM時獲得了最先進的表現,同時還在PlotQA和FigureQA上實現了更好的表現。
我們首先通過繼續使用liu2023deplot改進的圖表到表格翻譯任務的改進版本來改進圖表表示。然後,我們提出構建比原始訓練集大20倍的數據集。為了提高一般推理能力和改進數值運算,我們使用圖表的表格表示來合成推理軌跡。最後,我們的模型使用hsieh2023distilling引入的多任務損失進行微調。
我們的變體ChartPaLI-5B的表現甚至優於PaLIX-55B等10倍大的模型,而無需使用上游OCR系統,同時與PaLI3-5B基線相比保持推理時間恆定。當使用chen2023program提出的簡單思維程序進一步優化原因時,我們的模型的表現優於最近推出的Gemini Ultra和GPT-4V。
English
Vision-language models (VLMs) are achieving increasingly strong performance
on multimodal tasks. However, reasoning capabilities remain limited
particularly for smaller VLMs, while those of large-language models (LLMs) have
seen numerous improvements. We propose a technique to transfer capabilities
from LLMs to VLMs. On the recently introduced ChartQA, our method obtains
state-of-the-art performance when applied on the PaLI3-5B VLM by
chen2023pali3, while also enabling much better performance on PlotQA
and FigureQA.
We first improve the chart representation by continuing the pre-training
stage using an improved version of the chart-to-table translation task by
liu2023deplot. We then propose constructing a 20x larger dataset than
the original training set. To improve general reasoning capabilities and
improve numerical operations, we synthesize reasoning traces using the table
representation of charts. Lastly, our model is fine-tuned using the multitask
loss introduced by hsieh2023distilling.
Our variant ChartPaLI-5B outperforms even 10x larger models such as PaLIX-55B
without using an upstream OCR system, while keeping inference time constant
compared to the PaLI3-5B baseline. When rationales are further refined with a
simple program-of-thought prompt chen2023program, our model outperforms
the recently introduced Gemini Ultra and GPT-4V.Summary
AI-Generated Summary