基于图表的推理:将大型语言模型(LLMs)的能力转移到小型语言模型(VLMs)
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
March 19, 2024
作者: Victor Carbune, Hassan Mansoor, Fangyu Liu, Rahul Aralikatte, Gilles Baechler, Jindong Chen, Abhanshu Sharma
cs.AI
摘要
视觉语言模型(VLMs)在多模态任务上表现越来越强大。然而,尤其是对于较小的VLMs,推理能力仍然有限,而大型语言模型(LLMs)的推理能力已经得到了许多改进。我们提出了一种技术,可以将LLMs的能力转移到VLMs上。在最近推出的ChartQA上,我们的方法在应用于chen2023pali3的PaLI3-5B VLM时获得了最先进的性能,同时还在PlotQA和FigureQA上实现了更好的性能。
我们首先通过继续使用liu2023deplot改进的图表到表格翻译任务的预训练阶段来改进图表表示。然后,我们提出构建一个比原始训练集大20倍的数据集。为了提高一般推理能力和改善数值运算,我们使用图表表示合成推理追踪。最后,我们的模型使用hsieh2023distilling引入的多任务损失进行微调。
我们的变种ChartPaLI-5B甚至优于PaLIX-55B等大10倍的模型,而不使用上游OCR系统,并且与PaLI3-5B基线相比保持推理时间恒定。当使用chen2023program提出的简单思维程序进一步完善原因时,我们的模型优于最近推出的Gemini Ultra和GPT-4V。
English
Vision-language models (VLMs) are achieving increasingly strong performance
on multimodal tasks. However, reasoning capabilities remain limited
particularly for smaller VLMs, while those of large-language models (LLMs) have
seen numerous improvements. We propose a technique to transfer capabilities
from LLMs to VLMs. On the recently introduced ChartQA, our method obtains
state-of-the-art performance when applied on the PaLI3-5B VLM by
chen2023pali3, while also enabling much better performance on PlotQA
and FigureQA.
We first improve the chart representation by continuing the pre-training
stage using an improved version of the chart-to-table translation task by
liu2023deplot. We then propose constructing a 20x larger dataset than
the original training set. To improve general reasoning capabilities and
improve numerical operations, we synthesize reasoning traces using the table
representation of charts. Lastly, our model is fine-tuned using the multitask
loss introduced by hsieh2023distilling.
Our variant ChartPaLI-5B outperforms even 10x larger models such as PaLIX-55B
without using an upstream OCR system, while keeping inference time constant
compared to the PaLI3-5B baseline. When rationales are further refined with a
simple program-of-thought prompt chen2023program, our model outperforms
the recently introduced Gemini Ultra and GPT-4V.