基于图表的推理：将大型语言模型（LLMs）的能力转移到小型语言模型（VLMs）

摘要

视觉语言模型（VLMs）在多模态任务上表现越来越强大。然而，尤其是对于较小的VLMs，推理能力仍然有限，而大型语言模型（LLMs）的推理能力已经得到了许多改进。我们提出了一种技术，可以将LLMs的能力转移到VLMs上。在最近推出的ChartQA上，我们的方法在应用于chen2023pali3的PaLI3-5B VLM时获得了最先进的性能，同时还在PlotQA和FigureQA上实现了更好的性能。我们首先通过继续使用liu2023deplot改进的图表到表格翻译任务的预训练阶段来改进图表表示。然后，我们提出构建一个比原始训练集大20倍的数据集。为了提高一般推理能力和改善数值运算，我们使用图表表示合成推理追踪。最后，我们的模型使用hsieh2023distilling引入的多任务损失进行微调。我们的变种ChartPaLI-5B甚至优于PaLIX-55B等大10倍的模型，而不使用上游OCR系统，并且与PaLI3-5B基线相比保持推理时间恒定。当使用chen2023program提出的简单思维程序进一步完善原因时，我们的模型优于最近推出的Gemini Ultra和GPT-4V。

English

Vision-language models (VLMs) are achieving increasingly strong performance on multimodal tasks. However, reasoning capabilities remain limited particularly for smaller VLMs, while those of large-language models (LLMs) have seen numerous improvements. We propose a technique to transfer capabilities from LLMs to VLMs. On the recently introduced ChartQA, our method obtains state-of-the-art performance when applied on the PaLI3-5B VLM by chen2023pali3, while also enabling much better performance on PlotQA and FigureQA. We first improve the chart representation by continuing the pre-training stage using an improved version of the chart-to-table translation task by liu2023deplot. We then propose constructing a 20x larger dataset than the original training set. To improve general reasoning capabilities and improve numerical operations, we synthesize reasoning traces using the table representation of charts. Lastly, our model is fine-tuned using the multitask loss introduced by hsieh2023distilling. Our variant ChartPaLI-5B outperforms even 10x larger models such as PaLIX-55B without using an upstream OCR system, while keeping inference time constant compared to the PaLI3-5B baseline. When rationales are further refined with a simple program-of-thought prompt chen2023program, our model outperforms the recently introduced Gemini Ultra and GPT-4V.

基于图表的推理：将大型语言模型（LLMs）的能力转移到小型语言模型（VLMs）

Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs

摘要

Support