要点：利用经济实惠的策略改进您的视觉-语言模型

摘要

近年来，视觉-语言模型取得了显著进展，在光学字符识别和几何问题求解等任务中表现出色。然而，仍存在几个关键问题：1）专有模型通常缺乏关于其架构的透明度，而开源模型需要更详细的训练策略消融。2）开源作品中的预训练数据尚未得到充分探索，数据集的添加是凭经验进行的，使得过程繁琐。3）微调通常侧重于添加数据集，导致收益递减。为了解决这些问题，我们提出以下贡献：1）我们使用最新的视觉-语言模型进展训练了一个稳健的基准模型，引入了有效的改进，并针对每种技术进行了全面的消融和验证。2）受到大型语言模型最新工作的启发，我们使用困惑度筛选预训练数据，选择困惑度最低的数据进行训练。这种方法使我们能够在经过筛选的100万数据集上进行训练，实现了竞争性能。3）在视觉指导调整过程中，当添加更多数据集仅带来边际改进时，我们在不同数据集上使用模型混合。这些创新使我们得到了一个具有90亿参数的模型，与最先进模型竞争激烈。我们的策略高效且轻量，易于社区采纳。

English

In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.

要点：利用经济实惠的策略改进您的视觉-语言模型

POINTS: Improving Your Vision-language Model with Affordable Strategies

摘要

Support