要点:利用经济实惠的策略改进您的视觉-语言模型
POINTS: Improving Your Vision-language Model with Affordable Strategies
September 7, 2024
作者: Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou
cs.AI
摘要
近年来,视觉-语言模型取得了显著进展,在光学字符识别和几何问题求解等任务中表现出色。然而,仍存在几个关键问题:1)专有模型通常缺乏关于其架构的透明度,而开源模型需要更详细的训练策略消融。2)开源作品中的预训练数据尚未得到充分探索,数据集的添加是凭经验进行的,使得过程繁琐。3)微调通常侧重于添加数据集,导致收益递减。为了解决这些问题,我们提出以下贡献:1)我们使用最新的视觉-语言模型进展训练了一个稳健的基准模型,引入了有效的改进,并针对每种技术进行了全面的消融和验证。2)受到大型语言模型最新工作的启发,我们使用困惑度筛选预训练数据,选择困惑度最低的数据进行训练。这种方法使我们能够在经过筛选的100万数据集上进行训练,实现了竞争性能。3)在视觉指导调整过程中,当添加更多数据集仅带来边际改进时,我们在不同数据集上使用模型混合。这些创新使我们得到了一个具有90亿参数的模型,与最先进模型竞争激烈。我们的策略高效且轻量,易于社区采纳。
English
In recent years, vision-language models have made significant strides,
excelling in tasks like optical character recognition and geometric
problem-solving. However, several critical issues remain: 1) Proprietary models
often lack transparency about their architectures, while open-source models
need more detailed ablations of their training strategies. 2) Pre-training data
in open-source works is under-explored, with datasets added empirically, making
the process cumbersome. 3) Fine-tuning often focuses on adding datasets,
leading to diminishing returns. To address these issues, we propose the
following contributions: 1) We trained a robust baseline model using the latest
advancements in vision-language models, introducing effective improvements and
conducting comprehensive ablation and validation for each technique. 2)
Inspired by recent work on large language models, we filtered pre-training data
using perplexity, selecting the lowest perplexity data for training. This
approach allowed us to train on a curated 1M dataset, achieving competitive
performance. 3) During visual instruction tuning, we used model soup on
different datasets when adding more datasets yielded marginal improvements.
These innovations resulted in a 9B parameter model that performs competitively
with state-of-the-art models. Our strategies are efficient and lightweight,
making them easily adoptable by the community.Summary
AI-Generated Summary