重點：利用經濟實惠的策略來改進您的視覺語言模型

摘要

近年來，視覺語言模型在光學字符識別和幾何問題解決等任務中取得了顯著進展。然而，仍存在幾個關鍵問題：1）專有模型通常缺乏對其架構的透明度，而開源模型則需要更詳細的訓練策略消融。2）開源作品中的預訓練數據尚未得到充分探索，數據集是根據經驗添加的，使過程繁瑣。3）微調通常著重於添加數據集，導致回報遞減。為解決這些問題，我們提出以下貢獻：1）我們使用最新的視覺語言模型技術訓練了一個堅固的基準模型，引入了有效的改進，對每種技術進行了全面的消融和驗證。2）受到大型語言模型的最新研究啟發，我們使用困惑度篩選預訓練數據，選擇困惑度最低的數據進行訓練。這種方法使我們能夠在經過精心挑選的100萬數據集上進行訓練，達到了競爭性的性能。3）在視覺指導微調期間，當添加更多數據集僅帶來微小改進時，我們在不同數據集上使用模型湯。這些創新導致了一個具有90億參數的模型，與最先進的模型競爭。我們的策略高效且輕量，易於社區採納。

English

In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.

重點：利用經濟實惠的策略來改進您的視覺語言模型

POINTS: Improving Your Vision-language Model with Affordable Strategies

摘要

Support