在訓練具有多模態輸入的 GPT4 風格語言模型中，有哪些重要因素？

摘要

最近對於大型語言模型（LLMs）如GPT4的進展展示出在根據圖像給出的開放式指令方面具有卓越的多模態能力。然而，這些模型的表現在很大程度上取決於設計選擇，如網絡結構、訓練數據和訓練策略，而這些選擇在文獻中尚未得到廣泛討論，這使得在這一領域中量化進展變得困難。為了解決這個問題，本文提出了一項系統性和全面的研究，從定量和定性兩方面對訓練這些模型進行研究。我們在控制設置下實現了超過20個變體。具體來說，對於網絡結構，我們比較了不同的LLM主幹和模型設計。對於訓練數據，我們調查了數據和抽樣策略的影響。對於指令，我們探討了多樣化提示對訓練模型的指令遵循能力的影響。對於基準測試，我們通過眾包貢獻了第一個我們所知的包括圖像和視頻任務的全面評估集。根據我們的研究結果，我們提出了Lynx，它在保持與現有開源GPT4風格模型相比最準確的多模態理解能力的同時，保持了最佳的多模態生成能力。

English

Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following ability of the trained models. For benchmarks, we contribute the first, to our best knowledge, comprehensive evaluation set including both image and video tasks through crowd-sourcing. Based on our findings, we present Lynx, which performs the most accurate multi-modal understanding while keeping the best multi-modal generation ability compared to existing open-sourced GPT4-style models.

在訓練具有多模態輸入的 GPT4 風格語言模型中，有哪些重要因素？

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

摘要

Support