在训练具有多模态输入的类GPT4风格语言模型时，有哪些要点？

摘要

最近对于大型语言模型（LLMs）如GPT4的进展展示了在根据图像给出的开放式指令中具有出色的多模态能力。然而，这些模型的性能在很大程度上取决于诸如网络结构、训练数据和训练策略等设计选择，而这些选择在文献中并未得到广泛讨论，这使得在这一领域中很难量化进展。为了解决这一问题，本文提出了一项系统性和全面性的研究，定量和定性地研究了训练此类模型。我们在受控设置下实施了超过20种变体。具体而言，对于网络结构，我们比较了不同的LLM骨干和模型设计。对于训练数据，我们调查了数据和采样策略的影响。对于指令，我们探讨了多样化提示对训练模型的指令跟随能力的影响。对于基准测试，我们通过众包贡献了首个据我们所知包括图像和视频任务的全面评估集。根据我们的发现，我们提出了Lynx，它在保持与现有开源GPT4风格模型相比最准确的多模态理解能力的同时，具有最佳的多模态生成能力。

English

Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following ability of the trained models. For benchmarks, we contribute the first, to our best knowledge, comprehensive evaluation set including both image and video tasks through crowd-sourcing. Based on our findings, we present Lynx, which performs the most accurate multi-modal understanding while keeping the best multi-modal generation ability compared to existing open-sourced GPT4-style models.

在训练具有多模态输入的类GPT4风格语言模型时，有哪些要点？

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

摘要

Support