멀티모달 입력을 활용한 GPT4 스타일 언어 모델 훈련에서 중요한 요소는 무엇인가?

초록

GPT4와 같은 대형 언어 모델(LLMs)의 최근 발전은 이미지가 주어진 개방형 지시를 따르는 데 있어 탁월한 다중 모달 능력을 보여주고 있다. 그러나 이러한 모델의 성능은 네트워크 구조, 학습 데이터, 학습 전략과 같은 설계 선택에 크게 의존하며, 이러한 선택은 문헌에서 광범위하게 논의되지 않아 이 분야의 진전을 정량화하기 어렵다. 이 문제를 해결하기 위해, 본 논문은 이러한 모델을 훈련하는 데 있어 체계적이고 포괄적인 연구를 정량적 및 정성적으로 제시한다. 우리는 통제된 설정으로 20가지 이상의 변형을 구현한다. 구체적으로, 네트워크 구조에 대해 다양한 LLM 백본과 모델 설계를 비교한다. 학습 데이터에 대해 데이터 및 샘플링 전략의 영향을 조사한다. 지시에 대해 다양한 프롬프트가 훈련된 모델의 지시 수행 능력에 미치는 영향을 탐구한다. 벤치마크에 대해, 우리가 아는 한 최초로 이미지와 비디오 작업을 모두 포함한 포괄적인 평가 세트를 크라우드소싱을 통해 제공한다. 우리의 연구 결과를 바탕으로, 기존의 오픈소스 GPT4 스타일 모델과 비교하여 가장 정확한 다중 모달 이해 능력을 유지하면서 최고의 다중 모달 생성 능력을 보여주는 Lynx를 제시한다.

English

Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following ability of the trained models. For benchmarks, we contribute the first, to our best knowledge, comprehensive evaluation set including both image and video tasks through crowd-sourcing. Based on our findings, we present Lynx, which performs the most accurate multi-modal understanding while keeping the best multi-modal generation ability compared to existing open-sourced GPT4-style models.

멀티모달 입력을 활용한 GPT4 스타일 언어 모델 훈련에서 중요한 요소는 무엇인가?

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

초록

Support