포인트: 저렴한 전략을 활용하여 시각-언어 모델을 개선하기

초록

최근 몇 년간 비전-언어 모델은 상당한 발전을 이루어, 광학 문자 인식 및 기하학적 문제 해결과 같은 작업에서 뛰어난 성과를 거두었습니다. 그러나 몇 가지 중요한 문제가 남아 있습니다: 1) 소유 모델은 종종 아키텍처에 대한 투명성이 부족하지만 오픈 소스 모델은 보다 자세한 훈련 전략의 제거가 필요합니다. 2) 오픈 소스 작업의 사전 훈련 데이터는 미개척되어 있으며, 데이터셋이 경험적으로 추가되어 번거로운 과정을 만듭니다. 3) 세밀한 조정은 종종 데이터셋 추가에 집중하여 수익이 감소하게 됩니다. 이러한 문제를 해결하기 위해 다음과 같은 기여를 제안합니다: 1) 최신 비전-언어 모델의 최신 개선 사항을 활용하여 견고한 기준 모델을 훈련시키고, 효과적인 개선 사항을 도입하며 각 기술에 대해 철저한 제거 및 검증을 실시했습니다. 2) 최근 대형 언어 모델에 영감을 받아 우리는 헷갈리는 정도를 사용하여 사전 훈련 데이터를 필터링하고, 훈련용으로 가장 낮은 헷갈리는 데이터를 선택했습니다. 이 접근법을 통해 정리된 1백만 데이터셋에서 훈련하여 경쟁력 있는 성과를 달성했습니다. 3) 시각적 지시 조정 중에, 더 많은 데이터셋을 추가해도 한계적인 개선만 얻어지는 경우 다른 데이터셋에서 모델 수프를 사용했습니다. 이러한 혁신들은 최첨단 모델과 경쟁력 있는 성능을 발휘하는 90억 파라미터 모델로 이어졌습니다. 우리의 전략은 효율적이고 가벼워서 커뮤니티에서 쉽게 채택할 수 있습니다.

English

In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.

포인트: 저렴한 전략을 활용하여 시각-언어 모델을 개선하기

POINTS: Improving Your Vision-language Model with Affordable Strategies

초록

Support