카멜레온: 다중 모달 조기 융합 기반 모델

초록

우리는 이미지와 텍스트를 임의의 순서로 이해하고 생성할 수 있는 초기 융합 토큰 기반 혼합 모달 모델인 Chameleon 제품군을 소개한다. 초기 융합, 토큰 기반, 혼합 모달 설정에 맞춰 안정적인 학습 접근법, 정렬 방법, 그리고 아키텍처 파라미터화를 제시한다. 이 모델들은 시각적 질문 응답, 이미지 캡셔닝, 텍스트 생성, 이미지 생성, 그리고 장문 혼합 모달 생성 등 다양한 작업에서 평가되었다. Chameleon은 이미지 캡셔닝 작업에서 최첨단 성능을 보이며, 텍스트 전용 작업에서는 Llama-2를 능가하고 Mixtral 8x7B 및 Gemini-Pro와 경쟁력을 보이며, 단일 모델에서 비사소한 이미지 생성도 수행한다. 또한, 새로운 장문 혼합 모달 생성 평가에서 인간 판단에 따라 Gemini Pro와 GPT-4V와 같은 훨씬 더 큰 모델의 성능을 맞추거나 능가한다. 이 평가에서는 프롬프트나 출력이 이미지와 텍스트의 혼합 시퀀스를 포함한다. Chameleon은 완전한 다중 모달 문서의 통합 모델링에서 중요한 진전을 이루었다.

English

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

카멜레온: 다중 모달 조기 융합 기반 모델

Chameleon: Mixed-Modal Early-Fusion Foundation Models

초록

Support