DeepSpeed-VisualChat: 다중 라운드 다중 이미지 인터리브 채팅을 위한 다중 모달 인과적 어텐션

초록

기존의 대부분의 다중 모달 모델들은 다중 이미지 및 다중 라운드 대화에서 이미지와 텍스트 입력이 교차되는 상황을 능숙하게 처리하지 못함으로써 제약을 받고 있으며, 이는 훈련을 위한 자원 할당과 데이터 접근성에 상당한 영향을 미쳐 다양한 상호작용 영역에서의 적응성과 확장성을 저해하고 있습니다. 이러한 문제를 해결하기 위해, 우리는 대규모 언어 모델(LLMs)을 최적화하고 다중 모달 기능을 통합하는 데 초점을 맞춘 DeepSpeed-VisualChat 프레임워크를 제안합니다. 이 프레임워크는 특히 교차 입력을 처리하는 대규모 시각 및 언어 모델의 능력을 향상시키는 데 주력합니다. 우리의 프레임워크는 (1) 다중 라운드 및 다중 이미지 대화를 위한 오픈 소스 지원, (2) 혁신적인 다중 모달 인과적 주의 메커니즘 도입, (3) 기존 데이터셋에 대한 데이터 혼합 기술 활용을 통해 다중 라운드 및 다중 이미지 대화에서 원활한 상호작용을 보장한다는 점에서 주목할 만합니다. 기존 프레임워크와 비교하여, DeepSpeed-VisualChat은 70B 파라미터 규모의 언어 모델까지 우수한 확장성을 보여주며, 이는 다중 모달 언어 모델 분야에서의 중요한 진전을 나타내고 향후 탐구를 위한 견고한 기반을 마련합니다.

English

Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms. To address this, we present the DeepSpeed-VisualChat framework, designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities, with a focus on enhancing the proficiency of Large Vision and Language Models in handling interleaved inputs. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions in multi-round, multi-image conversations. Compared to existing frameworks, DeepSpeed-VisualChat shows superior scalability up to 70B parameter language model size, representing a significant advancement in multi-modal language models and setting a solid foundation for future explorations.

DeepSpeed-VisualChat: 다중 라운드 다중 이미지 인터리브 채팅을 위한 다중 모달 인과적 어텐션

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention

초록

Support