TextBind: 멀티턴 인터리브 멀티모달 명령어 추적

초록

명령어 수행 능력을 갖춘 대규모 언어 모델은 인공지능 분야에 혁신을 가져왔습니다. 이러한 모델들은 자연어 인터페이스를 통해 다양한 실제 작업을 해결하는 데 있어 탁월한 일반화 능력을 보여줍니다. 그러나 이들의 성능은 고품질의 예시 데이터에 크게 의존하며, 이러한 데이터는 종종 얻기 어렵습니다. 이러한 문제는 멀티모달 명령어 수행으로 넘어가면 더욱 심화됩니다. 본 연구에서는 TextBind를 소개합니다. 이는 거의 주석이 필요 없는 프레임워크로, 대규모 언어 모델에 멀티턴 교차형 멀티모달 명령어 수행 능력을 부여합니다. 우리의 접근 방식은 이미지-캡션 쌍만을 요구하며, 언어 모델로부터 멀티턴 멀티모달 명령어-응답 대화를 생성합니다. 우리는 멀티모달 명령어 수행 분야의 향후 연구를 촉진하기 위해 데이터셋, 모델, 데모를 공개합니다.

English

Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.

TextBind: 멀티턴 인터리브 멀티모달 명령어 추적

TextBind: Multi-turn Interleaved Multimodal Instruction-following

초록

Support