ChatPaper.aiChatPaper

TextBind:多輪交錯多模態指示跟隨

TextBind: Multi-turn Interleaved Multimodal Instruction-following

September 14, 2023
作者: Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, Shuming Shi
cs.AI

摘要

具有指示遵循能力的大型語言模型已經在人工智慧領域引起了革命。這些模型通過其自然語言界面展現出卓越的泛化能力,能夠應對各種真實世界任務。然而,它們的性能在很大程度上依賴於高質量的示範數據,而這往往很難獲得。當涉及多模式指示遵循時,這一挑戰變得更加嚴峻。我們介紹了TextBind,一個幾乎無需注釋的框架,用於賦予更大語言模型具有多輪交錯的多模式指示遵循能力。我們的方法僅需要圖像說明配對,並從語言模型生成多輪多模式指示-回應對話。我們釋出了我們的數據集、模型和演示,以促進未來在多模式指示遵循領域的研究。
English
Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.
PDF80December 15, 2024