ChatPaper.aiChatPaper

文本绑定:多轮交错多模态指令跟随

TextBind: Multi-turn Interleaved Multimodal Instruction-following

September 14, 2023
作者: Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, Shuming Shi
cs.AI

摘要

具有指令遵循能力的大型语言模型已经彻底改变了人工智能领域。这些模型通过其自然语言界面展现出出色的泛化能力,能够处理各种现实世界任务。然而,它们的性能在很大程度上依赖于高质量的示例数据,而这往往难以获得。当涉及多模态指令遵循时,这一挑战变得更加严峻。我们引入了TextBind,这是一个几乎无需注释的框架,用于赋予更大型语言模型多轮交错的多模态指令遵循能力。我们的方法仅需要图像标题对,并从语言模型生成多轮多模态指令-响应对话。我们发布了数据集、模型和演示,以促进未来在多模态指令遵循领域的研究。
English
Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.
PDF80December 15, 2024