ChatPaper.aiChatPaper

LLaVA-Interactive:一体化图像聊天、分割、生成和编辑演示

LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

November 1, 2023
作者: Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, Chunyuan Li
cs.AI

摘要

LLaVA-Interactive是一个用于多模态人机交互的研究原型。该系统可以通过获取多模态用户输入并生成多模态响应,与人类用户进行多轮对话。重要的是,LLaVA-Interactive不仅限于语言提示,还可以通过视觉提示来对齐交互中的人类意图。LLaVA-Interactive的开发非常具有成本效益,因为该系统结合了LLaVA的视觉聊天、SEEM的图像分割以及GLIGEN的图像生成和编辑等三种多模态技能的预构建AI模型,无需额外的模型训练。展示了多样的应用场景,以展示LLaVA-Interactive的潜力,并激发未来在多模态交互系统领域的研究。
English
LLaVA-Interactive is a research prototype for multimodal human-AI interaction. The system can have multi-turn dialogues with human users by taking multimodal user inputs and generating multimodal responses. Importantly, LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled to align human intents in the interaction. The development of LLaVA-Interactive is extremely cost-efficient as the system combines three multimodal skills of pre-built AI models without additional model training: visual chat of LLaVA, image segmentation from SEEM, as well as image generation and editing from GLIGEN. A diverse set of application scenarios is presented to demonstrate the promises of LLaVA-Interactive and to inspire future research in multimodal interactive systems.
PDF4310December 15, 2024