LLaVA-Interactive：一体化图像聊天、分割、生成和编辑演示

摘要

LLaVA-Interactive是一个用于多模态人机交互的研究原型。该系统可以通过获取多模态用户输入并生成多模态响应，与人类用户进行多轮对话。重要的是，LLaVA-Interactive不仅限于语言提示，还可以通过视觉提示来对齐交互中的人类意图。LLaVA-Interactive的开发非常具有成本效益，因为该系统结合了LLaVA的视觉聊天、SEEM的图像分割以及GLIGEN的图像生成和编辑等三种多模态技能的预构建AI模型，无需额外的模型训练。展示了多样的应用场景，以展示LLaVA-Interactive的潜力，并激发未来在多模态交互系统领域的研究。

English

LLaVA-Interactive is a research prototype for multimodal human-AI interaction. The system can have multi-turn dialogues with human users by taking multimodal user inputs and generating multimodal responses. Importantly, LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled to align human intents in the interaction. The development of LLaVA-Interactive is extremely cost-efficient as the system combines three multimodal skills of pre-built AI models without additional model training: visual chat of LLaVA, image segmentation from SEEM, as well as image generation and editing from GLIGEN. A diverse set of application scenarios is presented to demonstrate the promises of LLaVA-Interactive and to inspire future research in multimodal interactive systems.

LLaVA-Interactive：一体化图像聊天、分割、生成和编辑演示

LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

摘要

Support