ChatPaper.aiChatPaper

LLaVA-Interactive:一個集成了圖像對話、分割、生成和編輯功能的綜合演示。

LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

November 1, 2023
作者: Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, Chunyuan Li
cs.AI

摘要

LLaVA-Interactive 是一個用於多模態人工智慧互動的研究原型。該系統可以通過接收多模態用戶輸入並生成多模態回應,與人類用戶進行多輪對話。重要的是,LLaVA-Interactive 超越了語言提示,啟用了視覺提示以對齊互動中的人類意圖。LLaVA-Interactive 的開發非常具有成本效益,因為該系統結合了三種預先建立的多模態技能的人工智慧模型,無需額外的模型訓練:LLaVA 的視覺聊天、SEEM 的圖像分割,以及 GLIGEN 的圖像生成和編輯。展示了多種應用場景,以展示 LLaVA-Interactive 的潛力,並激發未來多模態互動系統研究的靈感。
English
LLaVA-Interactive is a research prototype for multimodal human-AI interaction. The system can have multi-turn dialogues with human users by taking multimodal user inputs and generating multimodal responses. Importantly, LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled to align human intents in the interaction. The development of LLaVA-Interactive is extremely cost-efficient as the system combines three multimodal skills of pre-built AI models without additional model training: visual chat of LLaVA, image segmentation from SEEM, as well as image generation and editing from GLIGEN. A diverse set of application scenarios is presented to demonstrate the promises of LLaVA-Interactive and to inspire future research in multimodal interactive systems.
PDF4310December 15, 2024