LLaVA-Interactive:一個集成了圖像對話、分割、生成和編輯功能的綜合演示。
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
November 1, 2023
作者: Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, Chunyuan Li
cs.AI
摘要
LLaVA-Interactive 是一個用於多模態人工智慧互動的研究原型。該系統可以通過接收多模態用戶輸入並生成多模態回應,與人類用戶進行多輪對話。重要的是,LLaVA-Interactive 超越了語言提示,啟用了視覺提示以對齊互動中的人類意圖。LLaVA-Interactive 的開發非常具有成本效益,因為該系統結合了三種預先建立的多模態技能的人工智慧模型,無需額外的模型訓練:LLaVA 的視覺聊天、SEEM 的圖像分割,以及 GLIGEN 的圖像生成和編輯。展示了多種應用場景,以展示 LLaVA-Interactive 的潛力,並激發未來多模態互動系統研究的靈感。
English
LLaVA-Interactive is a research prototype for multimodal human-AI
interaction. The system can have multi-turn dialogues with human users by
taking multimodal user inputs and generating multimodal responses. Importantly,
LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled
to align human intents in the interaction. The development of LLaVA-Interactive
is extremely cost-efficient as the system combines three multimodal skills of
pre-built AI models without additional model training: visual chat of LLaVA,
image segmentation from SEEM, as well as image generation and editing from
GLIGEN. A diverse set of application scenarios is presented to demonstrate the
promises of LLaVA-Interactive and to inspire future research in multimodal
interactive systems.