LLaVA-Interactive：一個集成了圖像對話、分割、生成和編輯功能的綜合演示。

摘要

LLaVA-Interactive 是一個用於多模態人工智慧互動的研究原型。該系統可以通過接收多模態用戶輸入並生成多模態回應，與人類用戶進行多輪對話。重要的是，LLaVA-Interactive 超越了語言提示，啟用了視覺提示以對齊互動中的人類意圖。LLaVA-Interactive 的開發非常具有成本效益，因為該系統結合了三種預先建立的多模態技能的人工智慧模型，無需額外的模型訓練：LLaVA 的視覺聊天、SEEM 的圖像分割，以及 GLIGEN 的圖像生成和編輯。展示了多種應用場景，以展示 LLaVA-Interactive 的潛力，並激發未來多模態互動系統研究的靈感。

English

LLaVA-Interactive is a research prototype for multimodal human-AI interaction. The system can have multi-turn dialogues with human users by taking multimodal user inputs and generating multimodal responses. Importantly, LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled to align human intents in the interaction. The development of LLaVA-Interactive is extremely cost-efficient as the system combines three multimodal skills of pre-built AI models without additional model training: visual chat of LLaVA, image segmentation from SEEM, as well as image generation and editing from GLIGEN. A diverse set of application scenarios is presented to demonstrate the promises of LLaVA-Interactive and to inspire future research in multimodal interactive systems.

LLaVA-Interactive：一個集成了圖像對話、分割、生成和編輯功能的綜合演示。

LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

摘要

Support