LLaVA-Interattivo: Una Demo Tutto-in-Uno per Chat con Immagini, Segmentazione, Generazione e Modifica

Abstract

LLaVA-Interactive è un prototipo di ricerca per l'interazione multimodale uomo-IA. Il sistema è in grado di condurre dialoghi a più turni con utenti umani, accettando input multimodali e generando risposte multimodali. In modo significativo, LLaVA-Interactive va oltre il prompt linguistico, abilitando il prompt visivo per allineare le intenzioni umane durante l'interazione. Lo sviluppo di LLaVA-Interactive è estremamente efficiente in termini di costi, poiché il sistema combina tre competenze multimodali di modelli IA preesistenti senza ulteriori addestramenti: il chat visivo di LLaVA, la segmentazione delle immagini di SEEM, nonché la generazione e modifica di immagini di GLIGEN. Viene presentato un insieme diversificato di scenari applicativi per dimostrare le potenzialità di LLaVA-Interactive e per ispirare future ricerche sui sistemi interattivi multimodali.

English

LLaVA-Interactive is a research prototype for multimodal human-AI interaction. The system can have multi-turn dialogues with human users by taking multimodal user inputs and generating multimodal responses. Importantly, LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled to align human intents in the interaction. The development of LLaVA-Interactive is extremely cost-efficient as the system combines three multimodal skills of pre-built AI models without additional model training: visual chat of LLaVA, image segmentation from SEEM, as well as image generation and editing from GLIGEN. A diverse set of application scenarios is presented to demonstrate the promises of LLaVA-Interactive and to inspire future research in multimodal interactive systems.

LLaVA-Interattivo: Una Demo Tutto-in-Uno per Chat con Immagini, Segmentazione, Generazione e Modifica

LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Abstract

Support