ChatPaper.aiChatPaper

LLaVA-Plus:學習使用工具創建多模態代理程序

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

November 9, 2023
作者: Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, Chunyuan Li
cs.AI

摘要

LLaVA-Plus是一個通用的多模態助理,擴展了大型多模態模型的功能。它維護了一個預先訓練的視覺和視覺語言模型的技能庫,可以根據用戶的輸入激活相關工具,以完成真實世界的任務。LLaVA-Plus通過多模態指令遵循數據進行訓練,以獲得使用工具的能力,涵蓋視覺理解、生成、外部知識檢索和組合。實證結果顯示,LLaVA-Plus在現有功能上優於LLaVA,並展現出新的功能。它與眾不同之處在於圖像查詢直接接地,並在整個人工智能交互會話中積極參與,顯著提高了工具使用性能,並實現了新的場景。
English
LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.
PDF514December 15, 2024