LLaVA-Plus：學習使用工具創建多模態代理程序

摘要

LLaVA-Plus是一個通用的多模態助理，擴展了大型多模態模型的功能。它維護了一個預先訓練的視覺和視覺語言模型的技能庫，可以根據用戶的輸入激活相關工具，以完成真實世界的任務。LLaVA-Plus通過多模態指令遵循數據進行訓練，以獲得使用工具的能力，涵蓋視覺理解、生成、外部知識檢索和組合。實證結果顯示，LLaVA-Plus在現有功能上優於LLaVA，並展現出新的功能。它與眾不同之處在於圖像查詢直接接地，並在整個人工智能交互會話中積極參與，顯著提高了工具使用性能，並實現了新的場景。

English

LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

LLaVA-Plus：學習使用工具創建多模態代理程序

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

摘要

Support