LLaVA-Plus：学习使用工具创建多模态代理程序

摘要

LLaVA-Plus是一个通用的多模态助手，扩展了大型多模态模型的功能。它维护一个预先训练的视觉和视觉-语言模型的技能存储库，并可以根据用户的输入激活相关工具，以完成现实世界的任务。LLaVA-Plus在多模态指令遵循数据上进行训练，以获得使用工具的能力，涵盖视觉理解、生成、外部知识检索和组合。实证结果显示，LLaVA-Plus在现有功能上优于LLaVA，并展示出新的功能。它的独特之处在于图像查询直接接地并在整个人工智能交互会话中积极参与，显著提高了工具使用性能并实现了新的场景。

English

LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

LLaVA-Plus：学习使用工具创建多模态代理程序

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

摘要

Support