LLaVA-Plus:学习使用工具创建多模态代理程序
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
November 9, 2023
作者: Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, Chunyuan Li
cs.AI
摘要
LLaVA-Plus是一个通用的多模态助手,扩展了大型多模态模型的功能。它维护一个预先训练的视觉和视觉-语言模型的技能存储库,并可以根据用户的输入激活相关工具,以完成现实世界的任务。LLaVA-Plus在多模态指令遵循数据上进行训练,以获得使用工具的能力,涵盖视觉理解、生成、外部知识检索和组合。实证结果显示,LLaVA-Plus在现有功能上优于LLaVA,并展示出新的功能。它的独特之处在于图像查询直接接地并在整个人工智能交互会话中积极参与,显著提高了工具使用性能并实现了新的场景。
English
LLaVA-Plus is a general-purpose multimodal assistant that expands the
capabilities of large multimodal models. It maintains a skill repository of
pre-trained vision and vision-language models and can activate relevant tools
based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on
multimodal instruction-following data to acquire the ability to use tools,
covering visual understanding, generation, external knowledge retrieval, and
compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in
existing capabilities and exhibits new ones. It is distinct in that the image
query is directly grounded and actively engaged throughout the entire human-AI
interaction sessions, significantly improving tool use performance and enabling
new scenarios.