AssistGPT:一款通用多模态助手,能够规划、执行、检查和学习。
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
June 14, 2023
作者: Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, Mike Zheng Shou
cs.AI
摘要
最近关于大型语言模型(LLMs)的研究取得了显著进展,为通用NLP人工智能助手带来了重大提升。一些研究进一步探讨了利用LLMs进行规划和调用模型或API来解决更一般的多模态用户查询。尽管取得了这些进展,由于视觉任务的多样性,复杂的基于视觉的任务仍然具有挑战性。这种多样性体现在两个方面:1)推理路径。对于许多现实应用,仅通过检查查询本身很难准确分解查询。通常需要基于特定视觉内容和每个步骤的结果进行规划。2)灵活的输入和中间结果。输入形式在野外情况下可能是灵活的,涉及不仅是单个图像或视频,还包括视频和图像的混合,例如,用户视图图像和一些参考视频。此外,复杂的推理过程还会生成多样的多模态中间结果,例如视频叙述,分段视频剪辑等。为了解决这种一般情况,我们提出了一个多模态人工智能助手AssistGPT,采用一种交错的代码和语言推理方法,称为Plan,Execute,Inspect和Learn(PEIL),将LLMs与各种工具集成在一起。具体而言,规划者能够使用自然语言规划Executor中下一步应该执行哪个工具,基于当前推理进展。检查器是一个高效的内存管理器,协助规划者向特定工具提供适当的视觉信息。最后,由于整个推理过程复杂且灵活,设计了一个学习器,使模型能够自主探索并发现最佳解决方案。我们在A-OKVQA和NExT-QA基准上进行了实验,取得了最先进的结果。此外,展示了我们的系统处理比基准中更复杂问题的能力。
English
Recent research on Large Language Models (LLMs) has led to remarkable
advancements in general NLP AI assistants. Some studies have further explored
the use of LLMs for planning and invoking models or APIs to address more
general multi-modal user queries. Despite this progress, complex visual-based
tasks still remain challenging due to the diverse nature of visual tasks. This
diversity is reflected in two aspects: 1) Reasoning paths. For many real-life
applications, it is hard to accurately decompose a query simply by examining
the query itself. Planning based on the specific visual content and the results
of each step is usually required. 2) Flexible inputs and intermediate results.
Input forms could be flexible for in-the-wild cases, and involves not only a
single image or video but a mixture of videos and images, e.g., a user-view
image with some reference videos. Besides, a complex reasoning process will
also generate diverse multimodal intermediate results, e.g., video narrations,
segmented video clips, etc. To address such general cases, we propose a
multi-modal AI assistant, AssistGPT, with an interleaved code and language
reasoning approach called Plan, Execute, Inspect, and Learn (PEIL) to integrate
LLMs with various tools. Specifically, the Planner is capable of using natural
language to plan which tool in Executor should do next based on the current
reasoning progress. Inspector is an efficient memory manager to assist the
Planner to feed proper visual information into a specific tool. Finally, since
the entire reasoning process is complex and flexible, a Learner is designed to
enable the model to autonomously explore and discover the optimal solution. We
conducted experiments on A-OKVQA and NExT-QA benchmarks, achieving
state-of-the-art results. Moreover, showcases demonstrate the ability of our
system to handle questions far more complex than those found in the benchmarks.