AssistGPT：一款通用多模态助手，能够规划、执行、检查和学习。

摘要

最近关于大型语言模型（LLMs）的研究取得了显著进展，为通用NLP人工智能助手带来了重大提升。一些研究进一步探讨了利用LLMs进行规划和调用模型或API来解决更一般的多模态用户查询。尽管取得了这些进展，由于视觉任务的多样性，复杂的基于视觉的任务仍然具有挑战性。这种多样性体现在两个方面：1）推理路径。对于许多现实应用，仅通过检查查询本身很难准确分解查询。通常需要基于特定视觉内容和每个步骤的结果进行规划。2）灵活的输入和中间结果。输入形式在野外情况下可能是灵活的，涉及不仅是单个图像或视频，还包括视频和图像的混合，例如，用户视图图像和一些参考视频。此外，复杂的推理过程还会生成多样的多模态中间结果，例如视频叙述，分段视频剪辑等。为了解决这种一般情况，我们提出了一个多模态人工智能助手AssistGPT，采用一种交错的代码和语言推理方法，称为Plan，Execute，Inspect和Learn（PEIL），将LLMs与各种工具集成在一起。具体而言，规划者能够使用自然语言规划Executor中下一步应该执行哪个工具，基于当前推理进展。检查器是一个高效的内存管理器，协助规划者向特定工具提供适当的视觉信息。最后，由于整个推理过程复杂且灵活，设计了一个学习器，使模型能够自主探索并发现最佳解决方案。我们在A-OKVQA和NExT-QA基准上进行了实验，取得了最先进的结果。此外，展示了我们的系统处理比基准中更复杂问题的能力。

English

Recent research on Large Language Models (LLMs) has led to remarkable advancements in general NLP AI assistants. Some studies have further explored the use of LLMs for planning and invoking models or APIs to address more general multi-modal user queries. Despite this progress, complex visual-based tasks still remain challenging due to the diverse nature of visual tasks. This diversity is reflected in two aspects: 1) Reasoning paths. For many real-life applications, it is hard to accurately decompose a query simply by examining the query itself. Planning based on the specific visual content and the results of each step is usually required. 2) Flexible inputs and intermediate results. Input forms could be flexible for in-the-wild cases, and involves not only a single image or video but a mixture of videos and images, e.g., a user-view image with some reference videos. Besides, a complex reasoning process will also generate diverse multimodal intermediate results, e.g., video narrations, segmented video clips, etc. To address such general cases, we propose a multi-modal AI assistant, AssistGPT, with an interleaved code and language reasoning approach called Plan, Execute, Inspect, and Learn (PEIL) to integrate LLMs with various tools. Specifically, the Planner is capable of using natural language to plan which tool in Executor should do next based on the current reasoning progress. Inspector is an efficient memory manager to assist the Planner to feed proper visual information into a specific tool. Finally, since the entire reasoning process is complex and flexible, a Learner is designed to enable the model to autonomously explore and discover the optimal solution. We conducted experiments on A-OKVQA and NExT-QA benchmarks, achieving state-of-the-art results. Moreover, showcases demonstrate the ability of our system to handle questions far more complex than those found in the benchmarks.

AssistGPT：一款通用多模态助手，能够规划、执行、检查和学习。

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

摘要

Support