AssistGPT: 계획, 실행, 검사 및 학습이 가능한 범용 멀티모달 어시스턴트

초록

대규모 언어 모델(LLMs)에 대한 최근 연구는 일반적인 NLP AI 어시스턴트 분야에서 놀라운 발전을 이끌어냈습니다. 일부 연구는 더 나아가 LLMs를 활용하여 다양한 모달리티의 사용자 질의를 해결하기 위해 모델이나 API를 계획하고 호출하는 방법을 탐구했습니다. 그러나 이러한 진전에도 불구하고, 복잡한 시각 기반 작업은 여전히 도전적인 과제로 남아 있습니다. 이는 시각 작업의 다양성에서 비롯되며, 이 다양성은 두 가지 측면에서 나타납니다: 1) 추론 경로. 많은 실제 응용 사례에서, 질의 자체만을 검토하여 정확하게 분해하는 것은 어렵습니다. 특정 시각적 콘텐츠와 각 단계의 결과를 기반으로 한 계획이 일반적으로 필요합니다. 2) 유연한 입력 및 중간 결과. 실제 상황에서 입력 형태는 유연할 수 있으며, 단일 이미지나 비디오뿐만 아니라 비디오와 이미지의 혼합, 예를 들어 사용자 시점 이미지와 참조 비디오 등이 포함될 수 있습니다. 또한, 복잡한 추론 과정은 비디오 내레이션, 분할된 비디오 클립 등과 같은 다양한 다중 모달 중간 결과를 생성합니다. 이러한 일반적인 사례를 해결하기 위해, 우리는 LLMs와 다양한 도구를 통합하기 위해 Plan, Execute, Inspect, and Learn(PEIL)이라는 교차 코드 및 언어 추론 방식을 사용하는 다중 모달 AI 어시스턴트인 AssistGPT를 제안합니다. 구체적으로, Planner는 현재 추론 진행 상황을 기반으로 Executor의 어떤 도구가 다음 작업을 수행해야 하는지를 자연어로 계획할 수 있습니다. Inspector는 Planner가 특정 도구에 적절한 시각 정보를 제공할 수 있도록 돕는 효율적인 메모리 관리자입니다. 마지막으로, 전체 추론 과정이 복잡하고 유연하기 때문에, Learner는 모델이 자율적으로 최적의 해결책을 탐색하고 발견할 수 있도록 설계되었습니다. 우리는 A-OKVQA와 NExT-QA 벤치마크에서 실험을 수행하여 최첨단 결과를 달성했습니다. 또한, 벤치마크에서 발견되는 것보다 훨씬 더 복잡한 질문을 처리할 수 있는 우리 시스템의 능력을 보여주는 사례를 제시했습니다.

English

Recent research on Large Language Models (LLMs) has led to remarkable advancements in general NLP AI assistants. Some studies have further explored the use of LLMs for planning and invoking models or APIs to address more general multi-modal user queries. Despite this progress, complex visual-based tasks still remain challenging due to the diverse nature of visual tasks. This diversity is reflected in two aspects: 1) Reasoning paths. For many real-life applications, it is hard to accurately decompose a query simply by examining the query itself. Planning based on the specific visual content and the results of each step is usually required. 2) Flexible inputs and intermediate results. Input forms could be flexible for in-the-wild cases, and involves not only a single image or video but a mixture of videos and images, e.g., a user-view image with some reference videos. Besides, a complex reasoning process will also generate diverse multimodal intermediate results, e.g., video narrations, segmented video clips, etc. To address such general cases, we propose a multi-modal AI assistant, AssistGPT, with an interleaved code and language reasoning approach called Plan, Execute, Inspect, and Learn (PEIL) to integrate LLMs with various tools. Specifically, the Planner is capable of using natural language to plan which tool in Executor should do next based on the current reasoning progress. Inspector is an efficient memory manager to assist the Planner to feed proper visual information into a specific tool. Finally, since the entire reasoning process is complex and flexible, a Learner is designed to enable the model to autonomously explore and discover the optimal solution. We conducted experiments on A-OKVQA and NExT-QA benchmarks, achieving state-of-the-art results. Moreover, showcases demonstrate the ability of our system to handle questions far more complex than those found in the benchmarks.

AssistGPT: 계획, 실행, 검사 및 학습이 가능한 범용 멀티모달 어시스턴트

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

초록

Support