InternChat：通過與聊天機器人互動解決以視覺為中心的任務超越語言

摘要

我們提出了一個名為InternChat（簡稱iChat）的互動式視覺框架。該框架整合了具有規劃和推理能力的聊天機器人，例如ChatGPT，以及非語言指令，如指向性動作，使用戶能夠直接操控屏幕上的圖像或視頻。指向性（包括手勢、游標等）動作可以在執行需要精細控制、編輯和生成視覺內容的視覺中心任務時提供更多靈活性和精確性。InternChat這個名稱代表互動、非語言和聊天機器人。與現有依賴純語言的互動系統不同，通過整合指向性指令，所提出的iChat顯著提高了用戶與聊天機器人之間的溝通效率，以及聊天機器人在視覺中心任務中的準確性，尤其是在物體數量大於2的複雜視覺場景中。此外，在iChat中，使用輔助控制機制來提高LLM的控制能力，並且對一個名為Husky的大型視覺語言模型進行微調，以進行高質量的多模態對話（令ChatGPT-3.5-turbo印象深刻，達到93.89% GPT-4質量）。我們希望這項工作能激發未來互動式視覺系統的新思路和方向。歡迎查看代碼：https://github.com/OpenGVLab/InternChat。

English

We present an interactive visual framework named InternChat, or iChat for short. The framework integrates chatbots that have planning and reasoning capabilities, such as ChatGPT, with non-verbal instructions like pointing movements that enable users to directly manipulate images or videos on the screen. Pointing (including gestures, cursors, etc.) movements can provide more flexibility and precision in performing vision-centric tasks that require fine-grained control, editing, and generation of visual content. The name InternChat stands for interaction, nonverbal, and chatbots. Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iChat significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2. Additionally, in iChat, an auxiliary control mechanism is used to improve the control capability of LLM, and a large vision-language model termed Husky is fine-tuned for high-quality multi-modal dialogue (impressing ChatGPT-3.5-turbo with 93.89% GPT-4 Quality). We hope this work can spark new ideas and directions for future interactive visual systems. Welcome to watch the code at https://github.com/OpenGVLab/InternChat.

InternChat：通過與聊天機器人互動解決以視覺為中心的任務超越語言

InternChat: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language

摘要

Support