InternChat：通过与聊天机器人互动解决以视觉为中心的任务超越语言

摘要

我们提出了一个名为InternChat（简称iChat）的交互式视觉框架。该框架集成了具有规划和推理能力的聊天机器人，如ChatGPT，以及非语言指令，如指向性动作，使用户能够直接操作屏幕上的图像或视频。指向性（包括手势、光标等）动作可以在执行需要精细控制、编辑和生成视觉内容的以视觉为中心的任务中提供更多灵活性和精度。InternChat这个名字代表着互动、非语言和聊天机器人。与现有依赖纯语言的交互系统不同，通过整合指向性指令，所提出的iChat显著提高了用户与聊天机器人之间的沟通效率，以及聊天机器人在以视觉为中心的任务中的准确性，特别是在物体数量大于2的复杂视觉场景中。此外，在iChat中，使用了辅助控制机制来提高LLM的控制能力，并对一种名为Husky的大型视觉-语言模型进行微调，用于高质量的多模态对话（令ChatGPT-3.5-turbo印象深刻，达到93.89%的GPT-4质量）。我们希望这项工作能激发未来交互式视觉系统的新思路和方向。欢迎查看代码：https://github.com/OpenGVLab/InternChat。

English

We present an interactive visual framework named InternChat, or iChat for short. The framework integrates chatbots that have planning and reasoning capabilities, such as ChatGPT, with non-verbal instructions like pointing movements that enable users to directly manipulate images or videos on the screen. Pointing (including gestures, cursors, etc.) movements can provide more flexibility and precision in performing vision-centric tasks that require fine-grained control, editing, and generation of visual content. The name InternChat stands for interaction, nonverbal, and chatbots. Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iChat significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2. Additionally, in iChat, an auxiliary control mechanism is used to improve the control capability of LLM, and a large vision-language model termed Husky is fine-tuned for high-quality multi-modal dialogue (impressing ChatGPT-3.5-turbo with 93.89% GPT-4 Quality). We hope this work can spark new ideas and directions for future interactive visual systems. Welcome to watch the code at https://github.com/OpenGVLab/InternChat.

InternChat：通过与聊天机器人互动解决以视觉为中心的任务超越语言

InternChat: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language

摘要

Support