InternChat:通过与聊天机器人互动解决以视觉为中心的任务 超越语言
InternChat: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language
May 9, 2023
作者: Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, Limin Wang, Ping Luo, Jifeng Dai, Yu Qiao
cs.AI
摘要
我们提出了一个名为InternChat(简称iChat)的交互式视觉框架。该框架集成了具有规划和推理能力的聊天机器人,如ChatGPT,以及非语言指令,如指向性动作,使用户能够直接操作屏幕上的图像或视频。指向性(包括手势、光标等)动作可以在执行需要精细控制、编辑和生成视觉内容的以视觉为中心的任务中提供更多灵活性和精度。InternChat这个名字代表着互动、非语言和聊天机器人。与现有依赖纯语言的交互系统不同,通过整合指向性指令,所提出的iChat显著提高了用户与聊天机器人之间的沟通效率,以及聊天机器人在以视觉为中心的任务中的准确性,特别是在物体数量大于2的复杂视觉场景中。此外,在iChat中,使用了辅助控制机制来提高LLM的控制能力,并对一种名为Husky的大型视觉-语言模型进行微调,用于高质量的多模态对话(令ChatGPT-3.5-turbo印象深刻,达到93.89%的GPT-4质量)。我们希望这项工作能激发未来交互式视觉系统的新思路和方向。欢迎查看代码:https://github.com/OpenGVLab/InternChat。
English
We present an interactive visual framework named InternChat, or iChat for
short. The framework integrates chatbots that have planning and reasoning
capabilities, such as ChatGPT, with non-verbal instructions like pointing
movements that enable users to directly manipulate images or videos on the
screen. Pointing (including gestures, cursors, etc.) movements can provide more
flexibility and precision in performing vision-centric tasks that require
fine-grained control, editing, and generation of visual content. The name
InternChat stands for interaction, nonverbal, and chatbots. Different from
existing interactive systems that rely on pure language, by incorporating
pointing instructions, the proposed iChat significantly improves the efficiency
of communication between users and chatbots, as well as the accuracy of
chatbots in vision-centric tasks, especially in complicated visual scenarios
where the number of objects is greater than 2. Additionally, in iChat, an
auxiliary control mechanism is used to improve the control capability of LLM,
and a large vision-language model termed Husky is fine-tuned for high-quality
multi-modal dialogue (impressing ChatGPT-3.5-turbo with 93.89% GPT-4 Quality).
We hope this work can spark new ideas and directions for future interactive
visual systems. Welcome to watch the code at
https://github.com/OpenGVLab/InternChat.