ChatPaper.aiChatPaper

JoyAI-VL-Interaction:实时视觉-语言交互智能

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

June 10, 2026
作者: Dingyu Yao, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Haowen Hou, Zheming Liang, Congcong Wang, Yuhang Cao, Shenglong Ye, Shuai Xie, Shuhuan Gu, Haoyang Huang, Qingyi Si, Nan Duan, Jiaqi Wang
cs.AI

摘要

现实世界中的许多时刻并不会等待用户主动提问。监控画面中突然燃起的火苗、视频通话中一闪而过的表情变化、或者直播中观众心仪的产品快速掠过。然而,当今的大语言模型在设计上仍以轮次驱动为主:它们仅在被直接提问时才会回应,即便是那些看似具有交互性的视频通话应用,本质上仍是问答系统——仅在轮询或收到提示时被动响应。我们主张一种不同的范式:让模型如同真人般存在于真实世界。它能持续观察当下发生的事件,自主决定何时发言或保持沉默,进行实时互动,并在遇到复杂问题时将任务委托给后台模型。为推进交互模型的发展及其跨领域应用,我们做出两项完全开源贡献。首先,我们发布JoyAI-VL-Interaction——一个8B参数规模、以视觉优先的视觉语言交互模型。该模型能自主做出回应决策,每秒判断是保持沉默、作出回应还是委托给后台模型,在视觉触发响应能力和时间感知方面表现卓越。我们同步公开了一套可迁移的训练方案,从中涌现出我们从未刻意训练的能力,例如引导用户切换应用程序界面,或根据幻灯片即兴授课。其次,我们发布了一个基于该模型的完整可部署系统。该系统可将任何实时视频流输入模型,使其真正融入现实世界。所有其他组件均为可插拔设计,包括语音识别/语音合成模块、记忆系统、可视化界面,以及可对接任意API或代理的后台大脑。在六个真实场景的评估中,人类评审员对JoyAI-VL-Interaction的偏好远超豆包和Gemini的内置视频通话助手。据我们所知,这是首个开源、视觉驱动的交互模型,其训练方案、数据及完整可部署系统同步发布。
English
Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.