JoyAI-VL-Interaction：实时视觉-语言交互智能

摘要

现实世界中的许多时刻并不会等待用户主动提问。监控画面中突然燃起的火苗、视频通话中一闪而过的表情变化、或者直播中观众心仪的产品快速掠过。然而，当今的大语言模型在设计上仍以轮次驱动为主：它们仅在被直接提问时才会回应，即便是那些看似具有交互性的视频通话应用，本质上仍是问答系统——仅在轮询或收到提示时被动响应。我们主张一种不同的范式：让模型如同真人般存在于真实世界。它能持续观察当下发生的事件，自主决定何时发言或保持沉默，进行实时互动，并在遇到复杂问题时将任务委托给后台模型。为推进交互模型的发展及其跨领域应用，我们做出两项完全开源贡献。首先，我们发布JoyAI-VL-Interaction——一个8B参数规模、以视觉优先的视觉语言交互模型。该模型能自主做出回应决策，每秒判断是保持沉默、作出回应还是委托给后台模型，在视觉触发响应能力和时间感知方面表现卓越。我们同步公开了一套可迁移的训练方案，从中涌现出我们从未刻意训练的能力，例如引导用户切换应用程序界面，或根据幻灯片即兴授课。其次，我们发布了一个基于该模型的完整可部署系统。该系统可将任何实时视频流输入模型，使其真正融入现实世界。所有其他组件均为可插拔设计，包括语音识别/语音合成模块、记忆系统、可视化界面，以及可对接任意API或代理的后台大脑。在六个真实场景的评估中，人类评审员对JoyAI-VL-Interaction的偏好远超豆包和Gemini的内置视频通话助手。据我们所知，这是首个开源、视觉驱动的交互模型，其训练方案、数据及完整可部署系统同步发布。

English

Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.