JoyAI-VL-Interaction：即時視覺語言互動智能

摘要

現實世界中的許多時刻並不會等待使用者提問——保全監視器上竄出火苗、視訊通話中閃過一抹表情、直播裡觀眾心儀的商品一閃而過。然而，當今的大型模型大多仍停留在回合制設計：它們只在被呼叫時才回應，即使是看似具互動性的視訊通話應用程式，本質上仍是問答系統，僅在輪詢或提示時才有所反應。我們主張另一種範式：一個像人類一樣「臨在」於世界中的模型。它持續觀察當下發生的一切，自主決定該發言或保持沉默，即時互動，並在問題困難時委託給後台模型處理。為推動互動模式的發展及其在各領域的採用，我們貢獻了兩項完全開源的成果。首先，我們釋出 JoyAI-VL-Interaction——一個 8B 規模、以視覺為優先的視覺語言互動模型。該模型在內部做出回應決策，每秒鐘選擇保持沉默、回應或委託給後台模型，並擅長視覺觸發的反應速度與時間感知。我們還為其配備了一套可遷移的訓練配方，從中湧現出從未刻意訓練的能力，例如引導購物者切換應用程式畫面，或是根據投影片即興授課。其次，我們釋出了一套完整的可部署系統，以該模型為核心。該系統能將任何持續進行的視訊串流輸入模型，使其真正「臨在」於世界中。所有其他元件均可插拔，包括 ASR/TTS 模組、記憶體、視覺化 UI，以及可連接任何 API 或代理程式的後台「大腦」。在六個真實世界場景中，人類評審者對 JoyAI-VL-Interaction 的偏好程度大幅優於豆包與 Gemini 的應用內視訊通話助手。據我們所知，這是首個開放原始碼的視覺驅動互動模型，一併釋出了其訓練配方、資料與完整的可部署系統。

English

Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.