JoyAI-VL-Interaction:即時視覺語言互動智能
JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence
June 10, 2026
作者: Dingyu Yao, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Haowen Hou, Zheming Liang, Congcong Wang, Yuhang Cao, Shenglong Ye, Shuai Xie, Shuhuan Gu, Haoyang Huang, Qingyi Si, Nan Duan, Jiaqi Wang
cs.AI
摘要
現實世界中的許多時刻並不會等待使用者提問——保全監視器上竄出火苗、視訊通話中閃過一抹表情、直播裡觀眾心儀的商品一閃而過。然而,當今的大型模型大多仍停留在回合制設計:它們只在被呼叫時才回應,即使是看似具互動性的視訊通話應用程式,本質上仍是問答系統,僅在輪詢或提示時才有所反應。我們主張另一種範式:一個像人類一樣「臨在」於世界中的模型。它持續觀察當下發生的一切,自主決定該發言或保持沉默,即時互動,並在問題困難時委託給後台模型處理。為推動互動模式的發展及其在各領域的採用,我們貢獻了兩項完全開源的成果。首先,我們釋出 JoyAI-VL-Interaction——一個 8B 規模、以視覺為優先的視覺語言互動模型。該模型在內部做出回應決策,每秒鐘選擇保持沉默、回應或委託給後台模型,並擅長視覺觸發的反應速度與時間感知。我們還為其配備了一套可遷移的訓練配方,從中湧現出從未刻意訓練的能力,例如引導購物者切換應用程式畫面,或是根據投影片即興授課。其次,我們釋出了一套完整的可部署系統,以該模型為核心。該系統能將任何持續進行的視訊串流輸入模型,使其真正「臨在」於世界中。所有其他元件均可插拔,包括 ASR/TTS 模組、記憶體、視覺化 UI,以及可連接任何 API 或代理程式的後台「大腦」。在六個真實世界場景中,人類評審者對 JoyAI-VL-Interaction 的偏好程度大幅優於豆包與 Gemini 的應用內視訊通話助手。據我們所知,這是首個開放原始碼的視覺驅動互動模型,一併釋出了其訓練配方、資料與完整的可部署系統。
English
Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.