JoyAI-VL-Interaction: リアルタイム視覚言語インタラクション知能

要旨

現実世界の多くの瞬間は、ユーザーが問いかけるのを待ってくれない。セキュリティモニターで火災が発生する、ビデオ通話で一瞬の表情がよぎる、ライブ配信で視聴者が欲しい商品が一瞬映る。しかし、今日の大規模モデルのほとんどは設計上、依然としてターンベースである。つまり、呼びかけられた時だけ応答し、インタラクティブに見えるビデオ通話アプリでさえ、質問応答システムとして動作し、ポーリングやプロンプトによってのみ反応する。我々は異なるパラダイムを主張する。それは、人のように世界に存在するモデルである。今起こっていることを継続的に監視し、話すか沈黙するかを自ら判断し、リアルタイムで対話し、難しい問題にはバックグラウンドモデルに委任する。対話モデルとそのドメイン横断的な採用を促進するため、我々は2つの完全にオープンソース化された貢献を行う。第一に、我々はJoyAI-VL-Interactionをリリースする。これは8B規模のビジョンファーストVL対話モデルである。このモデルは内部的に応答判断を行い、毎秒、沈黙、応答、またはバックグラウンドモデルへの委任を選択する。また、視覚トリガーによる応答性と時間認識に優れている。さらに、転移可能なトレーニングレシピを提供する。このレシピからは、訓練していない能力が出現する。例えば、アプリ画面の切り替えを通じて買い物客を案内したり、スライド資料から即興で講義を行ったりする能力である。第二に、我々はそのモデルを中心に構築された、完全にデプロイ可能なシステムをリリースする。このシステムは進行中のあらゆるビデオをモデルにストリーミングし、モデルを真に世界に存在させる。他のすべてのコンポーネントはプラグ可能であり、ASR/TTSモジュール、メモリ、可視化UI、任意のAPIやエージェントに接続可能なバックグラウンドブレインを含む。6つの実世界シナリオにおいて、人間の評価者はJoyAI-VL-InteractionをDoubaoやGeminiのアプリ内ビデオ通話アシスタントよりも大幅に好んだ。我々の知る限り、これはトレーニングレシピ、データ、完全なデプロイ可能システムと共にリリースされた、初のオープンなビジョン駆動型対話モデルである。

English

Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.