JoyAI-VL-Interaction: 실시간 시각-언어 상호작용 지능

초록

실제 세계의 많은 순간들은 사용자의 질문을 기다리지 않는다. 보안 모니터에 불이 붙기 시작하고, 영상 통화에서 표정이 스치며 지나가거나, 라이브 스트리밍 중 시청자가 원하는 제품이 순간적으로 지나간다. 하지만 오늘날의 대규모 모델들은 대부분 디자인상 턴제(turn-based) 방식으로 남아 있다: 호출될 때만 응답하며, 심지어 양방향처럼 보이는 영상 통화 앱들도 여전히 질문-응답 시스템으로 작동하여, 데이터를 요청(polled)하거나 프롬프트(prompted)를 입력할 때만 반응한다. 우리는 다른 패러다임을 주장한다: 마치 사람처럼 세상에 존재하는 모델이다. 이 모델은 지금 일어나는 일을 지속적으로 관찰하고, 스스로 말할지 침묵할지 결정하며, 실시간으로 상호작용하고, 문제가 어려울 때는 백그라운드 모델에 위임한다. 상호작용 모델과 이의 다양한 영역에서의 적용을 발전시키기 위해, 우리는 완전히 오픈소스화된 두 가지 기여를 한다. 첫째, 우리는 JoyAI-VL-Interaction을 공개한다. 이는 8B 규모의 비전 우선(vision-first) VL 상호작용 모델이다. 이 모델은 내부적으로 응답 결정을 내려, 매 초 침묵, 응답, 또는 백그라운드 모델 위임 중 하나를 선택하며, 비전 트리거 반응성과 시간 인식에 탁월하다. 우리는 이 모델과 함께 전이 가능한 훈련 레시피를 제공하며, 이를 통해 전혀 훈련하지 않은 능력(예: 쇼핑객을 앱 화면 변경 안내, 슬라이드 덱에서 즉석 강의)이 창발한다. 둘째, 우리는 이 모델을 기반으로 구축된 완전하고 배포 가능한 시스템을 공개한다. 이 시스템은 진행 중인 모든 비디오를 모델로 스트리밍하여, 모델이 진정으로 세상에 존재하도록 만든다. ASR/TTS 모듈, 메모리, 시각화 UI, 어떤 API나 에이전트에 연결할 수 있는 백그라운드 브레인을 포함한 모든 다른 구성 요소는 플러그형이다. 여섯 가지 실제 시나리오에서 인간 평가자는 JoyAI-VL-Interaction을 Doubao 및 Gemini의 인앱 영상 통화 어시스턴트보다 훨씬 선호했다. 우리가 아는 한, 이는 훈련 레시피, 데이터, 완전한 배포 가능 시스템과 함께 공개된 최초의 공개형 비전 기반 상호작용 모델이다.

English

Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.