AI와의 대화: 인간에서 AI로의 실시간 영상 통신의 놀라운 전환

초록

AI 비디오 채팅은 실시간 통신(RTC)의 새로운 패러다임으로 등장하고 있으며, 여기서 한쪽 피어는 인간이 아닌 멀티모달 대형 언어 모델(MLLM)입니다. 이는 인간과 AI 간의 상호작용을 마치 실제 사람과 얼굴을 마주보고 대화하듯 더 직관적으로 만듭니다. 그러나 이는 지연 시간에 상당한 도전을 제기합니다. 왜냐하면 MLLM 추론이 응답 시간의 대부분을 차지하여 비디오 스트리밍에 할당할 시간이 매우 적기 때문입니다. 네트워크의 불확실성과 불안정성으로 인해 전송 지연은 AI가 실제 사람처럼 행동하는 데 있어 중요한 병목 현상이 됩니다. 이를 해결하기 위해 우리는 Artic라는 AI 중심의 실시간 통신 프레임워크를 제안하며, "인간이 비디오를 보는 것"에서 "AI가 비디오를 이해하는 것"으로의 네트워크 요구 사항 변화를 탐구합니다. MLLM 정확도를 유지하면서 비트레이트를 극적으로 줄이기 위해, 우리는 채팅에 중요한 비디오 영역을 인식하고 비트레이트를 거의 전적으로 이러한 영역에 할당하는 컨텍스트 인식 비디오 스트리밍을 제안합니다. 패킷 재전송을 피하기 위해, 우리는 이전 프레임을 활용하여 손실되거나 지연된 프레임을 대체하면서 비트레이트 낭비를 방지하는 손실 복원 적응형 프레임 레이트를 제안합니다. 비디오 스트리밍 품질이 MLLM 정확도에 미치는 영향을 평가하기 위해, 우리는 Degraded Video Understanding Benchmark(DeViBench)라는 첫 번째 벤치마크를 구축했습니다. 마지막으로, 우리는 AI 비디오 채팅에 대한 몇 가지 열린 질문과 진행 중인 해결책에 대해 논의합니다.

English

AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty and instability, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we propose Artic, an AI-oriented Real-time Communication framework, exploring the network requirement shift from "humans watching video" to "AI understanding video". To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To avoid packet retransmission, we propose Loss-Resilient Adaptive Frame Rate that leverages previous frames to substitute for lost/delayed frames while avoiding bitrate waste. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat.

AI와의 대화: 인간에서 AI로의 실시간 영상 통신의 놀라운 전환

Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI

초록

Support