与AI对话：实时视频通信从人际交流到人工智能的惊人转变

摘要

AI视频聊天作为一种实时通信（RTC）的新范式应运而生，其中一方并非人类，而是多模态大语言模型（MLLM）。这使得人与AI之间的互动更加直观，仿佛与真人面对面交谈。然而，这给延迟带来了巨大挑战，因为MLLM推理占据了大部分响应时间，留给视频流传输的时间极少。由于网络的不确定性和不稳定性，传输延迟成为阻碍AI表现得像真人的关键瓶颈。为解决这一问题，我们提出了Artic，一个面向AI的实时通信框架，探索从“人类观看视频”到“AI理解视频”的网络需求转变。为了在保持MLLM准确性的同时大幅降低比特率，我们提出了上下文感知视频流技术，该技术识别聊天中每个视频区域的重要性，并几乎将所有比特率分配给对聊天至关重要的区域。为避免数据包重传，我们提出了抗损失自适应帧率技术，利用前一帧替代丢失或延迟的帧，同时避免比特率浪费。为了评估视频流质量对MLLM准确性的影响，我们构建了首个基准测试，名为降质视频理解基准（DeViBench）。最后，我们探讨了AI视频聊天中的一些开放性问题及正在实施的解决方案。

English

AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty and instability, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we propose Artic, an AI-oriented Real-time Communication framework, exploring the network requirement shift from "humans watching video" to "AI understanding video". To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To avoid packet retransmission, we propose Loss-Resilient Adaptive Frame Rate that leverages previous frames to substitute for lost/delayed frames while avoiding bitrate waste. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat.

与AI对话：实时视频通信从人际交流到人工智能的惊人转变

Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI

摘要

Support