MiniCPM-o 4.5：迈向实时全双工全模态交互新时代

摘要

多模态大语言模型（MLLMs）的最新进展已将AI能力从静态离线数据处理推进至实时流式交互阶段，然而现有模型仍与人类水平的多模态交互存在显著差距。关键瓶颈已不再局限于模态覆盖范围或延迟问题，而在于交互范式本身：其一，感知与响应仍被分割为交替阶段，导致模型无法在生成过程中融入新输入进行即时调整；其二，当前多数模型仍处于被动响应模式，仅对用户显式指令作出反应，而无法在动态演变的多模态环境中主动作为。我们推出MiniCPM-o 4.5模型，这是我们在类人多模态交互方向的最新突破，通过实时全双工全模态交互机制有效缓解上述局限。该模型能实时同步实现视觉感知、听觉接收与语音输出，并基于对实时场景的持续理解展现出主动行为（如发出提醒或评论）。其核心技术为Omni-Flow——一种沿共享时间轴对齐全模态输入输出的统一流式框架。该设计将传统轮次式交互转化为全双工时间对齐流程，实现感知响应的同步进行，并使主动行为能在同一框架内自然涌现。凭借90亿参数规模，MiniCPM-o 4.5在视觉语言能力上接近Gemini 2.5 Flash水平，在其参数规模下达到开源模型的顶尖性能。该模型在全模态理解方面超越Qwen3-Omni-30B-A3B，语音生成质量更优，且计算效率显著提升。通过高效架构设计与推理优化，该模型可在内存小于12GB的边缘设备上实现实时全双工全模态交互。

English

Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplex omni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting proactive behaviors such as issuing reminders or comments based on its continuous understanding of the live scene. The key technique behind MiniCPM-o 4.5 is Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis. This formulation converts conventional turn-based interaction into a full-duplex, time-aligned process, enabling simultaneous perception and response and allowing proactive behavior to arise within the same framework. With a total of 9B parameters, MiniCPM-o 4.5 approaches Gemini 2.5 Flash in vision-language capabilities, delivering state-of-the-art open-source performance at its scale. It also surpasses Qwen3-Omni-30B-A3B in omni-modal understanding and delivers better speech generation, with significantly higher computation efficiency. Driven by its efficient architecture design and inference optimization, the model can perform real-time full-duplex omni-modal interaction on edge devices with less than 12GB RAM cost.

MiniCPM-o 4.5：迈向实时全双工全模态交互新时代

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

摘要

Support