MiniCPM-o 4.5：迈向实时全双工全模态交互的新里程

摘要

多模态大语言模型（MLLMs）的最新进展已将AI能力从静态离线数据处理推进到实时流式交互阶段，但距离人类水平的多模态交互仍存在显著差距。当前的关键瓶颈已不仅是模态覆盖或延迟问题，而是交互范式本身。首先，感知与响应仍被分割为交替阶段，导致模型无法在生成过程中融入新输入进行及时调整。其次，现有模型大多保持被动响应模式，仅对用户显式指令作出反应，而无法在动态变化的多模态环境中主动作为。我们提出的MiniCPM-o 4.5作为实现类人多模态交互的最新尝试，通过实时全双工全模态交互机制有效缓解了这些局限。该模型能够实时同步实现视觉感知、听觉接收与语音输出，并基于对实时场景的持续理解展现出主动行为（如发出提醒或评论）。其核心技术Omni-Flow作为统一流式框架，将全模态输入输出沿共享时间轴对齐。这种设计将传统的轮次式交互转化为全双工时间对齐流程，实现感知响应的同步进行，并使主动行为在同一框架内自然涌现。凭借90亿参数规模，MiniCPM-o 4.5在视觉语言能力上接近Gemini 2.5 Flash水平，在该量级开源模型中达到领先性能。其全模态理解能力超越Qwen3-Omni-30B-A3B，语音生成质量更优，且计算效率显著提升。通过高效的架构设计与推理优化，该模型可在内存占用小于12GB的边缘设备上实现实时全双工全模态交互。

English

Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplex omni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting proactive behaviors such as issuing reminders or comments based on its continuous understanding of the live scene. The key technique behind MiniCPM-o 4.5 is Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis. This formulation converts conventional turn-based interaction into a full-duplex, time-aligned process, enabling simultaneous perception and response and allowing proactive behavior to arise within the same framework. With a total of 9B parameters, MiniCPM-o 4.5 approaches Gemini 2.5 Flash in vision-language capabilities, delivering state-of-the-art open-source performance at its scale. It also surpasses Qwen3-Omni-30B-A3B in omni-modal understanding and delivers better speech generation, with significantly higher computation efficiency. Driven by its efficient architecture design and inference optimization, the model can perform real-time full-duplex omni-modal interaction on edge devices with less than 12GB RAM cost.

MiniCPM-o 4.5：迈向实时全双工全模态交互的新里程

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

摘要

Support