MiniCPM-o 4.5：リアルタイム全二重全モーダルインタラクションの実現に向けて

要旨

マルチモーダル大規模言語モデル（MLLM）の最近の進歩により、AIの能力は静的なオフラインデータ処理からリアルタイムストリーミング対話へと発展したが、依然として人間レベルのマルチモーダル対話には程遠い。現在の主要なボトルネックは、単なるモダリティ対応範囲や遅延ではなく、対話パラダイムそのものである。第一に、知覚と応答が未だに交互に行われる段階に分離されており、生成途中で新たな入力を取り入れた適時調整ができない。第二に、現在のモデルの大半は受動的であり、明示的なユーザー要求に応答するのみで、変化するマルチモーダル環境において能動的に行動することがない。我々は、人間のようなマルチモーダル対話を目指した最新の取り組みであるMiniCPM-o 4.5を提案する。本モデルは、リアルタイム全二重全モード対話によりこれらの隔たりを軽減する。リアルタイムで同時に見て、聴き、話すことができ、さらにライブシーンを継続的に理解することに基づいたリマインダーやコメントの発出といった能動的な振る舞いも示す。MiniCPM-o 4.5の中核技術はOmni-Flowである。これは、全モードの入出力を共有の時間軸に沿って配置する統一ストリーミングフレームワークである。この定式化により、従来のターンベースの対話が全二重の時間同期プロセスに変換され、知覚と応答の同時実行が可能となり、同じフレームワーク内で能動的振る舞いが生起する。総パラメータ数90億のMiniCPM-o 4.5は、視覚言語能力においてGemini 2.5 Flashに迫り、その規模において最先端のオープンソース性能を発揮する。また、全モード理解ではQwen3-Omni-30B-A3Bを上回り、音声生成においても優れ、計算効率が大幅に高い。効率的なアーキテクチャ設計と推論最適化により、本モデルは12GB未満のRAMコストでエッジデバイス上でリアルタイム全二重全モード対話を実行可能である。

English

Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplex omni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting proactive behaviors such as issuing reminders or comments based on its continuous understanding of the live scene. The key technique behind MiniCPM-o 4.5 is Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis. This formulation converts conventional turn-based interaction into a full-duplex, time-aligned process, enabling simultaneous perception and response and allowing proactive behavior to arise within the same framework. With a total of 9B parameters, MiniCPM-o 4.5 approaches Gemini 2.5 Flash in vision-language capabilities, delivering state-of-the-art open-source performance at its scale. It also surpasses Qwen3-Omni-30B-A3B in omni-modal understanding and delivers better speech generation, with significantly higher computation efficiency. Driven by its efficient architecture design and inference optimization, the model can perform real-time full-duplex omni-modal interaction on edge devices with less than 12GB RAM cost.

MiniCPM-o 4.5：リアルタイム全二重全モーダルインタラクションの実現に向けて

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

要旨

Support