RynnBrain：オープンな具身体化基盤モデル

要旨

マルチモーダル基盤モデルの急速な進展にもかかわらず、エンボディード知能コミュニティには、知覚・推論・計画を現実世界の時空間ダイナミクス内で統合する、物理的に接地された統一基盤モデルが依然として不足している。本論文では、エンボディード知能のためのオープンソース時空間基盤モデル「RynnBrain」を提案する。RynnBrainは包括的なエゴセントリック理解、多様な時空間位置特定、物理的に接地された推論、物理法則を考慮した計画という4つの核心能力を統一フレームワークで強化する。RynnBrainファミリーは、3つの基盤モデル規模（2B、8B、30B-A3B MoE）と、下流エンボディードタスク（RynnBrain-Nav、RynnBrain-Plan、RynnBrain-VLA）または複雑な空間推論タスク（RynnBrain-CoP）向けに調整された4つの事後学習バリアントで構成される。20のエンボディードベンチマークと8つの一般視覚理解ベンチマークによる広範な評価において、我々のRynnBrain基盤モデルは既存のエンボディード基盤モデルを大幅に上回る性能を示した。事後学習モデルスイートは、RynnBrain基盤モデルの2つの重要な可能性をさらに実証する：（i）物理的に接地された推論と計画の実現、（ii）多様なエンボディードタスクに効率的に適応可能な強力な事前学習バックボーンとしての機能である。

English

Despite rapid progress in multimodal foundation models, embodied intelligence community still lacks a unified, physically grounded foundation model that integrates perception, reasoning, and planning within real-world spatial-temporal dynamics. We introduce RynnBrain, an open-source spatiotemporal foundation model for embodied intelligence. RynnBrain strengthens four core capabilities in a unified framework: comprehensive egocentric understanding, diverse spatiotemporal localization, physically grounded reasoning, and physics-aware planning. The RynnBrain family comprises three foundation model scales (2B, 8B, and 30B-A3B MoE) and four post-trained variants tailored for downstream embodied tasks (i.e., RynnBrain-Nav, RynnBrain-Plan, and RynnBrain-VLA) or complex spatial reasoning tasks (i.e., RynnBrain-CoP). In terms of extensive evaluations on 20 embodied benchmarks and 8 general vision understanding benchmarks, our RynnBrain foundation models largely outperform existing embodied foundation models by a significant margin. The post-trained model suite further substantiates two key potentials of the RynnBrain foundation model: (i) enabling physically grounded reasoning and planning, and (ii) serving as a strong pretrained backbone that can be efficiently adapted to diverse embodied tasks.

RynnBrain：オープンな具身体化基盤モデル

RynnBrain: Open Embodied Foundation Models

要旨

Support