RynnBrain: Open Embodied Foundation Models

Samenvatting

Ondanks de snelle vooruitgang in multimodale foundation-modellen, ontbreekt het de embodied intelligence-gemeenschap nog steeds aan een uniform, fysiek onderbouwd foundation-model dat perceptie, redeneren en planning integreert binnen real-world ruimtelijk-temporele dynamiek. Wij introduceren RynnBrain, een open-source ruimtelijk-temporeel foundation-model voor embodied intelligence. RynnBrain versterkt vier kerncapaciteiten in een uniform raamwerk: uitgebreid egocentrisch begrip, diverse ruimtelijk-temporele lokalisatie, fysiek onderbouwd redeneren en fysica-bewuste planning. De RynnBrain-familie omvat drie foundation-model schalen (2B, 8B en 30B-A3B MoE) en vier nage-trainde varianten afgestemd op downstream embodied taken (d.w.z. RynnBrain-Nav, RynnBrain-Plan en RynnBrain-VLA) of complexe ruimtelijke redeneertaken (d.w.z. RynnBrain-CoP). In uitgebreide evaluaties op 20 embodied benchmarks en 8 algemene visuele begrip benchmarks, overtreffen onze RynnBrain foundation-modellen bestaande embodied foundation-modellen met een aanzienlijke marge. De nage-trainde modelreeks onderstreept verder twee belangrijke potenties van het RynnBrain foundation-model: (i) het mogelijk maken van fysiek onderbouwd redeneren en plannen, en (ii) het dienen als een sterke vooraf getrainde backbone die efficiënt kan worden aangepast aan diverse embodied taken.

English

Despite rapid progress in multimodal foundation models, embodied intelligence community still lacks a unified, physically grounded foundation model that integrates perception, reasoning, and planning within real-world spatial-temporal dynamics. We introduce RynnBrain, an open-source spatiotemporal foundation model for embodied intelligence. RynnBrain strengthens four core capabilities in a unified framework: comprehensive egocentric understanding, diverse spatiotemporal localization, physically grounded reasoning, and physics-aware planning. The RynnBrain family comprises three foundation model scales (2B, 8B, and 30B-A3B MoE) and four post-trained variants tailored for downstream embodied tasks (i.e., RynnBrain-Nav, RynnBrain-Plan, and RynnBrain-VLA) or complex spatial reasoning tasks (i.e., RynnBrain-CoP). In terms of extensive evaluations on 20 embodied benchmarks and 8 general vision understanding benchmarks, our RynnBrain foundation models largely outperform existing embodied foundation models by a significant margin. The post-trained model suite further substantiates two key potentials of the RynnBrain foundation model: (i) enabling physically grounded reasoning and planning, and (ii) serving as a strong pretrained backbone that can be efficiently adapted to diverse embodied tasks.