RynnEC：將多模態大語言模型引入具身世界

摘要

我們推出RynnEC，這是一款專為具身認知設計的視頻多模態大語言模型。基於通用視覺-語言基礎模型，RynnEC整合了區域編碼器與遮罩解碼器，實現了靈活的區域級視頻交互。儘管架構緊湊，RynnEC在物體屬性理解、物體分割及空間推理方面達到了業界領先水平。從概念上講，它為具身代理的大腦提供了一種以區域為中心的視頻範式，賦予其對物理世界的細粒度感知能力，從而支持更精確的交互。為緩解註釋3D數據集的稀缺性，我們提出了一種基於自我中心視角的視頻管道，用於生成具身認知數據。此外，我們還引入了RynnEC-Bench，這是一個以區域為中心的基準測試，用於評估具身認知能力。我們期待RynnEC能推動具身代理通用認知核心的發展，並促進跨多樣具身任務的泛化能力。代碼、模型檢查點及基準測試均可通過以下鏈接獲取：https://github.com/alibaba-damo-academy/RynnEC。

English

We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC

RynnEC：將多模態大語言模型引入具身世界

RynnEC: Bringing MLLMs into Embodied World

摘要

Support