RynnEC：将多模态大语言模型引入具身世界

摘要

我们推出RynnEC，一款专为具身认知设计的视频多模态大语言模型。基于通用视觉-语言基础模型构建，RynnEC整合了区域编码器与掩码解码器，实现了灵活的区域级视频交互。尽管架构紧凑，RynnEC在物体属性理解、物体分割及空间推理方面均达到了业界领先水平。从概念上讲，它为具身代理的“大脑”提供了一种以区域为中心的视频范式，赋予其对物理世界更细致的感知能力，并支持更精准的交互。针对标注3D数据集稀缺的问题，我们提出了一种基于第一人称视频的流程，用于生成具身认知数据。此外，我们引入了RynnEC-Bench，一个以区域为核心的基准测试，用于评估具身认知能力。我们期待RynnEC能推动具身代理通用认知核心的发展，并促进跨多种具身任务的泛化能力。代码、模型检查点及基准测试均可访问：https://github.com/alibaba-damo-academy/RynnEC。

English

We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC

RynnEC：将多模态大语言模型引入具身世界

RynnEC: Bringing MLLMs into Embodied World

摘要

Support