HY-Embodied-0.5：面向现实世界智能体的具身基础模型

摘要

我们推出HY-Embodied-0.5系列基础模型，该系列专为现实世界具身智能体设计。为弥合通用视觉语言模型与具身智能体需求之间的鸿沟，我们开发了这些模型以增强具身智能所需的核心能力：时空视觉感知能力，以及面向预测、交互与规划的先进具身推理能力。HY-Embodied-0.5系列包含两个主要变体：面向边缘部署的2B激活参数高效模型，以及针对复杂推理任务的32B激活参数强大模型。为支撑具身任务所需的细粒度视觉感知，我们采用混合专家架构实现模态专属计算。通过引入潜在标记，该设计有效增强了模型的感知表征能力。为提升推理能力，我们提出迭代式自我进化的后训练范式。此外，我们采用同策略蒸馏技术将大模型的先进能力迁移至小模型变体，从而最大化紧凑模型的性能潜力。在涵盖视觉感知、空间推理与具身理解的22个基准测试中，广泛评估验证了我们方法的有效性。我们的2B模型在16项基准上超越同等规模的先进模型，而32B变体则达到与Gemini 3.0 Pro等前沿模型相媲美的性能。在下游机器人控制实验中，我们依托强大的视觉语言模型基础训练出高效的视觉-语言-动作模型，在真实物理评估中取得卓越成果。代码与模型已开源：https://github.com/Tencent-Hunyuan/HY-Embodied。

English

We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.

HY-Embodied-0.5：面向现实世界智能体的具身基础模型

HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

摘要

Support