HY-Embodied-0.5：面向真實世界智能體的具身基礎模型

摘要

我們推出HY-Embodied-0.5系列基礎模型，這是專為現實世界具身智能體設計的模型家族。為彌合通用視覺語言模型與具身智能體需求之間的差距，我們的模型著重增強具身智能核心能力：時空視覺感知能力，以及用於預測、交互和規劃的高階具身推理能力。該系列包含兩個主要變體：具備20億激活參數的高效模型適用於邊緣部署，而擁有320億激活參數的強力模型則面向複雜推理任務。為支撐具身任務所需的細粒度視覺感知，我們採用混合專家變換器架構實現模態專屬計算。通過引入潛在標記，該設計有效增強了模型的感知表徵能力。為提升推理能力，我們提出迭代式自我演進的後訓練範式。此外，採用在線策略蒸餾技術將大模型的先進能力遷移至小模型變體，從而最大化緊湊模型的性能潛力。在涵蓋視覺感知、空間推理和具身理解的22個基準測試中，廣泛評估證明了我們方法的有效性。我們的MoT-20億參數模型在16項基準上超越同規模的頂尖模型，而320億參數變體則達到與Gemini 3.0 Pro等前沿模型相當的性能。在下游機器人控制實驗中，我們基於穩健的視覺語言基礎訓練出高效的視覺-語言-行動模型，在真實物理環境評估中取得顯著成果。代碼與模型已開源於：https://github.com/Tencent-Hunyuan/HY-Embodied。

English

We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.

HY-Embodied-0.5：面向真實世界智能體的具身基礎模型

HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

摘要

Support