HY-Embodied-0.5: 実世界エージェントのための具現化基盤モデル

要旨

我々は、実世界のエンボディエージェント向けに特別に設計された基盤モデルファミリー「HY-Embodied-0.5」を紹介する。汎用視覚言語モデル（VLM）とエンボディエージェントの要求との間のギャップを埋めるため、本モデルはエンボディ知能に必要不可欠な中核能力——空間的・時間的視覚知覚と、予測・対話・計画のための高度なエンボディ推論能力——を強化するように開発された。HY-Embodied-0.5スイートは、エッジデプロイ向けに設計された20億活性化パラメータの効率型モデルと、複雑な推論を目的とした320億活性化パラメータの高性能モデルの2つの主要バリアントで構成される。エンボディタスクに不可欠な細粒度視覚知覚を支えるため、モダリティ特化型計算を可能にするMixture-of-Transformers（MoT）アーキテクチャを採用。潜在トークンを組み込むことで、モデルの知覚表現力を効果的に強化している。推論能力向上のため、反復的な自己進化型ポストトレーニングパラダイムを導入。さらに、大規模モデルの高度な能力を小型バリアントに転移するオンポリシー蒸留を採用し、コンパクトモデルの性能ポテンシャルを最大化した。視覚知覚・空間推論・エンボディ理解にわたる22のベンチマークによる広範な評価により、本アプローチの有効性を実証。MoT-20億モデルは同等規模の最先端モデルを16ベンチマークで上回り、320億モデルはGemini 3.0 Proなどのフロンティアモデルに匹敵する性能を達成した。下流のロボット制御実験では、頑健なVLM基盤を活用して効果的なVLAモデルを学習し、実世界物理評価で説得力のある結果を得ている。コードとモデルはhttps://github.com/Tencent-Hunyuan/HY-Embodied で公開中。

English

We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.

HY-Embodied-0.5: 実世界エージェントのための具現化基盤モデル

HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

要旨

Support