Being-H0.5: Het Opschalen van Mensgerichte Robotleren voor Generalisatie over Verschillende Lichamelijke Vormen

Samenvatting

Wij introduceren Being-H0.5, een fundamenteel Vision-Language-Action (VLA) model ontworpen voor robuuste kruis-embodiment generalisatie over diverse robotplatforms. Terwijl bestaande VLA's vaak worstelen met morfologische heterogeniteit en dataschaarste, stellen wij een mensgericht leerparadigma voor dat menselijke interactiesporen behandelt als een universele "moedertaal" voor fysieke interactie. Om dit te ondersteunen, presenteren wij UniHand-2.0, het grootste embodied pre-training recept tot nu toe, bestaande uit meer dan 35.000 uur aan multimodale data over 30 verschillende robot-embodiments. Onze aanpak introduceert een Uniforme Actieruimte die heterogene robotbesturingen afbeeldt op semantisch uitgelijnde slots, waardoor robots met weinig middelen vaardigheden kunnen opbouwen vanuit menselijke data en hoogwaardige platforms. Gebouwd op deze mensgerichte basis, ontwerpen wij een uniform sequentieel modelleer- en multi-task pre-training paradigma om menselijke demonstraties en robotuitvoering te verbinden. Architecturaal maakt Being-H0.5 gebruik van een Mixture-of-Transformers ontwerp met een nieuw Mixture-of- Flow (MoF) raamwerk om gedeelde motorische primitieven te ontkoppelen van gespecialiseerde embodiment-specifieke experts. Ten slotte introduceren wij, om kruis-embodiment beleid stabiel te maken in de echte wereld, Manifold-Preserving Gating voor robuustheid onder sensorische verschuiving en Universal Async Chunking om gechunkte besturing te universaliseren over embodiments met verschillende latentie- en besturingsprofielen. Empirisch tonen wij aan dat Being-H0.5 state-of-the-art resultaten behaalt op gesimuleerde benchmarks, zoals LIBERO (98.9%) en RoboCasa (53.9%), terwijl het ook sterke kruis-embodiment capaciteiten vertoont op vijf robotplatforms.

English

We introduce Being-H0.5, a foundational Vision-Language-Action (VLA) model designed for robust cross-embodiment generalization across diverse robotic platforms. While existing VLAs often struggle with morphological heterogeneity and data scarcity, we propose a human-centric learning paradigm that treats human interaction traces as a universal "mother tongue" for physical interaction. To support this, we present UniHand-2.0, the largest embodied pre-training recipe to date, comprising over 35,000 hours of multimodal data across 30 distinct robotic embodiments. Our approach introduces a Unified Action Space that maps heterogeneous robot controls into semantically aligned slots, enabling low-resource robots to bootstrap skills from human data and high-resource platforms. Built upon this human-centric foundation, we design a unified sequential modeling and multi-task pre-training paradigm to bridge human demonstrations and robotic execution. Architecturally, Being-H0.5 utilizes a Mixture-of-Transformers design featuring a novel Mixture-of-Flow (MoF) framework to decouple shared motor primitives from specialized embodiment-specific experts. Finally, to make cross-embodiment policies stable in the real world, we introduce Manifold-Preserving Gating for robustness under sensory shift and Universal Async Chunking to universalize chunked control across embodiments with different latency and control profiles. We empirically demonstrate that Being-H0.5 achieves state-of-the-art results on simulated benchmarks, such as LIBERO (98.9%) and RoboCasa (53.9%), while also exhibiting strong cross-embodiment capabilities on five robotic platforms.

Being-H0.5: Het Opschalen van Mensgerichte Robotleren voor Generalisatie over Verschillende Lichamelijke Vormen

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Samenvatting

Support