Op weg naar het overbruggen van de kloof tussen grootschalige voorafgaande training en efficiënte finetuning voor humanoïde besturing

Samenvatting

Versterkend leren (RL) wordt veel gebruikt voor humanoïde robotbesturing, waarbij on-policy methoden zoals Proximal Policy Optimization (PPO) robuuste training mogelijk maken via grootschalige parallelle simulatie en in sommige gevallen zero-shot-implementatie op echte robots. De lage steekproefefficiëntie van on-policy algoritmen beperkt echter een veilige aanpassing aan nieuwe omgevingen. Hoewel off-policy RL en modelgebaseerd RL een verbeterde steekproefefficiëntie hebben getoond, blijft de kloof tussen grootschalige pretraining en efficiënte finetuning op humanoïden bestaan. In dit artikel tonen we aan dat off-policy Soft Actor-Critic (SAC), met grootschalige batch-updates en een hoge Update-To-Data (UTD)-verhouding, betrouwbaar grootschalige pretraining van humanoïde locomotiebeleidsregels ondersteunt, wat zero-shot-implementatie op echte robots realiseert. Voor aanpassing demonstreren we dat deze SAC-voorgetrainde beleidsregels kunnen worden gefinetuned in nieuwe omgevingen en out-of-distribution taken met modelgebaseerde methoden. Datacollectie in de nieuwe omgeving gebruikt een deterministisch beleid, terwijl stochastische exploratie beperkt blijft tot een fysica-geïnformeerd wereldmodel. Deze scheiding vermindert de risico's van willekeurige exploratie tijdens aanpassing, terwijl de verkenningsdekking voor verbetering behouden blijft. Al met al combineert de aanpak de tijdsefficiëntie van grootschalige simulatie tijdens pretraining met de steekproefefficiëntie van modelgebaseerd leren tijdens finetuning.

English

Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.

Op weg naar het overbruggen van de kloof tussen grootschalige voorafgaande training en efficiënte finetuning voor humanoïde besturing

Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

Samenvatting

Support