ChatPaper.aiChatPaper

迈向人形控制的大规模预训练与高效微调间的鸿沟弥合

Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

January 29, 2026
作者: Weidong Huang, Zhehan Li, Hangxin Liu, Biao Hou, Yao Su, Jingwen Zhang
cs.AI

摘要

强化学习(RL)在人形机器人控制领域应用广泛,其中同策略方法(如近端策略优化PPO)通过大规模并行仿真实现稳健训练,并能在某些场景下实现真实机器人的零样本部署。然而,同策略算法的低样本效率限制了其在新环境中的安全适应能力。尽管异策略RL和基于模型的RL已展现出更高的样本效率,但人形机器人的大规模预训练与高效微调之间仍存在差距。本文发现,采用大批量更新和高更新数据比(UTD)的异策略软演员-评论家算法(SAC),可可靠支持人形运动策略的大规模预训练,并实现真实机器人的零样本部署。在适应能力方面,我们证明这些经过SAC预训练的策略可通过基于模型的方法在新环境和分布外任务中进行微调。新环境中的数据收集采用确定性策略执行,而随机探索则被限制在基于物理原理的世界模型中进行。这种分离机制既降低了适应过程中随机探索的风险,又保持了改进所需的探索覆盖度。总体而言,该方法将预训练阶段的大规模仿真时效优势与微调阶段基于模型学习的样本效率优势相结合。
English
Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.
PDF44February 11, 2026