ChatPaper.aiChatPaper

基于后验转移重加权的保守离线机器人策略学习

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

March 17, 2026
作者: Wanpeng Zhang, Hao Luo, Sipeng Zheng, Yicheng Feng, Haiweng Xu, Ziheng Xi, Chaoyi Xu, Haoqi Yuan, Zongqing Lu
cs.AI

摘要

離線後訓練透過對記錄動作進行監督式迴歸,使預訓練的機器人策略適應目標資料集。實務中,機器人資料集具有異質性:混合了不同本體結構、攝影機配置及品質參差的示範資料,導致許多軌跡反映恢復行為、操作者技能不一致或弱監督訊號。均質化後訓練平等對待所有樣本,可能因此對衝突性或低關聯性數據進行平均化。我們提出後驗轉移重加權法(PTR),這是一種無需獎勵函數且保守的後訓練方法,能決定每個訓練樣本對監督式更新的影響程度。PTR將觀測到的動作後果編碼為潛在目標,將其插入失配目標的候選池,並使用獨立的轉移評分器估算目標索引的軟性最大化識別後驗機率。後驗機率與均勻分佈的比值定義為PTR分數,該分數經裁剪混合加權後,透過自歸一化加權迴歸作用於原始動作目標。此架構無需可處理的策略似然函數,且相容於擴散模型與流匹配動作頭。PTR並非均等信任所有記錄數據,而是根據當前表徵下各樣本動作後果的可歸因性重新分配權重,從而提升異質機器人資料的保守離線適應效能。
English
Offline post-training adapts a pretrained robot policy to a target dataset by supervised regression on recorded actions. In practice, robot datasets are heterogeneous: they mix embodiments, camera setups, and demonstrations of varying quality, so many trajectories reflect recovery behavior, inconsistent operator skill, or weakly informative supervision. Uniform post-training gives equal credit to all samples and can therefore average over conflicting or low-attribution data. We propose Posterior-Transition Reweighting (PTR), a reward-free and conservative post-training method that decides how much each training sample should influence the supervised update. For each sample, PTR encodes the observed post-action consequence as a latent target, inserts it into a candidate pool of mismatched targets, and uses a separate transition scorer to estimate a softmax identification posterior over target indices. The posterior-to-uniform ratio defines the PTR score, which is converted into a clipped-and-mixed weight and applied to the original action objective through self-normalized weighted regression. This construction requires no tractable policy likelihood and is compatible with both diffusion and flow-matching action heads. Rather than uniformly trusting all recorded supervision, PTR reallocates credit according to how attributable each sample's post-action consequence is under the current representation, improving conservative offline adaptation to heterogeneous robot data.
PDF102March 20, 2026