基于后验转移加权的保守离线机器人策略学习

摘要

离线后训练通过监督回归记录动作的方式，将预训练的机器人策略适配至目标数据集。实践中，机器人数据集具有异构性：混合了不同本体结构、相机配置及质量参差不齐的演示数据，导致许多轨迹反映的是纠偏行为、操作者技能不一致或弱监督信息。均匀后训练赋予所有样本同等权重，因而会在冲突或低关联度的数据上取平均值。我们提出后验转移重加权（PTR），这是一种无需奖励且保守的后训练方法，可判定每个训练样本应如何影响监督更新。PTR将每个样本观测到的动作后效编码为潜变量目标，将其插入失配目标的候选池，并利用独立的转移评分器估算目标索引的softmax识别后验。后验与均匀分布之比定义为PTR分数，该分数经截断混合加权后，通过自归一化加权回归应用于原始动作目标。此构造无需处理策略似然函数，同时兼容扩散和流匹配动作头。PTR并非均匀信任所有记录数据，而是根据每个样本的动作后效在当前表征下的可归因性重新分配权重，从而提升异构机器人数据的保守离线适配效果。

English

Offline post-training adapts a pretrained robot policy to a target dataset by supervised regression on recorded actions. In practice, robot datasets are heterogeneous: they mix embodiments, camera setups, and demonstrations of varying quality, so many trajectories reflect recovery behavior, inconsistent operator skill, or weakly informative supervision. Uniform post-training gives equal credit to all samples and can therefore average over conflicting or low-attribution data. We propose Posterior-Transition Reweighting (PTR), a reward-free and conservative post-training method that decides how much each training sample should influence the supervised update. For each sample, PTR encodes the observed post-action consequence as a latent target, inserts it into a candidate pool of mismatched targets, and uses a separate transition scorer to estimate a softmax identification posterior over target indices. The posterior-to-uniform ratio defines the PTR score, which is converted into a clipped-and-mixed weight and applied to the original action objective through self-normalized weighted regression. This construction requires no tractable policy likelihood and is compatible with both diffusion and flow-matching action heads. Rather than uniformly trusting all recorded supervision, PTR reallocates credit according to how attributable each sample's post-action consequence is under the current representation, improving conservative offline adaptation to heterogeneous robot data.