强化学习引导的检索与软融合：在模态缺失情况下的鲁棒多模态模仿学习

摘要

机器人系统通过多种输入模态感知世界，包括视觉摄像头流和自然语言指令，并必须基于这些信号选择适当的动作。然而，假设所有输入设备永久可用是不现实的，因为传感器在部署过程中可能发生故障、被遮挡或完全丢失。因此，鲁棒地处理这种缺失模态场景对于真实世界的机器人操作至关重要。本文介绍了RL4IL，一种强化学习引导的模仿学习方法，通过从训练库中识别最相关的专家演示，为给定观察选择最合适的动作。通过近端策略优化在广度优先搜索候选集上训练的强化学习策略，对候选演示进行排序，然后一个软交叉注意力融合头聚合它们的动作信号以产生最终预测。当推理时某模态缺失时，专门的逐模态RL检索策略从训练库中识别捐赠演示，然后一个软插补头通过交叉注意力对排名靠前的捐赠者进行缺失嵌入的重建，而无需对系统进行任何重新训练。在三个LIBERO基准套件上的实验表明，RL4IL在传感器丢失条件下显著优于最先进的模仿学习方法，同时无需策略网络训练。代码可在https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera获取。

English

Robotic systems perceive the world through multiple input modalities -- including visual camera streams and natural language instructions -- and must select appropriate actions based on these signals. However, assuming the permanent availability of all input devices is unrealistic, as sensors may fail, become occluded, or drop out entirely during deployment. Robust handling of such missing-modality scenarios is therefore essential for real-world robot operation. This paper introduces RL4IL, a reinforcement learning guided method for imitation learning that selects the most suitable action for a given observation by identifying the most relevant expert demonstrations from a training library. A reinforcement learning policy, trained via Proximal Policy Optimisation over Breadth-First Search candidate sets, ranks candidate demonstrations and a soft cross-attention fusion head aggregates their action signals to produce the final prediction. When a modality is missing at inference time, a dedicated per-modality RL retrieval policy identifies donor demonstrations from the training library, and a soft imputation head reconstructs the missing embedding via cross-attention over the top-ranked donors -- without requiring any retraining of the system. Experiments on three LIBERO benchmark suites demonstrate that RL4IL substantially outperforms state-of-the-art imitation learning methods under sensor dropout conditions, while requiring no policy network training. The code can be found at https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera