強化學習引導的檢索與軟融合：在缺失模態下實現魯棒的多模態模仿學習

摘要

機器人系統透過多種輸入模態（包括視覺相機串流與自然語言指令）感知世界，並必須根據這些訊號選擇合適的行動。然而，假設所有輸入裝置永久可用是不切實際的，因為感測器在部署過程中可能故障、被遮蔽或完全失效。因此，對於真實世界的機器人操作而言，穩健處理此類缺失模態情境至關重要。本文提出RL4IL，一種強化學習引導的模仿學習方法，透過從訓練庫中識別最相關的專家示範，為給定觀測選擇最合適的行動。一個經由在廣度優先搜索候選集上進行近端策略優化訓練的強化學習策略，對候選示範進行排序，並由一個軟性交叉注意力融合頭匯總其行動訊號以產生最終預測。當某個模態在推論時缺失，一個專屬於該模態的強化學習檢索策略會從訓練庫中識別捐贈示範，並由一個軟性插補頭透過對排名最高的捐贈者進行交叉注意力來重建缺失的嵌入，且無需對系統進行任何重新訓練。在三組LIBERO基準測試套件上的實驗表明，RL4IL在感測器失效條件下顯著優於最先進的模仿學習方法，且無需進行策略網路訓練。程式碼可在 https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera 取得。

English

Robotic systems perceive the world through multiple input modalities -- including visual camera streams and natural language instructions -- and must select appropriate actions based on these signals. However, assuming the permanent availability of all input devices is unrealistic, as sensors may fail, become occluded, or drop out entirely during deployment. Robust handling of such missing-modality scenarios is therefore essential for real-world robot operation. This paper introduces RL4IL, a reinforcement learning guided method for imitation learning that selects the most suitable action for a given observation by identifying the most relevant expert demonstrations from a training library. A reinforcement learning policy, trained via Proximal Policy Optimisation over Breadth-First Search candidate sets, ranks candidate demonstrations and a soft cross-attention fusion head aggregates their action signals to produce the final prediction. When a modality is missing at inference time, a dedicated per-modality RL retrieval policy identifies donor demonstrations from the training library, and a soft imputation head reconstructs the missing embedding via cross-attention over the top-ranked donors -- without requiring any retraining of the system. Experiments on three LIBERO benchmark suites demonstrate that RL4IL substantially outperforms state-of-the-art imitation learning methods under sensor dropout conditions, while requiring no policy network training. The code can be found at https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera