對比式基於範例的控制
Contrastive Example-Based Control
July 24, 2023
作者: Kyle Hatch, Benjamin Eysenbach, Rafael Rafailov, Tianhe Yu, Ruslan Salakhutdinov, Sergey Levine, Chelsea Finn
cs.AI
摘要
儘管許多現實世界的問題可能受益於強化學習,這些問題很少符合馬可夫決策過程(MDP)的模式:與環境互動通常很昂貴,並且指定獎勵函數具有挑戰性。受到這些挑戰的激勵,先前的研究已經發展出從轉換動態的樣本和高回報狀態示例中完全學習的數據驅動方法。這些方法通常從高回報狀態學習獎勵函數,使用該獎勵函數標記轉換,然後將離線強化學習算法應用於這些轉換。儘管這些方法在許多任務上可以取得良好的結果,但它們可能很複雜,通常需要正則化和時間差更新。在本文中,我們提出了一種基於示例的離線控制方法,該方法學習多步轉換的隱式模型,而不是獎勵函數。我們展示了這個隱式模型可以表示基於示例的控制問題的Q值。在一系列基於狀態和基於圖像的離線控制任務中,我們的方法優於使用學習獎勵函數的基準線;額外的實驗表明了隨著數據集大小的增加,改進了韌性和擴展性。
English
While many real-world problems that might benefit from reinforcement
learning, these problems rarely fit into the MDP mold: interacting with the
environment is often expensive and specifying reward functions is challenging.
Motivated by these challenges, prior work has developed data-driven approaches
that learn entirely from samples from the transition dynamics and examples of
high-return states. These methods typically learn a reward function from
high-return states, use that reward function to label the transitions, and then
apply an offline RL algorithm to these transitions. While these methods can
achieve good results on many tasks, they can be complex, often requiring
regularization and temporal difference updates. In this paper, we propose a
method for offline, example-based control that learns an implicit model of
multi-step transitions, rather than a reward function. We show that this
implicit model can represent the Q-values for the example-based control
problem. Across a range of state-based and image-based offline control tasks,
our method outperforms baselines that use learned reward functions; additional
experiments demonstrate improved robustness and scaling with dataset size.