대조적 예제 기반 제어

초록

강화 학습의 적용이 유용할 수 있는 많은 실제 문제들이 있지만, 이러한 문제들은 MDP(마르코프 결정 과정) 틀에 잘 맞지 않는 경우가 많습니다. 환경과의 상호작용은 종종 비용이 많이 들고, 보상 함수를 명시하는 것도 어려운 과제입니다. 이러한 문제점들을 해결하기 위해, 기존 연구에서는 전이 역학의 샘플과 높은 보상 상태의 예시로부터 완전히 학습하는 데이터 기반 접근법을 개발해 왔습니다. 이러한 방법들은 일반적으로 높은 보상 상태로부터 보상 함수를 학습하고, 그 보상 함수를 사용하여 전이 데이터에 레이블을 지정한 다음, 오프라인 강화 학습 알고리즘을 이러한 전이 데이터에 적용합니다. 이러한 방법들은 많은 작업에서 좋은 결과를 얻을 수 있지만, 정규화와 시간 차이 업데이트가 필요한 등 복잡한 경우가 많습니다. 본 논문에서는 보상 함수 대신 다단계 전이를 암묵적으로 모델링하는 오프라인, 예시 기반 제어 방법을 제안합니다. 우리는 이 암묵적 모델이 예시 기반 제어 문제에 대한 Q-값을 표현할 수 있음을 보여줍니다. 다양한 상태 기반 및 이미지 기반 오프라인 제어 작업에서, 우리의 방법은 학습된 보상 함수를 사용하는 베이스라인보다 우수한 성능을 보였으며, 추가 실험을 통해 데이터셋 크기에 따른 견고성과 확장성이 개선되었음을 입증했습니다.

English

While many real-world problems that might benefit from reinforcement learning, these problems rarely fit into the MDP mold: interacting with the environment is often expensive and specifying reward functions is challenging. Motivated by these challenges, prior work has developed data-driven approaches that learn entirely from samples from the transition dynamics and examples of high-return states. These methods typically learn a reward function from high-return states, use that reward function to label the transitions, and then apply an offline RL algorithm to these transitions. While these methods can achieve good results on many tasks, they can be complex, often requiring regularization and temporal difference updates. In this paper, we propose a method for offline, example-based control that learns an implicit model of multi-step transitions, rather than a reward function. We show that this implicit model can represent the Q-values for the example-based control problem. Across a range of state-based and image-based offline control tasks, our method outperforms baselines that use learned reward functions; additional experiments demonstrate improved robustness and scaling with dataset size.

대조적 예제 기반 제어

Contrastive Example-Based Control

초록

Support