Entzerrte modellbasierte Darstellungen für stichprobeneffiziente kontinuierliche Steuerung

Zusammenfassung

Modellbasierte Repräsentationen heben sich in letzter Zeit als vielversprechendes Framework hervor, das latente Dynamikinformationen in die Repräsentationen für nachgelagertes Off-Policy-Actor-Critic-Lernen einbettet. Es kombiniert implizit die Vorteile sowohl modellfreier als auch modellbasierter Ansätze, während es die mit modellbasierten Methoden verbundenen Trainingskosten vermeidet. Dennoch können bestehende modellbasierte Repräsentationsmethoden unzureichende Informationen über relevante Variablen erfassen und zu früh im Replay-Puffer an frühe Erfahrungen überanpassen. Dies führt zu Verzerrungen in der Repräsentations- und Actor-Critic-Lernphase, was zu einer schlechteren Leistung führt. Um dem entgegenzuwirken, schlagen wir Debiased modellbasierte Repräsentationen für Q-Learning vor, genannt DR.Q-Algorithmus. DR.Q maximiert explizit die gegenseitige Information zwischen den Repräsentationen des aktuellen Zustands-Aktions-Paares und dem nächsten Zustand, zusätzlich zur Minimierung ihrer Abweichungen, und sampelt Übergänge mit abgeschwächtem priorisiertem Experience Replay. Wir evaluieren DR.Q auf zahlreichen Continuous-Control-Benchmarks mit einem einzigen Satz von Hyperparametern, und die Ergebnisse zeigen, dass DR.Q mit aktuellen starken Baselines mithalten oder sie übertreffen kann, manchmal mit großem Abstand. Unser Code ist verfügbar unter https://github.com/dmksjfl/DR.Q.

English

Model-based representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off-policy actor-critic learning. It implicitly combines the advantages of both model-free and model-based approaches while avoiding the training costs associated with model-based methods. Nevertheless, existing model-based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor-critic learning, leading to inferior performance. To address this, we propose Debiased model-based Representations for Q-learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state-action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin. Our code is available at https://github.com/dmksjfl/DR.Q.

Entzerrte modellbasierte Darstellungen für stichprobeneffiziente kontinuierliche Steuerung

Debiased Model-based Representations for Sample-efficient Continuous Control

Zusammenfassung

Support