The high sample complexity of reinforcement learning challenges its use in practice. A promising approach is to quickly adapt pre-trained policies to new environments. Existing methods for this policy adaptation problem typically rely on domain randomization and meta-learning, by sampling from some distribution of target environments during pre-training, and thus face difficulty on out-of-distribution target environments. We propose new model-based mechanisms that are able to make online adaptation in unseen target environments, by combining ideas from no-regret online learning and adaptive control. We prove that the approach learns policies in the target environment that can quickly recover trajectories from the source environment, and establish the rate of convergence in general settings. We demonstrate the benefits of our approach for policy adaptation in a diverse set of continuous control tasks, achieving the performance of state-of-the-art methods with much lower sample complexity.

Why adaptation?

It is easy to train a good policy in an environment that we have control on. But what if our ultimate goal is to have a good policy that works on another environment that we have very limited knowledge of? In most of the time, directly deploying the policy in hand to an unknown environment will result in failure - reinforcement learning algorithms are not that robust. This is why we need an extra step to adapt the policy to the other environment.

Overall Algorithm:

One way to ensure the performance of the adapted policy is to make it recover the source policy's trajectories. To do this, we realize that it suffices to learn the transitional dynamics of the environment that we are not familiar with. We propose a method that utilize the Data Aggregation scheme and we prove the rate of convergence of our algorithm. In practice, we train a neural network that predict the deviation between the two environments and use cross entropy method to find the action that minimizes the deviation.

Example Results:

Original policy

Directly deploy the original policy

Adapted policy

We see above that our algorithm achieves successful adaptation. However, since we are adapting in an environment with which interation could be expensive, we need to adapt as efficiently as possible. The learning curves, a comparison with other state-of-the-art policy adaptation methods, verifies the efficiency of our method. Our method can finish the adaptation as fast as within 10 episodes in the unknown environment.