Sun Jan 19 2020

How to speed up Deep Reinforcement Learning by telling it what to do

Autonomous learning can save us tedious feature and control engineering, hence a lot of work. It is able to attain optimality, and retain it even for changing environments - such as anything in the real world. Isn’t that nice?

But, methods such as Deep Reinforcement Learning (DRL) would be even better if they wouldn’t require humongous experience before becoming useful. In the words of Andrew Ng: “RL is a type of machine learning whose hunger for data is even greater than supervised learning. [...] There’s more work to be done to translate this to businesses and practice.”

So, for actual feasibility, we need to address the sample efficiency (how fast the algorithm learns). Otherwise, DRL is limited to the most expensive applications in the world, such as backflipping robots.

By advancing algorithms we can partly overcome the need for excessive training, but not completely. A fundamental issue remains: as it starts learning, a tabula-rasa agent does not have the slightest idea of how its goal is formulated. Neither does it have simple notions such as gravity, nor the insight that things may break under impact. Its first attempts will always be extremely ignorant, no matter how smart the learning is.

On the other hand, humans are full of ideas, and many of these help to succeed at tasks even if it is our first try. We can thus realize crucial learning accelerations if we convey our insights to learning control agents.

This blog provides a rough-around-the-edges explainer of Predictive Probabilistic Merging of Policies (PPMP), the first algorithm to leverage directive human feedback (e.g. left/right) for DRL. You may think of it as a toolbox that enables your favourite deep actor-critic algo to process human feedback and converge much faster (it only needs to be off-policy, which is typically the case for actor-critic architectures). PPMP is demonstrated in combination with DDPG (Lillicrap et al., 2015), a seminal work specific for continuous actions.

The PPMP idea

So whilst humans have superior initial performance, the final performance is typically less than what RL can do, because of poorer precision and greater reaction time. It is therefore reasonable to assume that the corrections, of which the intended magnitude is not known, should become more subtle as training progresses. In PPMP, the imperfect actions of the agent are combined with the noisy feedback in a probabilistic way, with respect to the abilities of the agent and the trainer for the current state. This idea is derived from Losey & O’Malley (2018), who phrased it best: ‘When learning from corrections ...[the agent] should also know what it does not know, and integrate this uncertainty as it makes decisions’[22].

So in PPMP, the respective uncertainties (that reflect the abilities of the agent and the trainer) determine the magnitude of the corrections. It means that initial exploration is vigorous, whereas upon convergence of the learner, the corrections become more subtle such that the trainer can refine the policy. Assuming Gaussian distributions, this principle may (for the early learning stage) be depicted as:

where we observe a broad distribution for rather pristine policy, and a narrower one for the human feedback. The posterior estimate of the optimal action a is determined as

Where h denotes human feedback (a vector with any of -1, 0, and 1) and the estimated error

which has predefined bounds c on the correction, and then the actual tradeoff as a function of the covariances (something like a Kalman gain):

For the bounding vectors c, the lower bound expresses a region of indifference (or precision) of the human feedback and it is also assumed that the RL algorithm will effectively find the local optimum here. This is depicted by the truncated green distribution, and furthermore ensures significant effect. We may for example set this to a fraction of the action space. The upper bound may be used to control how aggressive the corrections can be. With the precision of the human feedback assumed as a constant (multivariate) Gaussian, the only thing we need is a way to obtain up-to-date estimates of the agents abilities. For that, we use the multihead approach from Osband et al. (2016).

In actor-critic DRL, the critic and the actor are estimated using neural networks, which are a bunch of interconnected neurons. Architecture may be varied, but if we consider a simple example where the state vector x has three elements, and the action vector a two, a two-layer neural net could look like this:

This commonplace architecture can be adapted to our need of a covariance. By making a couple copies of the actors output layer (with different initial weights and training samples), we can obtain samples of the action estimate in a straightforward and efficient way. This also allows us to make inferences about the abilities of the agent by looking at the covariance between the action estimates. Observe the following scheme:

Again, under the assumption of things being Gaussian, the distribution over the optimal action estimate may now be computed as

the covariance over all the heads. It leaves us to the question of what action to consider as the policy. With multimodality and temporal consistency in mind, it is best to use a designated head per episode.

Architecture

PPMP consists of five modules, where the fifth one walks and talks:

The ‘merging of policies’ as explained above happens in the ‘Selector’ module that also obtains the given feedback. To render the algorithm more feedback efficient, a ‘Predictor’ module (with the same architecture as the critic) is trained with corrected actions, such that it can provide us with rough directions --- just what we need during the earliest stages of learning. Whilst these predicted corrected actions are very useful as a rough information source, its actions will never be very accurate. Later, when the actor’s knowledge becomes of better quality, we want to crossfade the influence from the predictor to the actor. Yet, exactly how and when the tradeoff is scheduled depends on the problem, the progress, the feedback, and most likely as well on the state. The beauty of using an actor-critic learning scheme, is that we may resolve the question of how the actions should best be interleaved using the critic. It can here act as an arbiter, as it has value estimates for all state-action pairs.

Results

PPMP is benchmarked against DDPG (pure DRL) and DCOACH (a deep learning approach that learns from corrective feedback only). As is customary in the domain of RL, we consider standard problems found in the OpenAI gym that require continuous control. The first environment, Mountaincar, requires to drive a little cart up a hill and reach the flag. Because of a little engine, the thing needs to be rocked back and forth a bit to gain momentum. The second environment is the Pendulum problem, again underactuated, where a red pole is to be swung to its upright equilibrium and balanced there. Both problems discount applied control action.

Besides testing with actual human feedback, feedback is synthesised using a converged DDPG policy, such that algorithms can be consistently and fairly compared without having to worry about the variance in human feedback. Below are the learning curves (top charts), where we desire to obtain maximum return as fast as possible for Mountaincar (left) and Pendulum (right). Just beneath is the amount of feedback as a fraction of the samples --- less feedback is less work for the human which is better.

We see that DDPG (red) fails in Mountaincar. It rarely reaches the flag and therefore there is little reward. DCOACH (green) hardly solves the Pendulum problem. PPMP (we are blue) uses significantly less feedback but converges at least 5x faster and has superior final performance (on par with the oracle in purple). As an ablation study, the orange lines demonstrate PPMP without predictor (PMP). For both environments, the orange curves get more feedback, but perform worse. This demonstrates that the predictor module makes the teacher’s job even easier.

Although performance is cool, eventual applicancy depends as well on the robustness of algorithms. Real world problems and actual human feedback both feature real world noise, and the above innovations are only meaningful if they cope with it. With the oracle implementation, we can precisely emulate erroneous feedback to assess the robustness of our algorithms. For ranges up to 30%, we stick to the previous colouring yet different line styles will now indicate the applied error rate:

It is clear that DCOACH (green) cannot handle erroneous feedback: performance quickly drops to zero as the feedback is less perfect. Because PPMP also learns from environmental reward, it’s able to downplay the misguidance eventually and then fully solve the problem.

Everything so far has all been related to simulated feedback. Now, what happens when we use actual human feedback that suffers from errors and delays? Below, we observe the same tendencies: less feedback, faster learning, and greater final performance for both environments:

Last but not least, let us consider a  typical use case where the teacher has  limited performance and is not able to fully  solve the problem itself. We now consider the Lunarlander environment, a cute little game (until you actually try) where a space pod needs to land between some flags (or crashes). We use an oracle that more or less knows how to hover without a clue, but does not know how safely come to a rest. The environment assigns great negative reward to crashes, and 100 points for gentle landings. PPMP compares to DDPG as follows:

Note that using DDPG, the problem is not solved. The performance difference between PPMP and its teacher may seem small when we consider the reward points, but PPMP actually found the complete solution, thereby exceeding the performance of the teacher (purple).

Conclusion

Humans have better insight whilst computers have greater precision and responsiveness. Therefore, their learning is complementary, and we can overcome much of the sample efficiency struggles of DRL by incorporating human feedback.

Directional feedback is particularly effective for this purpose. PPMP takes a probabilistic approach where directions given by the teacher directly affect the action selection of the agent. In addition, the corrections are predicted, such that the need for feedback remains manageable.

As a result, PPMP can be used in cases where DDPG or DCOACH fail. In other cases, learning is accelerated with 5-10x, or performance exceeds that of the trainer. In real-world applications, telling your DRL what to do can thus make the difference between a working robot and a broken one.

Further reading

  • A complete treatment of PPMP, including background on (deep) reinforcement learning, is found in my thesis titled Deep Reinforcement Learning with Feedback-based Exploration.
  • PPMP was presented at the 59th Conference on Decision and Control in Nice. We’ve made a 6-page paper for the proceedings. (Paywall? Arxiv).
  • The codebase of this study is hosted on github.com/janscholten/ppmp.
  • For general ‘insight’ see also the World Models study (Ha & Schmidhuber, 2019).
Jan scholten headshot

This blog was written by Jan Scholten, a Xomnia junior data scientist. Are you interested in becoming an expert in machine learning? With our challenging junior development program, you can put your knowledge into practice and learn even more from other data scientists. Check out this vacancy for more details about the program.