Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
hamilton-jacobi-bellman_equation [2018/02/12 15:24]
admin
hamilton-jacobi-bellman_equation [2018/05/30 23:40] (current)
admin
Line 54: Line 54:
 We consider the exploration/​exploitation problem in reinforcement learning. For exploitation,​ it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar uncertainty Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the estimated value of any fixed policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly,​ and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for ϵ-greedy improves DQN performance on 51 out of 57 games in the Atari suite. We consider the exploration/​exploitation problem in reinforcement learning. For exploitation,​ it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar uncertainty Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the estimated value of any fixed policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly,​ and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for ϵ-greedy improves DQN performance on 51 out of 57 games in the Atari suite.
  
 +https://​arxiv.org/​abs/​1805.11593 Observe and Look Further: Achieving Consistent Performance on Atari
 +
 +A new transformed Bellman operator allows our algorithm to process rewards of varying densities and scales; an auxiliary temporal consistency loss allows us to train stably using a discount factor of γ=0.999 (instead of γ=0.99) extending the effective planning horizon by an order of magnitude; and we ease the exploration problem by using human demonstrations that guide the agent towards rewarding states. When tested on a set of 42 Atari games, our algorithm exceeds the performance of an average human on 40 games using a common set of hyper parameters. Furthermore,​ it is the first deep RL algorithm to solve the first level of Montezuma'​s Revenge.