Bellman's Optimality Equation

Internship at OpenGenus

Get this book -> Problems on Array: For Interviews and Competitive Programming

In this article, we will be walking through the concept behind Bellman's Optimality equation, and how it is used in the case of Reinforcement Learning.

V( s ) = a max ( R(s,a) )+γV(s'))> Bellman's equation for deterministic environment

Bellman’s optimality equation:

V * (s)= a π(a|s) . s' P(s'|a)[ E(r|s,as')+γ V * (s') ]

Bellman’s equation is one amongst other very important equations in reinforcement learning.
As we already know, reinforcement learning RL is a reward algorithm that tries to enable an intelligent agent to take some actions in an environment, in other to get the best rewards possible – it seeks to maximize our long-term rewards.

The bellman’s equation seeks to estimate whether actions taken was able to really return the maximum rewards as expected, in other words, it tries to hint us the kind of rewards we could expect if we take certain best actions now based on our present state, and if we continue with such actions in the future steps.

For the deterministic environment, bellman’s equation can be viewed thus:

  • The max_a function tries to choose actions that will get the maximum rewards.
  • The discount rate γ,is basically a hyper-parameter which helps in adjusting and suiting our decisions process, it shows how valuable the rewards will be for the long-term or for the short-term; hence it guides the agent on whether to focus on long or short-term rewards.
  • The rest part of the quation centers on calculating the rewards.

The bellman’s equation is a recursive type - calls itself during execution of the commands; that is, it repeats actions so many times before giving out the desired output.

Summary for the deterministic equation:
A Step is taking, resulting to an action, the action leads to a new state, new state gives a new reward and an episode ends….at this point, we backtrack – we apply the discount factor γ , then we utilize the max_a function to choose actions that gave us the maximum rewards, then we calculate the rewards using the R(s,a) function ”

Important terminologies used herein:
Policy π = this tells the agent what best action to taking in a given state or situation.

Value function:

V π (s)=E( G t | s t =s) – this seeks to estimate how suitable is it for an agent to be in a particular state, and how appropriate the action of an agent best suits a given state or situation. From here, we can see that the way an agent tends to act, in directly influenced by the policy it is following, consequently, the value function is defined with respect to the policies. **Discounted return**: This is intended to take rewards that comes first in the short term, and not having to wait for cumulative future reward to be returned. It concerns itself with returns that falls within the discount rate γ– values from 0 – 1. It helps to converge our returned rewards within a chosen range of the discount rate/factor. let's review the relationships on how the returns at successive time steps are related to each other: G t = R t+1 +γ R t+2 + γ 2 R t+3 + γ 3 R t+4 ...... G t = R t+1 +γ( R t+2 +γ R t+3 + γ 2 R t+3 .....) G t = R t+1 +γ G t+1

Hence , we can see that -

G t = R t+1 + γ 2 R t+2 + γ 3 R t+3 .......+ γ T R T = k=0 γ k R t+k = Δt=0 e Δt/π R t+Δt

Let’s now look into how to derive the Bellman’s Optimality Equation:

Proof of Bellman’s Equation

V π (s)=E( G t | s t =s)

; where E is the expectation, Rt = the return in time state, s t = the state in time,

π=Policy>(a>s)

G t = return of reward =

G t = R t+1 +γ R t+2 + γ 2 R t+3 + γ 3 R t+4 ......

and

γ=discount_rate:0γ1

Now considering Gt,

G t = R t+1 +γ( R t+2 +γ R t+3 + γ 2 R t+3 .....) G t = R t+1 +γ G t+1

Hence,

V π (s)=E[ R t+1 +γ( G t+1 )| s t =s] meaning – the expectation of the returns, given the time state St = state S

Illustrative Diagram


(s)>(s')

backup-diagram1

Note: a\s = action, given the very state.
What happens here is this:
The agent starts at a state s, follows a policy π, takes an action a, and goes into a new state s'.
The probability of the agent taken an action using a policy =

π(a|s),

The probability of taken the next action in the new state

S'=P(s'|s,a)

Where P is a transitional probability – and it is environment dynamic.

Likewise, the probability of moving from S to S5 taken action a2:

P s 5 a 2 =π( a 2 |s).P( s 5 |s, a 2 )

furthermore, the probability of moving from S to any of the action S' is:

Ρ= aεA π(a|s) . s'εS P(s'|s, a 2 )

Our value function is :

V π (s)=Ρ.E[ R t+1 +γ G t+1 | s t+1 =s']

Now substituting for P into the value function;

V π (s)= a π(a|s) . s' P(s'|a) .E[ R t+1 +γ G t+1 | s t+1 =s']

Now considering the expectation and taking E inside the equation, we get

=E(r|s,as')+γE[ G t+1 | s t+1 =s']

Going further, we know that –

V π (s)=E[ G t+1 | s t+1 =s'] E(r|s,as')+γE[ G t+1 | s t+1 =s']=E(r|s,as')+γ V π (s)

Consequently, Bellman’s equation for policy π, can then be written as:

V π (s)= a π(a|s) . s' P(s'|a) .E[(r\s,as')+γ V π (s)]

Furthermore, we know that:

π=highest_reward>Optimal_Policy V * (s) = π max V π (s)>Optimalvalue_function

Finally, this leads to Bellman’s optimal equation -

V * (s)= a π(a|s) . s' P(s'|a) .{ E(r|s,as')+γ V * (s') }

We could also look at this from the Q – function perspective –

Q * (s,a)=OptimalQ_function But V * (s)= a π(a|s) . Q * (s,a)

Therefore, Bellman’s Equation using Q-function –

Q * (s,a)= s' P(s'|a) .{ E(r|s,as')+γ a π(a|s') . Q * (s',a) }

From the insight rendered thus far, there is no doubt you have gotten an idea behind the Bellman's Optimality equation, and how it is used in the case of Reinforcement Learning. Thank you for enjoying this article at OpenGenus.