In class I am learning about value iteration and markov decision problems, we are doing through the UC Berkley pac-man project, so I am trying to write the value iterator for it and as I understand it, value iteration is that for each iteration you are visiting every state, and then tracking to a terminal state to get its value.
I have a feeling I am not right, because when I try that in python I get a recursive depth exceed. So I return to the pseudo-code, and there is a Vk[s]
and Vk-1[s']
, which I had thought to mean value of state, and value of newState
, but I must be missing something.
So what is the significance of the k
and k-1
?
My Code:
def val(i, state):
if mdp.isTerminal(state) or i == 0:
return 0.0
actionCost = {}
for action in mdp.getPossibleActions(state):
actionCost[action] = 0
for (nextState, probability) in mdp.getTransitionStatesAndProbs(state, action):
reward = mdp.getReward(state, action, nextState)
actionCost[action] += probability * reward + discount * val(i - 1, nextState)
return actionCost[max(actionCost, key=actionCost.get)]
for i in range(iterations):
for state in mdp.getStates():
self.values[state] = val(i, state)
Pseudo Code:
k ←0
repeat
k ←k+1
for each state s do
Vk[s] = maxa ∑s' P(s'|s,a) (R(s,a,s')+ γVk-1[s'])
until ∀s |Vk[s]-Vk-1[s]| < θ
11
Vk and Vk-1 are different iterations of the approximation of V. You could rewrite the pseudo code as:
V ← 0
V' ← 0
while true
for each state s do
V ← V'
V = maxa ∑s' [ P(s'|s,a) R(s,a,s')+ γV'[s'] ]
if ∀s |V[s]-V'[s]| < θ
return V
Note that the pseudo-code is not recursive. This is the idea of value-iteration/dyanmic-programming which make it efficient :
- Compute the optimal value function for 0 time step: V0=0
- then compute the optimal value function for 1 time step: V1=H(V0)
- then compute the optimal value function for 2 time steps: V2=H(V1)
- then compute the optimal value function for 3 time steps: V3=H(V2)
- …
- then compute the optimal value function for n time steps: Vn=H(Vn-1)
with H defined as:
H(V) = max_a ( Pa [ Ra + γ V ] )
Your code is recursive (val calls val) which triggers some stack overflow error. Value iteration is not recursive but iterative. Your ‘eval’ function make this:
actionCost[action] += probability * reward + discount * val(i - 1, nextState)
which recomputes recursively a values that you have already computed at the previous iteration (k-1). val(i – 1, nextState) is in fact:
self.previous_values[nextState]
(assuming you keep a copy of previousValue).