I’m attempting to implement PPO to beat cartpole-v2, I manage to get it working if I keep things as A2C (That is, without clipped loss and a single epoch), when I use clipped loss and more than one epoch it doesn’t learn, have been trying to find the issue in my implementation for about a week but I can’t find what’s wrong.
Full Code
Here is the function responsible for optimizing:
def finish_episode():
# Calculating losses and performing backprop
R = 0
saved_actions = actor.saved_actions
returns = []
epsilon = 0.3
num_epochs = 1 # When num_epochs is greater than one my network won't learn
for r in actor.rewards[::-1]:
R = r + 0.99 * R # Gamma is 0.99
returns.insert(0, R)
returns = torch.tensor(returns, device=device)
returns = (returns - returns.mean()) / (returns.std() + eps)
old_probs, state_values, states, actions = zip(*saved_actions)
old_probs = torch.stack(old_probs).to(device)
state_values = torch.stack(state_values).to(device)
states = torch.stack(states).to(device)
actions = torch.stack(actions).to(device)
advantages = returns - state_values.squeeze()
for epoch in range(num_epochs):
new_probs = actor(states).gather(1, actions.unsqueeze(-1)).squeeze()
ratios = new_probs / old_probs
surr1 = ratios * advantages
surr2 = torch.clamp(ratios, 1 - epsilon, 1 + epsilon) * advantages
#actor_loss = -torch.min(surr1, surr2).mean() # When using this (clipped) loss my network won't learn
actor_loss = -surr1.mean()
actor_optimizer.zero_grad()
actor_loss.backward(retain_graph=True)
actor_optimizer.step()
if epoch == num_epochs - 1:
critic_loss = F.smooth_l1_loss(state_values.squeeze(), returns)
critic_optimizer.zero_grad()
critic_loss.backward(retain_graph=False)
critic_optimizer.step()
del actor.rewards[:]
del actor.saved_actions[:]
Tried different hyperparameters, using gae as opposed to full monte carlo retuns/advantages, in combing through my code I can’t see what’s wrong.