I am studying ML and was trying to make a reinforcement learning algorithm for a gymnasium environment. I already made a q-learning for a very basic and simple problem and I decided to use the same algorithm with a slightly more complex environment such as a the car-pole.
I believe the algorithm is working as expected as the AI is able to achieve pretty decent result and that the result are improving / worsening based on the number of episodes / learning rate / epsilon. However I noticed that if I run different tests after the AI has completed the training I obtain different result with each test: When I test, I remove the epsilon probability to explore, I therefore believe that the AI should take the best possible action each time and therefore should obtain the same result for every trial, however, this does not happen. Did I not understand correctly how the Q-Learning algorithm works or is it supposed to obtain slightly different result each time?
This is my code:
import gymnasium as gym
import numpy as np
import random
# Hyperparamteters
alpha = 0.05
gamma = 0.90
epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.1
episodes = 10000
max_steps = 200
# Initialise environment
env = gym.make("CartPole-v1")
state_space = [20, 20, 50, 50] #cart_position, cart_velocity, pole_angle, pole_angular_velocity
q_table = np.zeros(state_space + [env.action_space.n])
def discretize_state(state):
"""
Discretize a space means to convert all continuous actions into discrete and finite actions.
Continuous action can be infinite or very large and will therefore be difficult to handle.
This function takes a state representing all the values for each dimension [cart position, cart velocity, pole angle, pole velocity]
and returns a discretised tuple rounded to the closest integer.
"""
# Normalization formula = (state - min) / (max - min). Returns a value between 0 - 1
normalised_state = (state - env.observation_space.low) / (env.observation_space.high - env.observation_space.low)
# Scales the normalised values into the number and size of bins, then it rounds each direction into their closest integer value.
discretized = np.round(normalised_state * (np.array(state_space) - 1)).astype(int)
return tuple(discretized)
# Q-learning loop algorithm
print("Training started:n-----------------------------------n")
for episode in range(episodes):
state = discretize_state(env.reset()[0])
total_reward = 0
for step in range(max_steps):
"""
Decide wether to explore or exploit based on epsilon. with probability epsilon the
algorithm will explore by taking a random possible action. With probability 1 - epsilon
the algorithm will take the best possible action based on the q-value of the previously
explored actions.
As epsilon starts with a value of 1, the first action will always be random.
"""
if random.uniform(0, 1) < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(q_table[state])
next_state, reward, done, _, _ = env.step(action)
next_state = discretize_state(next_state)
total_reward += reward
# Q-Learning algorithm: Q(s, a) <- Q(s, a) + alpha[R + gamma * max(Q(s1, a1)) - Q(s, a))]
best_next_action = np.argmax(q_table[next_state])
td_target = reward + gamma * q_table[next_state][best_next_action] # Temporal Difference Target -> sum of total reward and the discounted q value of best action for next state
td_error = td_target - q_table[state][action] # Temporal Difference Error -> Difference between TDTarget and the current q-value
q_table[state][action] = q_table[state][action] + alpha * td_error # Update current q-value based on the larning rate (alpha)
state = next_state
if done:
break
# Reduce epsilon by epsilon decay rate to gradually reduce exploration and favour learning on previous experiences
epsilon = max(epsilon_min, epsilon * epsilon_decay)
print(f"Epsiode {episode + 1}: Total reward: {total_reward}")
print("Training finished.")
# Test trained agend
print("Testing:n----------------------------------n")
for episode in range(10):
state = discretize_state(env.reset()[0])
total_reward = 0
for step in range(max_steps):
action = np.argmax(q_table[state])
next_state, reward, done, _, _ = env.step(action)
state = discretize_state(next_state)
total_reward += reward
if done:
print(f"Episode {episode + 1} - Total reward: {total_reward}")
break
env.close()
After training, an example of results batch is the following:
Episode 1 - Total reward: 38.0
Episode 2 - Total reward: 46.0
Episode 3 - Total reward: 48.0
Episode 4 - Total reward: 62.0
Episode 5 - Total reward: 48.0
Episode 6 - Total reward: 65.0
Episode 7 - Total reward: 44.0
Episode 8 - Total reward: 69.0
Episode 9 - Total reward: 59.0
Episode 10 - Total reward: 16.0