I am currently working on a project for uni, for which I am applying DQN to solve a warehouse storage allocation problem. I finished programming the Markov Decision Process and the entire DQN last week and it runs. The loss values show an increase in performance (that is, they are being minimised). However, this increase in performance in loss values is not translated to an increase in performance in the reward values, The reward values keep circling around the same values, without showing any form of improvement.
A bit of context on the MDP: The MDP consists of a warehouse, where all products have their own location. A dataset is used as input, which contains the demand per day of every product. The MDP simulates through 365 days and whenever a product is out-of-stock, the choice is given: do we want to move it to a different location and if so, to which location (the possible locations are defined as the action)? Then, locations are swapped and the MDP continues. The observation space are the demand of all items and their current locations. The reward function is {1 – the normalised function below}, where occurences(pi) is the demand of product i, and dist(pi) refers to the distance of its current location, maxDist is the maximum possible distance:
retrieved from: https://doi.org/10.1145/3594300.3594314
What I am aiming to do is similar to the article mentioned in the reference (https://doi.org/10.1145/3594300.3594314).
To solve the issue, I have tried many things: experimenting with the hyperparameters, adding gradient clipping to escape local optima, decreasing the observation space and working with a smaller dataset. Right now, I am lost and not sure where the problem may lie anymore. If someone could please guide me into the right direction, that would be really awesome! Below, I added an extra note and my code.
Note: A thing I noticed: In the ‘train_inner’ function, in the line where target_values is updated, the target_values are computed using the Bellman equation, with the following input for the formula: next_state_values, self.gamma, and reward_batch. What I noticed is that next_state_values starts at values around 200 and slowly decreased, until it reaches values of around 20-25. Then it stabilises. Reward_batch, however, is computed using a normalised function which implies that the values are between 0 and 1. This is an insignificant value compared to the next_state_values. Therefore, the impact of the rewards on the target_values is minimal and therefore negligible.
DQN code:
class ExperienceReplay(object):
def init(self, capacity):
# Construct experience replay class and clears it
self.capacity = capacity
self.memory = []
self.position = 0
<code>def push(self, *args):
# if memory isn't full, add a new experience
if len(self.memory) < self.capacity:
# If memory is full, do not append anything
self.memory[self.position] = Transition(*args)
self.position = (self.position + 1) % self.capacity
def sample(self, batch_size):
# Randomly obtain a batch_size number of samples from the memory
return random.sample(self.memory, batch_size)
# Returns length of memory list
<code>def push(self, *args):
# if memory isn't full, add a new experience
if len(self.memory) < self.capacity:
# If memory is full, do not append anything
self.memory.append(None)
# Append new memory
self.memory[self.position] = Transition(*args)
self.position = (self.position + 1) % self.capacity
def sample(self, batch_size):
# Randomly obtain a batch_size number of samples from the memory
return random.sample(self.memory, batch_size)
def __len__(self):
# Returns length of memory list
return len(self.memory)
</code>
def push(self, *args):
# if memory isn't full, add a new experience
if len(self.memory) < self.capacity:
# If memory is full, do not append anything
self.memory.append(None)
# Append new memory
self.memory[self.position] = Transition(*args)
self.position = (self.position + 1) % self.capacity
def sample(self, batch_size):
# Randomly obtain a batch_size number of samples from the memory
return random.sample(self.memory, batch_size)
def __len__(self):
# Returns length of memory list
return len(self.memory)
#deep Q network implementation
class DQN(nn.Module):
def init(self, in_size, out_size):
super(DQN, self).init()
self.layer1 = nn.Linear(in_size, 64) #this is 10×128, mat2
self.layer2 = nn.Linear(64, 32)
self.layer3 = nn.Linear(32, out_size)
self.dropout = nn.Dropout(0.7)
<code>def forward(self, x):
# x = Variable(torch.from_numpy(x).float().unsqueeze(0)).to(device)
x = F.relu(self.layer1(x))
x = self.dropout(F.relu(self.layer2(x)))
x = F.relu(self.layer3(x))
<code>def forward(self, x):
# x = Variable(torch.from_numpy(x).float().unsqueeze(0)).to(device)
print('run')
x = F.relu(self.layer1(x))
x = self.dropout(F.relu(self.layer2(x)))
x = F.relu(self.layer3(x))
return x
</code>
def forward(self, x):
# x = Variable(torch.from_numpy(x).float().unsqueeze(0)).to(device)
print('run')
x = F.relu(self.layer1(x))
x = self.dropout(F.relu(self.layer2(x)))
x = F.relu(self.layer3(x))
return x
class Runner():
def init(self, dqn, loss, lr = 0.01, eps_start = 1, eps_end = 0.1, eps_decay = 10000000,
batch_size = 128, target_update = 5000, logs = “sWarehouse/model”,
gamma = 0.9):
<code> #Create class to write data for visualisation and tracking of training progress in Tensorboard
self.writer = SummaryWriter(logs)
# Assign logs variable values from OpenAI gym model created
# Assign deep Q network both as learner network and as target network
# Loads state dictionary from learner model and save it to target model as state dictionary
# Synchronises parameters of learning and target model
# (typically contains learnable parameters, such as weights and biases, to restore model's state)
#self.target.load_state_dict(self.learner.state_dict())
# Sets target model in evaluation mode, ready for evaluation tasks
# Ensures cosistent and deterministic outputs when making predictions or evaluating performance
# Choose optmiser and assign learning rate to it (Adam in this case)
self.optimizer = optim.Adam(self.learner.parameters(), lr = lr)
# Assign loss function to class parameter (MSE in this case)
# Set up replay memory, to store past experiences (transitions) in (consist e.g. of state, action, reward, next state)
self.memory = ExperienceReplay(10000)
# Assign input values of this class to class specific variables, to make them reachable everywhere within the class
self.batch_size = batch_size
self.eps_start = eps_start
self.eps_decay = eps_decay
self.target_update = target_update
# Create empty lists to plot later
self.plots = {"Loss": [], "Reward": [], "Mean Reward": []}
def select_action(self, state):
# Update steps, which measures how often an action has been selected
self.steps = self.steps + 1
# Select a random value to determine exploration/exploitation
# Get a decayed epsilon threshold
eps_thresh = self.eps_end + (self.eps_start - self.eps_end) * math.exp(-1 * self.steps / self.eps_decay)
# Select the optimal action based on the maximum expected return
action = torch.argmax(self.learner(state)).view(1, 1)
return torch.tensor([[random.randrange(env.action_space.n)]], device = device, dtype=torch.long)
# Perform optimisation on network
# Skip inner training if there is not enough memory avaiable to perform training
if len(self.memory) < self.batch_size:
# Returns a self.batch_size number of sample transitions
sample_transitions = self.memory.sample(self.batch_size)
# Organises a batch of transitions into batch variable
batch = Transition(*zip(*sample_transitions))
# Filter out None object from batch and turn them into tensor
# Creates tensor that contains only non-None elements from batch.next_state
next_states = torch.cat(batch.next_state)
# Create more tensor. using all other inputs
state_batch = torch.cat(batch.state)
action_batch = torch.cat(batch.action)
reward_batch = torch.cat(batch.reward)
# Provides predictions of outputs, based upon the inputs state_batch and action_batch
# This is done using the learner deep Q network
pred_values = self.learner(state_batch).gather(1, action_batch)
# Creates tensor with zeros as values
next_state_values = torch.zeros(self.batch_size, device = device)
# Update next_state_values with the maximum values predicted by the target deep Q network
# Using next_states as input
next_state_values = self.target(next_states).max(1)[0].detach()
# Q(s, a) = reward(s, a) + Q(s_t+1, a_t+1)* gamma
# Calculates target_values for current state-action pairs in batch
target_values = next_state_values * self.gamma + reward_batch
# Compute loss, which used predicted values and target values as input
loss = self.loss(pred_values, target_values.unsqueeze(1))
# Reset gradients, so they are ready to use for backward propagation
self.optimizer.zero_grad()
# Compute gradients to update parameter values
for param in self.learner.parameters():
# Ensure gradient clipping for each gradient
# Gradients will not be lower than -1 and not higher than 1
param.grad.data.clamp_(-1, 1)
def env_step(self, action):
state, reward, terminated, truncated, info = env.step(action)
# Put together observation. This is a small workaround but it suffices
observation = state['demand']
observation = np.append(observation, state['current_demand'])
observation = np.append(observation, state['storage'])
return torch.FloatTensor([observation]).to(device), torch.FloatTensor([reward]).to(device), terminated, truncated, info
# For here, episodes = epochs
def train(self, episodes=10000, smooth=10):
# Update time elapsed since start
elapsed_time = time.time() - start_time
# Create empty lists for graphs
# Define number of steps within episode/epoch
# Used to loop through all instances of datasets
for episode in range(episodes):
# Select data file to use
temp_file_name = "inst0001.dat"
# Obtain and convert data from selected file
initialise_environment(file_name = temp_file_name)
# Initialise environment and get its initial state
state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)
for i in range(replacements):
# Loop through predetermined number of steps
# Choose action for current state
action = self.select_action(state)
# Retrieve new state, reward, and terminated/truncated values from environment after providing action input
next_state, reward, terminated, truncated, _ = self.env_step(action.item())
# Exit simulation if environment says to terminate or truncate
if terminated == True or truncated == True:
# Continue to next episode/epoch
# Push new transition into memory
self.memory.push(state, action, next_state, reward)
# Set new state as current state
loss = self.train_inner()
# Add current reward to total rewards variable
rewards += reward.detach().item()
c_samples += self.batch_size
c_loss += loss#.detach().numpy()
c_loss += loss.detach().numpy()
c_loss += loss_cpu.detach().numpy()
individual_reward.append(reward.detach().item())
individual_loss.append(loss)
individual_loss.append(loss.detach().numpy())
individual_loss.append(loss_cpu.detach().numpy())
# Synchronize target network parameters periodically with parameters of learning network
if i % self.target_update == 0:
self.target.load_state_dict(self.learner.state_dict())
elapsed_time = time.time() - start_time
# if elapsed time is longer than max run time, exit simulation
if elapsed_time >= max_runtime:
average_reward.append(rewards/replacements)
if elapsed_time >= max_runtime:
temp_file_name = "inst0001.dat"
initialise_environment(file_name = temp_file_name)
state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)
action = self.select_action(state)
next_state, reward, terminated, truncated, _ = self.env_step(action.data[0].item())
if terminated == True or truncated == True:
<code> #Create class to write data for visualisation and tracking of training progress in Tensorboard
self.writer = SummaryWriter(logs)
# Assign logs variable values from OpenAI gym model created
self.logs = logs
# Assign deep Q network both as learner network and as target network
self.learner = dqn
self.target = dqn
# Loads state dictionary from learner model and save it to target model as state dictionary
# Synchronises parameters of learning and target model
# (typically contains learnable parameters, such as weights and biases, to restore model's state)
#self.target.load_state_dict(self.learner.state_dict())
# Sets target model in evaluation mode, ready for evaluation tasks
# Ensures cosistent and deterministic outputs when making predictions or evaluating performance
self.target.eval()
# Choose optmiser and assign learning rate to it (Adam in this case)
self.optimizer = optim.Adam(self.learner.parameters(), lr = lr)
# Assign loss function to class parameter (MSE in this case)
self.loss = loss
# Set up replay memory, to store past experiences (transitions) in (consist e.g. of state, action, reward, next state)
self.memory = ExperienceReplay(10000)
# Assign input values of this class to class specific variables, to make them reachable everywhere within the class
self.batch_size = batch_size
self.eps_start = eps_start
self.eps_end = eps_end
self.eps_decay = eps_decay
self.target_update = target_update
self.gamma = gamma
# Reset steps counter
self.steps = 0
# Create empty lists to plot later
self.plots = {"Loss": [], "Reward": [], "Mean Reward": []}
def select_action(self, state):
# Update steps, which measures how often an action has been selected
self.steps = self.steps + 1
# Select a random value to determine exploration/exploitation
sample = random.random()
# Get a decayed epsilon threshold
eps_thresh = self.eps_end + (self.eps_start - self.eps_end) * math.exp(-1 * self.steps / self.eps_decay)
if sample > eps_thresh:
with torch.no_grad():
# Select the optimal action based on the maximum expected return
action = torch.argmax(self.learner(state)).view(1, 1)
return action
else:
# Return random action
return torch.tensor([[random.randrange(env.action_space.n)]], device = device, dtype=torch.long)
def train_inner(self):
# Perform optimisation on network
# Skip inner training if there is not enough memory avaiable to perform training
if len(self.memory) < self.batch_size:
return 0
# Returns a self.batch_size number of sample transitions
sample_transitions = self.memory.sample(self.batch_size)
# Organises a batch of transitions into batch variable
batch = Transition(*zip(*sample_transitions))
# Filter out None object from batch and turn them into tensor
# Creates tensor that contains only non-None elements from batch.next_state
next_states = torch.cat(batch.next_state)
# Create more tensor. using all other inputs
state_batch = torch.cat(batch.state)
action_batch = torch.cat(batch.action)
reward_batch = torch.cat(batch.reward)
# Provides predictions of outputs, based upon the inputs state_batch and action_batch
# This is done using the learner deep Q network
pred_values = self.learner(state_batch).gather(1, action_batch)
# Creates tensor with zeros as values
next_state_values = torch.zeros(self.batch_size, device = device)
# Update next_state_values with the maximum values predicted by the target deep Q network
# Using next_states as input
next_state_values = self.target(next_states).max(1)[0].detach()
# Q(s, a) = reward(s, a) + Q(s_t+1, a_t+1)* gamma
# Calculates target_values for current state-action pairs in batch
target_values = next_state_values * self.gamma + reward_batch
# Compute loss, which used predicted values and target values as input
loss = self.loss(pred_values, target_values.unsqueeze(1))
# Reset gradients, so they are ready to use for backward propagation
self.optimizer.zero_grad()
# Compute gradients to update parameter values
loss.backward()
for param in self.learner.parameters():
# Ensure gradient clipping for each gradient
# Gradients will not be lower than -1 and not higher than 1
param.grad.data.clamp_(-1, 1)
# Update parameters
self.optimizer.step()
return loss
def env_step(self, action):
state, reward, terminated, truncated, info = env.step(action)
# Put together observation. This is a small workaround but it suffices
observation = state['demand']
observation = np.append(observation, state['current_demand'])
observation = np.append(observation, state['storage'])
return torch.FloatTensor([observation]).to(device), torch.FloatTensor([reward]).to(device), terminated, truncated, info
# For here, episodes = epochs
def train(self, episodes=10000, smooth=10):
# Update time elapsed since start
elapsed_time = time.time() - start_time
steps = 0
# Create empty lists for graphs
average_reward = []
# Define number of steps within episode/epoch
replacements = 100000
# Used to loop through all instances of datasets
instance = 0
for episode in range(episodes):
individual_reward = []
individual_loss = []
# Reset variable values
c_loss = 0
c_samples = 0
rewards = 0
self.loss_value_old = 0
# Select data file to use
temp_file_name = "inst0001.dat"
# Obtain and convert data from selected file
initialise_environment(file_name = temp_file_name)
# Initialise environment and get its initial state
state = env.reset()
state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)
for i in range(replacements):
# Loop through predetermined number of steps
# Choose action for current state
action = self.select_action(state)
# Retrieve new state, reward, and terminated/truncated values from environment after providing action input
next_state, reward, terminated, truncated, _ = self.env_step(action.item())
# Exit simulation if environment says to terminate or truncate
if terminated == True or truncated == True:
next_state = None
# Continue to next episode/epoch
break
# Push new transition into memory
self.memory.push(state, action, next_state, reward)
# Set new state as current state
state = next_state
# Compute loss value
loss = self.train_inner()
# Add current reward to total rewards variable
rewards += reward.detach().item()
print(steps)
steps += 1
c_samples += self.batch_size
if loss == 0:
c_loss += loss#.detach().numpy()
else:
if device == "cpu":
c_loss += loss.detach().numpy()
else:
loss_cpu = loss.cpu()
c_loss += loss_cpu.detach().numpy()
individual_reward.append(reward.detach().item())
if loss == 0:
individual_loss.append(loss)
else:
if device == "cpu":
individual_loss.append(loss.detach().numpy())
else:
loss_cpu = loss.cpu()
individual_loss.append(loss_cpu.detach().numpy())
# Synchronize target network parameters periodically with parameters of learning network
if i % self.target_update == 0:
self.target.load_state_dict(self.learner.state_dict())
# Update elapsed time
elapsed_time = time.time() - start_time
# if elapsed time is longer than max run time, exit simulation
if elapsed_time >= max_runtime:
print(elapsed_time)
break
average_reward.append(rewards/replacements)
if elapsed_time >= max_runtime:
print(elapsed_time)
break
env.close()
def run(self):
#import test data
temp_file_name = "inst0001.dat"
initialise_environment(file_name = temp_file_name)
rewards = 0
state = env.reset()
state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)
for time in range(50):
action = self.select_action(state)
next_state, reward, terminated, truncated, _ = self.env_step(action.data[0].item())
rewards += reward
if terminated == True or truncated == True:
break
</code>
#Create class to write data for visualisation and tracking of training progress in Tensorboard
self.writer = SummaryWriter(logs)
# Assign logs variable values from OpenAI gym model created
self.logs = logs
# Assign deep Q network both as learner network and as target network
self.learner = dqn
self.target = dqn
# Loads state dictionary from learner model and save it to target model as state dictionary
# Synchronises parameters of learning and target model
# (typically contains learnable parameters, such as weights and biases, to restore model's state)
#self.target.load_state_dict(self.learner.state_dict())
# Sets target model in evaluation mode, ready for evaluation tasks
# Ensures cosistent and deterministic outputs when making predictions or evaluating performance
self.target.eval()
# Choose optmiser and assign learning rate to it (Adam in this case)
self.optimizer = optim.Adam(self.learner.parameters(), lr = lr)
# Assign loss function to class parameter (MSE in this case)
self.loss = loss
# Set up replay memory, to store past experiences (transitions) in (consist e.g. of state, action, reward, next state)
self.memory = ExperienceReplay(10000)
# Assign input values of this class to class specific variables, to make them reachable everywhere within the class
self.batch_size = batch_size
self.eps_start = eps_start
self.eps_end = eps_end
self.eps_decay = eps_decay
self.target_update = target_update
self.gamma = gamma
# Reset steps counter
self.steps = 0
# Create empty lists to plot later
self.plots = {"Loss": [], "Reward": [], "Mean Reward": []}
def select_action(self, state):
# Update steps, which measures how often an action has been selected
self.steps = self.steps + 1
# Select a random value to determine exploration/exploitation
sample = random.random()
# Get a decayed epsilon threshold
eps_thresh = self.eps_end + (self.eps_start - self.eps_end) * math.exp(-1 * self.steps / self.eps_decay)
if sample > eps_thresh:
with torch.no_grad():
# Select the optimal action based on the maximum expected return
action = torch.argmax(self.learner(state)).view(1, 1)
return action
else:
# Return random action
return torch.tensor([[random.randrange(env.action_space.n)]], device = device, dtype=torch.long)
def train_inner(self):
# Perform optimisation on network
# Skip inner training if there is not enough memory avaiable to perform training
if len(self.memory) < self.batch_size:
return 0
# Returns a self.batch_size number of sample transitions
sample_transitions = self.memory.sample(self.batch_size)
# Organises a batch of transitions into batch variable
batch = Transition(*zip(*sample_transitions))
# Filter out None object from batch and turn them into tensor
# Creates tensor that contains only non-None elements from batch.next_state
next_states = torch.cat(batch.next_state)
# Create more tensor. using all other inputs
state_batch = torch.cat(batch.state)
action_batch = torch.cat(batch.action)
reward_batch = torch.cat(batch.reward)
# Provides predictions of outputs, based upon the inputs state_batch and action_batch
# This is done using the learner deep Q network
pred_values = self.learner(state_batch).gather(1, action_batch)
# Creates tensor with zeros as values
next_state_values = torch.zeros(self.batch_size, device = device)
# Update next_state_values with the maximum values predicted by the target deep Q network
# Using next_states as input
next_state_values = self.target(next_states).max(1)[0].detach()
# Q(s, a) = reward(s, a) + Q(s_t+1, a_t+1)* gamma
# Calculates target_values for current state-action pairs in batch
target_values = next_state_values * self.gamma + reward_batch
# Compute loss, which used predicted values and target values as input
loss = self.loss(pred_values, target_values.unsqueeze(1))
# Reset gradients, so they are ready to use for backward propagation
self.optimizer.zero_grad()
# Compute gradients to update parameter values
loss.backward()
for param in self.learner.parameters():
# Ensure gradient clipping for each gradient
# Gradients will not be lower than -1 and not higher than 1
param.grad.data.clamp_(-1, 1)
# Update parameters
self.optimizer.step()
return loss
def env_step(self, action):
state, reward, terminated, truncated, info = env.step(action)
# Put together observation. This is a small workaround but it suffices
observation = state['demand']
observation = np.append(observation, state['current_demand'])
observation = np.append(observation, state['storage'])
return torch.FloatTensor([observation]).to(device), torch.FloatTensor([reward]).to(device), terminated, truncated, info
# For here, episodes = epochs
def train(self, episodes=10000, smooth=10):
# Update time elapsed since start
elapsed_time = time.time() - start_time
steps = 0
# Create empty lists for graphs
average_reward = []
# Define number of steps within episode/epoch
replacements = 100000
# Used to loop through all instances of datasets
instance = 0
for episode in range(episodes):
individual_reward = []
individual_loss = []
# Reset variable values
c_loss = 0
c_samples = 0
rewards = 0
self.loss_value_old = 0
# Select data file to use
temp_file_name = "inst0001.dat"
# Obtain and convert data from selected file
initialise_environment(file_name = temp_file_name)
# Initialise environment and get its initial state
state = env.reset()
state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)
for i in range(replacements):
# Loop through predetermined number of steps
# Choose action for current state
action = self.select_action(state)
# Retrieve new state, reward, and terminated/truncated values from environment after providing action input
next_state, reward, terminated, truncated, _ = self.env_step(action.item())
# Exit simulation if environment says to terminate or truncate
if terminated == True or truncated == True:
next_state = None
# Continue to next episode/epoch
break
# Push new transition into memory
self.memory.push(state, action, next_state, reward)
# Set new state as current state
state = next_state
# Compute loss value
loss = self.train_inner()
# Add current reward to total rewards variable
rewards += reward.detach().item()
print(steps)
steps += 1
c_samples += self.batch_size
if loss == 0:
c_loss += loss#.detach().numpy()
else:
if device == "cpu":
c_loss += loss.detach().numpy()
else:
loss_cpu = loss.cpu()
c_loss += loss_cpu.detach().numpy()
individual_reward.append(reward.detach().item())
if loss == 0:
individual_loss.append(loss)
else:
if device == "cpu":
individual_loss.append(loss.detach().numpy())
else:
loss_cpu = loss.cpu()
individual_loss.append(loss_cpu.detach().numpy())
# Synchronize target network parameters periodically with parameters of learning network
if i % self.target_update == 0:
self.target.load_state_dict(self.learner.state_dict())
# Update elapsed time
elapsed_time = time.time() - start_time
# if elapsed time is longer than max run time, exit simulation
if elapsed_time >= max_runtime:
print(elapsed_time)
break
average_reward.append(rewards/replacements)
if elapsed_time >= max_runtime:
print(elapsed_time)
break
env.close()
def run(self):
#import test data
temp_file_name = "inst0001.dat"
initialise_environment(file_name = temp_file_name)
rewards = 0
state = env.reset()
state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)
for time in range(50):
action = self.select_action(state)
next_state, reward, terminated, truncated, _ = self.env_step(action.data[0].item())
rewards += reward
if terminated == True or truncated == True:
break
def main():
# Choose to use either CPU or GPU
device_name = device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”) #device#”cuda: %s”%(args.device) if torch.cuda.is_available() else “cpu”
print(“[Device]tDevice selected: “, device_name)
<code># Import data file to start simulation with
initialise_environment(file_name = "inst0001.dat")
# Create deep Q-network using right layer size input
dqn = DQN(in_size = env.observation_space.shape[0], out_size = env.action_space.n).to(device)
#if we're loading a model, load that deep q network
dqn.load_state_dict(torch.load(XXX.pt"))
# Use MSE loss function as loss function
# Initialise RL algorithm
runner = Runner(dqn, loss, lr = args.lr, gamma = args.gamma, logs = "warehouse/%s" %time.time())
if "train" in args.runtype:
# Start training algorithm
print("[Train]tTraining Beginning ...")
runner.train(args.episodes)
<code># Import data file to start simulation with
initialise_environment(file_name = "inst0001.dat")
# Create deep Q-network using right layer size input
dqn = DQN(in_size = env.observation_space.shape[0], out_size = env.action_space.n).to(device)
#if we're loading a model, load that deep q network
dqn.load_state_dict(torch.load(XXX.pt"))
# Use MSE loss function as loss function
loss = nn.MSELoss()
# Initialise RL algorithm
runner = Runner(dqn, loss, lr = args.lr, gamma = args.gamma, logs = "warehouse/%s" %time.time())
if "train" in args.runtype:
# Start training algorithm
print("[Train]tTraining Beginning ...")
runner.train(args.episodes)
</code>
# Import data file to start simulation with
initialise_environment(file_name = "inst0001.dat")
# Create deep Q-network using right layer size input
dqn = DQN(in_size = env.observation_space.shape[0], out_size = env.action_space.n).to(device)
#if we're loading a model, load that deep q network
dqn.load_state_dict(torch.load(XXX.pt"))
# Use MSE loss function as loss function
loss = nn.MSELoss()
# Initialise RL algorithm
runner = Runner(dqn, loss, lr = args.lr, gamma = args.gamma, logs = "warehouse/%s" %time.time())
if "train" in args.runtype:
# Start training algorithm
print("[Train]tTraining Beginning ...")
runner.train(args.episodes)
if name == ‘main‘:
main()