DQN showing loss improvements but lacking reward improvements

I am currently working on a project for uni, for which I am applying DQN to solve a warehouse storage allocation problem. I finished programming the Markov Decision Process and the entire DQN last week and it runs. The loss values show an increase in performance (that is, they are being minimised). However, this increase in performance in loss values is not translated to an increase in performance in the reward values, The reward values keep circling around the same values, without showing any form of improvement.

A bit of context on the MDP: The MDP consists of a warehouse, where all products have their own location. A dataset is used as input, which contains the demand per day of every product. The MDP simulates through 365 days and whenever a product is out-of-stock, the choice is given: do we want to move it to a different location and if so, to which location (the possible locations are defined as the action)? Then, locations are swapped and the MDP continues. The observation space are the demand of all items and their current locations. The reward function is {1 – the normalised function below}, where occurences(pi) is the demand of product i, and dist(pi) refers to the distance of its current location, maxDist is the maximum possible distance:

retrieved from: https://doi.org/10.1145/3594300.3594314

What I am aiming to do is similar to the article mentioned in the reference (https://doi.org/10.1145/3594300.3594314).

To solve the issue, I have tried many things: experimenting with the hyperparameters, adding gradient clipping to escape local optima, decreasing the observation space and working with a smaller dataset. Right now, I am lost and not sure where the problem may lie anymore. If someone could please guide me into the right direction, that would be really awesome! Below, I added an extra note and my code.

Note: A thing I noticed: In the ‘train_inner’ function, in the line where target_values is updated, the target_values are computed using the Bellman equation, with the following input for the formula: next_state_values, self.gamma, and reward_batch. What I noticed is that next_state_values starts at values around 200 and slowly decreased, until it reaches values of around 20-25. Then it stabilises. Reward_batch, however, is computed using a normalised function which implies that the values are between 0 and 1. This is an insignificant value compared to the next_state_values. Therefore, the impact of the rewards on the target_values is minimal and therefore negligible.

DQN code:

class ExperienceReplay(object):
def init(self, capacity):
# Construct experience replay class and clears it
self.capacity = capacity
self.memory = []
self.position = 0

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>def push(self, *args):
# if memory isn't full, add a new experience
if len(self.memory) < self.capacity:
# If memory is full, do not append anything
self.memory.append(None)
# Append new memory
self.memory[self.position] = Transition(*args)
self.position = (self.position + 1) % self.capacity
def sample(self, batch_size):
# Randomly obtain a batch_size number of samples from the memory
return random.sample(self.memory, batch_size)
def __len__(self):
# Returns length of memory list
return len(self.memory)
</code>
<code>def push(self, *args): # if memory isn't full, add a new experience if len(self.memory) < self.capacity: # If memory is full, do not append anything self.memory.append(None) # Append new memory self.memory[self.position] = Transition(*args) self.position = (self.position + 1) % self.capacity def sample(self, batch_size): # Randomly obtain a batch_size number of samples from the memory return random.sample(self.memory, batch_size) def __len__(self): # Returns length of memory list return len(self.memory) </code>
def push(self, *args):
    # if memory isn't full, add a new experience
    if len(self.memory) < self.capacity:
        # If memory is full, do not append anything
        self.memory.append(None)
    
    # Append new memory
    self.memory[self.position] = Transition(*args)
    self.position = (self.position + 1) % self.capacity
    
def sample(self, batch_size): 
    # Randomly obtain a batch_size number of samples from the memory
    return random.sample(self.memory, batch_size)

def __len__(self): 
    # Returns length of memory list
    return len(self.memory)

#deep Q network implementation
class DQN(nn.Module):
def init(self, in_size, out_size):
super(DQN, self).init()
self.layer1 = nn.Linear(in_size, 64) #this is 10×128, mat2
self.layer2 = nn.Linear(64, 32)
self.layer3 = nn.Linear(32, out_size)
self.dropout = nn.Dropout(0.7)

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>def forward(self, x):
# x = Variable(torch.from_numpy(x).float().unsqueeze(0)).to(device)
print('run')
x = F.relu(self.layer1(x))
x = self.dropout(F.relu(self.layer2(x)))
x = F.relu(self.layer3(x))
return x
</code>
<code>def forward(self, x): # x = Variable(torch.from_numpy(x).float().unsqueeze(0)).to(device) print('run') x = F.relu(self.layer1(x)) x = self.dropout(F.relu(self.layer2(x))) x = F.relu(self.layer3(x)) return x </code>
def forward(self, x): 
    # x = Variable(torch.from_numpy(x).float().unsqueeze(0)).to(device)
    print('run')
    x = F.relu(self.layer1(x))
    x = self.dropout(F.relu(self.layer2(x)))
    x = F.relu(self.layer3(x)) 
    return x

class Runner():
def init(self, dqn, loss, lr = 0.01, eps_start = 1, eps_end = 0.1, eps_decay = 10000000,
batch_size = 128, target_update = 5000, logs = “sWarehouse/model”,
gamma = 0.9):

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code> #Create class to write data for visualisation and tracking of training progress in Tensorboard
self.writer = SummaryWriter(logs)
# Assign logs variable values from OpenAI gym model created
self.logs = logs
# Assign deep Q network both as learner network and as target network
self.learner = dqn
self.target = dqn
# Loads state dictionary from learner model and save it to target model as state dictionary
# Synchronises parameters of learning and target model
# (typically contains learnable parameters, such as weights and biases, to restore model's state)
#self.target.load_state_dict(self.learner.state_dict())
# Sets target model in evaluation mode, ready for evaluation tasks
# Ensures cosistent and deterministic outputs when making predictions or evaluating performance
self.target.eval()
# Choose optmiser and assign learning rate to it (Adam in this case)
self.optimizer = optim.Adam(self.learner.parameters(), lr = lr)
# Assign loss function to class parameter (MSE in this case)
self.loss = loss
# Set up replay memory, to store past experiences (transitions) in (consist e.g. of state, action, reward, next state)
self.memory = ExperienceReplay(10000)
# Assign input values of this class to class specific variables, to make them reachable everywhere within the class
self.batch_size = batch_size
self.eps_start = eps_start
self.eps_end = eps_end
self.eps_decay = eps_decay
self.target_update = target_update
self.gamma = gamma
# Reset steps counter
self.steps = 0
# Create empty lists to plot later
self.plots = {"Loss": [], "Reward": [], "Mean Reward": []}
def select_action(self, state):
# Update steps, which measures how often an action has been selected
self.steps = self.steps + 1
# Select a random value to determine exploration/exploitation
sample = random.random()
# Get a decayed epsilon threshold
eps_thresh = self.eps_end + (self.eps_start - self.eps_end) * math.exp(-1 * self.steps / self.eps_decay)
if sample > eps_thresh:
with torch.no_grad():
# Select the optimal action based on the maximum expected return
action = torch.argmax(self.learner(state)).view(1, 1)
return action
else:
# Return random action
return torch.tensor([[random.randrange(env.action_space.n)]], device = device, dtype=torch.long)
def train_inner(self):
# Perform optimisation on network
# Skip inner training if there is not enough memory avaiable to perform training
if len(self.memory) < self.batch_size:
return 0
# Returns a self.batch_size number of sample transitions
sample_transitions = self.memory.sample(self.batch_size)
# Organises a batch of transitions into batch variable
batch = Transition(*zip(*sample_transitions))
# Filter out None object from batch and turn them into tensor
# Creates tensor that contains only non-None elements from batch.next_state
next_states = torch.cat(batch.next_state)
# Create more tensor. using all other inputs
state_batch = torch.cat(batch.state)
action_batch = torch.cat(batch.action)
reward_batch = torch.cat(batch.reward)
# Provides predictions of outputs, based upon the inputs state_batch and action_batch
# This is done using the learner deep Q network
pred_values = self.learner(state_batch).gather(1, action_batch)
# Creates tensor with zeros as values
next_state_values = torch.zeros(self.batch_size, device = device)
# Update next_state_values with the maximum values predicted by the target deep Q network
# Using next_states as input
next_state_values = self.target(next_states).max(1)[0].detach()
# Q(s, a) = reward(s, a) + Q(s_t+1, a_t+1)* gamma
# Calculates target_values for current state-action pairs in batch
target_values = next_state_values * self.gamma + reward_batch
# Compute loss, which used predicted values and target values as input
loss = self.loss(pred_values, target_values.unsqueeze(1))
# Reset gradients, so they are ready to use for backward propagation
self.optimizer.zero_grad()
# Compute gradients to update parameter values
loss.backward()
for param in self.learner.parameters():
# Ensure gradient clipping for each gradient
# Gradients will not be lower than -1 and not higher than 1
param.grad.data.clamp_(-1, 1)
# Update parameters
self.optimizer.step()
return loss
def env_step(self, action):
state, reward, terminated, truncated, info = env.step(action)
# Put together observation. This is a small workaround but it suffices
observation = state['demand']
observation = np.append(observation, state['current_demand'])
observation = np.append(observation, state['storage'])
return torch.FloatTensor([observation]).to(device), torch.FloatTensor([reward]).to(device), terminated, truncated, info
# For here, episodes = epochs
def train(self, episodes=10000, smooth=10):
# Update time elapsed since start
elapsed_time = time.time() - start_time
steps = 0
# Create empty lists for graphs
average_reward = []
# Define number of steps within episode/epoch
replacements = 100000
# Used to loop through all instances of datasets
instance = 0
for episode in range(episodes):
individual_reward = []
individual_loss = []
# Reset variable values
c_loss = 0
c_samples = 0
rewards = 0
self.loss_value_old = 0
# Select data file to use
temp_file_name = "inst0001.dat"
# Obtain and convert data from selected file
initialise_environment(file_name = temp_file_name)
# Initialise environment and get its initial state
state = env.reset()
state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)
for i in range(replacements):
# Loop through predetermined number of steps
# Choose action for current state
action = self.select_action(state)
# Retrieve new state, reward, and terminated/truncated values from environment after providing action input
next_state, reward, terminated, truncated, _ = self.env_step(action.item())
# Exit simulation if environment says to terminate or truncate
if terminated == True or truncated == True:
next_state = None
# Continue to next episode/epoch
break
# Push new transition into memory
self.memory.push(state, action, next_state, reward)
# Set new state as current state
state = next_state
# Compute loss value
loss = self.train_inner()
# Add current reward to total rewards variable
rewards += reward.detach().item()
print(steps)
steps += 1
c_samples += self.batch_size
if loss == 0:
c_loss += loss#.detach().numpy()
else:
if device == "cpu":
c_loss += loss.detach().numpy()
else:
loss_cpu = loss.cpu()
c_loss += loss_cpu.detach().numpy()
individual_reward.append(reward.detach().item())
if loss == 0:
individual_loss.append(loss)
else:
if device == "cpu":
individual_loss.append(loss.detach().numpy())
else:
loss_cpu = loss.cpu()
individual_loss.append(loss_cpu.detach().numpy())
# Synchronize target network parameters periodically with parameters of learning network
if i % self.target_update == 0:
self.target.load_state_dict(self.learner.state_dict())
# Update elapsed time
elapsed_time = time.time() - start_time
# if elapsed time is longer than max run time, exit simulation
if elapsed_time >= max_runtime:
print(elapsed_time)
break
average_reward.append(rewards/replacements)
if elapsed_time >= max_runtime:
print(elapsed_time)
break
env.close()
def run(self):
#import test data
temp_file_name = "inst0001.dat"
initialise_environment(file_name = temp_file_name)
rewards = 0
state = env.reset()
state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)
for time in range(50):
action = self.select_action(state)
next_state, reward, terminated, truncated, _ = self.env_step(action.data[0].item())
rewards += reward
if terminated == True or truncated == True:
break
</code>
<code> #Create class to write data for visualisation and tracking of training progress in Tensorboard self.writer = SummaryWriter(logs) # Assign logs variable values from OpenAI gym model created self.logs = logs # Assign deep Q network both as learner network and as target network self.learner = dqn self.target = dqn # Loads state dictionary from learner model and save it to target model as state dictionary # Synchronises parameters of learning and target model # (typically contains learnable parameters, such as weights and biases, to restore model's state) #self.target.load_state_dict(self.learner.state_dict()) # Sets target model in evaluation mode, ready for evaluation tasks # Ensures cosistent and deterministic outputs when making predictions or evaluating performance self.target.eval() # Choose optmiser and assign learning rate to it (Adam in this case) self.optimizer = optim.Adam(self.learner.parameters(), lr = lr) # Assign loss function to class parameter (MSE in this case) self.loss = loss # Set up replay memory, to store past experiences (transitions) in (consist e.g. of state, action, reward, next state) self.memory = ExperienceReplay(10000) # Assign input values of this class to class specific variables, to make them reachable everywhere within the class self.batch_size = batch_size self.eps_start = eps_start self.eps_end = eps_end self.eps_decay = eps_decay self.target_update = target_update self.gamma = gamma # Reset steps counter self.steps = 0 # Create empty lists to plot later self.plots = {"Loss": [], "Reward": [], "Mean Reward": []} def select_action(self, state): # Update steps, which measures how often an action has been selected self.steps = self.steps + 1 # Select a random value to determine exploration/exploitation sample = random.random() # Get a decayed epsilon threshold eps_thresh = self.eps_end + (self.eps_start - self.eps_end) * math.exp(-1 * self.steps / self.eps_decay) if sample > eps_thresh: with torch.no_grad(): # Select the optimal action based on the maximum expected return action = torch.argmax(self.learner(state)).view(1, 1) return action else: # Return random action return torch.tensor([[random.randrange(env.action_space.n)]], device = device, dtype=torch.long) def train_inner(self): # Perform optimisation on network # Skip inner training if there is not enough memory avaiable to perform training if len(self.memory) < self.batch_size: return 0 # Returns a self.batch_size number of sample transitions sample_transitions = self.memory.sample(self.batch_size) # Organises a batch of transitions into batch variable batch = Transition(*zip(*sample_transitions)) # Filter out None object from batch and turn them into tensor # Creates tensor that contains only non-None elements from batch.next_state next_states = torch.cat(batch.next_state) # Create more tensor. using all other inputs state_batch = torch.cat(batch.state) action_batch = torch.cat(batch.action) reward_batch = torch.cat(batch.reward) # Provides predictions of outputs, based upon the inputs state_batch and action_batch # This is done using the learner deep Q network pred_values = self.learner(state_batch).gather(1, action_batch) # Creates tensor with zeros as values next_state_values = torch.zeros(self.batch_size, device = device) # Update next_state_values with the maximum values predicted by the target deep Q network # Using next_states as input next_state_values = self.target(next_states).max(1)[0].detach() # Q(s, a) = reward(s, a) + Q(s_t+1, a_t+1)* gamma # Calculates target_values for current state-action pairs in batch target_values = next_state_values * self.gamma + reward_batch # Compute loss, which used predicted values and target values as input loss = self.loss(pred_values, target_values.unsqueeze(1)) # Reset gradients, so they are ready to use for backward propagation self.optimizer.zero_grad() # Compute gradients to update parameter values loss.backward() for param in self.learner.parameters(): # Ensure gradient clipping for each gradient # Gradients will not be lower than -1 and not higher than 1 param.grad.data.clamp_(-1, 1) # Update parameters self.optimizer.step() return loss def env_step(self, action): state, reward, terminated, truncated, info = env.step(action) # Put together observation. This is a small workaround but it suffices observation = state['demand'] observation = np.append(observation, state['current_demand']) observation = np.append(observation, state['storage']) return torch.FloatTensor([observation]).to(device), torch.FloatTensor([reward]).to(device), terminated, truncated, info # For here, episodes = epochs def train(self, episodes=10000, smooth=10): # Update time elapsed since start elapsed_time = time.time() - start_time steps = 0 # Create empty lists for graphs average_reward = [] # Define number of steps within episode/epoch replacements = 100000 # Used to loop through all instances of datasets instance = 0 for episode in range(episodes): individual_reward = [] individual_loss = [] # Reset variable values c_loss = 0 c_samples = 0 rewards = 0 self.loss_value_old = 0 # Select data file to use temp_file_name = "inst0001.dat" # Obtain and convert data from selected file initialise_environment(file_name = temp_file_name) # Initialise environment and get its initial state state = env.reset() state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device) for i in range(replacements): # Loop through predetermined number of steps # Choose action for current state action = self.select_action(state) # Retrieve new state, reward, and terminated/truncated values from environment after providing action input next_state, reward, terminated, truncated, _ = self.env_step(action.item()) # Exit simulation if environment says to terminate or truncate if terminated == True or truncated == True: next_state = None # Continue to next episode/epoch break # Push new transition into memory self.memory.push(state, action, next_state, reward) # Set new state as current state state = next_state # Compute loss value loss = self.train_inner() # Add current reward to total rewards variable rewards += reward.detach().item() print(steps) steps += 1 c_samples += self.batch_size if loss == 0: c_loss += loss#.detach().numpy() else: if device == "cpu": c_loss += loss.detach().numpy() else: loss_cpu = loss.cpu() c_loss += loss_cpu.detach().numpy() individual_reward.append(reward.detach().item()) if loss == 0: individual_loss.append(loss) else: if device == "cpu": individual_loss.append(loss.detach().numpy()) else: loss_cpu = loss.cpu() individual_loss.append(loss_cpu.detach().numpy()) # Synchronize target network parameters periodically with parameters of learning network if i % self.target_update == 0: self.target.load_state_dict(self.learner.state_dict()) # Update elapsed time elapsed_time = time.time() - start_time # if elapsed time is longer than max run time, exit simulation if elapsed_time >= max_runtime: print(elapsed_time) break average_reward.append(rewards/replacements) if elapsed_time >= max_runtime: print(elapsed_time) break env.close() def run(self): #import test data temp_file_name = "inst0001.dat" initialise_environment(file_name = temp_file_name) rewards = 0 state = env.reset() state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device) for time in range(50): action = self.select_action(state) next_state, reward, terminated, truncated, _ = self.env_step(action.data[0].item()) rewards += reward if terminated == True or truncated == True: break </code>
    #Create class to write data for visualisation and tracking of training progress in Tensorboard
    self.writer = SummaryWriter(logs) 
    # Assign logs variable values from OpenAI gym model created
    self.logs = logs
    
    # Assign deep Q network both as learner network and as target network
    self.learner = dqn
    self.target = dqn
    
    # Loads state dictionary from learner model and save it to target model as state dictionary
    # Synchronises parameters of learning and target model
    # (typically contains learnable parameters, such as weights and biases, to restore model's state)
    #self.target.load_state_dict(self.learner.state_dict())
    
    # Sets target model in evaluation mode, ready for evaluation tasks
    # Ensures cosistent and deterministic outputs when making predictions or evaluating performance
    self.target.eval()
    
    # Choose optmiser and assign learning rate to it (Adam in this case)
    self.optimizer = optim.Adam(self.learner.parameters(), lr = lr)
    
    # Assign loss function to class parameter (MSE in this case)
    self.loss = loss
    
    # Set up replay memory, to store past experiences (transitions) in (consist e.g. of state, action, reward, next state)
    self.memory = ExperienceReplay(10000)
    
    # Assign input values of this class to class specific variables, to make them reachable everywhere within the class
    self.batch_size = batch_size
    self.eps_start = eps_start
    self.eps_end = eps_end
    self.eps_decay = eps_decay
    self.target_update = target_update
    self.gamma = gamma
    
    # Reset steps counter
    self.steps = 0
    
    # Create empty lists to plot later
    self.plots = {"Loss": [], "Reward": [], "Mean Reward": []}

def select_action(self, state):
    # Update steps, which measures how often an action has been selected
    self.steps = self.steps + 1
    # Select a random value to determine exploration/exploitation
    sample = random.random()
   
    # Get a decayed epsilon threshold
    eps_thresh = self.eps_end + (self.eps_start - self.eps_end) * math.exp(-1 * self.steps / self.eps_decay)

    if sample > eps_thresh:
        with torch.no_grad(): 
            # Select the optimal action based on the maximum expected return
            action = torch.argmax(self.learner(state)).view(1, 1)
            return action
    else: 
        # Return random action
        return torch.tensor([[random.randrange(env.action_space.n)]], device = device, dtype=torch.long)

def train_inner(self): 
    # Perform optimisation on network
    # Skip inner training if there is not enough memory avaiable to perform training
    if len(self.memory) < self.batch_size:
        return 0
    
    # Returns a self.batch_size number of sample transitions
    sample_transitions = self.memory.sample(self.batch_size)
    
    # Organises a batch of transitions into batch variable
    batch = Transition(*zip(*sample_transitions))
    
# Filter out None object from batch and turn them into tensor
    # Creates tensor that contains only non-None elements from batch.next_state
    next_states = torch.cat(batch.next_state)

    # Create more tensor. using all other inputs
    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)
    
    # Provides predictions of outputs, based upon the inputs state_batch and action_batch
    # This is done using the learner deep Q network
    pred_values = self.learner(state_batch).gather(1, action_batch)
    
    # Creates tensor with zeros as values
    next_state_values = torch.zeros(self.batch_size, device = device)
    
    # Update next_state_values with the maximum values predicted by the target deep Q network
    # Using next_states as input
    next_state_values = self.target(next_states).max(1)[0].detach()
    
    # Q(s, a) = reward(s, a) + Q(s_t+1, a_t+1)* gamma
    # Calculates target_values for current state-action pairs in batch
    target_values = next_state_values * self.gamma + reward_batch
    
    # Compute loss, which used predicted values and target values as input
    loss = self.loss(pred_values, target_values.unsqueeze(1))
    
    # Reset gradients, so they are ready to use for backward propagation
    self.optimizer.zero_grad()
    
    # Compute gradients to update parameter values
    loss.backward()
    
    
    for param in self.learner.parameters():
        # Ensure gradient clipping for each gradient
        # Gradients will not be lower than -1 and not higher than 1
        param.grad.data.clamp_(-1, 1)
        
    # Update parameters
    self.optimizer.step()

    return loss

def env_step(self, action):
    state, reward, terminated, truncated, info = env.step(action)

# Put together observation. This is a small workaround but it suffices
    observation = state['demand']
    observation = np.append(observation, state['current_demand'])
    observation = np.append(observation, state['storage'])

    return torch.FloatTensor([observation]).to(device), torch.FloatTensor([reward]).to(device), terminated, truncated, info

# For here, episodes = epochs
def train(self, episodes=10000, smooth=10):
    # Update time elapsed since start
    elapsed_time = time.time() - start_time
    steps = 0
    # Create empty lists for graphs
    average_reward = []
    
    # Define number of steps within episode/epoch
    replacements = 100000
    
    # Used to loop through all instances of datasets
    instance = 0

    for episode in range(episodes):
        
        individual_reward = []
        individual_loss = []
        
        # Reset variable values
        c_loss = 0
        c_samples = 0
        rewards = 0
        self.loss_value_old = 0
        
        # Select data file to use
        temp_file_name = "inst0001.dat"
        
        # Obtain and convert data from selected file
        initialise_environment(file_name = temp_file_name)
        
        # Initialise environment and get its initial state
        state = env.reset()
        state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)
        
        for i in range(replacements):
            # Loop through predetermined number of steps
             # Choose action for current state
            action = self.select_action(state)
            
            # Retrieve new state, reward, and terminated/truncated values from environment after providing action input
            next_state, reward, terminated, truncated, _ = self.env_step(action.item())
            
            # Exit simulation if environment says to terminate or truncate
            if terminated == True or truncated == True:
                next_state = None
                # Continue to next episode/epoch
                break
            
            # Push new transition into memory
            self.memory.push(state, action, next_state, reward)
            
            # Set new state as current state
            state = next_state

            # Compute loss value
            loss = self.train_inner()
            
            # Add current reward to total rewards variable
            rewards += reward.detach().item()
            
            print(steps)
            steps += 1
            
            c_samples += self.batch_size
            
            if loss == 0:
                c_loss += loss#.detach().numpy()
            else:
                if device == "cpu":
                    c_loss += loss.detach().numpy()
                else:
                    loss_cpu = loss.cpu()
                    c_loss += loss_cpu.detach().numpy()
            
            individual_reward.append(reward.detach().item())
            if loss == 0:
                individual_loss.append(loss)
            else:
                if device == "cpu":
                    individual_loss.append(loss.detach().numpy())
                else:
                    loss_cpu = loss.cpu()
                    individual_loss.append(loss_cpu.detach().numpy())                    
            
            # Synchronize target network parameters periodically with parameters of learning network
            if i % self.target_update == 0:
                self.target.load_state_dict(self.learner.state_dict())
              
            # Update elapsed time
            elapsed_time = time.time() - start_time
            
            # if elapsed time is longer than max run time, exit simulation
            if elapsed_time >= max_runtime:
                print(elapsed_time)
                break
          
        average_reward.append(rewards/replacements)
        
        if elapsed_time >= max_runtime:
            print(elapsed_time)
            break

    env.close()


def run(self):
    #import test data
    temp_file_name = "inst0001.dat"
    initialise_environment(file_name = temp_file_name)
      
    rewards = 0
    state = env.reset()
    state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)
  
    for time in range(50):
        action = self.select_action(state) 
        next_state, reward, terminated, truncated, _ = self.env_step(action.data[0].item())
        rewards += reward
    
        if terminated == True or truncated == True:
            break

def main():
# Choose to use either CPU or GPU
device_name = device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”) #device#”cuda: %s”%(args.device) if torch.cuda.is_available() else “cpu”
print(“[Device]tDevice selected: “, device_name)

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code># Import data file to start simulation with
initialise_environment(file_name = "inst0001.dat")
# Create deep Q-network using right layer size input
dqn = DQN(in_size = env.observation_space.shape[0], out_size = env.action_space.n).to(device)
#if we're loading a model, load that deep q network
dqn.load_state_dict(torch.load(XXX.pt"))
# Use MSE loss function as loss function
loss = nn.MSELoss()
# Initialise RL algorithm
runner = Runner(dqn, loss, lr = args.lr, gamma = args.gamma, logs = "warehouse/%s" %time.time())
if "train" in args.runtype:
# Start training algorithm
print("[Train]tTraining Beginning ...")
runner.train(args.episodes)
</code>
<code># Import data file to start simulation with initialise_environment(file_name = "inst0001.dat") # Create deep Q-network using right layer size input dqn = DQN(in_size = env.observation_space.shape[0], out_size = env.action_space.n).to(device) #if we're loading a model, load that deep q network dqn.load_state_dict(torch.load(XXX.pt")) # Use MSE loss function as loss function loss = nn.MSELoss() # Initialise RL algorithm runner = Runner(dqn, loss, lr = args.lr, gamma = args.gamma, logs = "warehouse/%s" %time.time()) if "train" in args.runtype: # Start training algorithm print("[Train]tTraining Beginning ...") runner.train(args.episodes) </code>
# Import data file to start simulation with
initialise_environment(file_name = "inst0001.dat")

# Create deep Q-network using right layer size input
dqn = DQN(in_size = env.observation_space.shape[0], out_size = env.action_space.n).to(device)

#if we're loading a model, load that deep q network

    dqn.load_state_dict(torch.load(XXX.pt"))


# Use MSE loss function as loss function
loss = nn.MSELoss()

# Initialise RL algorithm
runner = Runner(dqn, loss, lr = args.lr, gamma = args.gamma, logs = "warehouse/%s" %time.time())

if "train" in args.runtype:
    # Start training algorithm
    print("[Train]tTraining Beginning ...")
    runner.train(args.episodes)

if name == ‘main‘:
main()

New contributor

Jasper de Vries is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa
Thiết kế website Thiết kế website Thiết kế website Cách kháng tài khoản quảng cáo Mua bán Fanpage Facebook Dịch vụ SEO Tổ chức sinh nhật