Thiết kế website giá rẻ

Question

I am currently working on a project for uni, for which I am applying DQN to solve a warehouse storage allocation problem. I finished programming the Markov Decision Process and the entire DQN last week and it runs. The loss values show an increase in performance (that is, they are being minimised). However, this increase in performance in loss values is not translated to an increase in performance in the reward values, The reward values keep circling around the same values, without showing any form of improvement.

A bit of context on the MDP: The MDP consists of a warehouse, where all products have their own location. A dataset is used as input, which contains the demand per day of every product. The MDP simulates through 365 days and whenever a product is out-of-stock, the choice is given: do we want to move it to a different location and if so, to which location (the possible locations are defined as the action)? Then, locations are swapped and the MDP continues. The observation space are the demand of all items and their current locations. The reward function is {1 – the normalised function below}, where occurences(pi) is the demand of product i, and dist(pi) refers to the distance of its current location, maxDist is the maximum possible distance:

retrieved from: https://doi.org/10.1145/3594300.3594314

What I am aiming to do is similar to the article mentioned in the reference (https://doi.org/10.1145/3594300.3594314).

To solve the issue, I have tried many things: experimenting with the hyperparameters, adding gradient clipping to escape local optima, decreasing the observation space and working with a smaller dataset. Right now, I am lost and not sure where the problem may lie anymore. If someone could please guide me into the right direction, that would be really awesome! Below, I added an extra note and my code.

Note: A thing I noticed: In the ‘train_inner’ function, in the line where target_values is updated, the target_values are computed using the Bellman equation, with the following input for the formula: next_state_values, self.gamma, and reward_batch. What I noticed is that next_state_values starts at values around 200 and slowly decreased, until it reaches values of around 20-25. Then it stabilises. Reward_batch, however, is computed using a normalised function which implies that the values are between 0 and 1. This is an insignificant value compared to the next_state_values. Therefore, the impact of the rewards on the target_values is minimal and therefore negligible.

DQN code:

class ExperienceReplay(object):
def init(self, capacity):
# Construct experience replay class and clears it
self.capacity = capacity
self.memory = []
self.position = 0

<code>def push(self, *args):

# if memory isn't full, add a new experience

if len(self.memory) < self.capacity:

# If memory is full, do not append anything

self.memory.append(None)

# Append new memory

self.memory[self.position] = Transition(*args)

self.position = (self.position + 1) % self.capacity

def sample(self, batch_size):

# Randomly obtain a batch_size number of samples from the memory

return random.sample(self.memory, batch_size)

def __len__(self):

# Returns length of memory list

return len(self.memory)

</code>

<code>def push(self, *args): # if memory isn't full, add a new experience if len(self.memory) < self.capacity: # If memory is full, do not append anything self.memory.append(None) # Append new memory self.memory[self.position] = Transition(*args) self.position = (self.position + 1) % self.capacity def sample(self, batch_size): # Randomly obtain a batch_size number of samples from the memory return random.sample(self.memory, batch_size) def __len__(self): # Returns length of memory list return len(self.memory) </code>

def push(self, *args):
    # if memory isn't full, add a new experience
    if len(self.memory) < self.capacity:
        # If memory is full, do not append anything
        self.memory.append(None)
    
    # Append new memory
    self.memory[self.position] = Transition(*args)
    self.position = (self.position + 1) % self.capacity
    
def sample(self, batch_size): 
    # Randomly obtain a batch_size number of samples from the memory
    return random.sample(self.memory, batch_size)

def __len__(self): 
    # Returns length of memory list
    return len(self.memory)

#deep Q network implementation
class DQN(nn.Module):
def init(self, in_size, out_size):
super(DQN, self).init()
self.layer1 = nn.Linear(in_size, 64) #this is 10×128, mat2
self.layer2 = nn.Linear(64, 32)
self.layer3 = nn.Linear(32, out_size)
self.dropout = nn.Dropout(0.7)

<code>def forward(self, x):

# x = Variable(torch.from_numpy(x).float().unsqueeze(0)).to(device)

print('run')

x = F.relu(self.layer1(x))

x = self.dropout(F.relu(self.layer2(x)))

x = F.relu(self.layer3(x))

return x

</code>

<code>def forward(self, x): # x = Variable(torch.from_numpy(x).float().unsqueeze(0)).to(device) print('run') x = F.relu(self.layer1(x)) x = self.dropout(F.relu(self.layer2(x))) x = F.relu(self.layer3(x)) return x </code>

def forward(self, x): 
    # x = Variable(torch.from_numpy(x).float().unsqueeze(0)).to(device)
    print('run')
    x = F.relu(self.layer1(x))
    x = self.dropout(F.relu(self.layer2(x)))
    x = F.relu(self.layer3(x)) 
    return x

class Runner():
def init(self, dqn, loss, lr = 0.01, eps_start = 1, eps_end = 0.1, eps_decay = 10000000,
batch_size = 128, target_update = 5000, logs = “sWarehouse/model”,
gamma = 0.9):

<code> #Create class to write data for visualisation and tracking of training progress in Tensorboard

self.writer = SummaryWriter(logs)

# Assign logs variable values from OpenAI gym model created

self.logs = logs

# Assign deep Q network both as learner network and as target network

self.learner = dqn

self.target = dqn

# Loads state dictionary from learner model and save it to target model as state dictionary

# Synchronises parameters of learning and target model

# (typically contains learnable parameters, such as weights and biases, to restore model's state)

#self.target.load_state_dict(self.learner.state_dict())

# Sets target model in evaluation mode, ready for evaluation tasks

# Ensures cosistent and deterministic outputs when making predictions or evaluating performance

self.target.eval()

# Choose optmiser and assign learning rate to it (Adam in this case)

self.optimizer = optim.Adam(self.learner.parameters(), lr = lr)

# Assign loss function to class parameter (MSE in this case)

self.loss = loss

# Set up replay memory, to store past experiences (transitions) in (consist e.g. of state, action, reward, next state)

self.memory = ExperienceReplay(10000)

# Assign input values of this class to class specific variables, to make them reachable everywhere within the class

self.batch_size = batch_size

self.eps_start = eps_start

self.eps_end = eps_end

self.eps_decay = eps_decay

self.target_update = target_update

self.gamma = gamma

# Reset steps counter

self.steps = 0

# Create empty lists to plot later

self.plots = {"Loss": [], "Reward": [], "Mean Reward": []}

def select_action(self, state):

# Update steps, which measures how often an action has been selected

self.steps = self.steps + 1

# Select a random value to determine exploration/exploitation

sample = random.random()

# Get a decayed epsilon threshold

eps_thresh = self.eps_end + (self.eps_start - self.eps_end) * math.exp(-1 * self.steps / self.eps_decay)

if sample > eps_thresh:

with torch.no_grad():

# Select the optimal action based on the maximum expected return

action = torch.argmax(self.learner(state)).view(1, 1)

return action

else:

# Return random action

return torch.tensor([[random.randrange(env.action_space.n)]], device = device, dtype=torch.long)

def train_inner(self):

# Perform optimisation on network

# Skip inner training if there is not enough memory avaiable to perform training

if len(self.memory) < self.batch_size:

return 0

# Returns a self.batch_size number of sample transitions

sample_transitions = self.memory.sample(self.batch_size)

# Organises a batch of transitions into batch variable

batch = Transition(*zip(*sample_transitions))

# Filter out None object from batch and turn them into tensor

# Creates tensor that contains only non-None elements from batch.next_state

next_states = torch.cat(batch.next_state)

# Create more tensor. using all other inputs

state_batch = torch.cat(batch.state)

action_batch = torch.cat(batch.action)

reward_batch = torch.cat(batch.reward)

# Provides predictions of outputs, based upon the inputs state_batch and action_batch

# This is done using the learner deep Q network

pred_values = self.learner(state_batch).gather(1, action_batch)

# Creates tensor with zeros as values

next_state_values = torch.zeros(self.batch_size, device = device)

# Update next_state_values with the maximum values predicted by the target deep Q network

# Using next_states as input

next_state_values = self.target(next_states).max(1)[0].detach()

# Q(s, a) = reward(s, a) + Q(s_t+1, a_t+1)* gamma

# Calculates target_values for current state-action pairs in batch

target_values = next_state_values * self.gamma + reward_batch

# Compute loss, which used predicted values and target values as input

loss = self.loss(pred_values, target_values.unsqueeze(1))

# Reset gradients, so they are ready to use for backward propagation

self.optimizer.zero_grad()

# Compute gradients to update parameter values

loss.backward()

for param in self.learner.parameters():

# Ensure gradient clipping for each gradient

# Gradients will not be lower than -1 and not higher than 1

param.grad.data.clamp_(-1, 1)

# Update parameters

self.optimizer.step()

return loss

def env_step(self, action):

state, reward, terminated, truncated, info = env.step(action)

# Put together observation. This is a small workaround but it suffices

observation = state['demand']

observation = np.append(observation, state['current_demand'])

observation = np.append(observation, state['storage'])

return torch.FloatTensor([observation]).to(device), torch.FloatTensor([reward]).to(device), terminated, truncated, info

# For here, episodes = epochs

def train(self, episodes=10000, smooth=10):

# Update time elapsed since start

elapsed_time = time.time() - start_time

steps = 0

# Create empty lists for graphs

average_reward = []

# Define number of steps within episode/epoch

replacements = 100000

# Used to loop through all instances of datasets

instance = 0

for episode in range(episodes):

individual_reward = []

individual_loss = []

# Reset variable values

c_loss = 0

c_samples = 0

rewards = 0

self.loss_value_old = 0

# Select data file to use

temp_file_name = "inst0001.dat"

# Obtain and convert data from selected file

initialise_environment(file_name = temp_file_name)

# Initialise environment and get its initial state

state = env.reset()

state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)

for i in range(replacements):

# Loop through predetermined number of steps

# Choose action for current state

action = self.select_action(state)

# Retrieve new state, reward, and terminated/truncated values from environment after providing action input

next_state, reward, terminated, truncated, _ = self.env_step(action.item())

# Exit simulation if environment says to terminate or truncate

if terminated == True or truncated == True:

next_state = None

# Continue to next episode/epoch

break

# Push new transition into memory

self.memory.push(state, action, next_state, reward)

# Set new state as current state

state = next_state

# Compute loss value

loss = self.train_inner()

# Add current reward to total rewards variable

rewards += reward.detach().item()

print(steps)

steps += 1

c_samples += self.batch_size

if loss == 0:

c_loss += loss#.detach().numpy()

else:

if device == "cpu":

c_loss += loss.detach().numpy()

else:

loss_cpu = loss.cpu()

c_loss += loss_cpu.detach().numpy()

individual_reward.append(reward.detach().item())

if loss == 0:

individual_loss.append(loss)

else:

if device == "cpu":

individual_loss.append(loss.detach().numpy())

else:

loss_cpu = loss.cpu()

individual_loss.append(loss_cpu.detach().numpy())

# Synchronize target network parameters periodically with parameters of learning network

if i % self.target_update == 0:

self.target.load_state_dict(self.learner.state_dict())

# Update elapsed time

elapsed_time = time.time() - start_time

# if elapsed time is longer than max run time, exit simulation

if elapsed_time >= max_runtime:

print(elapsed_time)

break

average_reward.append(rewards/replacements)

if elapsed_time >= max_runtime:

print(elapsed_time)

break

env.close()

def run(self):

#import test data

temp_file_name = "inst0001.dat"

initialise_environment(file_name = temp_file_name)

rewards = 0

state = env.reset()

state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)

for time in range(50):

action = self.select_action(state)

next_state, reward, terminated, truncated, _ = self.env_step(action.data[0].item())

rewards += reward

if terminated == True or truncated == True:

break

</code>

<code> #Create class to write data for visualisation and tracking of training progress in Tensorboard self.writer = SummaryWriter(logs) # Assign logs variable values from OpenAI gym model created self.logs = logs # Assign deep Q network both as learner network and as target network self.learner = dqn self.target = dqn # Loads state dictionary from learner model and save it to target model as state dictionary # Synchronises parameters of learning and target model # (typically contains learnable parameters, such as weights and biases, to restore model's state) #self.target.load_state_dict(self.learner.state_dict()) # Sets target model in evaluation mode, ready for evaluation tasks # Ensures cosistent and deterministic outputs when making predictions or evaluating performance self.target.eval() # Choose optmiser and assign learning rate to it (Adam in this case) self.optimizer = optim.Adam(self.learner.parameters(), lr = lr) # Assign loss function to class parameter (MSE in this case) self.loss = loss # Set up replay memory, to store past experiences (transitions) in (consist e.g. of state, action, reward, next state) self.memory = ExperienceReplay(10000) # Assign input values of this class to class specific variables, to make them reachable everywhere within the class self.batch_size = batch_size self.eps_start = eps_start self.eps_end = eps_end self.eps_decay = eps_decay self.target_update = target_update self.gamma = gamma # Reset steps counter self.steps = 0 # Create empty lists to plot later self.plots = {"Loss": [], "Reward": [], "Mean Reward": []} def select_action(self, state): # Update steps, which measures how often an action has been selected self.steps = self.steps + 1 # Select a random value to determine exploration/exploitation sample = random.random() # Get a decayed epsilon threshold eps_thresh = self.eps_end + (self.eps_start - self.eps_end) * math.exp(-1 * self.steps / self.eps_decay) if sample > eps_thresh: with torch.no_grad(): # Select the optimal action based on the maximum expected return action = torch.argmax(self.learner(state)).view(1, 1) return action else: # Return random action return torch.tensor([[random.randrange(env.action_space.n)]], device = device, dtype=torch.long) def train_inner(self): # Perform optimisation on network # Skip inner training if there is not enough memory avaiable to perform training if len(self.memory) < self.batch_size: return 0 # Returns a self.batch_size number of sample transitions sample_transitions = self.memory.sample(self.batch_size) # Organises a batch of transitions into batch variable batch = Transition(*zip(*sample_transitions)) # Filter out None object from batch and turn them into tensor # Creates tensor that contains only non-None elements from batch.next_state next_states = torch.cat(batch.next_state) # Create more tensor. using all other inputs state_batch = torch.cat(batch.state) action_batch = torch.cat(batch.action) reward_batch = torch.cat(batch.reward) # Provides predictions of outputs, based upon the inputs state_batch and action_batch # This is done using the learner deep Q network pred_values = self.learner(state_batch).gather(1, action_batch) # Creates tensor with zeros as values next_state_values = torch.zeros(self.batch_size, device = device) # Update next_state_values with the maximum values predicted by the target deep Q network # Using next_states as input next_state_values = self.target(next_states).max(1)[0].detach() # Q(s, a) = reward(s, a) + Q(s_t+1, a_t+1)* gamma # Calculates target_values for current state-action pairs in batch target_values = next_state_values * self.gamma + reward_batch # Compute loss, which used predicted values and target values as input loss = self.loss(pred_values, target_values.unsqueeze(1)) # Reset gradients, so they are ready to use for backward propagation self.optimizer.zero_grad() # Compute gradients to update parameter values loss.backward() for param in self.learner.parameters(): # Ensure gradient clipping for each gradient # Gradients will not be lower than -1 and not higher than 1 param.grad.data.clamp_(-1, 1) # Update parameters self.optimizer.step() return loss def env_step(self, action): state, reward, terminated, truncated, info = env.step(action) # Put together observation. This is a small workaround but it suffices observation = state['demand'] observation = np.append(observation, state['current_demand']) observation = np.append(observation, state['storage']) return torch.FloatTensor([observation]).to(device), torch.FloatTensor([reward]).to(device), terminated, truncated, info # For here, episodes = epochs def train(self, episodes=10000, smooth=10): # Update time elapsed since start elapsed_time = time.time() - start_time steps = 0 # Create empty lists for graphs average_reward = [] # Define number of steps within episode/epoch replacements = 100000 # Used to loop through all instances of datasets instance = 0 for episode in range(episodes): individual_reward = [] individual_loss = [] # Reset variable values c_loss = 0 c_samples = 0 rewards = 0 self.loss_value_old = 0 # Select data file to use temp_file_name = "inst0001.dat" # Obtain and convert data from selected file initialise_environment(file_name = temp_file_name) # Initialise environment and get its initial state state = env.reset() state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device) for i in range(replacements): # Loop through predetermined number of steps # Choose action for current state action = self.select_action(state) # Retrieve new state, reward, and terminated/truncated values from environment after providing action input next_state, reward, terminated, truncated, _ = self.env_step(action.item()) # Exit simulation if environment says to terminate or truncate if terminated == True or truncated == True: next_state = None # Continue to next episode/epoch break # Push new transition into memory self.memory.push(state, action, next_state, reward) # Set new state as current state state = next_state # Compute loss value loss = self.train_inner() # Add current reward to total rewards variable rewards += reward.detach().item() print(steps) steps += 1 c_samples += self.batch_size if loss == 0: c_loss += loss#.detach().numpy() else: if device == "cpu": c_loss += loss.detach().numpy() else: loss_cpu = loss.cpu() c_loss += loss_cpu.detach().numpy() individual_reward.append(reward.detach().item()) if loss == 0: individual_loss.append(loss) else: if device == "cpu": individual_loss.append(loss.detach().numpy()) else: loss_cpu = loss.cpu() individual_loss.append(loss_cpu.detach().numpy()) # Synchronize target network parameters periodically with parameters of learning network if i % self.target_update == 0: self.target.load_state_dict(self.learner.state_dict()) # Update elapsed time elapsed_time = time.time() - start_time # if elapsed time is longer than max run time, exit simulation if elapsed_time >= max_runtime: print(elapsed_time) break average_reward.append(rewards/replacements) if elapsed_time >= max_runtime: print(elapsed_time) break env.close() def run(self): #import test data temp_file_name = "inst0001.dat" initialise_environment(file_name = temp_file_name) rewards = 0 state = env.reset() state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device) for time in range(50): action = self.select_action(state) next_state, reward, terminated, truncated, _ = self.env_step(action.data[0].item()) rewards += reward if terminated == True or truncated == True: break </code>

    #Create class to write data for visualisation and tracking of training progress in Tensorboard
    self.writer = SummaryWriter(logs) 
    # Assign logs variable values from OpenAI gym model created
    self.logs = logs
    
    # Assign deep Q network both as learner network and as target network
    self.learner = dqn
    self.target = dqn
    
    # Loads state dictionary from learner model and save it to target model as state dictionary
    # Synchronises parameters of learning and target model
    # (typically contains learnable parameters, such as weights and biases, to restore model's state)
    #self.target.load_state_dict(self.learner.state_dict())
    
    # Sets target model in evaluation mode, ready for evaluation tasks
    # Ensures cosistent and deterministic outputs when making predictions or evaluating performance
    self.target.eval()
    
    # Choose optmiser and assign learning rate to it (Adam in this case)
    self.optimizer = optim.Adam(self.learner.parameters(), lr = lr)
    
    # Assign loss function to class parameter (MSE in this case)
    self.loss = loss
    
    # Set up replay memory, to store past experiences (transitions) in (consist e.g. of state, action, reward, next state)
    self.memory = ExperienceReplay(10000)
    
    # Assign input values of this class to class specific variables, to make them reachable everywhere within the class
    self.batch_size = batch_size
    self.eps_start = eps_start
    self.eps_end = eps_end
    self.eps_decay = eps_decay
    self.target_update = target_update
    self.gamma = gamma
    
    # Reset steps counter
    self.steps = 0
    
    # Create empty lists to plot later
    self.plots = {"Loss": [], "Reward": [], "Mean Reward": []}

def select_action(self, state):
    # Update steps, which measures how often an action has been selected
    self.steps = self.steps + 1
    # Select a random value to determine exploration/exploitation
    sample = random.random()
   
    # Get a decayed epsilon threshold
    eps_thresh = self.eps_end + (self.eps_start - self.eps_end) * math.exp(-1 * self.steps / self.eps_decay)

    if sample > eps_thresh:
        with torch.no_grad(): 
            # Select the optimal action based on the maximum expected return
            action = torch.argmax(self.learner(state)).view(1, 1)
            return action
    else: 
        # Return random action
        return torch.tensor([[random.randrange(env.action_space.n)]], device = device, dtype=torch.long)

def train_inner(self): 
    # Perform optimisation on network
    # Skip inner training if there is not enough memory avaiable to perform training
    if len(self.memory) < self.batch_size:
        return 0
    
    # Returns a self.batch_size number of sample transitions
    sample_transitions = self.memory.sample(self.batch_size)
    
    # Organises a batch of transitions into batch variable
    batch = Transition(*zip(*sample_transitions))
    
# Filter out None object from batch and turn them into tensor
    # Creates tensor that contains only non-None elements from batch.next_state
    next_states = torch.cat(batch.next_state)

    # Create more tensor. using all other inputs
    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)
    
    # Provides predictions of outputs, based upon the inputs state_batch and action_batch
    # This is done using the learner deep Q network
    pred_values = self.learner(state_batch).gather(1, action_batch)
    
    # Creates tensor with zeros as values
    next_state_values = torch.zeros(self.batch_size, device = device)
    
    # Update next_state_values with the maximum values predicted by the target deep Q network
    # Using next_states as input
    next_state_values = self.target(next_states).max(1)[0].detach()
    
    # Q(s, a) = reward(s, a) + Q(s_t+1, a_t+1)* gamma
    # Calculates target_values for current state-action pairs in batch
    target_values = next_state_values * self.gamma + reward_batch
    
    # Compute loss, which used predicted values and target values as input
    loss = self.loss(pred_values, target_values.unsqueeze(1))
    
    # Reset gradients, so they are ready to use for backward propagation
    self.optimizer.zero_grad()
    
    # Compute gradients to update parameter values
    loss.backward()
    
    
    for param in self.learner.parameters():
        # Ensure gradient clipping for each gradient
        # Gradients will not be lower than -1 and not higher than 1
        param.grad.data.clamp_(-1, 1)
        
    # Update parameters
    self.optimizer.step()

    return loss

def env_step(self, action):
    state, reward, terminated, truncated, info = env.step(action)

# Put together observation. This is a small workaround but it suffices
    observation = state['demand']
    observation = np.append(observation, state['current_demand'])
    observation = np.append(observation, state['storage'])

    return torch.FloatTensor([observation]).to(device), torch.FloatTensor([reward]).to(device), terminated, truncated, info

# For here, episodes = epochs
def train(self, episodes=10000, smooth=10):
    # Update time elapsed since start
    elapsed_time = time.time() - start_time
    steps = 0
    # Create empty lists for graphs
    average_reward = []
    
    # Define number of steps within episode/epoch
    replacements = 100000
    
    # Used to loop through all instances of datasets
    instance = 0

    for episode in range(episodes):
        
        individual_reward = []
        individual_loss = []
        
        # Reset variable values
        c_loss = 0
        c_samples = 0
        rewards = 0
        self.loss_value_old = 0
        
        # Select data file to use
        temp_file_name = "inst0001.dat"
        
        # Obtain and convert data from selected file
        initialise_environment(file_name = temp_file_name)
        
        # Initialise environment and get its initial state
        state = env.reset()
        state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)
        
        for i in range(replacements):
            # Loop through predetermined number of steps
             # Choose action for current state
            action = self.select_action(state)
            
            # Retrieve new state, reward, and terminated/truncated values from environment after providing action input
            next_state, reward, terminated, truncated, _ = self.env_step(action.item())
            
            # Exit simulation if environment says to terminate or truncate
            if terminated == True or truncated == True:
                next_state = None
                # Continue to next episode/epoch
                break
            
            # Push new transition into memory
            self.memory.push(state, action, next_state, reward)
            
            # Set new state as current state
            state = next_state

            # Compute loss value
            loss = self.train_inner()
            
            # Add current reward to total rewards variable
            rewards += reward.detach().item()
            
            print(steps)
            steps += 1
            
            c_samples += self.batch_size
            
            if loss == 0:
                c_loss += loss#.detach().numpy()
            else:
                if device == "cpu":
                    c_loss += loss.detach().numpy()
                else:
                    loss_cpu = loss.cpu()
                    c_loss += loss_cpu.detach().numpy()
            
            individual_reward.append(reward.detach().item())
            if loss == 0:
                individual_loss.append(loss)
            else:
                if device == "cpu":
                    individual_loss.append(loss.detach().numpy())
                else:
                    loss_cpu = loss.cpu()
                    individual_loss.append(loss_cpu.detach().numpy())                    
            
            # Synchronize target network parameters periodically with parameters of learning network
            if i % self.target_update == 0:
                self.target.load_state_dict(self.learner.state_dict())
              
            # Update elapsed time
            elapsed_time = time.time() - start_time
            
            # if elapsed time is longer than max run time, exit simulation
            if elapsed_time >= max_runtime:
                print(elapsed_time)
                break
          
        average_reward.append(rewards/replacements)
        
        if elapsed_time >= max_runtime:
            print(elapsed_time)
            break

    env.close()


def run(self):
    #import test data
    temp_file_name = "inst0001.dat"
    initialise_environment(file_name = temp_file_name)
      
    rewards = 0
    state = env.reset()
    state = Variable(torch.from_numpy(state).float().unsqueeze(0)).to(device)
  
    for time in range(50):
        action = self.select_action(state) 
        next_state, reward, terminated, truncated, _ = self.env_step(action.data[0].item())
        rewards += reward
    
        if terminated == True or truncated == True:
            break

def main():
# Choose to use either CPU or GPU
device_name = device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”) #device#”cuda: %s”%(args.device) if torch.cuda.is_available() else “cpu”
print(“[Device]tDevice selected: “, device_name)

<code># Import data file to start simulation with

initialise_environment(file_name = "inst0001.dat")

# Create deep Q-network using right layer size input

dqn = DQN(in_size = env.observation_space.shape[0], out_size = env.action_space.n).to(device)

#if we're loading a model, load that deep q network

dqn.load_state_dict(torch.load(XXX.pt"))

# Use MSE loss function as loss function

loss = nn.MSELoss()

# Initialise RL algorithm

runner = Runner(dqn, loss, lr = args.lr, gamma = args.gamma, logs = "warehouse/%s" %time.time())

if "train" in args.runtype:

# Start training algorithm

print("[Train]tTraining Beginning ...")

runner.train(args.episodes)

</code>

<code># Import data file to start simulation with initialise_environment(file_name = "inst0001.dat") # Create deep Q-network using right layer size input dqn = DQN(in_size = env.observation_space.shape[0], out_size = env.action_space.n).to(device) #if we're loading a model, load that deep q network dqn.load_state_dict(torch.load(XXX.pt")) # Use MSE loss function as loss function loss = nn.MSELoss() # Initialise RL algorithm runner = Runner(dqn, loss, lr = args.lr, gamma = args.gamma, logs = "warehouse/%s" %time.time()) if "train" in args.runtype: # Start training algorithm print("[Train]tTraining Beginning ...") runner.train(args.episodes) </code>

# Import data file to start simulation with
initialise_environment(file_name = "inst0001.dat")

# Create deep Q-network using right layer size input
dqn = DQN(in_size = env.observation_space.shape[0], out_size = env.action_space.n).to(device)

#if we're loading a model, load that deep q network

    dqn.load_state_dict(torch.load(XXX.pt"))


# Use MSE loss function as loss function
loss = nn.MSELoss()

# Initialise RL algorithm
runner = Runner(dqn, loss, lr = args.lr, gamma = args.gamma, logs = "warehouse/%s" %time.time())

if "train" in args.runtype:
    # Start training algorithm
    print("[Train]tTraining Beginning ...")
    runner.train(args.episodes)

if name == ‘main‘:
main()

Thiết kế website giá rẻ

Danh mục

DQN showing loss improvements but lacking reward improvements