I’m using Ray (version 1.8.0) to train an agent. The agent controls a unit in a simulation, and the simulation can end in one of three different ways: “UnitADestroyed”, “UnitBDestroyed” or “Timeout”. My aim is to maximize the probability of the outcome “UnitADestroyed”, so I give rewards accordingly. I also log the outcome of each simulation.
My code for training the agent looks like this:
config = {
"env": "MyEnv-v0",
"num_workers": 40,
"num_cpus_per_worker": 1.0,
"train_batch_size": 32000,
"num_sgd_iter": 2,
"framework": "torch",
"log_level": "INFO",
"callbacks": EpisodeEndCallbacks, # This basically just logs the outcome to stdout
}
trainer = ppo.PPOTrainer(config=config, logger_creator=lambda config: ray.tune.logger.UnifiedLogger(config, logdir))
print("Starting training...")
for i in range(numIterations):
result = trainer.train()
This seems to work, and in TensorBoard I can see the development of the average of my custom metrics. This is the average of “UnitADestroyed” in one of my runs:
This is the average of “UnitBDestroyed”:
And this is the average of “Timeout”:
So as you can see, “UnitADestroyed” increases from almost nothing in the beginning (which I think is realistic with a completely random agent) to about 25-30%. This is all very nice. I have also verified this by looking at a few log files from the simulations (each simulation creates its own log file).
But the problem begins when I try to test the agent after training. I do this:
for seed in range(100):
env = MyEnv()
env.MySeed = seed + 1001 # during training, MySeed is not set, and will therefore get a random value
state = env.reset()
done = False
total_reward = 0
while not done:
action = trainer.compute_action(state, explore=False)
state, reward, done, _ = env.step(action)
total_reward += reward
This runs the simulation 100 times (the simulation is initialized with seeds from 1 to 100, for reproducability – during training the seeds are random). I had expected to see the “UnitADestroyed” outcome 25-30 times, but didn’t – it only occured 3 times on average (I have of course repeated this whole process a lot of times now), the “UnitBDestroyed” occured around anywhere between 0 and 40 times and the rest was “Timeout”.
I have also tried saving the weights to files, and restoring them in a whole new process, with the exact same results. And I have tried using explore = True
, but that gave even worse results (“Timeout” all the time).
So what have I done wrong? It looks very promising during training, so something is clearly being achieved here, but I’m not able to reproduce that success afterwards. I suspect I may be doing something wrong when I extract the trained agent, since the results after training seem so random???