I have a Custom environment which has a continuous state space and a multiBinary (3×1) action space. Given the state, the agent has to choose whether to apply a 3×1 control (coming from another algorithm) or not via the binary action. Then the dynamics to get to the next state are propagated and the reward is evaluated based on the state (via a scalar potential field), if the number of ones in the action (as a cost), and on success/constraint breaching.
The state space is normalized and the rewards are kept between -10 and 10.
After training using both PPO and A2C from stable baselines 3 (using various sets of hyperparameters which are known to work in different papers for other problems and training for 300k to 1M steps), during testing the actor takes always the same action if the predict is run deterministically because the action probabilities almost don’t change at all (they change of something like 1e-4). Moreover, when running an episode by feeding to the step random actions at each iteration or by selecting stochastically the action with the policy, the reward is better than the trained policy.
I’m new to RL and I wanted to know if there is an issue with the reward shaping or what could be the cause (maybe hyperparameters? Maybe add more parameters to the observations which could be useful?). The main thing I don’t understand is the reason why the actor is converging to the same action probability in every state.
Thanks in advance for the help.
I tried different algorithms (PPO,A2C), sets of hyperparameters, reward shapings, but nothing solved the problem with the action probabilities. Even running the .learn for the minimum possible timesteps, it still has the same problem.
Luca 1607 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.