My team manages around 5000 datacenter and cloud servers. On them, via HP OMi ( a monitoring tool like dynatrace or datadog), a few parameters have been set to detect system health and raise a service ticket when one of the health parameters are affected. The parameters are :
- High cpu utilisation.
- Drive space in servers.
- Server reachability ( ticket raised when server runs into an error and goes in deallocated state) .
- Service monitoring ( ticket gets generated when a service comes to an unexpected stop) .
I need to come up with a solution to reduce the count of tickets here, as the number of tickets is getting very high. Has anyone done some work on this? Any and all ideas are appreciated!
Note : Turbonomic onboarded.
I need to reduce the count of tickets. I’m looking to get data of the servers raising highest volume of tickets and maybe resizing them.