There are some built-in NodeConditions issues that NPD recognises and reports which can only be cleared via a node restart or an NPD restart.
We have some taskhung conditions that are not so serious that we need to restart the entire node and it is annoying to have to restart NPD on that node.
I have tried creating a temporary
Event for taskhung by way of a custom-plugin
which overrides the built-in NodeCondition so that I can handle it in a nicer way via our monitoring platform.
This custom-plugin uses the following string matches:
ALERT ON: ‘task .* blocked for more than 120 seconds.’
CLEAR: echo ‘task .* has resumed normal operation.’
Is what I am trying to achieve possible?
Here is the configmap excerpt for the custom plugin I am trying to create to override the built-in NodeCondition for TaskHung/KernelDeadLock:
data:
custom-plugin-monitor.json: |
{
"plugin": "custom",
"logPath": "/dev/kmsg",
"lookback": "5m",
"bufferSize": 10,
"source": "kernel-monitor",
"conditions": [
{
"type": "KernelDeadlock",
"reason": "TaskHung",
"message": "Task hung detected"
}
],
"rules": [
{
"type": "temporary",
"condition": {
"type": "KernelDeadlock",
"status": "true",
"reason": "TaskHung",
"message": "Task hung detected"
},
"pattern": "task .* blocked for more than .* seconds."
},
{
"type": "temporary",
"condition": {
"type": "KernelDeadlock",
"status": "false",
"reason": "TaskHungCleared",
"message": "Task hung cleared"
},
"pattern": "task .* has resumed normal operation."
}
]
}
other-configurations.json: |
{
// Your other NPD configurations here
}