Runbook: Alert : CLU <<k8s cluster>> reported NODE <<cluster node>> is cordoned for more than 1h
No outages directly related but impacted health score. Some alerts are not closed for about two months.
Automation Plan:
Phase-I: Detect the reason for "cordon" (i.e: Active Outage, Active Troubleshooting by checking open SSR/CRQ/WO)
Phase-II: Check Docker, Kubelet Service, Collect var logs for errors, status of Cordon POD with respect to other PODs
Phase-III: Execute Remediation Action - UN CORDON & Monitor for next hour.
impact: Risk of Total Outage due to too many Cordoned Nodes (example: Mastercard), Risk of extreme delay in finding who did what to cause this cordoned node.
risk of reoccurrence: Anytime. These events happened 3 times per day on an average and takes up to 2 months to close these events.
280 events per month
210 hours per month
Matta, Isaac to update SREMON as suggested by Manoj Patil
04 Feb 2025 SREMON-4291 Resolved with saying: " These looks like operational decisions, why node is cordoned, time line of cordoned.
I want it... | Soon |
Visible Customer Benefit - Some
Internal Benefit - Small
Financial Benefit - No
Efforts - Moderate, Multiple Teams
Impact Urgency - Low
Disruption Risk by Not Doing This - Low
Aligned to Strategic Goals - Yes
Hard Need By Date? - NA
Executive Escalation? - NA
Force Highest Priority -NA
We have not started automation efforts for this activity.
@Jason Ferens Here Top customer team wants to know why those nodes are cordoned and who did it. This is not automation request hence please close it.
@Jason Ferens Please refer IP ticket with this information and can raise ticket with cloud platform if you need more information.
https://jira.bmc.com/browse/IP-8389