Cordoned NODE Automation response

Runbook: Alert : CLU <<k8s cluster>> reported NODE <<cluster node>> is cordoned for more than 1h
No outages directly related but impacted health score. Some alerts are not closed for about two months.
Automation Plan:
- Phase-I: Detect the reason for "cordon" (i.e: Active Outage, Active Troubleshooting by checking open SSR/CRQ/WO)
- Phase-II: Check Docker, Kubelet Service, Collect var logs for errors, status of Cordon POD with respect to other PODs
- Phase-III: Execute Remediation Action - UN CORDON & Monitor for next hour.

impact: Risk of Total Outage due to too many Cordoned Nodes (example: Mastercard), Risk of extreme delay in finding who did what to cause this cordoned node.

risk of reoccurrence: Anytime. These events happened 3 times per day on an average and takes up to 2 months to close these events.

280 events per month

210 hours per month

Matta, Isaac to update SREMON as suggested by Manoj Patil

04 Feb 2025 SREMON-4291 Resolved with saying: " These looks like operational decisions, why node is cordoned, time line of cordoned.

I want it...

Soon

Attach files

Enter a subject

Manoj Patil

Feb 28, 2025
1. Visible Customer Benefit - Some
2. Internal Benefit - Small
3. Financial Benefit - No
4. Efforts - Moderate, Multiple Teams
5. Impact Urgency - Low
6. Disruption Risk by Not Doing This - Low
7. Aligned to Strategic Goals - Yes
8. Hard Need By Date? - NA
9. Executive Escalation? - NA
10. Force Highest Priority -NA
We have not started automation efforts for this activity.
Reply
Hide replies

Manoj Patil

Feb 26, 2025

@Jason Ferens Here Top customer team wants to know why those nodes are cordoned and who did it. This is not automation request hence please close it.
@Jason Ferens Please refer IP ticket with this information and can raise ticket with cloud platform if you need more information.
https://jira.bmc.com/browse/IP-8389

1 reply
Hide replies

Please enter your email address

RELATED IDEAS

Cordoned NODE Automation response