Skip to Main Content
BMC Helix SaaSOps Ideas Portal
Status Planned
Created by Guest
Created on Feb 17, 2025

Cordoned NODE Automation response

  • Runbook: Alert : CLU <<k8s cluster>> reported NODE <<cluster node>> is cordoned for more than 1h

  • No outages directly related but impacted health score. Some alerts are not closed for about two months.

  • Automation Plan:

    • Phase-I: Detect the reason for "cordon" (i.e: Active Outage, Active Troubleshooting by checking open SSR/CRQ/WO)

    • Phase-II: Check Docker, Kubelet Service, Collect var logs for errors, status of Cordon POD with respect to other PODs

    • Phase-III: Execute Remediation Action - UN CORDON & Monitor for next hour.


impact: Risk of Total Outage due to too many Cordoned Nodes (example: Mastercard), Risk of extreme delay in finding who did what to cause this cordoned node.

risk of reoccurrence: Anytime. These events happened 3 times per day on an average and takes up to 2 months to close these events.

280 events per month

210 hours per month


Matta, Isaac to update SREMON as suggested by Manoj Patil

04 Feb 2025 SREMON-4291 Resolved with saying: " These looks like operational decisions, why node is cordoned, time line of cordoned.

I want it... Soon
  • Attach files
  • Manoj Patil
    Reply
    |
    Feb 28, 2025
    1. Visible Customer Benefit - Some

    2. Internal Benefit - Small

    3. Financial Benefit - No

    4. Efforts - Moderate, Multiple Teams

    5. Impact Urgency - Low

    6. Disruption Risk by Not Doing This - Low

    7. Aligned to Strategic Goals - Yes

    8. Hard Need By Date? - NA

    9. Executive Escalation? - NA

    10. Force Highest Priority -NA

    We have not started automation efforts for this activity.

  • Manoj Patil
    Reply
    |
    Feb 26, 2025

    @Jason Ferens Here Top customer team wants to know why those nodes are cordoned and who did it. This is not automation request hence please close it.

    @Jason Ferens Please refer IP ticket with this information and can raise ticket with cloud platform if you need more information.

    https://jira.bmc.com/browse/IP-8389