Quick Definition (30–60 words)
Non-Maskable Interrupt (NMI) is a hardware-level interrupt that cannot be ignored by the CPU and signals critical conditions like hardware faults or watchdog timeouts. Analogy: an emergency alarm that always breaks through whatever you’re doing. Formal: a high-priority, non-discardable interrupt vector delivered to the processor for immediate handling.
What is NMI?
-
What it is / what it is NOT
NMI is a hardware-initiated interrupt line intended for critical, high-severity events that require immediate attention from the CPU and operating system. It is NOT a regular software interrupt, nor is it generally used for routine telemetry or graceful shutdowns. -
Key properties and constraints
- Always prioritized above maskable interrupts.
- Delivered by hardware sources (chipset, watchdog timers, PCI devices, thermal units).
- Handler execution is constrained by the CPU and OS context; reentrancy and safety are concerns.
- Can indicate unrecoverable conditions (e.g., memory corruption) or survivable diagnostics (e.g., watchdog timeout).
-
Behavior can vary by platform, hypervisor, and BIOS/UEFI firmware.
-
Where it fits in modern cloud/SRE workflows
- Troubleshooting root-cause hardware incidents on hosts (bare metal or VMs with passthrough).
- Detecting stuck CPUs and kernel deadlocks via NMI watchdogs.
- Integrating with incident response for host-level severity incidents and correlated observability.
-
Postmortems and capacity planning when hardware error rates rise.
-
A text-only “diagram description” readers can visualize
- Physical hardware sources (watchdog chip, thermal sensor, PCI device) send NMI signal to CPU(s).
- CPU interrupts current context and vectors to the NMI handler in the kernel/firmware.
- NMI handler attempts minimal diagnostics, writes to special logs, and may trigger platform recovery or reboot.
- Observability agents collect NMI logs and map to infrastructure monitoring and incident pipelines.
NMI in one sentence
NMI is a hardware-triggered, non-maskable interrupt delivered to the CPU to signal critical conditions that require immediate OS/firmware attention.
NMI vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from NMI | Common confusion |
|---|---|---|---|
| T1 | IRQ | IRQs are maskable and used for normal device interrupts | Confused as same urgency |
| T2 | SMI | SMI runs in firmware/SMM, not OS context | Both are low-level interrupts |
| T3 | Watchdog timer | Watchdog can generate NMI as reaction | Not all watchdog events are NMIs |
| T4 | Machine Check | Machine checks are CPU error reports, sometimes NMI-triggered | Often conflated with NMI events |
| T5 | Kernel panic | Panic is OS reaction; NMI may cause panic | People think panic equals NMI |
| T6 | Exception | Exceptions are CPU faults from instruction execution | Some exceptions are handled differently |
| T7 | Hypervisor trap | Hypervisor traps are virtualization events | Might mask or reroute NMIs |
| T8 | ACPI event | ACPI signals power/thermal; can cause NMI on some boards | Not typically an NMI source |
Row Details (only if any cell says “See details below”)
None.
Why does NMI matter?
- Business impact (revenue, trust, risk)
-
Host-level failures can cause multi-tenant VM interruptions or data plane outages. Uninvestigated NMIs can hide systemic hardware issues leading to repeated downtime, SLA breaches, and customer churn.
-
Engineering impact (incident reduction, velocity)
-
Early detection of stuck CPUs or hardware errors via NMI reduces mean time to detect (MTTD) and helps avoid cascading faults. Clear NMI telemetry improves troubleshooting velocity and reduces firefighting toil.
-
SRE framing (SLIs/SLOs/error budgets/toil/on-call) where applicable
-
NMIs map to host-health SLIs (e.g., host-healthy-rate). Elevated NMI rates should consume error budget for infrastructure SLOs. On-call playbooks must include NMI triage to prevent noisy paging and to escalate only genuine host-level failures.
-
3–5 realistic “what breaks in production” examples
1. Kernel deadlock on a host causing all containers to freeze; watchdog NMI triggers kernel handler and may force reboot.
2. Repeated Machine Check Exceptions signaled via NMI indicating failing DIMM, causing VM crashes and storage errors.
3. Thermal sensor NMI on a blade begins throttling or immediate shutdown, impacting throughput and auto-scaling decisions.
4. Faulty NIC triggers PCI-generated NMI, breaking network traffic for affected hosts in a cluster.
5. Firmware bug causes spurious NMIs across a hardware fleet, generating alert storms and masking true incidents.
Where is NMI used? (TABLE REQUIRED)
| ID | Layer/Area | How NMI appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge / Physical host | Hardware watchdog or thermal NMI | NMI logs, kernel oops, dmesg | syslog, journalctl |
| L2 | Network / NIC | PCI error NMI | NIC reset traces, driver logs | ethtool, driver logs |
| L3 | Hypervisor | NMI forwarded or trapped | Hypervisor logs, VM crash reports | QEMU/KVM logs, Xen logs |
| L4 | Kubernetes node | Node freezes, kubelet not responding | Node heartbeat, node-exporter metrics | Prometheus, node-exporter |
| L5 | Serverless / PaaS | Platform-level host failures | Platform incident events | Cloud provider status, platform logs |
| L6 | Observability | Alert correlator for host NMIs | Event stores, incident graphs | PagerDuty, Opsgenie |
| L7 | BIOS/UEFI firmware | SMI/NMI interplay visible during boot | Firmware logs, platform error logs | ipmitool, ACPI logs |
Row Details (only if needed)
- L1: See details below: L1
- L3: See details below: L3
- L5: See details below: L5
Row Details
- L1: Hardware NMIs often originate from embedded controllers and require vendor firmware logs and IPMI/OEM logs for full context.
- L3: Hypervisors may choose to swallow, forward, or emulate NMIs; behavior depends on platform and configuration.
- L5: Managed PaaS abstracts NMIs; operators usually see only higher-level platform health events and must rely on provider telemetry.
When should you use NMI?
- When it’s necessary
-
Use NMIs for detecting critical, unrecoverable hardware errors, stuck CPUs, and platform watchdog conditions where immediate attention is required.
-
When it’s optional
-
Use an NMI watchdog in addition to regular telemetry if host-level hangs are rare but costly; enable selectively in high-value or high-risk instances.
-
When NOT to use / overuse it
-
Do NOT use NMI as a general-purpose monitoring signal or for frequent, low-priority events. Overuse leads to noisy alerting and may mask true emergencies.
-
Decision checklist
- If a host can hang and cause multi-tenant impact AND other diagnostics are insufficient -> enable NMI/watchdog.
- If application-level retries or circuit breakers can recover without host intervention -> prefer application-level SLIs instead.
-
If you run fully managed compute where NMI is opaque -> rely on provider platform alerts, not local NMI tinkering.
-
Maturity ladder: Beginner -> Intermediate -> Advanced
- Beginner: Enable basic NMI/watchdog; collect kernel oops and report to central logging.
- Intermediate: Correlate NMI events with metrics and alerts; automate host cordon and replacement.
- Advanced: Fleet-wide NMI analytics, predictive failure models, automated firmware rollbacks, and integration with capacity and incident pipelines.
How does NMI work?
-
Components and workflow
1. Hardware source (watchdog, thermal controller, PCI device) detects a critical condition.
2. Hardware asserts the NMI line to CPU.
3. CPU halts normal interrupt masking and vectors to the NMI handler (in kernel or firmware).
4. Handler performs minimal diagnostics (stack trace, CPU registers) and writes to protected log regions.
5. Depending on severity, handler may attempt recovery, notify management controllers, or initiate a controlled reboot.
6. Logged data is collected by host agents and sent to observability/incident pipelines. -
Data flow and lifecycle
-
Detection -> NMI delivered -> Handler executes -> Diagnostic capture -> Persist to ring buffer/firmware logs -> Collection agent scrapes logs -> Ingest into central observability -> Correlation and alerting -> Triage and remediation.
-
Edge cases and failure modes
- NMI handler deadlocks or corrupts memory causing cascade.
- NMIs occur during low-level firmware initialization and produce incomplete diagnostics.
- Hypervisor masks or reroutes NMIs causing guest-visible symptoms without host-level signals.
- Multiple NMIs in close succession trigger log overflow and partial data capture.
Typical architecture patterns for NMI
- Local NMI watchdog: Enable kernel NMI watchdog on hosts; used for VMs and bare metal to detect stuck CPUs. Use when debugging sporadic host hangs.
- Hardware watchdog + management controller: Dedicated BMC triggers NMI and then power-cycles host if OS is unresponsive. Use for resilient fleets with out-of-band management.
- Hypervisor-forwarded NMI: Hypervisor forwards host NMIs to guests for debugging; used in nested virtualization or when guest needs hardware-level visibility.
- Fleet telemetry aggregation: Agents collect NMI events and ship to centralized observability with correlation to kernel oops and machine-check logs. Use for fleet-wide failure analysis.
- Managed cloud abstraction: Rely on provider-supplied host health events and remediation; use when you cannot access host firmware or BMC.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Missing NMI logs | No diagnostics after NMI | Log overflow or handler crash | Preserve ring buffer and flush early | Empty dmesg after event |
| F2 | Spurious NMIs | Frequent transient NMIs | Faulty hardware or firmware bug | Firmware update or isolate hardware | Repeated NMI timestamps |
| F3 | NMI handler deadlock | Host becomes unresponsive | Handler reentrancy or corruption | Minimal handler steps; safe fallback | No new metrics post-NMI |
| F4 | Hypervisor masking | Guest shows symptoms but no host NMI | Hypervisor swallowed NMI | Reconfigure hypervisor forwarding | Host vs guest discrepancy |
| F5 | Partial diagnostics | Incomplete stack traces | Interrupt at early boot or corrupted memory | Out-of-band logs via BMC | Truncated logs in kernel |
| F6 | Alert storms | Many pages for same root cause | Lack of dedupe/correlation | Deduplication and grouping | High alert frequency |
Row Details (only if needed)
- F1: Ensure persistent storage of ring buffer and out-of-band log copies; use crashkernel reservation where supported.
- F2: Collect hardware vendor logs and correlate with firmware releases; reproduce with hardware swap.
- F3: Harden NMI handler to write minimal state atomically and call management controller early.
- F4: Check hypervisor settings for NMI passthrough; test with host-level diagnostics.
- F5: Use IPMI or BMC to fetch system event logs that persist across reboots.
- F6: Implement event correlation rules in alerting system and use rate-limits for paging.
Key Concepts, Keywords & Terminology for NMI
Term — 1–2 line definition — why it matters — common pitfall
- Non-Maskable Interrupt — Hardware interrupt that cannot be masked — Signals critical conditions — Mistaking for normal IRQ
- NMI watchdog — Kernel feature to detect stuck CPUs — Helps detect hangs — False positives on heavy GC workloads
- Machine Check Exception (MCE) — CPU-detected hardware error — Points to memory/CPU faults — Ignoring vendor MCE logs
- SMI — System Management Interrupt running in firmware — Can mask OS handling — Confused with NMI behavior
- BMC — Baseboard Management Controller — Out-of-band logs and power control — Not all providers expose BMC to tenants
- IPMI — Interface for managing BMC — Useful for fetching hardware logs — Security risk if not secured
- dmesg — Kernel ring buffer output — Primary local diagnostic source — Overwrites quickly without persistence
- oops — Kernel stack trace after fault — Essential for debugging — Misinterpreting stack levels
- crashkernel — Kernel reservation for kdump — Enables capture during panic — Not always configured by default
- kdump — Kernel crash dump mechanism — Stores memory images for postmortem — Large disk needs and complexity
- PCIe AER — PCI Express Advanced Error Reporting — Can cause NMIs for device errors — Misconfigured drivers suppress reports
- Thermal trip — Overheat event causing emergency interrupt — Prevents hardware damage — Lack of telemetry for gradual heating
- Watchdog timer — Timer that triggers recovery if system appears stuck — Can generate NMI — Overzealous timeouts lead to reboots
- CPU hang — CPU stops making progress — Detectable by NMI watchdog — Hard to reproduce in dev environments
- Firmware / BIOS — Low-level platform code — Controls NMI routing — Vendor-specific behavior varies
- Hypervisor passthrough — Forwarding hardware interrupts to guests — Enables guest visibility — Can hide host-level problems
- Nested virtualization — VM inside VM — NMIs may be emulated — Complexity in debug trails
- Kernel panic — OS-level unrecoverable state — May be triggered by NMI actions — Panics should be captured via kdump
- Ring buffer — Circular log buffer in kernel — Holds recent messages — Overwrites with bursty logs
- ECC memory — Error-correcting RAM — Reduces MCEs — Not all platforms use ECC
- DIMM failure — Memory module hardware fault — Often appears as MCEs — Replacement required, not software fix
- Watchdog reset — Forced reboot from watchdog — Helps recover but may lose in-flight work — Need graceful drain before hard reset
- OOB management — Out-of-band control via BMC/IPMI — Critical when OS is unresponsive — Limited in fully managed clouds
- Firmware log — Persistent log stored by firmware/BMC — Survives reboots — Requires vendor tools to interpret
- Event correlation — Linking NMI with other telemetry — Essential for accurate triage — Lack of correlation causes noisy paging
- Panic kernel dump — Capture on panic for postmortem — Enables deep analysis — Large storage and retrieval complexity
- Telemetry agent — Host agent that ships logs/metrics — Collects NMI indicators — Can be unavailable during host hang
- Kernel oops decode — Interpreting oops stack and registers — Critical for root cause — Often requires symbolized kernels
- Symbolized stack — Human-readable stack using kernel symbols — Needed for interpretation — Requires matching build IDs
- Firmware rollback — Reverting to older firmware to mitigate regressions — Useful for spurious NMIs — Needs controlled rollout
- Safe-mode boot — Boot mode with minimal drivers — Helps reproduce early-boot NMIs — Not always feasible in production
- Panic_on_oops — Kernel setting to panic on oops — Ensures kdump capture — Can reduce system availability if misused
- Watchdog threshold — Timeout value for watchdog — Balances detection vs false positives — Too short causes unnecessary reboots
- Node cordon — Mark node unschedulable in orchestration — Prevents placing workloads on unstable node — Needs automation to avoid human delay
- Auto-replace — Automated decommission and replacement after NMI — Reduces mean time to repair — Risk of replacing healthy nodes if false positive
- JVM safepoint — Java pause indicating thread stops — Can look like CPU hang — JVM-induced pauses can trigger NMI misdiagnosis
- Interrupt storm — Flood of interrupts from device — Can starve CPU and cause watchdogs to fire — Mitigate with driver fixes
- Vendor support ticketing — Escalation for hardware NMIs — Required for warranty parts — Slow if undiagnosed internally
- Fleet analytics — Aggregate NMI events across hosts — Useful for root-cause across fleet — Requires centralized ingestion and retention
- Recovery policy — What to do when NMI occurs (reboot, replace, escalate) — Ensures consistent response — Poor policy causes inconsistent handling
How to Measure NMI (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | NMI count per host | Frequency of NMIs on a host | Count NMI events in logs per host per day | 0–0.01/day | Sparse events may indicate underreporting |
| M2 | Hosts with NMI rate | Fraction of fleet with NMIs | Unique hosts with NMI / total hosts | <0.5% weekly | Bursts skew weekly metrics |
| M3 | Time-to-detect NMI | Delay from NMI to central ingest | Timestamp compare log vs ingest | <60s | Network/agent outage increases delay |
| M4 | NMI-induced reboots | Reboots caused by NMI | Correlate reboot reason in BMC/logs | <0.1% monthly | Mislabelled reboot reasons |
| M5 | Mean time to remediation (MTTR) | Time to replace or fix host after NMI | From alert to resolved incident | <2 hours | Human-led processes vary |
| M6 | NMI correlation rate | % NMIs with root-cause identified | Count resolved root cause / total NMIs | >80% | Complex hardware issues may be unresolved |
| M7 | Alert noise ratio | Ratio of NMI alerts that were actionable | Actionable alerts / total alerts | >70% actionable | Poor dedupe inflates noise |
| M8 | NMI stack capture rate | % of NMIs with captured stack traces | Successful stack dumps / NMIs | >90% | Early boot NMIs may miss captures |
| M9 | Recurrent NMI hosts | Hosts with >1 NMI in window | Hosts with repeated NMIs in 30 days | 0 for single-tenant hosts | Intermittent hardware faults cause repeats |
| M10 | Cost per NMI incident | Operational cost to remediate | Sum of staff hours + host replacement | Varies / depends | Hard to attribute costs precisely |
Row Details (only if needed)
- M1: Ensure agent reliably writes NMI marker and preserves event timestamps.
- M3: Use agent heartbeat to validate ingest pipeline latency.
- M4: Use BMC and cloud provider metadata to identify reboot cause.
- M8: Configure crashkernel and kdump; validate in staging.
Best tools to measure NMI
Tool — Prometheus + exporters
- What it measures for NMI: Metrics like host NMI counts, node-exporter custom metrics, uptime and reboots.
- Best-fit environment: Kubernetes and bare-metal with Prometheus stacks.
- Setup outline:
- Expose custom metric from host agent on NMI events.
- Scrape via Prometheus node-exporter or custom exporter.
- Record rules to compute rates and SLOs.
- Use alerts to feed PagerDuty.
- Strengths:
- Flexible query language (PromQL).
- Good for SRE-run monitoring.
- Limitations:
- Requires reliable agent during host hangs.
- Not a full log solution.
Tool — Fluentd/Fluent Bit + Central Logs
- What it measures for NMI: Collects dmesg, kernel oops, and BMC logs.
- Best-fit environment: Fleet with centralized log storage.
- Setup outline:
- Configure to tail kernel logs and IPMI outputs.
- Tag and enrich events with host metadata.
- Send to central store with retention policy.
- Strengths:
- Rich context for postmortem.
- Pipeline for long-term analysis.
- Limitations:
- May miss logs if host fully unresponsive.
Tool — BMC/IPMI tooling
- What it measures for NMI: Persistent firmware logs and system event logs.
- Best-fit environment: Bare metal with out-of-band management.
- Setup outline:
- Inventory BMC credentials and secure access.
- Poll SEL (System Event Log) at interval.
- Correlate with host logs on events.
- Strengths:
- Survives OS reboots.
- Authoritative hardware source.
- Limitations:
- Access restricted in managed clouds.
- Security risks if exposed.
Tool — Cloud provider host health events
- What it measures for NMI: Provider-detected host hardware faults and maintenance events.
- Best-fit environment: Managed VMs and managed Kubernetes.
- Setup outline:
- Subscribe to provider health events.
- Map provider event codes to internal incident categories.
- Automate replacement where provider signals host degradation.
- Strengths:
- No host agent dependency.
- Limitations:
- Abstracted details; limited diagnostic depth.
Tool — Fleet analytics / SIEM
- What it measures for NMI: Correlation across hosts, trend detection.
- Best-fit environment: Large fleets with many hosts.
- Setup outline:
- Ingest NMI events and enrich with hardware metadata.
- Run clustering and anomaly detection.
- Surface recurrent failure families for remediation.
- Strengths:
- Detects systemic issues early.
- Limitations:
- Requires sustained investment in data pipelines.
Recommended dashboards & alerts for NMI
- Executive dashboard
- Panels: Fleet NMI rate (trend), Hosts affected (count), Open NMI incidents, Business impact summary.
-
Why: High-level view for stakeholders to see hardware health and operational risk.
-
On-call dashboard
- Panels: Live NMI event stream, Hosts with recent NMIs, Reboot reasons, Recent kdump captures, Correlated application outages.
-
Why: Focused triage surface for responders.
-
Debug dashboard
- Panels: Per-host dmesg view, MCE logs, CPU usage pre-NMI, Interrupt counts, Network and disk metrics, BMC SEL entries.
- Why: Deep diagnostic view for engineers performing root-cause analysis.
Alerting guidance:
- What should page vs ticket
- Page: Single or recurrent NMI on production controller/critical host impacting availability.
-
Ticket: Isolated NMI on non-critical dev host or informational BMC event with no service impact.
-
Burn-rate guidance (if applicable)
-
If NMI-driven host failures consume >25% of infrastructure error budget in a week, escalate to a reliability strike team.
-
Noise reduction tactics (dedupe, grouping, suppression)
- Deduplicate by root cause fingerprint and host group.
- Group alerts by cluster and time window (e.g., suppress repeat pages for same host within 10 minutes).
- Use suppression windows during platform maintenance.
Implementation Guide (Step-by-step)
1) Prerequisites
– Inventory of hardware and virtualization mappings.
– Access to BMC/IPMI or provider host health events.
– Central logging and metrics pipeline.
– Defined recovery policy (cordon/replace vs repair).
2) Instrumentation plan
– Enable kernel NMI watchdog where appropriate.
– Configure crashkernel and kdump for dumps.
– Add host-agent hooks to mark NMI events as metrics and log them.
3) Data collection
– Collect dmesg, kernel oops, MCE logs, and BMC SEL entries.
– Persist to central logs with retention and indexing.
4) SLO design
– Define host-health SLOs incorporating NMI frequency and recovery.
– Set error budgets for host-induced incidents.
5) Dashboards
– Build executive, on-call, and debug dashboards.
– Provide drilldowns from fleet to single-host views.
6) Alerts & routing
– Set thresholds for paging vs ticketing.
– Route to hardware on-call, cloud ops, and relevant service owners.
7) Runbooks & automation
– Document step-by-step triage for NMI events.
– Automate cordon, drain, and replacement actions where safe.
8) Validation (load/chaos/game days)
– Run controlled tests that simulate stuck CPUs and verify detection & remediation.
– Include firmware-update scenarios and BMC failure modes.
9) Continuous improvement
– Monthly review of NMI trends, firmware rollouts, and vendor replacements.
– Implement RCA feedback into host provisioning workflows.
Include checklists:
- Pre-production checklist
- Verify crashkernel/kdump enabled.
- Test agent log shipping under simulated hang.
- Validate BMC access and SEL retrieval.
-
Define recovery policy for hosts in the test environment.
-
Production readiness checklist
- Baseline host NMI metrics and thresholds.
- Alerting rules with dedupe and grouping.
- Runbook available and indexed in incident portal.
-
Automation for safe cordon/replacement validated.
-
Incident checklist specific to NMI
- Capture timestamp and host identifiers.
- Retrieve BMC SEL and kernel logs.
- Correlate with other telemetry (network, storage).
- Decide repair vs replace based on recurrence and vendor guidance.
- Record mitigation steps and update fleet actions.
Use Cases of NMI
Provide 8–12 use cases:
-
Bare-metal compute stability
– Context: High-density hosting on bare metal.
– Problem: Occasional host hangs impacting tenant VMs.
– Why NMI helps: Detects CPU stalls and forces diagnostic capture.
– What to measure: NMI count, reboots, kdump capture rate.
– Typical tools: BMC, Prometheus, Fluentd. -
Kubernetes node reliability
– Context: Node-level kernel issues causing pods to freeze.
– Problem: kubelet not responding, disrupts stateful services.
– Why NMI helps: Detects node-level hangs to trigger cordon/replace.
– What to measure: Node NMI events, node-exporter health.
– Typical tools: Prometheus, kube-controller-manager automation. -
High-performance computing clusters
– Context: Long-running compute workloads sensitive to host faults.
– Problem: Silent CPU faults leading to silent corruption.
– Why NMI helps: Machine Check Exceptions via NMI reveal hardware corruption.
– What to measure: MCE rates, NMI per job.
– Typical tools: MCE logs, fleet analytics. -
Network device faults on servers
– Context: NIC causing packet loss intermittently.
– Problem: Device driver issues trigger NMIs for serious PCI errors.
– Why NMI helps: Immediate diagnostics and possible device isolation.
– What to measure: PCIe AER incidents, NIC resets.
– Typical tools: ethtool, kernel logs. -
Firmware regression detection
– Context: Fleet firmware upgrades performed regularly.
– Problem: New firmware triggers spurious NMIs on a class of boards.
– Why NMI helps: Early detection to rollback firmware.
– What to measure: NMI spike per firmware version.
– Typical tools: Fleet analytics, firmware inventory. -
Managed cloud incident correlation
– Context: Provider host degradation impacts VMs.
– Problem: Provider signals host health but tenant lacks low-level logs.
– Why NMI helps: When available via host health events, informs remediation.
– What to measure: Provider host events vs tenant NMI symptoms.
– Typical tools: Provider event streams, internal incident systems. -
Storage controller failures
– Context: Local disks attached to hosts experiencing errors.
– Problem: Storage IO lockups affecting VMs.
– Why NMI helps: PCIe/device NMIs indicate severe controller faults.
– What to measure: Device NMIs, IO latency before event.
– Typical tools: Smartctl, device driver logs. -
Safety-critical applications (telecom/finance)
– Context: Applications requiring durable correctness.
– Problem: Silent hardware errors can corrupt transactions.
– Why NMI helps: Ensures hardware error visibility and forced remediation.
– What to measure: ECC and MCE logs correlated with NMIs.
– Typical tools: ECC telemetry, MCE parsers. -
On-prem colo with limited redundancy
– Context: Single-host services without cloud redundancy.
– Problem: Host hangs cause prolonged service outages.
– Why NMI helps: Automate detection and BMC-initiated recovery to reduce downtime.
– What to measure: Time-to-reboot and service availability post-NMI.
– Typical tools: IPMI, automated runbooks. -
Regression testing for OS updates
- Context: Rolling kernel updates can affect NMI handling.
- Problem: Kernel changes cause different NMI behavior.
- Why NMI helps: Detect regressions in test labs before rollout.
- What to measure: NMI occurrence in canary hosts.
- Typical tools: CI labs, kdump validations.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes node hang due to kernel deadlock (Kubernetes scenario)
Context: Production Kubernetes cluster with stateful services experiencing pod stalls.
Goal: Detect node hangs quickly and automate replacement with minimal app impact.
Why NMI matters here: NMI watchdog can detect kernel stalls that kubelet and liveness probes cannot.
Architecture / workflow: Host-level NMI watchdog -> Kernel handler captures oops and marks node as unhealthy -> Host agent sends event to central logs -> Controller automation cordons node and drains workloads -> Replace host.
Step-by-step implementation:
- Enable kernel NMI watchdog on node images.
- Configure crashkernel and kdump to store dumps to network-backed storage.
- Add agent metric for “nmi_event_total” and ship to Prometheus.
- Create alert rule: any nmi_event_total > 0 on prod node -> Pager for hardware team.
- Automate cordon+drain when NMI occurs via operator with safety checks.
What to measure: NMI events, time node cordoned, pod eviction success rate.
Tools to use and why: Prometheus for metrics, Fluent Bit for logs, kube-controller-manager for automation.
Common pitfalls: Watchdog false positives during CPU-intensive pods; ensure thresholds.
Validation: Run simulated CPU hang tests in canary; verify cordon and replacement.
Outcome: Faster detection and reduced downtime for affected services.
Scenario #2 — Serverless function host hardware fault (Serverless/managed-PaaS scenario)
Context: Managed serverless provider where tenant functions occasionally cold-fail without clear app errors.
Goal: Correlate host-level faults with function failures and reduce function error rates.
Why NMI matters here: Underlying host hardware faults manifest as function invocation failures; NMIs indicate host-level issues needing provider action.
Architecture / workflow: Provider collects host events -> NMIs associated with host IDs -> Functions on same host flagged and invocations redirected -> Provider performs host replacement.
Step-by-step implementation:
- Ensure provider exposes host health events or internal BMC logs.
- Correlate function invocation failures with host event timestamps.
- Blacklist affected hosts in routing pool until resolved.
What to measure: Host NMI events, function error rate per host, blacklisted-host count.
Tools to use and why: Provider internal telemetry, central event bus.
Common pitfalls: Tenant cannot access host-level logs in public clouds; rely on provider channels.
Validation: Inject a simulated kernel hang in a non-prod region to verify routing changes.
Outcome: Reduced invocation failures and faster provider remediation.
Scenario #3 — Postmortem after recurrent NMIs (Incident-response/postmortem scenario)
Context: Fleet shows recurring NMIs on a set of blades after firmware rollout.
Goal: Root-cause analysis and rollback plan.
Why NMI matters here: Pattern of NMIs indicates regression introduced by firmware.
Architecture / workflow: Collect NMI events, firmware version mapping, BMC SEL, and service impact metrics -> RCA -> Firmware rollback plan and automated replacement.
Step-by-step implementation:
- Aggregate NMIs across hosts and group by firmware version.
- Validate with vendor logs and affected hardware SKU.
- Rollback firmware on canary subset, observe reduction in NMIs, then do controlled rollback.
What to measure: NMI rate pre/post rollback, service error budget consumption.
Tools to use and why: Fleet analytics, vendor firmware tooling.
Common pitfalls: Incomplete metadata linking firmware to host; ensure accurate inventory.
Validation: Canary rollback and monitoring for recurrence.
Outcome: Restored fleet stability and updated firmware rollout policy.
Scenario #4 — Cost vs performance tradeoff: aggressive watchdogs (Cost/performance trade-off scenario)
Context: High-performance trading application prioritizing latency; ops must balance detection with performance.
Goal: Tune watchdog sensitivity to avoid performance regressions while detecting real hangs.
Why NMI matters here: Aggressive watchdogs may cause reboots during short GC or high-CPU bursts; too lax and hangs go undetected.
Architecture / workflow: Parameterize watchdog timeout per instance class; observe impact on latency and NMI rate.
Step-by-step implementation:
- Baseline latency and CPU profiles.
- Enable NMI watchdog with conservative timeout on canary nodes.
- Adjust timeouts and correlate NMI rate vs latency impact.
- Choose per-instance class policy (e.g., lower sensitivity on ultra-low-latency instances).
What to measure: Latency SLOs, NMI events, false reboot rate.
Tools to use and why: Prometheus, performance test harness.
Common pitfalls: Over-generalizing timeouts across different workloads; tune per class.
Validation: Load tests and targeted chaos testing.
Outcome: Balanced detection while preserving latency goals.
Scenario #5 — Nested virtualization debug (Kubernetes/hypervisor hybrid)
Context: VM running Kubernetes, guest experiences fatal hangs; unclear if guest or host triggered.
Goal: Determine whether NMIs originate in host or guest and fix accordingly.
Why NMI matters here: NMIs may be swallowed or emulated by hypervisor leading to ambiguity.
Architecture / workflow: Instrument both host and guest with NMI counters, enable hypervisor passthrough settings, compare logs.
Step-by-step implementation:
- Enable guest-visible NMI metrics.
- Correlate guest NMI events with host BMC SEL entries.
- If host-origin, escalate to hardware replacement; if guest-origin, patch guest kernel.
What to measure: Host vs guest NMI event correlation rate.
Tools to use and why: QEMU/KVM logs, host BMC logs, guest kernel logs.
Common pitfalls: Hypervisor default masks; must explicitly enable forwarding for accurate diagnosis.
Validation: Reproduce on nested testbed.
Outcome: Correct ownership and remediation path identified.
Common Mistakes, Anti-patterns, and Troubleshooting
List 15–25 mistakes with: Symptom -> Root cause -> Fix
- Symptom: No NMI logs after host crash -> Root cause: crashkernel not configured -> Fix: Reserve crashkernel and enable kdump.
- Symptom: Frequent NMI alerts -> Root cause: firmware regression -> Fix: Rollback firmware and coordinate vendor patch.
- Symptom: NMI event but no service impact -> Root cause: Spurious device error -> Fix: Isolate device and monitor; vendor test.
- Symptom: Guest shows hang, no host NMI -> Root cause: Hypervisor masking -> Fix: Enable NMI passthrough or increase host diagnostics.
- Symptom: Alert storm during maintenance -> Root cause: No suppression rules -> Fix: Implement maintenance window suppression and dedupe.
- Symptom: Missing kernel stacks -> Root cause: Handler did too much and crashed -> Fix: Simplify handler and ensure atomic minimal logging.
- Symptom: False positives in high-CPU workloads -> Root cause: Watchdog timeout too short -> Fix: Tune thresholds per workload.
- Symptom: Unable to access BMC -> Root cause: Network/cred issues -> Fix: Secure and document BMC access and rotate creds.
- Symptom: Recurrent host replacement without fix -> Root cause: Not capturing full diagnostics -> Fix: Ensure persistent firmware logs and kdump.
- Symptom: Slow detection of NMI -> Root cause: Agent ingest lag -> Fix: Prioritize NMI pipeline and heartbeat checks.
- Symptom: Lost correlation between NMI and app outages -> Root cause: Poor timestamp sync -> Fix: Ensure NTP/PTP across hosts and BMCs.
- Symptom: High MTTR -> Root cause: No automation for cordon/replace -> Fix: Implement safe automation runbooks.
- Symptom: Overloaded alerting system -> Root cause: Every NMI pages SRE -> Fix: Tier alerts and route to hardware team first.
- Symptom: Partial SEL entries -> Root cause: BMC buffer overflow -> Fix: Increase SEL polling frequency.
- Symptom: Confusing SMI vs NMI behavior -> Root cause: Lack of documentation -> Fix: Create vendor-specific runbooks clarifying differences.
- Symptom: Incomplete kdump captures -> Root cause: Insufficient reserved memory -> Fix: Increase crashkernel reservation.
- Symptom: Missing symbolized stacks -> Root cause: Kernel versions mismatched -> Fix: Keep kernel build and symbol store synchronized.
- Symptom: Noise during firmware maintenance -> Root cause: Not excluding canary hosts -> Fix: Use canary cohort and staged rollout.
- Symptom: Data loss after watchdog reboot -> Root cause: Hard reset without graceful drain -> Fix: Graceful drain automation before reboot.
- Symptom: Observability agent fails during hang -> Root cause: Agent runs in same kernel context -> Fix: Use out-of-band collection or pre-persisted logs.
- Symptom: Misattributed billing events -> Root cause: Reboots lead to VM migration counts -> Fix: Tag incidents and correlate with billing pipeline.
- Symptom: Excessive human toil -> Root cause: Manual runbook steps -> Fix: Automate routine remediation tasks.
- Symptom: Vendor denies hardware fault -> Root cause: Insufficient diagnostic evidence -> Fix: Collect MCE, SEL, and crash dumps before replacement.
Observability pitfalls (at least 5):
- PITFALL: Relying solely on dmesg -> CAUSE: dmesg ephemeral -> FIX: Persist to central logs and capture frequently.
- PITFALL: Not syncing clocks -> CAUSE: Misaligned timestamps -> FIX: NTP/PTP across host and BMC.
- PITFALL: Missing correlation with higher-level metrics -> CAUSE: Siloed telemetry -> FIX: Integrate event bus and correlate traces/metrics/logs.
- PITFALL: Agent unavailable during hang -> CAUSE: Agent process requires kernel context -> FIX: Out-of-band logging via BMC.
- PITFALL: Over-indexed raw dumps -> CAUSE: Storage blowout -> FIX: Prioritize metadata and sample full dumps.
Best Practices & Operating Model
- Ownership and on-call
-
Hardware/BMC team owns NMI handling policies; service owners own application impact. Create a separate hardware on-call rotation for urgent host-level NMIs.
-
Runbooks vs playbooks
- Runbooks: Step-by-step remediation for a specific host NMI.
-
Playbooks: Higher-level escalation flows involving multiple teams and vendor engagement.
-
Safe deployments (canary/rollback)
-
Firmware/kernel changes must roll through small canaries with NMI and kdump monitoring. Implement automated rollback triggers on NMI spikes.
-
Toil reduction and automation
-
Automate cordon/drain and host replacement. Use auto-replace policies for hosts with repeat NMIs and escalate to vendor only after automated triage.
-
Security basics
- Secure BMC/IPMI access with rotation and network isolation. Limit who can read SEL logs. Audit all actions against hardware.
Include:
- Weekly/monthly routines
- Weekly: Review NMI events for spikes and outstanding kdumps.
- Monthly: Analyze fleet trends, firmware versions prevalence, and identify hotspots.
- What to review in postmortems related to NMI
- Confirm diagnostic artifacts preserved, validate remediation execution, check for automation coverage, and revise firmware rollout policy if needed.
Tooling & Integration Map for NMI (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Metrics | Collects NMI counters from hosts | Prometheus, exporters | Use custom metric for NMIs |
| I2 | Logs | Aggregates dmesg and kernel oops | Fluent Bit, ELK | Ensure retention for postmortems |
| I3 | BMC tooling | Fetches SEL and firmware logs | ipmitool, vendor tools | Out-of-band source for diagnostics |
| I4 | Crash capture | Captures kdump images | kdump, crash utilities | Requires reserved memory |
| I5 | Fleet analytics | Correlates events across hosts | SIEM, data warehouse | Good for systemic trends |
| I6 | Alerting | Pages and routes incidents | PagerDuty, Opsgenie | Deduplication important |
| I7 | Orchestration | Automates cordon/drain/replace | Kubernetes controllers, Terraform | Critical for low MTTR |
| I8 | Hypervisor | Manages interrupt forwarding | KVM/QEMU, Xen | Hypervisor settings affect NMI visibility |
| I9 | Firmware mgmt | Deploys and rolls back firmware | Vendor update tools | Track firmware per host |
| I10 | Provider health | Received provider host events | Cloud event bus | Abstracted details; combine with tenant logs |
Row Details (only if needed)
- I3: Ensure BMC credentials management integrates with secrets store and audit logs.
- I4: Validate crashkernel sizing in CI images to ensure kdump success.
Frequently Asked Questions (FAQs)
What exactly triggers an NMI?
Hardware sources like watchdog timers, certain PCI errors, thermal events, and CPU machine checks can trigger an NMI.
Can NMIs be disabled?
Varies / depends. Some platforms and hypervisors allow configuration; disabling removes a critical safety mechanism and is generally discouraged.
Are NMIs visible inside VMs?
Sometimes. Hypervisor decisions determine whether NMIs are forwarded to guests, emulated, or swallowed.
How do NMIs differ from SMIs?
SMIs execute in firmware SMM and are generally invisible to the OS; NMIs are delivered to the CPU and handled by the OS/kernel.
How do I capture useful data when an NMI happens?
Enable crashkernel/kdump, persist kernel ring buffer to a central log, and collect BMC SEL entries.
Can NMIs be the cause of data corruption?
NMIs often indicate hardware faults (like MCEs) that can lead to data corruption if not detected; NMIs themselves are signaling mechanisms.
How to avoid false positives from watchdog NMIs?
Tune watchdog thresholds per workload and test under realistic load profiles.
Do cloud providers surface NMIs to tenants?
Varies / depends. Public cloud providers typically surface aggregate host health events rather than raw NMIs.
How to troubleshoot repeated NMIs on a host?
Collect kdump and SEL, compare firmware and hardware SKU, isolate with hardware swap, and engage vendor support.
Is there a standard SLI for NMIs?
No universal standard; common SLI is hosts with NMI rate per period. Define based on fleet risk tolerance.
Should NMIs page engineers?
Only if the event impacts production availability or is a repeated/recurrent failure; otherwise log and ticket.
How to test NMI handling?
Simulate hung CPUs in a controlled environment or use vendor-provided diagnostic tools.
Will NMIs always produce stack traces?
Not always. Early-boot NMIs or severe memory corruption can prevent full stack captures.
How to secure BMC/IPMI access?
Network isolation, credential rotation, strong auth, and restricted access roles.
What to do if NMIs increase after a firmware update?
Halt rollout, roll back canary firmware, gather diagnostics, and escalate to vendor.
Can software cause an NMI?
Software cannot directly cause NMI, but software behavior (e.g., infinite interrupt disabling) can lead to watchdog timeouts that trigger NMIs.
How long should we retain NMI-related logs?
Sufficient to perform RCA and trend analysis; typically months for fleet analytics and at least 90 days for incident traces.
Do containers see NMIs?
Containers rely on kernel/host; containers do not directly receive NMIs but will suffer host-level impacts.
Conclusion
Non-Maskable Interrupts are a critical, low-level signal for hardware and platform reliability. For 2026 cloud-native operations, treating NMIs as part of your observability and incident response fabric is essential—especially for bare-metal fleets, high-availability services, and safety-critical workloads. Proper instrumentation, automation, and vendor collaboration turn NMIs from opaque emergencies into actionable diagnostics.
Next 7 days plan (5 bullets):
- Day 1: Inventory hosts and confirm crashkernel/kdump settings on a canary cohort.
- Day 2: Enable minimal NMI telemetry and ship to central logs.
- Day 3: Create Prometheus metrics and dashboards for NMI events.
- Day 4: Draft runbook for NMI triage and test on non-prod hosts.
- Day 5–7: Run a controlled NMI simulation in staging and validate automation for cordon/replace.
Appendix — NMI Keyword Cluster (SEO)
- Primary keywords
- Non-Maskable Interrupt NMI
- NMI watchdog
- kernel NMI
- NMI dashboard
-
NMI monitoring
-
Secondary keywords
- machine check exception MCE
- kernel oops NMI
- crashkernel kdump NMI
- BMC SEL NMI
- IPMI NMI logs
- firmware NMI regression
- NMI alerting
- hypervisor NMI passthrough
- NMI troubleshooting
-
NMI runbook
-
Long-tail questions
- What causes a non-maskable interrupt in servers
- How to capture kdump after an NMI
- How to correlate NMI with application outages
- How to configure NMI watchdog in Linux
- Why am I getting frequent NMIs after firmware update
- How do cloud providers expose host NMIs
- How to test NMI detection and recovery
- How to secure BMC when collecting NMI logs
- What is the difference between SMI and NMI
- Do VMs receive NMIs from the host
- How to reduce NMI alert noise
- What to include in an NMI incident postmortem
- How to configure Prometheus for NMI events
- How to automate replacement on NMI detection
-
How to interpret machine check exception logs
-
Related terminology
- kernel panic
- IRQ vs NMI
- System Event Log SEL
- out-of-band management
- crash dump
- symbolized stack
- ECC memory
- DIMM failure
- PCIe AER
- watchdog timer
- BIOS/UEFI
- SMM
- kexec
- node cordon
- auto-replace policy
- fleet analytics
- incident burn rate
- observability pipeline
- deduplication rules
- canary firmware rollout
- proactive replacement
- panic_on_oops
- kernel symbol store
- NTP/PTP sync
- event correlation
- out-of-band logs
- vendor escalation
- hardware on-call
- platform health event