What is NMI? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)

rajeshkumar February 17, 2026 0

Quick Definition (30–60 words)

Non-Maskable Interrupt (NMI) is a hardware-level interrupt that cannot be ignored by the CPU and signals critical conditions like hardware faults or watchdog timeouts. Analogy: an emergency alarm that always breaks through whatever you’re doing. Formal: a high-priority, non-discardable interrupt vector delivered to the processor for immediate handling.

What is NMI?

What it is / what it is NOT
NMI is a hardware-initiated interrupt line intended for critical, high-severity events that require immediate attention from the CPU and operating system. It is NOT a regular software interrupt, nor is it generally used for routine telemetry or graceful shutdowns.
Key properties and constraints
Always prioritized above maskable interrupts.
Delivered by hardware sources (chipset, watchdog timers, PCI devices, thermal units).
Handler execution is constrained by the CPU and OS context; reentrancy and safety are concerns.
Can indicate unrecoverable conditions (e.g., memory corruption) or survivable diagnostics (e.g., watchdog timeout).
Behavior can vary by platform, hypervisor, and BIOS/UEFI firmware.
Where it fits in modern cloud/SRE workflows
Troubleshooting root-cause hardware incidents on hosts (bare metal or VMs with passthrough).
Detecting stuck CPUs and kernel deadlocks via NMI watchdogs.
Integrating with incident response for host-level severity incidents and correlated observability.
Postmortems and capacity planning when hardware error rates rise.
A text-only “diagram description” readers can visualize
Physical hardware sources (watchdog chip, thermal sensor, PCI device) send NMI signal to CPU(s).
CPU interrupts current context and vectors to the NMI handler in the kernel/firmware.
NMI handler attempts minimal diagnostics, writes to special logs, and may trigger platform recovery or reboot.
Observability agents collect NMI logs and map to infrastructure monitoring and incident pipelines.

NMI in one sentence

NMI is a hardware-triggered, non-maskable interrupt delivered to the CPU to signal critical conditions that require immediate OS/firmware attention.

NMI vs related terms (TABLE REQUIRED)

ID	Term	How it differs from NMI	Common confusion
T1	IRQ	IRQs are maskable and used for normal device interrupts	Confused as same urgency
T2	SMI	SMI runs in firmware/SMM, not OS context	Both are low-level interrupts
T3	Watchdog timer	Watchdog can generate NMI as reaction	Not all watchdog events are NMIs
T4	Machine Check	Machine checks are CPU error reports, sometimes NMI-triggered	Often conflated with NMI events
T5	Kernel panic	Panic is OS reaction; NMI may cause panic	People think panic equals NMI
T6	Exception	Exceptions are CPU faults from instruction execution	Some exceptions are handled differently
T7	Hypervisor trap	Hypervisor traps are virtualization events	Might mask or reroute NMIs
T8	ACPI event	ACPI signals power/thermal; can cause NMI on some boards	Not typically an NMI source

Row Details (only if any cell says “See details below”)

None.

Why does NMI matter?

Business impact (revenue, trust, risk)
Host-level failures can cause multi-tenant VM interruptions or data plane outages. Uninvestigated NMIs can hide systemic hardware issues leading to repeated downtime, SLA breaches, and customer churn.
Engineering impact (incident reduction, velocity)
Early detection of stuck CPUs or hardware errors via NMI reduces mean time to detect (MTTD) and helps avoid cascading faults. Clear NMI telemetry improves troubleshooting velocity and reduces firefighting toil.
SRE framing (SLIs/SLOs/error budgets/toil/on-call) where applicable
NMIs map to host-health SLIs (e.g., host-healthy-rate). Elevated NMI rates should consume error budget for infrastructure SLOs. On-call playbooks must include NMI triage to prevent noisy paging and to escalate only genuine host-level failures.
3–5 realistic “what breaks in production” examples
1. Kernel deadlock on a host causing all containers to freeze; watchdog NMI triggers kernel handler and may force reboot.
2. Repeated Machine Check Exceptions signaled via NMI indicating failing DIMM, causing VM crashes and storage errors.
3. Thermal sensor NMI on a blade begins throttling or immediate shutdown, impacting throughput and auto-scaling decisions.
4. Faulty NIC triggers PCI-generated NMI, breaking network traffic for affected hosts in a cluster.
5. Firmware bug causes spurious NMIs across a hardware fleet, generating alert storms and masking true incidents.

Where is NMI used? (TABLE REQUIRED)

ID	Layer/Area	How NMI appears	Typical telemetry	Common tools
L1	Edge / Physical host	Hardware watchdog or thermal NMI	NMI logs, kernel oops, dmesg	syslog, journalctl
L2	Network / NIC	PCI error NMI	NIC reset traces, driver logs	ethtool, driver logs
L3	Hypervisor	NMI forwarded or trapped	Hypervisor logs, VM crash reports	QEMU/KVM logs, Xen logs
L4	Kubernetes node	Node freezes, kubelet not responding	Node heartbeat, node-exporter metrics	Prometheus, node-exporter
L5	Serverless / PaaS	Platform-level host failures	Platform incident events	Cloud provider status, platform logs
L6	Observability	Alert correlator for host NMIs	Event stores, incident graphs	PagerDuty, Opsgenie
L7	BIOS/UEFI firmware	SMI/NMI interplay visible during boot	Firmware logs, platform error logs	ipmitool, ACPI logs

Row Details (only if needed)

L1: See details below: L1
L3: See details below: L3
L5: See details below: L5

Row Details

L1: Hardware NMIs often originate from embedded controllers and require vendor firmware logs and IPMI/OEM logs for full context.
L3: Hypervisors may choose to swallow, forward, or emulate NMIs; behavior depends on platform and configuration.
L5: Managed PaaS abstracts NMIs; operators usually see only higher-level platform health events and must rely on provider telemetry.

When should you use NMI?

When it’s necessary
Use NMIs for detecting critical, unrecoverable hardware errors, stuck CPUs, and platform watchdog conditions where immediate attention is required.
When it’s optional
Use an NMI watchdog in addition to regular telemetry if host-level hangs are rare but costly; enable selectively in high-value or high-risk instances.
When NOT to use / overuse it
Do NOT use NMI as a general-purpose monitoring signal or for frequent, low-priority events. Overuse leads to noisy alerting and may mask true emergencies.
Decision checklist
If a host can hang and cause multi-tenant impact AND other diagnostics are insufficient -> enable NMI/watchdog.
If application-level retries or circuit breakers can recover without host intervention -> prefer application-level SLIs instead.
If you run fully managed compute where NMI is opaque -> rely on provider platform alerts, not local NMI tinkering.
Maturity ladder: Beginner -> Intermediate -> Advanced
Beginner: Enable basic NMI/watchdog; collect kernel oops and report to central logging.
Intermediate: Correlate NMI events with metrics and alerts; automate host cordon and replacement.
Advanced: Fleet-wide NMI analytics, predictive failure models, automated firmware rollbacks, and integration with capacity and incident pipelines.

How does NMI work?

Components and workflow
1. Hardware source (watchdog, thermal controller, PCI device) detects a critical condition.
2. Hardware asserts the NMI line to CPU.
3. CPU halts normal interrupt masking and vectors to the NMI handler (in kernel or firmware).
4. Handler performs minimal diagnostics (stack trace, CPU registers) and writes to protected log regions.
5. Depending on severity, handler may attempt recovery, notify management controllers, or initiate a controlled reboot.
6. Logged data is collected by host agents and sent to observability/incident pipelines.
Data flow and lifecycle
Detection -> NMI delivered -> Handler executes -> Diagnostic capture -> Persist to ring buffer/firmware logs -> Collection agent scrapes logs -> Ingest into central observability -> Correlation and alerting -> Triage and remediation.
Edge cases and failure modes
NMI handler deadlocks or corrupts memory causing cascade.
NMIs occur during low-level firmware initialization and produce incomplete diagnostics.
Hypervisor masks or reroutes NMIs causing guest-visible symptoms without host-level signals.
Multiple NMIs in close succession trigger log overflow and partial data capture.

Typical architecture patterns for NMI

Local NMI watchdog: Enable kernel NMI watchdog on hosts; used for VMs and bare metal to detect stuck CPUs. Use when debugging sporadic host hangs.
Hardware watchdog + management controller: Dedicated BMC triggers NMI and then power-cycles host if OS is unresponsive. Use for resilient fleets with out-of-band management.
Hypervisor-forwarded NMI: Hypervisor forwards host NMIs to guests for debugging; used in nested virtualization or when guest needs hardware-level visibility.
Fleet telemetry aggregation: Agents collect NMI events and ship to centralized observability with correlation to kernel oops and machine-check logs. Use for fleet-wide failure analysis.
Managed cloud abstraction: Rely on provider-supplied host health events and remediation; use when you cannot access host firmware or BMC.

Failure modes & mitigation (TABLE REQUIRED)

ID	Failure mode	Symptom	Likely cause	Mitigation	Observability signal
F1	Missing NMI logs	No diagnostics after NMI	Log overflow or handler crash	Preserve ring buffer and flush early	Empty dmesg after event
F2	Spurious NMIs	Frequent transient NMIs	Faulty hardware or firmware bug	Firmware update or isolate hardware	Repeated NMI timestamps
F3	NMI handler deadlock	Host becomes unresponsive	Handler reentrancy or corruption	Minimal handler steps; safe fallback	No new metrics post-NMI
F4	Hypervisor masking	Guest shows symptoms but no host NMI	Hypervisor swallowed NMI	Reconfigure hypervisor forwarding	Host vs guest discrepancy
F5	Partial diagnostics	Incomplete stack traces	Interrupt at early boot or corrupted memory	Out-of-band logs via BMC	Truncated logs in kernel
F6	Alert storms	Many pages for same root cause	Lack of dedupe/correlation	Deduplication and grouping	High alert frequency

Row Details (only if needed)

F1: Ensure persistent storage of ring buffer and out-of-band log copies; use crashkernel reservation where supported.
F2: Collect hardware vendor logs and correlate with firmware releases; reproduce with hardware swap.
F3: Harden NMI handler to write minimal state atomically and call management controller early.
F4: Check hypervisor settings for NMI passthrough; test with host-level diagnostics.
F5: Use IPMI or BMC to fetch system event logs that persist across reboots.
F6: Implement event correlation rules in alerting system and use rate-limits for paging.

Key Concepts, Keywords & Terminology for NMI

Term — 1–2 line definition — why it matters — common pitfall

Non-Maskable Interrupt — Hardware interrupt that cannot be masked — Signals critical conditions — Mistaking for normal IRQ
NMI watchdog — Kernel feature to detect stuck CPUs — Helps detect hangs — False positives on heavy GC workloads
Machine Check Exception (MCE) — CPU-detected hardware error — Points to memory/CPU faults — Ignoring vendor MCE logs
SMI — System Management Interrupt running in firmware — Can mask OS handling — Confused with NMI behavior
BMC — Baseboard Management Controller — Out-of-band logs and power control — Not all providers expose BMC to tenants
IPMI — Interface for managing BMC — Useful for fetching hardware logs — Security risk if not secured
dmesg — Kernel ring buffer output — Primary local diagnostic source — Overwrites quickly without persistence
oops — Kernel stack trace after fault — Essential for debugging — Misinterpreting stack levels
crashkernel — Kernel reservation for kdump — Enables capture during panic — Not always configured by default
kdump — Kernel crash dump mechanism — Stores memory images for postmortem — Large disk needs and complexity
PCIe AER — PCI Express Advanced Error Reporting — Can cause NMIs for device errors — Misconfigured drivers suppress reports
Thermal trip — Overheat event causing emergency interrupt — Prevents hardware damage — Lack of telemetry for gradual heating
Watchdog timer — Timer that triggers recovery if system appears stuck — Can generate NMI — Overzealous timeouts lead to reboots
CPU hang — CPU stops making progress — Detectable by NMI watchdog — Hard to reproduce in dev environments
Firmware / BIOS — Low-level platform code — Controls NMI routing — Vendor-specific behavior varies
Hypervisor passthrough — Forwarding hardware interrupts to guests — Enables guest visibility — Can hide host-level problems
Nested virtualization — VM inside VM — NMIs may be emulated — Complexity in debug trails
Kernel panic — OS-level unrecoverable state — May be triggered by NMI actions — Panics should be captured via kdump
Ring buffer — Circular log buffer in kernel — Holds recent messages — Overwrites with bursty logs
ECC memory — Error-correcting RAM — Reduces MCEs — Not all platforms use ECC
DIMM failure — Memory module hardware fault — Often appears as MCEs — Replacement required, not software fix
Watchdog reset — Forced reboot from watchdog — Helps recover but may lose in-flight work — Need graceful drain before hard reset
OOB management — Out-of-band control via BMC/IPMI — Critical when OS is unresponsive — Limited in fully managed clouds
Firmware log — Persistent log stored by firmware/BMC — Survives reboots — Requires vendor tools to interpret
Event correlation — Linking NMI with other telemetry — Essential for accurate triage — Lack of correlation causes noisy paging
Panic kernel dump — Capture on panic for postmortem — Enables deep analysis — Large storage and retrieval complexity
Telemetry agent — Host agent that ships logs/metrics — Collects NMI indicators — Can be unavailable during host hang
Kernel oops decode — Interpreting oops stack and registers — Critical for root cause — Often requires symbolized kernels
Symbolized stack — Human-readable stack using kernel symbols — Needed for interpretation — Requires matching build IDs
Firmware rollback — Reverting to older firmware to mitigate regressions — Useful for spurious NMIs — Needs controlled rollout
Safe-mode boot — Boot mode with minimal drivers — Helps reproduce early-boot NMIs — Not always feasible in production
Panic_on_oops — Kernel setting to panic on oops — Ensures kdump capture — Can reduce system availability if misused
Watchdog threshold — Timeout value for watchdog — Balances detection vs false positives — Too short causes unnecessary reboots
Node cordon — Mark node unschedulable in orchestration — Prevents placing workloads on unstable node — Needs automation to avoid human delay
Auto-replace — Automated decommission and replacement after NMI — Reduces mean time to repair — Risk of replacing healthy nodes if false positive
JVM safepoint — Java pause indicating thread stops — Can look like CPU hang — JVM-induced pauses can trigger NMI misdiagnosis
Interrupt storm — Flood of interrupts from device — Can starve CPU and cause watchdogs to fire — Mitigate with driver fixes
Vendor support ticketing — Escalation for hardware NMIs — Required for warranty parts — Slow if undiagnosed internally
Fleet analytics — Aggregate NMI events across hosts — Useful for root-cause across fleet — Requires centralized ingestion and retention
Recovery policy — What to do when NMI occurs (reboot, replace, escalate) — Ensures consistent response — Poor policy causes inconsistent handling

How to Measure NMI (Metrics, SLIs, SLOs) (TABLE REQUIRED)

ID	Metric/SLI	What it tells you	How to measure	Starting target	Gotchas
M1	NMI count per host	Frequency of NMIs on a host	Count NMI events in logs per host per day	0–0.01/day	Sparse events may indicate underreporting
M2	Hosts with NMI rate	Fraction of fleet with NMIs	Unique hosts with NMI / total hosts	<0.5% weekly	Bursts skew weekly metrics
M3	Time-to-detect NMI	Delay from NMI to central ingest	Timestamp compare log vs ingest	<60s	Network/agent outage increases delay
M4	NMI-induced reboots	Reboots caused by NMI	Correlate reboot reason in BMC/logs	<0.1% monthly	Mislabelled reboot reasons
M5	Mean time to remediation (MTTR)	Time to replace or fix host after NMI	From alert to resolved incident	<2 hours	Human-led processes vary
M6	NMI correlation rate	% NMIs with root-cause identified	Count resolved root cause / total NMIs	>80%	Complex hardware issues may be unresolved
M7	Alert noise ratio	Ratio of NMI alerts that were actionable	Actionable alerts / total alerts	>70% actionable	Poor dedupe inflates noise
M8	NMI stack capture rate	% of NMIs with captured stack traces	Successful stack dumps / NMIs	>90%	Early boot NMIs may miss captures
M9	Recurrent NMI hosts	Hosts with >1 NMI in window	Hosts with repeated NMIs in 30 days	0 for single-tenant hosts	Intermittent hardware faults cause repeats
M10	Cost per NMI incident	Operational cost to remediate	Sum of staff hours + host replacement	Varies / depends	Hard to attribute costs precisely

Row Details (only if needed)

M1: Ensure agent reliably writes NMI marker and preserves event timestamps.
M3: Use agent heartbeat to validate ingest pipeline latency.
M4: Use BMC and cloud provider metadata to identify reboot cause.
M8: Configure crashkernel and kdump; validate in staging.

Best tools to measure NMI

Tool — Prometheus + exporters

What it measures for NMI: Metrics like host NMI counts, node-exporter custom metrics, uptime and reboots.
Best-fit environment: Kubernetes and bare-metal with Prometheus stacks.
Setup outline:
Expose custom metric from host agent on NMI events.
Scrape via Prometheus node-exporter or custom exporter.
Record rules to compute rates and SLOs.
Use alerts to feed PagerDuty.
Strengths:
Flexible query language (PromQL).
Good for SRE-run monitoring.
Limitations:
Requires reliable agent during host hangs.
Not a full log solution.

Tool — Fluentd/Fluent Bit + Central Logs

What it measures for NMI: Collects dmesg, kernel oops, and BMC logs.
Best-fit environment: Fleet with centralized log storage.
Setup outline:
Configure to tail kernel logs and IPMI outputs.
Tag and enrich events with host metadata.
Send to central store with retention policy.
Strengths:
Rich context for postmortem.
Pipeline for long-term analysis.
Limitations:
May miss logs if host fully unresponsive.

Tool — BMC/IPMI tooling

What it measures for NMI: Persistent firmware logs and system event logs.
Best-fit environment: Bare metal with out-of-band management.
Setup outline:
Inventory BMC credentials and secure access.
Poll SEL (System Event Log) at interval.
Correlate with host logs on events.
Strengths:
Survives OS reboots.
Authoritative hardware source.
Limitations:
Access restricted in managed clouds.
Security risks if exposed.

Tool — Cloud provider host health events

What it measures for NMI: Provider-detected host hardware faults and maintenance events.
Best-fit environment: Managed VMs and managed Kubernetes.
Setup outline:
Subscribe to provider health events.
Map provider event codes to internal incident categories.
Automate replacement where provider signals host degradation.
Strengths:
No host agent dependency.
Limitations:
Abstracted details; limited diagnostic depth.

Tool — Fleet analytics / SIEM

What it measures for NMI: Correlation across hosts, trend detection.
Best-fit environment: Large fleets with many hosts.
Setup outline:
Ingest NMI events and enrich with hardware metadata.
Run clustering and anomaly detection.
Surface recurrent failure families for remediation.
Strengths:
Detects systemic issues early.
Limitations:
Requires sustained investment in data pipelines.

Recommended dashboards & alerts for NMI

Executive dashboard
Panels: Fleet NMI rate (trend), Hosts affected (count), Open NMI incidents, Business impact summary.
Why: High-level view for stakeholders to see hardware health and operational risk.
On-call dashboard
Panels: Live NMI event stream, Hosts with recent NMIs, Reboot reasons, Recent kdump captures, Correlated application outages.
Why: Focused triage surface for responders.
Debug dashboard
Panels: Per-host dmesg view, MCE logs, CPU usage pre-NMI, Interrupt counts, Network and disk metrics, BMC SEL entries.
Why: Deep diagnostic view for engineers performing root-cause analysis.

Alerting guidance:

What should page vs ticket
Page: Single or recurrent NMI on production controller/critical host impacting availability.
Ticket: Isolated NMI on non-critical dev host or informational BMC event with no service impact.
Burn-rate guidance (if applicable)
If NMI-driven host failures consume >25% of infrastructure error budget in a week, escalate to a reliability strike team.
Noise reduction tactics (dedupe, grouping, suppression)
Deduplicate by root cause fingerprint and host group.
Group alerts by cluster and time window (e.g., suppress repeat pages for same host within 10 minutes).
Use suppression windows during platform maintenance.

Implementation Guide (Step-by-step)

1) Prerequisites
– Inventory of hardware and virtualization mappings.
– Access to BMC/IPMI or provider host health events.
– Central logging and metrics pipeline.
– Defined recovery policy (cordon/replace vs repair).
2) Instrumentation plan
– Enable kernel NMI watchdog where appropriate.
– Configure crashkernel and kdump for dumps.
– Add host-agent hooks to mark NMI events as metrics and log them.
3) Data collection
– Collect dmesg, kernel oops, MCE logs, and BMC SEL entries.
– Persist to central logs with retention and indexing.
4) SLO design
– Define host-health SLOs incorporating NMI frequency and recovery.
– Set error budgets for host-induced incidents.
5) Dashboards
– Build executive, on-call, and debug dashboards.
– Provide drilldowns from fleet to single-host views.
6) Alerts & routing
– Set thresholds for paging vs ticketing.
– Route to hardware on-call, cloud ops, and relevant service owners.
7) Runbooks & automation
– Document step-by-step triage for NMI events.
– Automate cordon, drain, and replacement actions where safe.
8) Validation (load/chaos/game days)
– Run controlled tests that simulate stuck CPUs and verify detection & remediation.
– Include firmware-update scenarios and BMC failure modes.
9) Continuous improvement
– Monthly review of NMI trends, firmware rollouts, and vendor replacements.
– Implement RCA feedback into host provisioning workflows.

Include checklists:

Pre-production checklist
Verify crashkernel/kdump enabled.
Test agent log shipping under simulated hang.
Validate BMC access and SEL retrieval.
Define recovery policy for hosts in the test environment.
Production readiness checklist
Baseline host NMI metrics and thresholds.
Alerting rules with dedupe and grouping.
Runbook available and indexed in incident portal.
Automation for safe cordon/replacement validated.
Incident checklist specific to NMI
Capture timestamp and host identifiers.
Retrieve BMC SEL and kernel logs.
Correlate with other telemetry (network, storage).
Decide repair vs replace based on recurrence and vendor guidance.
Record mitigation steps and update fleet actions.

Use Cases of NMI

Provide 8–12 use cases:

Bare-metal compute stability
– Context: High-density hosting on bare metal.
– Problem: Occasional host hangs impacting tenant VMs.
– Why NMI helps: Detects CPU stalls and forces diagnostic capture.
– What to measure: NMI count, reboots, kdump capture rate.
– Typical tools: BMC, Prometheus, Fluentd.
Kubernetes node reliability
– Context: Node-level kernel issues causing pods to freeze.
– Problem: kubelet not responding, disrupts stateful services.
– Why NMI helps: Detects node-level hangs to trigger cordon/replace.
– What to measure: Node NMI events, node-exporter health.
– Typical tools: Prometheus, kube-controller-manager automation.
High-performance computing clusters
– Context: Long-running compute workloads sensitive to host faults.
– Problem: Silent CPU faults leading to silent corruption.
– Why NMI helps: Machine Check Exceptions via NMI reveal hardware corruption.
– What to measure: MCE rates, NMI per job.
– Typical tools: MCE logs, fleet analytics.
Network device faults on servers
– Context: NIC causing packet loss intermittently.
– Problem: Device driver issues trigger NMIs for serious PCI errors.
– Why NMI helps: Immediate diagnostics and possible device isolation.
– What to measure: PCIe AER incidents, NIC resets.
– Typical tools: ethtool, kernel logs.
Firmware regression detection
– Context: Fleet firmware upgrades performed regularly.
– Problem: New firmware triggers spurious NMIs on a class of boards.
– Why NMI helps: Early detection to rollback firmware.
– What to measure: NMI spike per firmware version.
– Typical tools: Fleet analytics, firmware inventory.
Managed cloud incident correlation
– Context: Provider host degradation impacts VMs.
– Problem: Provider signals host health but tenant lacks low-level logs.
– Why NMI helps: When available via host health events, informs remediation.
– What to measure: Provider host events vs tenant NMI symptoms.
– Typical tools: Provider event streams, internal incident systems.
Storage controller failures
– Context: Local disks attached to hosts experiencing errors.
– Problem: Storage IO lockups affecting VMs.
– Why NMI helps: PCIe/device NMIs indicate severe controller faults.
– What to measure: Device NMIs, IO latency before event.
– Typical tools: Smartctl, device driver logs.
Safety-critical applications (telecom/finance)
– Context: Applications requiring durable correctness.
– Problem: Silent hardware errors can corrupt transactions.
– Why NMI helps: Ensures hardware error visibility and forced remediation.
– What to measure: ECC and MCE logs correlated with NMIs.
– Typical tools: ECC telemetry, MCE parsers.
On-prem colo with limited redundancy
– Context: Single-host services without cloud redundancy.
– Problem: Host hangs cause prolonged service outages.
– Why NMI helps: Automate detection and BMC-initiated recovery to reduce downtime.
– What to measure: Time-to-reboot and service availability post-NMI.
– Typical tools: IPMI, automated runbooks.
Regression testing for OS updates
- Context: Rolling kernel updates can affect NMI handling.
- Problem: Kernel changes cause different NMI behavior.
- Why NMI helps: Detect regressions in test labs before rollout.
- What to measure: NMI occurrence in canary hosts.
- Typical tools: CI labs, kdump validations.

Scenario Examples (Realistic, End-to-End)

Scenario #1 — Kubernetes node hang due to kernel deadlock (Kubernetes scenario)

Context: Production Kubernetes cluster with stateful services experiencing pod stalls.
Goal: Detect node hangs quickly and automate replacement with minimal app impact.
Why NMI matters here: NMI watchdog can detect kernel stalls that kubelet and liveness probes cannot.
Architecture / workflow: Host-level NMI watchdog -> Kernel handler captures oops and marks node as unhealthy -> Host agent sends event to central logs -> Controller automation cordons node and drains workloads -> Replace host.
Step-by-step implementation:

Enable kernel NMI watchdog on node images.
Configure crashkernel and kdump to store dumps to network-backed storage.
Add agent metric for “nmi_event_total” and ship to Prometheus.
Create alert rule: any nmi_event_total > 0 on prod node -> Pager for hardware team.
Automate cordon+drain when NMI occurs via operator with safety checks.
What to measure: NMI events, time node cordoned, pod eviction success rate.
Tools to use and why: Prometheus for metrics, Fluent Bit for logs, kube-controller-manager for automation.
Common pitfalls: Watchdog false positives during CPU-intensive pods; ensure thresholds.
Validation: Run simulated CPU hang tests in canary; verify cordon and replacement.
Outcome: Faster detection and reduced downtime for affected services.

Scenario #2 — Serverless function host hardware fault (Serverless/managed-PaaS scenario)

Context: Managed serverless provider where tenant functions occasionally cold-fail without clear app errors.
Goal: Correlate host-level faults with function failures and reduce function error rates.
Why NMI matters here: Underlying host hardware faults manifest as function invocation failures; NMIs indicate host-level issues needing provider action.
Architecture / workflow: Provider collects host events -> NMIs associated with host IDs -> Functions on same host flagged and invocations redirected -> Provider performs host replacement.
Step-by-step implementation:

Ensure provider exposes host health events or internal BMC logs.
Correlate function invocation failures with host event timestamps.
Blacklist affected hosts in routing pool until resolved.
What to measure: Host NMI events, function error rate per host, blacklisted-host count.
Tools to use and why: Provider internal telemetry, central event bus.
Common pitfalls: Tenant cannot access host-level logs in public clouds; rely on provider channels.
Validation: Inject a simulated kernel hang in a non-prod region to verify routing changes.
Outcome: Reduced invocation failures and faster provider remediation.

Scenario #3 — Postmortem after recurrent NMIs (Incident-response/postmortem scenario)

Context: Fleet shows recurring NMIs on a set of blades after firmware rollout.
Goal: Root-cause analysis and rollback plan.
Why NMI matters here: Pattern of NMIs indicates regression introduced by firmware.
Architecture / workflow: Collect NMI events, firmware version mapping, BMC SEL, and service impact metrics -> RCA -> Firmware rollback plan and automated replacement.
Step-by-step implementation:

Aggregate NMIs across hosts and group by firmware version.
Validate with vendor logs and affected hardware SKU.
Rollback firmware on canary subset, observe reduction in NMIs, then do controlled rollback.
What to measure: NMI rate pre/post rollback, service error budget consumption.
Tools to use and why: Fleet analytics, vendor firmware tooling.
Common pitfalls: Incomplete metadata linking firmware to host; ensure accurate inventory.
Validation: Canary rollback and monitoring for recurrence.
Outcome: Restored fleet stability and updated firmware rollout policy.

Scenario #4 — Cost vs performance tradeoff: aggressive watchdogs (Cost/performance trade-off scenario)

Context: High-performance trading application prioritizing latency; ops must balance detection with performance.
Goal: Tune watchdog sensitivity to avoid performance regressions while detecting real hangs.
Why NMI matters here: Aggressive watchdogs may cause reboots during short GC or high-CPU bursts; too lax and hangs go undetected.
Architecture / workflow: Parameterize watchdog timeout per instance class; observe impact on latency and NMI rate.
Step-by-step implementation:

Baseline latency and CPU profiles.
Enable NMI watchdog with conservative timeout on canary nodes.
Adjust timeouts and correlate NMI rate vs latency impact.
Choose per-instance class policy (e.g., lower sensitivity on ultra-low-latency instances).
What to measure: Latency SLOs, NMI events, false reboot rate.
Tools to use and why: Prometheus, performance test harness.
Common pitfalls: Over-generalizing timeouts across different workloads; tune per class.
Validation: Load tests and targeted chaos testing.
Outcome: Balanced detection while preserving latency goals.

Scenario #5 — Nested virtualization debug (Kubernetes/hypervisor hybrid)

Context: VM running Kubernetes, guest experiences fatal hangs; unclear if guest or host triggered.
Goal: Determine whether NMIs originate in host or guest and fix accordingly.
Why NMI matters here: NMIs may be swallowed or emulated by hypervisor leading to ambiguity.
Architecture / workflow: Instrument both host and guest with NMI counters, enable hypervisor passthrough settings, compare logs.
Step-by-step implementation:

Enable guest-visible NMI metrics.
Correlate guest NMI events with host BMC SEL entries.
If host-origin, escalate to hardware replacement; if guest-origin, patch guest kernel.
What to measure: Host vs guest NMI event correlation rate.
Tools to use and why: QEMU/KVM logs, host BMC logs, guest kernel logs.
Common pitfalls: Hypervisor default masks; must explicitly enable forwarding for accurate diagnosis.
Validation: Reproduce on nested testbed.
Outcome: Correct ownership and remediation path identified.

Common Mistakes, Anti-patterns, and Troubleshooting

List 15–25 mistakes with: Symptom -> Root cause -> Fix

Symptom: No NMI logs after host crash -> Root cause: crashkernel not configured -> Fix: Reserve crashkernel and enable kdump.
Symptom: Frequent NMI alerts -> Root cause: firmware regression -> Fix: Rollback firmware and coordinate vendor patch.
Symptom: NMI event but no service impact -> Root cause: Spurious device error -> Fix: Isolate device and monitor; vendor test.
Symptom: Guest shows hang, no host NMI -> Root cause: Hypervisor masking -> Fix: Enable NMI passthrough or increase host diagnostics.
Symptom: Alert storm during maintenance -> Root cause: No suppression rules -> Fix: Implement maintenance window suppression and dedupe.
Symptom: Missing kernel stacks -> Root cause: Handler did too much and crashed -> Fix: Simplify handler and ensure atomic minimal logging.
Symptom: False positives in high-CPU workloads -> Root cause: Watchdog timeout too short -> Fix: Tune thresholds per workload.
Symptom: Unable to access BMC -> Root cause: Network/cred issues -> Fix: Secure and document BMC access and rotate creds.
Symptom: Recurrent host replacement without fix -> Root cause: Not capturing full diagnostics -> Fix: Ensure persistent firmware logs and kdump.
Symptom: Slow detection of NMI -> Root cause: Agent ingest lag -> Fix: Prioritize NMI pipeline and heartbeat checks.
Symptom: Lost correlation between NMI and app outages -> Root cause: Poor timestamp sync -> Fix: Ensure NTP/PTP across hosts and BMCs.
Symptom: High MTTR -> Root cause: No automation for cordon/replace -> Fix: Implement safe automation runbooks.
Symptom: Overloaded alerting system -> Root cause: Every NMI pages SRE -> Fix: Tier alerts and route to hardware team first.
Symptom: Partial SEL entries -> Root cause: BMC buffer overflow -> Fix: Increase SEL polling frequency.
Symptom: Confusing SMI vs NMI behavior -> Root cause: Lack of documentation -> Fix: Create vendor-specific runbooks clarifying differences.
Symptom: Incomplete kdump captures -> Root cause: Insufficient reserved memory -> Fix: Increase crashkernel reservation.
Symptom: Missing symbolized stacks -> Root cause: Kernel versions mismatched -> Fix: Keep kernel build and symbol store synchronized.
Symptom: Noise during firmware maintenance -> Root cause: Not excluding canary hosts -> Fix: Use canary cohort and staged rollout.
Symptom: Data loss after watchdog reboot -> Root cause: Hard reset without graceful drain -> Fix: Graceful drain automation before reboot.
Symptom: Observability agent fails during hang -> Root cause: Agent runs in same kernel context -> Fix: Use out-of-band collection or pre-persisted logs.
Symptom: Misattributed billing events -> Root cause: Reboots lead to VM migration counts -> Fix: Tag incidents and correlate with billing pipeline.
Symptom: Excessive human toil -> Root cause: Manual runbook steps -> Fix: Automate routine remediation tasks.
Symptom: Vendor denies hardware fault -> Root cause: Insufficient diagnostic evidence -> Fix: Collect MCE, SEL, and crash dumps before replacement.

Observability pitfalls (at least 5):

PITFALL: Relying solely on dmesg -> CAUSE: dmesg ephemeral -> FIX: Persist to central logs and capture frequently.
PITFALL: Not syncing clocks -> CAUSE: Misaligned timestamps -> FIX: NTP/PTP across host and BMC.
PITFALL: Missing correlation with higher-level metrics -> CAUSE: Siloed telemetry -> FIX: Integrate event bus and correlate traces/metrics/logs.
PITFALL: Agent unavailable during hang -> CAUSE: Agent process requires kernel context -> FIX: Out-of-band logging via BMC.
PITFALL: Over-indexed raw dumps -> CAUSE: Storage blowout -> FIX: Prioritize metadata and sample full dumps.

Best Practices & Operating Model

Ownership and on-call
Hardware/BMC team owns NMI handling policies; service owners own application impact. Create a separate hardware on-call rotation for urgent host-level NMIs.
Runbooks vs playbooks
Runbooks: Step-by-step remediation for a specific host NMI.
Playbooks: Higher-level escalation flows involving multiple teams and vendor engagement.
Safe deployments (canary/rollback)
Firmware/kernel changes must roll through small canaries with NMI and kdump monitoring. Implement automated rollback triggers on NMI spikes.
Toil reduction and automation
Automate cordon/drain and host replacement. Use auto-replace policies for hosts with repeat NMIs and escalate to vendor only after automated triage.
Security basics
Secure BMC/IPMI access with rotation and network isolation. Limit who can read SEL logs. Audit all actions against hardware.

Include:

Weekly/monthly routines
Weekly: Review NMI events for spikes and outstanding kdumps.
Monthly: Analyze fleet trends, firmware versions prevalence, and identify hotspots.
What to review in postmortems related to NMI
Confirm diagnostic artifacts preserved, validate remediation execution, check for automation coverage, and revise firmware rollout policy if needed.

Tooling & Integration Map for NMI (TABLE REQUIRED)

ID	Category	What it does	Key integrations	Notes
I1	Metrics	Collects NMI counters from hosts	Prometheus, exporters	Use custom metric for NMIs
I2	Logs	Aggregates dmesg and kernel oops	Fluent Bit, ELK	Ensure retention for postmortems
I3	BMC tooling	Fetches SEL and firmware logs	ipmitool, vendor tools	Out-of-band source for diagnostics
I4	Crash capture	Captures kdump images	kdump, crash utilities	Requires reserved memory
I5	Fleet analytics	Correlates events across hosts	SIEM, data warehouse	Good for systemic trends
I6	Alerting	Pages and routes incidents	PagerDuty, Opsgenie	Deduplication important
I7	Orchestration	Automates cordon/drain/replace	Kubernetes controllers, Terraform	Critical for low MTTR
I8	Hypervisor	Manages interrupt forwarding	KVM/QEMU, Xen	Hypervisor settings affect NMI visibility
I9	Firmware mgmt	Deploys and rolls back firmware	Vendor update tools	Track firmware per host
I10	Provider health	Received provider host events	Cloud event bus	Abstracted details; combine with tenant logs

Row Details (only if needed)

I3: Ensure BMC credentials management integrates with secrets store and audit logs.
I4: Validate crashkernel sizing in CI images to ensure kdump success.

Frequently Asked Questions (FAQs)

What exactly triggers an NMI?

Hardware sources like watchdog timers, certain PCI errors, thermal events, and CPU machine checks can trigger an NMI.

Can NMIs be disabled?

Varies / depends. Some platforms and hypervisors allow configuration; disabling removes a critical safety mechanism and is generally discouraged.

Are NMIs visible inside VMs?

Sometimes. Hypervisor decisions determine whether NMIs are forwarded to guests, emulated, or swallowed.

How do NMIs differ from SMIs?

SMIs execute in firmware SMM and are generally invisible to the OS; NMIs are delivered to the CPU and handled by the OS/kernel.

How do I capture useful data when an NMI happens?

Enable crashkernel/kdump, persist kernel ring buffer to a central log, and collect BMC SEL entries.

Can NMIs be the cause of data corruption?

NMIs often indicate hardware faults (like MCEs) that can lead to data corruption if not detected; NMIs themselves are signaling mechanisms.

How to avoid false positives from watchdog NMIs?

Tune watchdog thresholds per workload and test under realistic load profiles.

Do cloud providers surface NMIs to tenants?

Varies / depends. Public cloud providers typically surface aggregate host health events rather than raw NMIs.

How to troubleshoot repeated NMIs on a host?

Collect kdump and SEL, compare firmware and hardware SKU, isolate with hardware swap, and engage vendor support.

Is there a standard SLI for NMIs?

No universal standard; common SLI is hosts with NMI rate per period. Define based on fleet risk tolerance.

Should NMIs page engineers?

Only if the event impacts production availability or is a repeated/recurrent failure; otherwise log and ticket.

How to test NMI handling?

Simulate hung CPUs in a controlled environment or use vendor-provided diagnostic tools.

Will NMIs always produce stack traces?

Not always. Early-boot NMIs or severe memory corruption can prevent full stack captures.

How to secure BMC/IPMI access?

Network isolation, credential rotation, strong auth, and restricted access roles.

What to do if NMIs increase after a firmware update?

Halt rollout, roll back canary firmware, gather diagnostics, and escalate to vendor.

Can software cause an NMI?

Software cannot directly cause NMI, but software behavior (e.g., infinite interrupt disabling) can lead to watchdog timeouts that trigger NMIs.

How long should we retain NMI-related logs?

Sufficient to perform RCA and trend analysis; typically months for fleet analytics and at least 90 days for incident traces.

Do containers see NMIs?

Containers rely on kernel/host; containers do not directly receive NMIs but will suffer host-level impacts.

Conclusion

Non-Maskable Interrupts are a critical, low-level signal for hardware and platform reliability. For 2026 cloud-native operations, treating NMIs as part of your observability and incident response fabric is essential—especially for bare-metal fleets, high-availability services, and safety-critical workloads. Proper instrumentation, automation, and vendor collaboration turn NMIs from opaque emergencies into actionable diagnostics.

Next 7 days plan (5 bullets):

Day 1: Inventory hosts and confirm crashkernel/kdump settings on a canary cohort.
Day 2: Enable minimal NMI telemetry and ship to central logs.
Day 3: Create Prometheus metrics and dashboards for NMI events.
Day 4: Draft runbook for NMI triage and test on non-prod hosts.
Day 5–7: Run a controlled NMI simulation in staging and validate automation for cordon/replace.

Appendix — NMI Keyword Cluster (SEO)

Primary keywords
Non-Maskable Interrupt NMI
NMI watchdog
kernel NMI
NMI dashboard
NMI monitoring
Secondary keywords
machine check exception MCE
kernel oops NMI
crashkernel kdump NMI
BMC SEL NMI
IPMI NMI logs
firmware NMI regression
NMI alerting
hypervisor NMI passthrough
NMI troubleshooting
NMI runbook
Long-tail questions
What causes a non-maskable interrupt in servers
How to capture kdump after an NMI
How to correlate NMI with application outages
How to configure NMI watchdog in Linux
Why am I getting frequent NMIs after firmware update
How do cloud providers expose host NMIs
How to test NMI detection and recovery
How to secure BMC when collecting NMI logs
What is the difference between SMI and NMI
Do VMs receive NMIs from the host
How to reduce NMI alert noise
What to include in an NMI incident postmortem
How to configure Prometheus for NMI events
How to automate replacement on NMI detection
How to interpret machine check exception logs
Related terminology
kernel panic
IRQ vs NMI
System Event Log SEL
out-of-band management
crash dump
symbolized stack
ECC memory
DIMM failure
PCIe AER
watchdog timer
BIOS/UEFI
SMM
kexec
node cordon
auto-replace policy
fleet analytics
incident burn rate
observability pipeline
deduplication rules
canary firmware rollout
proactive replacement
panic_on_oops
kernel symbol store
NTP/PTP sync
event correlation
out-of-band logs
vendor escalation
hardware on-call
platform health event

Category:

What is Series?