{"id":2435,"date":"2026-02-17T08:08:48","date_gmt":"2026-02-17T08:08:48","guid":{"rendered":"https:\/\/dataopsschool.com\/blog\/nmi\/"},"modified":"2026-02-17T15:32:08","modified_gmt":"2026-02-17T15:32:08","slug":"nmi","status":"publish","type":"post","link":"https:\/\/dataopsschool.com\/blog\/nmi\/","title":{"rendered":"What is NMI? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)"},"content":{"rendered":"\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Quick Definition (30\u201360 words)<\/h2>\n\n\n\n<p>Non-Maskable Interrupt (NMI) is a hardware-level interrupt that cannot be ignored by the CPU and signals critical conditions like hardware faults or watchdog timeouts. Analogy: an emergency alarm that always breaks through whatever you\u2019re doing. Formal: a high-priority, non-discardable interrupt vector delivered to the processor for immediate handling.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">What is NMI?<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\n<p>What it is \/ what it is NOT<br\/>\n  NMI is a hardware-initiated interrupt line intended for critical, high-severity events that require immediate attention from the CPU and operating system. It is NOT a regular software interrupt, nor is it generally used for routine telemetry or graceful shutdowns.<\/p>\n<\/li>\n<li>\n<p>Key properties and constraints  <\/p>\n<\/li>\n<li>Always prioritized above maskable interrupts.  <\/li>\n<li>Delivered by hardware sources (chipset, watchdog timers, PCI devices, thermal units).  <\/li>\n<li>Handler execution is constrained by the CPU and OS context; reentrancy and safety are concerns.  <\/li>\n<li>Can indicate unrecoverable conditions (e.g., memory corruption) or survivable diagnostics (e.g., watchdog timeout).  <\/li>\n<li>\n<p>Behavior can vary by platform, hypervisor, and BIOS\/UEFI firmware.<\/p>\n<\/li>\n<li>\n<p>Where it fits in modern cloud\/SRE workflows  <\/p>\n<\/li>\n<li>Troubleshooting root-cause hardware incidents on hosts (bare metal or VMs with passthrough).  <\/li>\n<li>Detecting stuck CPUs and kernel deadlocks via NMI watchdogs.  <\/li>\n<li>Integrating with incident response for host-level severity incidents and correlated observability.  <\/li>\n<li>\n<p>Postmortems and capacity planning when hardware error rates rise.<\/p>\n<\/li>\n<li>\n<p>A text-only \u201cdiagram description\u201d readers can visualize  <\/p>\n<\/li>\n<li>Physical hardware sources (watchdog chip, thermal sensor, PCI device) send NMI signal to CPU(s).  <\/li>\n<li>CPU interrupts current context and vectors to the NMI handler in the kernel\/firmware.  <\/li>\n<li>NMI handler attempts minimal diagnostics, writes to special logs, and may trigger platform recovery or reboot.  <\/li>\n<li>Observability agents collect NMI logs and map to infrastructure monitoring and incident pipelines.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">NMI in one sentence<\/h3>\n\n\n\n<p>NMI is a hardware-triggered, non-maskable interrupt delivered to the CPU to signal critical conditions that require immediate OS\/firmware attention.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">NMI vs related terms (TABLE REQUIRED)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Term<\/th>\n<th>How it differs from NMI<\/th>\n<th>Common confusion<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>T1<\/td>\n<td>IRQ<\/td>\n<td>IRQs are maskable and used for normal device interrupts<\/td>\n<td>Confused as same urgency<\/td>\n<\/tr>\n<tr>\n<td>T2<\/td>\n<td>SMI<\/td>\n<td>SMI runs in firmware\/SMM, not OS context<\/td>\n<td>Both are low-level interrupts<\/td>\n<\/tr>\n<tr>\n<td>T3<\/td>\n<td>Watchdog timer<\/td>\n<td>Watchdog can generate NMI as reaction<\/td>\n<td>Not all watchdog events are NMIs<\/td>\n<\/tr>\n<tr>\n<td>T4<\/td>\n<td>Machine Check<\/td>\n<td>Machine checks are CPU error reports, sometimes NMI-triggered<\/td>\n<td>Often conflated with NMI events<\/td>\n<\/tr>\n<tr>\n<td>T5<\/td>\n<td>Kernel panic<\/td>\n<td>Panic is OS reaction; NMI may cause panic<\/td>\n<td>People think panic equals NMI<\/td>\n<\/tr>\n<tr>\n<td>T6<\/td>\n<td>Exception<\/td>\n<td>Exceptions are CPU faults from instruction execution<\/td>\n<td>Some exceptions are handled differently<\/td>\n<\/tr>\n<tr>\n<td>T7<\/td>\n<td>Hypervisor trap<\/td>\n<td>Hypervisor traps are virtualization events<\/td>\n<td>Might mask or reroute NMIs<\/td>\n<\/tr>\n<tr>\n<td>T8<\/td>\n<td>ACPI event<\/td>\n<td>ACPI signals power\/thermal; can cause NMI on some boards<\/td>\n<td>Not typically an NMI source<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if any cell says \u201cSee details below\u201d)<\/h4>\n\n\n\n<p>None.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Why does NMI matter?<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Business impact (revenue, trust, risk)  <\/li>\n<li>\n<p>Host-level failures can cause multi-tenant VM interruptions or data plane outages. Uninvestigated NMIs can hide systemic hardware issues leading to repeated downtime, SLA breaches, and customer churn.<\/p>\n<\/li>\n<li>\n<p>Engineering impact (incident reduction, velocity)  <\/p>\n<\/li>\n<li>\n<p>Early detection of stuck CPUs or hardware errors via NMI reduces mean time to detect (MTTD) and helps avoid cascading faults. Clear NMI telemetry improves troubleshooting velocity and reduces firefighting toil.<\/p>\n<\/li>\n<li>\n<p>SRE framing (SLIs\/SLOs\/error budgets\/toil\/on-call) where applicable  <\/p>\n<\/li>\n<li>\n<p>NMIs map to host-health SLIs (e.g., host-healthy-rate). Elevated NMI rates should consume error budget for infrastructure SLOs. On-call playbooks must include NMI triage to prevent noisy paging and to escalate only genuine host-level failures.<\/p>\n<\/li>\n<li>\n<p>3\u20135 realistic \u201cwhat breaks in production\u201d examples<br\/>\n  1. Kernel deadlock on a host causing all containers to freeze; watchdog NMI triggers kernel handler and may force reboot.<br\/>\n  2. Repeated Machine Check Exceptions signaled via NMI indicating failing DIMM, causing VM crashes and storage errors.<br\/>\n  3. Thermal sensor NMI on a blade begins throttling or immediate shutdown, impacting throughput and auto-scaling decisions.<br\/>\n  4. Faulty NIC triggers PCI-generated NMI, breaking network traffic for affected hosts in a cluster.<br\/>\n  5. Firmware bug causes spurious NMIs across a hardware fleet, generating alert storms and masking true incidents.<\/p>\n<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Where is NMI used? (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Layer\/Area<\/th>\n<th>How NMI appears<\/th>\n<th>Typical telemetry<\/th>\n<th>Common tools<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>L1<\/td>\n<td>Edge \/ Physical host<\/td>\n<td>Hardware watchdog or thermal NMI<\/td>\n<td>NMI logs, kernel oops, dmesg<\/td>\n<td>syslog, journalctl<\/td>\n<\/tr>\n<tr>\n<td>L2<\/td>\n<td>Network \/ NIC<\/td>\n<td>PCI error NMI<\/td>\n<td>NIC reset traces, driver logs<\/td>\n<td>ethtool, driver logs<\/td>\n<\/tr>\n<tr>\n<td>L3<\/td>\n<td>Hypervisor<\/td>\n<td>NMI forwarded or trapped<\/td>\n<td>Hypervisor logs, VM crash reports<\/td>\n<td>QEMU\/KVM logs, Xen logs<\/td>\n<\/tr>\n<tr>\n<td>L4<\/td>\n<td>Kubernetes node<\/td>\n<td>Node freezes, kubelet not responding<\/td>\n<td>Node heartbeat, node-exporter metrics<\/td>\n<td>Prometheus, node-exporter<\/td>\n<\/tr>\n<tr>\n<td>L5<\/td>\n<td>Serverless \/ PaaS<\/td>\n<td>Platform-level host failures<\/td>\n<td>Platform incident events<\/td>\n<td>Cloud provider status, platform logs<\/td>\n<\/tr>\n<tr>\n<td>L6<\/td>\n<td>Observability<\/td>\n<td>Alert correlator for host NMIs<\/td>\n<td>Event stores, incident graphs<\/td>\n<td>PagerDuty, Opsgenie<\/td>\n<\/tr>\n<tr>\n<td>L7<\/td>\n<td>BIOS\/UEFI firmware<\/td>\n<td>SMI\/NMI interplay visible during boot<\/td>\n<td>Firmware logs, platform error logs<\/td>\n<td>ipmitool, ACPI logs<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>L1: See details below: L1<\/li>\n<li>L3: See details below: L3<\/li>\n<li>L5: See details below: L5<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>L1: Hardware NMIs often originate from embedded controllers and require vendor firmware logs and IPMI\/OEM logs for full context.<\/li>\n<li>L3: Hypervisors may choose to swallow, forward, or emulate NMIs; behavior depends on platform and configuration.<\/li>\n<li>L5: Managed PaaS abstracts NMIs; operators usually see only higher-level platform health events and must rely on provider telemetry.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">When should you use NMI?<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>When it\u2019s necessary  <\/li>\n<li>\n<p>Use NMIs for detecting critical, unrecoverable hardware errors, stuck CPUs, and platform watchdog conditions where immediate attention is required.<\/p>\n<\/li>\n<li>\n<p>When it\u2019s optional  <\/p>\n<\/li>\n<li>\n<p>Use an NMI watchdog in addition to regular telemetry if host-level hangs are rare but costly; enable selectively in high-value or high-risk instances.<\/p>\n<\/li>\n<li>\n<p>When NOT to use \/ overuse it  <\/p>\n<\/li>\n<li>\n<p>Do NOT use NMI as a general-purpose monitoring signal or for frequent, low-priority events. Overuse leads to noisy alerting and may mask true emergencies.<\/p>\n<\/li>\n<li>\n<p>Decision checklist  <\/p>\n<\/li>\n<li>If a host can hang and cause multi-tenant impact AND other diagnostics are insufficient -&gt; enable NMI\/watchdog.  <\/li>\n<li>If application-level retries or circuit breakers can recover without host intervention -&gt; prefer application-level SLIs instead.  <\/li>\n<li>\n<p>If you run fully managed compute where NMI is opaque -&gt; rely on provider platform alerts, not local NMI tinkering.<\/p>\n<\/li>\n<li>\n<p>Maturity ladder: Beginner -&gt; Intermediate -&gt; Advanced  <\/p>\n<\/li>\n<li>Beginner: Enable basic NMI\/watchdog; collect kernel oops and report to central logging.  <\/li>\n<li>Intermediate: Correlate NMI events with metrics and alerts; automate host cordon and replacement.  <\/li>\n<li>Advanced: Fleet-wide NMI analytics, predictive failure models, automated firmware rollbacks, and integration with capacity and incident pipelines.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">How does NMI work?<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\n<p>Components and workflow<br\/>\n  1. Hardware source (watchdog, thermal controller, PCI device) detects a critical condition.<br\/>\n  2. Hardware asserts the NMI line to CPU.<br\/>\n  3. CPU halts normal interrupt masking and vectors to the NMI handler (in kernel or firmware).<br\/>\n  4. Handler performs minimal diagnostics (stack trace, CPU registers) and writes to protected log regions.<br\/>\n  5. Depending on severity, handler may attempt recovery, notify management controllers, or initiate a controlled reboot.<br\/>\n  6. Logged data is collected by host agents and sent to observability\/incident pipelines.<\/p>\n<\/li>\n<li>\n<p>Data flow and lifecycle  <\/p>\n<\/li>\n<li>\n<p>Detection -&gt; NMI delivered -&gt; Handler executes -&gt; Diagnostic capture -&gt; Persist to ring buffer\/firmware logs -&gt; Collection agent scrapes logs -&gt; Ingest into central observability -&gt; Correlation and alerting -&gt; Triage and remediation.<\/p>\n<\/li>\n<li>\n<p>Edge cases and failure modes  <\/p>\n<\/li>\n<li>NMI handler deadlocks or corrupts memory causing cascade.  <\/li>\n<li>NMIs occur during low-level firmware initialization and produce incomplete diagnostics.  <\/li>\n<li>Hypervisor masks or reroutes NMIs causing guest-visible symptoms without host-level signals.  <\/li>\n<li>Multiple NMIs in close succession trigger log overflow and partial data capture.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Typical architecture patterns for NMI<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Local NMI watchdog: Enable kernel NMI watchdog on hosts; used for VMs and bare metal to detect stuck CPUs. Use when debugging sporadic host hangs.<\/li>\n<li>Hardware watchdog + management controller: Dedicated BMC triggers NMI and then power-cycles host if OS is unresponsive. Use for resilient fleets with out-of-band management.<\/li>\n<li>Hypervisor-forwarded NMI: Hypervisor forwards host NMIs to guests for debugging; used in nested virtualization or when guest needs hardware-level visibility.<\/li>\n<li>Fleet telemetry aggregation: Agents collect NMI events and ship to centralized observability with correlation to kernel oops and machine-check logs. Use for fleet-wide failure analysis.<\/li>\n<li>Managed cloud abstraction: Rely on provider-supplied host health events and remediation; use when you cannot access host firmware or BMC.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Failure modes &amp; mitigation (TABLE REQUIRED)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Failure mode<\/th>\n<th>Symptom<\/th>\n<th>Likely cause<\/th>\n<th>Mitigation<\/th>\n<th>Observability signal<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>F1<\/td>\n<td>Missing NMI logs<\/td>\n<td>No diagnostics after NMI<\/td>\n<td>Log overflow or handler crash<\/td>\n<td>Preserve ring buffer and flush early<\/td>\n<td>Empty dmesg after event<\/td>\n<\/tr>\n<tr>\n<td>F2<\/td>\n<td>Spurious NMIs<\/td>\n<td>Frequent transient NMIs<\/td>\n<td>Faulty hardware or firmware bug<\/td>\n<td>Firmware update or isolate hardware<\/td>\n<td>Repeated NMI timestamps<\/td>\n<\/tr>\n<tr>\n<td>F3<\/td>\n<td>NMI handler deadlock<\/td>\n<td>Host becomes unresponsive<\/td>\n<td>Handler reentrancy or corruption<\/td>\n<td>Minimal handler steps; safe fallback<\/td>\n<td>No new metrics post-NMI<\/td>\n<\/tr>\n<tr>\n<td>F4<\/td>\n<td>Hypervisor masking<\/td>\n<td>Guest shows symptoms but no host NMI<\/td>\n<td>Hypervisor swallowed NMI<\/td>\n<td>Reconfigure hypervisor forwarding<\/td>\n<td>Host vs guest discrepancy<\/td>\n<\/tr>\n<tr>\n<td>F5<\/td>\n<td>Partial diagnostics<\/td>\n<td>Incomplete stack traces<\/td>\n<td>Interrupt at early boot or corrupted memory<\/td>\n<td>Out-of-band logs via BMC<\/td>\n<td>Truncated logs in kernel<\/td>\n<\/tr>\n<tr>\n<td>F6<\/td>\n<td>Alert storms<\/td>\n<td>Many pages for same root cause<\/td>\n<td>Lack of dedupe\/correlation<\/td>\n<td>Deduplication and grouping<\/td>\n<td>High alert frequency<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>F1: Ensure persistent storage of ring buffer and out-of-band log copies; use crashkernel reservation where supported.<\/li>\n<li>F2: Collect hardware vendor logs and correlate with firmware releases; reproduce with hardware swap.<\/li>\n<li>F3: Harden NMI handler to write minimal state atomically and call management controller early.<\/li>\n<li>F4: Check hypervisor settings for NMI passthrough; test with host-level diagnostics.<\/li>\n<li>F5: Use IPMI or BMC to fetch system event logs that persist across reboots.<\/li>\n<li>F6: Implement event correlation rules in alerting system and use rate-limits for paging.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Key Concepts, Keywords &amp; Terminology for NMI<\/h2>\n\n\n\n<p>Term \u2014 1\u20132 line definition \u2014 why it matters \u2014 common pitfall<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Non-Maskable Interrupt \u2014 Hardware interrupt that cannot be masked \u2014 Signals critical conditions \u2014 Mistaking for normal IRQ  <\/li>\n<li>NMI watchdog \u2014 Kernel feature to detect stuck CPUs \u2014 Helps detect hangs \u2014 False positives on heavy GC workloads  <\/li>\n<li>Machine Check Exception (MCE) \u2014 CPU-detected hardware error \u2014 Points to memory\/CPU faults \u2014 Ignoring vendor MCE logs  <\/li>\n<li>SMI \u2014 System Management Interrupt running in firmware \u2014 Can mask OS handling \u2014 Confused with NMI behavior  <\/li>\n<li>BMC \u2014 Baseboard Management Controller \u2014 Out-of-band logs and power control \u2014 Not all providers expose BMC to tenants  <\/li>\n<li>IPMI \u2014 Interface for managing BMC \u2014 Useful for fetching hardware logs \u2014 Security risk if not secured  <\/li>\n<li>dmesg \u2014 Kernel ring buffer output \u2014 Primary local diagnostic source \u2014 Overwrites quickly without persistence  <\/li>\n<li>oops \u2014 Kernel stack trace after fault \u2014 Essential for debugging \u2014 Misinterpreting stack levels  <\/li>\n<li>crashkernel \u2014 Kernel reservation for kdump \u2014 Enables capture during panic \u2014 Not always configured by default  <\/li>\n<li>kdump \u2014 Kernel crash dump mechanism \u2014 Stores memory images for postmortem \u2014 Large disk needs and complexity  <\/li>\n<li>PCIe AER \u2014 PCI Express Advanced Error Reporting \u2014 Can cause NMIs for device errors \u2014 Misconfigured drivers suppress reports  <\/li>\n<li>Thermal trip \u2014 Overheat event causing emergency interrupt \u2014 Prevents hardware damage \u2014 Lack of telemetry for gradual heating  <\/li>\n<li>Watchdog timer \u2014 Timer that triggers recovery if system appears stuck \u2014 Can generate NMI \u2014 Overzealous timeouts lead to reboots  <\/li>\n<li>CPU hang \u2014 CPU stops making progress \u2014 Detectable by NMI watchdog \u2014 Hard to reproduce in dev environments  <\/li>\n<li>Firmware \/ BIOS \u2014 Low-level platform code \u2014 Controls NMI routing \u2014 Vendor-specific behavior varies  <\/li>\n<li>Hypervisor passthrough \u2014 Forwarding hardware interrupts to guests \u2014 Enables guest visibility \u2014 Can hide host-level problems  <\/li>\n<li>Nested virtualization \u2014 VM inside VM \u2014 NMIs may be emulated \u2014 Complexity in debug trails  <\/li>\n<li>Kernel panic \u2014 OS-level unrecoverable state \u2014 May be triggered by NMI actions \u2014 Panics should be captured via kdump  <\/li>\n<li>Ring buffer \u2014 Circular log buffer in kernel \u2014 Holds recent messages \u2014 Overwrites with bursty logs  <\/li>\n<li>ECC memory \u2014 Error-correcting RAM \u2014 Reduces MCEs \u2014 Not all platforms use ECC  <\/li>\n<li>DIMM failure \u2014 Memory module hardware fault \u2014 Often appears as MCEs \u2014 Replacement required, not software fix  <\/li>\n<li>Watchdog reset \u2014 Forced reboot from watchdog \u2014 Helps recover but may lose in-flight work \u2014 Need graceful drain before hard reset  <\/li>\n<li>OOB management \u2014 Out-of-band control via BMC\/IPMI \u2014 Critical when OS is unresponsive \u2014 Limited in fully managed clouds  <\/li>\n<li>Firmware log \u2014 Persistent log stored by firmware\/BMC \u2014 Survives reboots \u2014 Requires vendor tools to interpret  <\/li>\n<li>Event correlation \u2014 Linking NMI with other telemetry \u2014 Essential for accurate triage \u2014 Lack of correlation causes noisy paging  <\/li>\n<li>Panic kernel dump \u2014 Capture on panic for postmortem \u2014 Enables deep analysis \u2014 Large storage and retrieval complexity  <\/li>\n<li>Telemetry agent \u2014 Host agent that ships logs\/metrics \u2014 Collects NMI indicators \u2014 Can be unavailable during host hang  <\/li>\n<li>Kernel oops decode \u2014 Interpreting oops stack and registers \u2014 Critical for root cause \u2014 Often requires symbolized kernels  <\/li>\n<li>Symbolized stack \u2014 Human-readable stack using kernel symbols \u2014 Needed for interpretation \u2014 Requires matching build IDs  <\/li>\n<li>Firmware rollback \u2014 Reverting to older firmware to mitigate regressions \u2014 Useful for spurious NMIs \u2014 Needs controlled rollout  <\/li>\n<li>Safe-mode boot \u2014 Boot mode with minimal drivers \u2014 Helps reproduce early-boot NMIs \u2014 Not always feasible in production  <\/li>\n<li>Panic_on_oops \u2014 Kernel setting to panic on oops \u2014 Ensures kdump capture \u2014 Can reduce system availability if misused  <\/li>\n<li>Watchdog threshold \u2014 Timeout value for watchdog \u2014 Balances detection vs false positives \u2014 Too short causes unnecessary reboots  <\/li>\n<li>Node cordon \u2014 Mark node unschedulable in orchestration \u2014 Prevents placing workloads on unstable node \u2014 Needs automation to avoid human delay  <\/li>\n<li>Auto-replace \u2014 Automated decommission and replacement after NMI \u2014 Reduces mean time to repair \u2014 Risk of replacing healthy nodes if false positive  <\/li>\n<li>JVM safepoint \u2014 Java pause indicating thread stops \u2014 Can look like CPU hang \u2014 JVM-induced pauses can trigger NMI misdiagnosis  <\/li>\n<li>Interrupt storm \u2014 Flood of interrupts from device \u2014 Can starve CPU and cause watchdogs to fire \u2014 Mitigate with driver fixes  <\/li>\n<li>Vendor support ticketing \u2014 Escalation for hardware NMIs \u2014 Required for warranty parts \u2014 Slow if undiagnosed internally  <\/li>\n<li>Fleet analytics \u2014 Aggregate NMI events across hosts \u2014 Useful for root-cause across fleet \u2014 Requires centralized ingestion and retention  <\/li>\n<li>Recovery policy \u2014 What to do when NMI occurs (reboot, replace, escalate) \u2014 Ensures consistent response \u2014 Poor policy causes inconsistent handling<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">How to Measure NMI (Metrics, SLIs, SLOs) (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Metric\/SLI<\/th>\n<th>What it tells you<\/th>\n<th>How to measure<\/th>\n<th>Starting target<\/th>\n<th>Gotchas<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>M1<\/td>\n<td>NMI count per host<\/td>\n<td>Frequency of NMIs on a host<\/td>\n<td>Count NMI events in logs per host per day<\/td>\n<td>0\u20130.01\/day<\/td>\n<td>Sparse events may indicate underreporting<\/td>\n<\/tr>\n<tr>\n<td>M2<\/td>\n<td>Hosts with NMI rate<\/td>\n<td>Fraction of fleet with NMIs<\/td>\n<td>Unique hosts with NMI \/ total hosts<\/td>\n<td>&lt;0.5% weekly<\/td>\n<td>Bursts skew weekly metrics<\/td>\n<\/tr>\n<tr>\n<td>M3<\/td>\n<td>Time-to-detect NMI<\/td>\n<td>Delay from NMI to central ingest<\/td>\n<td>Timestamp compare log vs ingest<\/td>\n<td>&lt;60s<\/td>\n<td>Network\/agent outage increases delay<\/td>\n<\/tr>\n<tr>\n<td>M4<\/td>\n<td>NMI-induced reboots<\/td>\n<td>Reboots caused by NMI<\/td>\n<td>Correlate reboot reason in BMC\/logs<\/td>\n<td>&lt;0.1% monthly<\/td>\n<td>Mislabelled reboot reasons<\/td>\n<\/tr>\n<tr>\n<td>M5<\/td>\n<td>Mean time to remediation (MTTR)<\/td>\n<td>Time to replace or fix host after NMI<\/td>\n<td>From alert to resolved incident<\/td>\n<td>&lt;2 hours<\/td>\n<td>Human-led processes vary<\/td>\n<\/tr>\n<tr>\n<td>M6<\/td>\n<td>NMI correlation rate<\/td>\n<td>% NMIs with root-cause identified<\/td>\n<td>Count resolved root cause \/ total NMIs<\/td>\n<td>&gt;80%<\/td>\n<td>Complex hardware issues may be unresolved<\/td>\n<\/tr>\n<tr>\n<td>M7<\/td>\n<td>Alert noise ratio<\/td>\n<td>Ratio of NMI alerts that were actionable<\/td>\n<td>Actionable alerts \/ total alerts<\/td>\n<td>&gt;70% actionable<\/td>\n<td>Poor dedupe inflates noise<\/td>\n<\/tr>\n<tr>\n<td>M8<\/td>\n<td>NMI stack capture rate<\/td>\n<td>% of NMIs with captured stack traces<\/td>\n<td>Successful stack dumps \/ NMIs<\/td>\n<td>&gt;90%<\/td>\n<td>Early boot NMIs may miss captures<\/td>\n<\/tr>\n<tr>\n<td>M9<\/td>\n<td>Recurrent NMI hosts<\/td>\n<td>Hosts with &gt;1 NMI in window<\/td>\n<td>Hosts with repeated NMIs in 30 days<\/td>\n<td>0 for single-tenant hosts<\/td>\n<td>Intermittent hardware faults cause repeats<\/td>\n<\/tr>\n<tr>\n<td>M10<\/td>\n<td>Cost per NMI incident<\/td>\n<td>Operational cost to remediate<\/td>\n<td>Sum of staff hours + host replacement<\/td>\n<td>Varies \/ depends<\/td>\n<td>Hard to attribute costs precisely<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>M1: Ensure agent reliably writes NMI marker and preserves event timestamps.<\/li>\n<li>M3: Use agent heartbeat to validate ingest pipeline latency.<\/li>\n<li>M4: Use BMC and cloud provider metadata to identify reboot cause.<\/li>\n<li>M8: Configure crashkernel and kdump; validate in staging.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Best tools to measure NMI<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Prometheus + exporters<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for NMI: Metrics like host NMI counts, node-exporter custom metrics, uptime and reboots.<\/li>\n<li>Best-fit environment: Kubernetes and bare-metal with Prometheus stacks.<\/li>\n<li>Setup outline:<\/li>\n<li>Expose custom metric from host agent on NMI events.<\/li>\n<li>Scrape via Prometheus node-exporter or custom exporter.<\/li>\n<li>Record rules to compute rates and SLOs.<\/li>\n<li>Use alerts to feed PagerDuty.<\/li>\n<li>Strengths:<\/li>\n<li>Flexible query language (PromQL).<\/li>\n<li>Good for SRE-run monitoring.<\/li>\n<li>Limitations:<\/li>\n<li>Requires reliable agent during host hangs.<\/li>\n<li>Not a full log solution.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Fluentd\/Fluent Bit + Central Logs<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for NMI: Collects dmesg, kernel oops, and BMC logs.<\/li>\n<li>Best-fit environment: Fleet with centralized log storage.<\/li>\n<li>Setup outline:<\/li>\n<li>Configure to tail kernel logs and IPMI outputs.<\/li>\n<li>Tag and enrich events with host metadata.<\/li>\n<li>Send to central store with retention policy.<\/li>\n<li>Strengths:<\/li>\n<li>Rich context for postmortem.<\/li>\n<li>Pipeline for long-term analysis.<\/li>\n<li>Limitations:<\/li>\n<li>May miss logs if host fully unresponsive.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 BMC\/IPMI tooling<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for NMI: Persistent firmware logs and system event logs.<\/li>\n<li>Best-fit environment: Bare metal with out-of-band management.<\/li>\n<li>Setup outline:<\/li>\n<li>Inventory BMC credentials and secure access.<\/li>\n<li>Poll SEL (System Event Log) at interval.<\/li>\n<li>Correlate with host logs on events.<\/li>\n<li>Strengths:<\/li>\n<li>Survives OS reboots.<\/li>\n<li>Authoritative hardware source.<\/li>\n<li>Limitations:<\/li>\n<li>Access restricted in managed clouds.<\/li>\n<li>Security risks if exposed.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Cloud provider host health events<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for NMI: Provider-detected host hardware faults and maintenance events.<\/li>\n<li>Best-fit environment: Managed VMs and managed Kubernetes.<\/li>\n<li>Setup outline:<\/li>\n<li>Subscribe to provider health events.<\/li>\n<li>Map provider event codes to internal incident categories.<\/li>\n<li>Automate replacement where provider signals host degradation.<\/li>\n<li>Strengths:<\/li>\n<li>No host agent dependency.<\/li>\n<li>Limitations:<\/li>\n<li>Abstracted details; limited diagnostic depth.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Fleet analytics \/ SIEM<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for NMI: Correlation across hosts, trend detection.<\/li>\n<li>Best-fit environment: Large fleets with many hosts.<\/li>\n<li>Setup outline:<\/li>\n<li>Ingest NMI events and enrich with hardware metadata.<\/li>\n<li>Run clustering and anomaly detection.<\/li>\n<li>Surface recurrent failure families for remediation.<\/li>\n<li>Strengths:<\/li>\n<li>Detects systemic issues early.<\/li>\n<li>Limitations:<\/li>\n<li>Requires sustained investment in data pipelines.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Recommended dashboards &amp; alerts for NMI<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Executive dashboard  <\/li>\n<li>Panels: Fleet NMI rate (trend), Hosts affected (count), Open NMI incidents, Business impact summary.  <\/li>\n<li>\n<p>Why: High-level view for stakeholders to see hardware health and operational risk.<\/p>\n<\/li>\n<li>\n<p>On-call dashboard  <\/p>\n<\/li>\n<li>Panels: Live NMI event stream, Hosts with recent NMIs, Reboot reasons, Recent kdump captures, Correlated application outages.  <\/li>\n<li>\n<p>Why: Focused triage surface for responders.<\/p>\n<\/li>\n<li>\n<p>Debug dashboard  <\/p>\n<\/li>\n<li>Panels: Per-host dmesg view, MCE logs, CPU usage pre-NMI, Interrupt counts, Network and disk metrics, BMC SEL entries.  <\/li>\n<li>Why: Deep diagnostic view for engineers performing root-cause analysis.<\/li>\n<\/ul>\n\n\n\n<p>Alerting guidance:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What should page vs ticket  <\/li>\n<li>Page: Single or recurrent NMI on production controller\/critical host impacting availability.  <\/li>\n<li>\n<p>Ticket: Isolated NMI on non-critical dev host or informational BMC event with no service impact.<\/p>\n<\/li>\n<li>\n<p>Burn-rate guidance (if applicable)  <\/p>\n<\/li>\n<li>\n<p>If NMI-driven host failures consume &gt;25% of infrastructure error budget in a week, escalate to a reliability strike team.<\/p>\n<\/li>\n<li>\n<p>Noise reduction tactics (dedupe, grouping, suppression)  <\/p>\n<\/li>\n<li>Deduplicate by root cause fingerprint and host group.  <\/li>\n<li>Group alerts by cluster and time window (e.g., suppress repeat pages for same host within 10 minutes).  <\/li>\n<li>Use suppression windows during platform maintenance.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Implementation Guide (Step-by-step)<\/h2>\n\n\n\n<p>1) Prerequisites<br\/>\n   &#8211; Inventory of hardware and virtualization mappings.<br\/>\n   &#8211; Access to BMC\/IPMI or provider host health events.<br\/>\n   &#8211; Central logging and metrics pipeline.<br\/>\n   &#8211; Defined recovery policy (cordon\/replace vs repair).<br\/>\n2) Instrumentation plan<br\/>\n   &#8211; Enable kernel NMI watchdog where appropriate.<br\/>\n   &#8211; Configure crashkernel and kdump for dumps.<br\/>\n   &#8211; Add host-agent hooks to mark NMI events as metrics and log them.<br\/>\n3) Data collection<br\/>\n   &#8211; Collect dmesg, kernel oops, MCE logs, and BMC SEL entries.<br\/>\n   &#8211; Persist to central logs with retention and indexing.<br\/>\n4) SLO design<br\/>\n   &#8211; Define host-health SLOs incorporating NMI frequency and recovery.<br\/>\n   &#8211; Set error budgets for host-induced incidents.<br\/>\n5) Dashboards<br\/>\n   &#8211; Build executive, on-call, and debug dashboards.<br\/>\n   &#8211; Provide drilldowns from fleet to single-host views.<br\/>\n6) Alerts &amp; routing<br\/>\n   &#8211; Set thresholds for paging vs ticketing.<br\/>\n   &#8211; Route to hardware on-call, cloud ops, and relevant service owners.<br\/>\n7) Runbooks &amp; automation<br\/>\n   &#8211; Document step-by-step triage for NMI events.<br\/>\n   &#8211; Automate cordon, drain, and replacement actions where safe.<br\/>\n8) Validation (load\/chaos\/game days)<br\/>\n   &#8211; Run controlled tests that simulate stuck CPUs and verify detection &amp; remediation.<br\/>\n   &#8211; Include firmware-update scenarios and BMC failure modes.<br\/>\n9) Continuous improvement<br\/>\n   &#8211; Monthly review of NMI trends, firmware rollouts, and vendor replacements.<br\/>\n   &#8211; Implement RCA feedback into host provisioning workflows.<\/p>\n\n\n\n<p>Include checklists:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Pre-production checklist  <\/li>\n<li>Verify crashkernel\/kdump enabled.  <\/li>\n<li>Test agent log shipping under simulated hang.  <\/li>\n<li>Validate BMC access and SEL retrieval.  <\/li>\n<li>\n<p>Define recovery policy for hosts in the test environment.<\/p>\n<\/li>\n<li>\n<p>Production readiness checklist  <\/p>\n<\/li>\n<li>Baseline host NMI metrics and thresholds.  <\/li>\n<li>Alerting rules with dedupe and grouping.  <\/li>\n<li>Runbook available and indexed in incident portal.  <\/li>\n<li>\n<p>Automation for safe cordon\/replacement validated.<\/p>\n<\/li>\n<li>\n<p>Incident checklist specific to NMI  <\/p>\n<\/li>\n<li>Capture timestamp and host identifiers.  <\/li>\n<li>Retrieve BMC SEL and kernel logs.  <\/li>\n<li>Correlate with other telemetry (network, storage).  <\/li>\n<li>Decide repair vs replace based on recurrence and vendor guidance.  <\/li>\n<li>Record mitigation steps and update fleet actions.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Use Cases of NMI<\/h2>\n\n\n\n<p>Provide 8\u201312 use cases:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p>Bare-metal compute stability<br\/>\n   &#8211; Context: High-density hosting on bare metal.<br\/>\n   &#8211; Problem: Occasional host hangs impacting tenant VMs.<br\/>\n   &#8211; Why NMI helps: Detects CPU stalls and forces diagnostic capture.<br\/>\n   &#8211; What to measure: NMI count, reboots, kdump capture rate.<br\/>\n   &#8211; Typical tools: BMC, Prometheus, Fluentd.<\/p>\n<\/li>\n<li>\n<p>Kubernetes node reliability<br\/>\n   &#8211; Context: Node-level kernel issues causing pods to freeze.<br\/>\n   &#8211; Problem: kubelet not responding, disrupts stateful services.<br\/>\n   &#8211; Why NMI helps: Detects node-level hangs to trigger cordon\/replace.<br\/>\n   &#8211; What to measure: Node NMI events, node-exporter health.<br\/>\n   &#8211; Typical tools: Prometheus, kube-controller-manager automation.<\/p>\n<\/li>\n<li>\n<p>High-performance computing clusters<br\/>\n   &#8211; Context: Long-running compute workloads sensitive to host faults.<br\/>\n   &#8211; Problem: Silent CPU faults leading to silent corruption.<br\/>\n   &#8211; Why NMI helps: Machine Check Exceptions via NMI reveal hardware corruption.<br\/>\n   &#8211; What to measure: MCE rates, NMI per job.<br\/>\n   &#8211; Typical tools: MCE logs, fleet analytics.<\/p>\n<\/li>\n<li>\n<p>Network device faults on servers<br\/>\n   &#8211; Context: NIC causing packet loss intermittently.<br\/>\n   &#8211; Problem: Device driver issues trigger NMIs for serious PCI errors.<br\/>\n   &#8211; Why NMI helps: Immediate diagnostics and possible device isolation.<br\/>\n   &#8211; What to measure: PCIe AER incidents, NIC resets.<br\/>\n   &#8211; Typical tools: ethtool, kernel logs.<\/p>\n<\/li>\n<li>\n<p>Firmware regression detection<br\/>\n   &#8211; Context: Fleet firmware upgrades performed regularly.<br\/>\n   &#8211; Problem: New firmware triggers spurious NMIs on a class of boards.<br\/>\n   &#8211; Why NMI helps: Early detection to rollback firmware.<br\/>\n   &#8211; What to measure: NMI spike per firmware version.<br\/>\n   &#8211; Typical tools: Fleet analytics, firmware inventory.<\/p>\n<\/li>\n<li>\n<p>Managed cloud incident correlation<br\/>\n   &#8211; Context: Provider host degradation impacts VMs.<br\/>\n   &#8211; Problem: Provider signals host health but tenant lacks low-level logs.<br\/>\n   &#8211; Why NMI helps: When available via host health events, informs remediation.<br\/>\n   &#8211; What to measure: Provider host events vs tenant NMI symptoms.<br\/>\n   &#8211; Typical tools: Provider event streams, internal incident systems.<\/p>\n<\/li>\n<li>\n<p>Storage controller failures<br\/>\n   &#8211; Context: Local disks attached to hosts experiencing errors.<br\/>\n   &#8211; Problem: Storage IO lockups affecting VMs.<br\/>\n   &#8211; Why NMI helps: PCIe\/device NMIs indicate severe controller faults.<br\/>\n   &#8211; What to measure: Device NMIs, IO latency before event.<br\/>\n   &#8211; Typical tools: Smartctl, device driver logs.<\/p>\n<\/li>\n<li>\n<p>Safety-critical applications (telecom\/finance)<br\/>\n   &#8211; Context: Applications requiring durable correctness.<br\/>\n   &#8211; Problem: Silent hardware errors can corrupt transactions.<br\/>\n   &#8211; Why NMI helps: Ensures hardware error visibility and forced remediation.<br\/>\n   &#8211; What to measure: ECC and MCE logs correlated with NMIs.<br\/>\n   &#8211; Typical tools: ECC telemetry, MCE parsers.<\/p>\n<\/li>\n<li>\n<p>On-prem colo with limited redundancy<br\/>\n   &#8211; Context: Single-host services without cloud redundancy.<br\/>\n   &#8211; Problem: Host hangs cause prolonged service outages.<br\/>\n   &#8211; Why NMI helps: Automate detection and BMC-initiated recovery to reduce downtime.<br\/>\n   &#8211; What to measure: Time-to-reboot and service availability post-NMI.<br\/>\n   &#8211; Typical tools: IPMI, automated runbooks.<\/p>\n<\/li>\n<li>\n<p>Regression testing for OS updates  <\/p>\n<ul>\n<li>Context: Rolling kernel updates can affect NMI handling.  <\/li>\n<li>Problem: Kernel changes cause different NMI behavior.  <\/li>\n<li>Why NMI helps: Detect regressions in test labs before rollout.  <\/li>\n<li>What to measure: NMI occurrence in canary hosts.  <\/li>\n<li>Typical tools: CI labs, kdump validations.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Scenario Examples (Realistic, End-to-End)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #1 \u2014 Kubernetes node hang due to kernel deadlock (Kubernetes scenario)<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Production Kubernetes cluster with stateful services experiencing pod stalls.<br\/>\n<strong>Goal:<\/strong> Detect node hangs quickly and automate replacement with minimal app impact.<br\/>\n<strong>Why NMI matters here:<\/strong> NMI watchdog can detect kernel stalls that kubelet and liveness probes cannot.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Host-level NMI watchdog -&gt; Kernel handler captures oops and marks node as unhealthy -&gt; Host agent sends event to central logs -&gt; Controller automation cordons node and drains workloads -&gt; Replace host.<br\/>\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Enable kernel NMI watchdog on node images.  <\/li>\n<li>Configure crashkernel and kdump to store dumps to network-backed storage.  <\/li>\n<li>Add agent metric for &#8220;nmi_event_total&#8221; and ship to Prometheus.  <\/li>\n<li>Create alert rule: any nmi_event_total &gt; 0 on prod node -&gt; Pager for hardware team.  <\/li>\n<li>Automate cordon+drain when NMI occurs via operator with safety checks.<br\/>\n<strong>What to measure:<\/strong> NMI events, time node cordoned, pod eviction success rate.<br\/>\n<strong>Tools to use and why:<\/strong> Prometheus for metrics, Fluent Bit for logs, kube-controller-manager for automation.<br\/>\n<strong>Common pitfalls:<\/strong> Watchdog false positives during CPU-intensive pods; ensure thresholds.<br\/>\n<strong>Validation:<\/strong> Run simulated CPU hang tests in canary; verify cordon and replacement.<br\/>\n<strong>Outcome:<\/strong> Faster detection and reduced downtime for affected services.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #2 \u2014 Serverless function host hardware fault (Serverless\/managed-PaaS scenario)<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Managed serverless provider where tenant functions occasionally cold-fail without clear app errors.<br\/>\n<strong>Goal:<\/strong> Correlate host-level faults with function failures and reduce function error rates.<br\/>\n<strong>Why NMI matters here:<\/strong> Underlying host hardware faults manifest as function invocation failures; NMIs indicate host-level issues needing provider action.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Provider collects host events -&gt; NMIs associated with host IDs -&gt; Functions on same host flagged and invocations redirected -&gt; Provider performs host replacement.<br\/>\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Ensure provider exposes host health events or internal BMC logs.  <\/li>\n<li>Correlate function invocation failures with host event timestamps.  <\/li>\n<li>Blacklist affected hosts in routing pool until resolved.<br\/>\n<strong>What to measure:<\/strong> Host NMI events, function error rate per host, blacklisted-host count.<br\/>\n<strong>Tools to use and why:<\/strong> Provider internal telemetry, central event bus.<br\/>\n<strong>Common pitfalls:<\/strong> Tenant cannot access host-level logs in public clouds; rely on provider channels.<br\/>\n<strong>Validation:<\/strong> Inject a simulated kernel hang in a non-prod region to verify routing changes.<br\/>\n<strong>Outcome:<\/strong> Reduced invocation failures and faster provider remediation.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #3 \u2014 Postmortem after recurrent NMIs (Incident-response\/postmortem scenario)<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Fleet shows recurring NMIs on a set of blades after firmware rollout.<br\/>\n<strong>Goal:<\/strong> Root-cause analysis and rollback plan.<br\/>\n<strong>Why NMI matters here:<\/strong> Pattern of NMIs indicates regression introduced by firmware.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Collect NMI events, firmware version mapping, BMC SEL, and service impact metrics -&gt; RCA -&gt; Firmware rollback plan and automated replacement.<br\/>\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Aggregate NMIs across hosts and group by firmware version.  <\/li>\n<li>Validate with vendor logs and affected hardware SKU.  <\/li>\n<li>Rollback firmware on canary subset, observe reduction in NMIs, then do controlled rollback.<br\/>\n<strong>What to measure:<\/strong> NMI rate pre\/post rollback, service error budget consumption.<br\/>\n<strong>Tools to use and why:<\/strong> Fleet analytics, vendor firmware tooling.<br\/>\n<strong>Common pitfalls:<\/strong> Incomplete metadata linking firmware to host; ensure accurate inventory.<br\/>\n<strong>Validation:<\/strong> Canary rollback and monitoring for recurrence.<br\/>\n<strong>Outcome:<\/strong> Restored fleet stability and updated firmware rollout policy.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #4 \u2014 Cost vs performance tradeoff: aggressive watchdogs (Cost\/performance trade-off scenario)<\/h3>\n\n\n\n<p><strong>Context:<\/strong> High-performance trading application prioritizing latency; ops must balance detection with performance.<br\/>\n<strong>Goal:<\/strong> Tune watchdog sensitivity to avoid performance regressions while detecting real hangs.<br\/>\n<strong>Why NMI matters here:<\/strong> Aggressive watchdogs may cause reboots during short GC or high-CPU bursts; too lax and hangs go undetected.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Parameterize watchdog timeout per instance class; observe impact on latency and NMI rate.<br\/>\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Baseline latency and CPU profiles.  <\/li>\n<li>Enable NMI watchdog with conservative timeout on canary nodes.  <\/li>\n<li>Adjust timeouts and correlate NMI rate vs latency impact.  <\/li>\n<li>Choose per-instance class policy (e.g., lower sensitivity on ultra-low-latency instances).<br\/>\n<strong>What to measure:<\/strong> Latency SLOs, NMI events, false reboot rate.<br\/>\n<strong>Tools to use and why:<\/strong> Prometheus, performance test harness.<br\/>\n<strong>Common pitfalls:<\/strong> Over-generalizing timeouts across different workloads; tune per class.<br\/>\n<strong>Validation:<\/strong> Load tests and targeted chaos testing.<br\/>\n<strong>Outcome:<\/strong> Balanced detection while preserving latency goals.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #5 \u2014 Nested virtualization debug (Kubernetes\/hypervisor hybrid)<\/h3>\n\n\n\n<p><strong>Context:<\/strong> VM running Kubernetes, guest experiences fatal hangs; unclear if guest or host triggered.<br\/>\n<strong>Goal:<\/strong> Determine whether NMIs originate in host or guest and fix accordingly.<br\/>\n<strong>Why NMI matters here:<\/strong> NMIs may be swallowed or emulated by hypervisor leading to ambiguity.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Instrument both host and guest with NMI counters, enable hypervisor passthrough settings, compare logs.<br\/>\n<strong>Step-by-step implementation:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Enable guest-visible NMI metrics.  <\/li>\n<li>Correlate guest NMI events with host BMC SEL entries.  <\/li>\n<li>If host-origin, escalate to hardware replacement; if guest-origin, patch guest kernel.<br\/>\n<strong>What to measure:<\/strong> Host vs guest NMI event correlation rate.<br\/>\n<strong>Tools to use and why:<\/strong> QEMU\/KVM logs, host BMC logs, guest kernel logs.<br\/>\n<strong>Common pitfalls:<\/strong> Hypervisor default masks; must explicitly enable forwarding for accurate diagnosis.<br\/>\n<strong>Validation:<\/strong> Reproduce on nested testbed.<br\/>\n<strong>Outcome:<\/strong> Correct ownership and remediation path identified.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Common Mistakes, Anti-patterns, and Troubleshooting<\/h2>\n\n\n\n<p>List 15\u201325 mistakes with:\nSymptom -&gt; Root cause -&gt; Fix<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Symptom: No NMI logs after host crash -&gt; Root cause: crashkernel not configured -&gt; Fix: Reserve crashkernel and enable kdump.  <\/li>\n<li>Symptom: Frequent NMI alerts -&gt; Root cause: firmware regression -&gt; Fix: Rollback firmware and coordinate vendor patch.  <\/li>\n<li>Symptom: NMI event but no service impact -&gt; Root cause: Spurious device error -&gt; Fix: Isolate device and monitor; vendor test.  <\/li>\n<li>Symptom: Guest shows hang, no host NMI -&gt; Root cause: Hypervisor masking -&gt; Fix: Enable NMI passthrough or increase host diagnostics.  <\/li>\n<li>Symptom: Alert storm during maintenance -&gt; Root cause: No suppression rules -&gt; Fix: Implement maintenance window suppression and dedupe.  <\/li>\n<li>Symptom: Missing kernel stacks -&gt; Root cause: Handler did too much and crashed -&gt; Fix: Simplify handler and ensure atomic minimal logging.  <\/li>\n<li>Symptom: False positives in high-CPU workloads -&gt; Root cause: Watchdog timeout too short -&gt; Fix: Tune thresholds per workload.  <\/li>\n<li>Symptom: Unable to access BMC -&gt; Root cause: Network\/cred issues -&gt; Fix: Secure and document BMC access and rotate creds.  <\/li>\n<li>Symptom: Recurrent host replacement without fix -&gt; Root cause: Not capturing full diagnostics -&gt; Fix: Ensure persistent firmware logs and kdump.  <\/li>\n<li>Symptom: Slow detection of NMI -&gt; Root cause: Agent ingest lag -&gt; Fix: Prioritize NMI pipeline and heartbeat checks.  <\/li>\n<li>Symptom: Lost correlation between NMI and app outages -&gt; Root cause: Poor timestamp sync -&gt; Fix: Ensure NTP\/PTP across hosts and BMCs.  <\/li>\n<li>Symptom: High MTTR -&gt; Root cause: No automation for cordon\/replace -&gt; Fix: Implement safe automation runbooks.  <\/li>\n<li>Symptom: Overloaded alerting system -&gt; Root cause: Every NMI pages SRE -&gt; Fix: Tier alerts and route to hardware team first.  <\/li>\n<li>Symptom: Partial SEL entries -&gt; Root cause: BMC buffer overflow -&gt; Fix: Increase SEL polling frequency.  <\/li>\n<li>Symptom: Confusing SMI vs NMI behavior -&gt; Root cause: Lack of documentation -&gt; Fix: Create vendor-specific runbooks clarifying differences.  <\/li>\n<li>Symptom: Incomplete kdump captures -&gt; Root cause: Insufficient reserved memory -&gt; Fix: Increase crashkernel reservation.  <\/li>\n<li>Symptom: Missing symbolized stacks -&gt; Root cause: Kernel versions mismatched -&gt; Fix: Keep kernel build and symbol store synchronized.  <\/li>\n<li>Symptom: Noise during firmware maintenance -&gt; Root cause: Not excluding canary hosts -&gt; Fix: Use canary cohort and staged rollout.  <\/li>\n<li>Symptom: Data loss after watchdog reboot -&gt; Root cause: Hard reset without graceful drain -&gt; Fix: Graceful drain automation before reboot.  <\/li>\n<li>Symptom: Observability agent fails during hang -&gt; Root cause: Agent runs in same kernel context -&gt; Fix: Use out-of-band collection or pre-persisted logs.  <\/li>\n<li>Symptom: Misattributed billing events -&gt; Root cause: Reboots lead to VM migration counts -&gt; Fix: Tag incidents and correlate with billing pipeline.  <\/li>\n<li>Symptom: Excessive human toil -&gt; Root cause: Manual runbook steps -&gt; Fix: Automate routine remediation tasks.  <\/li>\n<li>Symptom: Vendor denies hardware fault -&gt; Root cause: Insufficient diagnostic evidence -&gt; Fix: Collect MCE, SEL, and crash dumps before replacement.<\/li>\n<\/ol>\n\n\n\n<p>Observability pitfalls (at least 5):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>PITFALL: Relying solely on dmesg -&gt; CAUSE: dmesg ephemeral -&gt; FIX: Persist to central logs and capture frequently.  <\/li>\n<li>PITFALL: Not syncing clocks -&gt; CAUSE: Misaligned timestamps -&gt; FIX: NTP\/PTP across host and BMC.  <\/li>\n<li>PITFALL: Missing correlation with higher-level metrics -&gt; CAUSE: Siloed telemetry -&gt; FIX: Integrate event bus and correlate traces\/metrics\/logs.  <\/li>\n<li>PITFALL: Agent unavailable during hang -&gt; CAUSE: Agent process requires kernel context -&gt; FIX: Out-of-band logging via BMC.  <\/li>\n<li>PITFALL: Over-indexed raw dumps -&gt; CAUSE: Storage blowout -&gt; FIX: Prioritize metadata and sample full dumps.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Best Practices &amp; Operating Model<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Ownership and on-call  <\/li>\n<li>\n<p>Hardware\/BMC team owns NMI handling policies; service owners own application impact. Create a separate hardware on-call rotation for urgent host-level NMIs.<\/p>\n<\/li>\n<li>\n<p>Runbooks vs playbooks  <\/p>\n<\/li>\n<li>Runbooks: Step-by-step remediation for a specific host NMI.  <\/li>\n<li>\n<p>Playbooks: Higher-level escalation flows involving multiple teams and vendor engagement.<\/p>\n<\/li>\n<li>\n<p>Safe deployments (canary\/rollback)  <\/p>\n<\/li>\n<li>\n<p>Firmware\/kernel changes must roll through small canaries with NMI and kdump monitoring. Implement automated rollback triggers on NMI spikes.<\/p>\n<\/li>\n<li>\n<p>Toil reduction and automation  <\/p>\n<\/li>\n<li>\n<p>Automate cordon\/drain and host replacement. Use auto-replace policies for hosts with repeat NMIs and escalate to vendor only after automated triage.<\/p>\n<\/li>\n<li>\n<p>Security basics  <\/p>\n<\/li>\n<li>Secure BMC\/IPMI access with rotation and network isolation. Limit who can read SEL logs. Audit all actions against hardware.<\/li>\n<\/ul>\n\n\n\n<p>Include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Weekly\/monthly routines  <\/li>\n<li>Weekly: Review NMI events for spikes and outstanding kdumps.  <\/li>\n<li>Monthly: Analyze fleet trends, firmware versions prevalence, and identify hotspots.  <\/li>\n<li>What to review in postmortems related to NMI  <\/li>\n<li>Confirm diagnostic artifacts preserved, validate remediation execution, check for automation coverage, and revise firmware rollout policy if needed.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Tooling &amp; Integration Map for NMI (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Category<\/th>\n<th>What it does<\/th>\n<th>Key integrations<\/th>\n<th>Notes<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>I1<\/td>\n<td>Metrics<\/td>\n<td>Collects NMI counters from hosts<\/td>\n<td>Prometheus, exporters<\/td>\n<td>Use custom metric for NMIs<\/td>\n<\/tr>\n<tr>\n<td>I2<\/td>\n<td>Logs<\/td>\n<td>Aggregates dmesg and kernel oops<\/td>\n<td>Fluent Bit, ELK<\/td>\n<td>Ensure retention for postmortems<\/td>\n<\/tr>\n<tr>\n<td>I3<\/td>\n<td>BMC tooling<\/td>\n<td>Fetches SEL and firmware logs<\/td>\n<td>ipmitool, vendor tools<\/td>\n<td>Out-of-band source for diagnostics<\/td>\n<\/tr>\n<tr>\n<td>I4<\/td>\n<td>Crash capture<\/td>\n<td>Captures kdump images<\/td>\n<td>kdump, crash utilities<\/td>\n<td>Requires reserved memory<\/td>\n<\/tr>\n<tr>\n<td>I5<\/td>\n<td>Fleet analytics<\/td>\n<td>Correlates events across hosts<\/td>\n<td>SIEM, data warehouse<\/td>\n<td>Good for systemic trends<\/td>\n<\/tr>\n<tr>\n<td>I6<\/td>\n<td>Alerting<\/td>\n<td>Pages and routes incidents<\/td>\n<td>PagerDuty, Opsgenie<\/td>\n<td>Deduplication important<\/td>\n<\/tr>\n<tr>\n<td>I7<\/td>\n<td>Orchestration<\/td>\n<td>Automates cordon\/drain\/replace<\/td>\n<td>Kubernetes controllers, Terraform<\/td>\n<td>Critical for low MTTR<\/td>\n<\/tr>\n<tr>\n<td>I8<\/td>\n<td>Hypervisor<\/td>\n<td>Manages interrupt forwarding<\/td>\n<td>KVM\/QEMU, Xen<\/td>\n<td>Hypervisor settings affect NMI visibility<\/td>\n<\/tr>\n<tr>\n<td>I9<\/td>\n<td>Firmware mgmt<\/td>\n<td>Deploys and rolls back firmware<\/td>\n<td>Vendor update tools<\/td>\n<td>Track firmware per host<\/td>\n<\/tr>\n<tr>\n<td>I10<\/td>\n<td>Provider health<\/td>\n<td>Received provider host events<\/td>\n<td>Cloud event bus<\/td>\n<td>Abstracted details; combine with tenant logs<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>I3: Ensure BMC credentials management integrates with secrets store and audit logs.<\/li>\n<li>I4: Validate crashkernel sizing in CI images to ensure kdump success.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Frequently Asked Questions (FAQs)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What exactly triggers an NMI?<\/h3>\n\n\n\n<p>Hardware sources like watchdog timers, certain PCI errors, thermal events, and CPU machine checks can trigger an NMI.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Can NMIs be disabled?<\/h3>\n\n\n\n<p>Varies \/ depends. Some platforms and hypervisors allow configuration; disabling removes a critical safety mechanism and is generally discouraged.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Are NMIs visible inside VMs?<\/h3>\n\n\n\n<p>Sometimes. Hypervisor decisions determine whether NMIs are forwarded to guests, emulated, or swallowed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How do NMIs differ from SMIs?<\/h3>\n\n\n\n<p>SMIs execute in firmware SMM and are generally invisible to the OS; NMIs are delivered to the CPU and handled by the OS\/kernel.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How do I capture useful data when an NMI happens?<\/h3>\n\n\n\n<p>Enable crashkernel\/kdump, persist kernel ring buffer to a central log, and collect BMC SEL entries.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Can NMIs be the cause of data corruption?<\/h3>\n\n\n\n<p>NMIs often indicate hardware faults (like MCEs) that can lead to data corruption if not detected; NMIs themselves are signaling mechanisms.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to avoid false positives from watchdog NMIs?<\/h3>\n\n\n\n<p>Tune watchdog thresholds per workload and test under realistic load profiles.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Do cloud providers surface NMIs to tenants?<\/h3>\n\n\n\n<p>Varies \/ depends. Public cloud providers typically surface aggregate host health events rather than raw NMIs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to troubleshoot repeated NMIs on a host?<\/h3>\n\n\n\n<p>Collect kdump and SEL, compare firmware and hardware SKU, isolate with hardware swap, and engage vendor support.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Is there a standard SLI for NMIs?<\/h3>\n\n\n\n<p>No universal standard; common SLI is hosts with NMI rate per period. Define based on fleet risk tolerance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Should NMIs page engineers?<\/h3>\n\n\n\n<p>Only if the event impacts production availability or is a repeated\/recurrent failure; otherwise log and ticket.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to test NMI handling?<\/h3>\n\n\n\n<p>Simulate hung CPUs in a controlled environment or use vendor-provided diagnostic tools.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Will NMIs always produce stack traces?<\/h3>\n\n\n\n<p>Not always. Early-boot NMIs or severe memory corruption can prevent full stack captures.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to secure BMC\/IPMI access?<\/h3>\n\n\n\n<p>Network isolation, credential rotation, strong auth, and restricted access roles.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What to do if NMIs increase after a firmware update?<\/h3>\n\n\n\n<p>Halt rollout, roll back canary firmware, gather diagnostics, and escalate to vendor.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Can software cause an NMI?<\/h3>\n\n\n\n<p>Software cannot directly cause NMI, but software behavior (e.g., infinite interrupt disabling) can lead to watchdog timeouts that trigger NMIs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How long should we retain NMI-related logs?<\/h3>\n\n\n\n<p>Sufficient to perform RCA and trend analysis; typically months for fleet analytics and at least 90 days for incident traces.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Do containers see NMIs?<\/h3>\n\n\n\n<p>Containers rely on kernel\/host; containers do not directly receive NMIs but will suffer host-level impacts.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Non-Maskable Interrupts are a critical, low-level signal for hardware and platform reliability. For 2026 cloud-native operations, treating NMIs as part of your observability and incident response fabric is essential\u2014especially for bare-metal fleets, high-availability services, and safety-critical workloads. Proper instrumentation, automation, and vendor collaboration turn NMIs from opaque emergencies into actionable diagnostics.<\/p>\n\n\n\n<p>Next 7 days plan (5 bullets):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Day 1: Inventory hosts and confirm crashkernel\/kdump settings on a canary cohort.  <\/li>\n<li>Day 2: Enable minimal NMI telemetry and ship to central logs.  <\/li>\n<li>Day 3: Create Prometheus metrics and dashboards for NMI events.  <\/li>\n<li>Day 4: Draft runbook for NMI triage and test on non-prod hosts.  <\/li>\n<li>Day 5\u20137: Run a controlled NMI simulation in staging and validate automation for cordon\/replace.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix \u2014 NMI Keyword Cluster (SEO)<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Primary keywords<\/li>\n<li>Non-Maskable Interrupt NMI<\/li>\n<li>NMI watchdog<\/li>\n<li>kernel NMI<\/li>\n<li>NMI dashboard<\/li>\n<li>\n<p>NMI monitoring<\/p>\n<\/li>\n<li>\n<p>Secondary keywords<\/p>\n<\/li>\n<li>machine check exception MCE<\/li>\n<li>kernel oops NMI<\/li>\n<li>crashkernel kdump NMI<\/li>\n<li>BMC SEL NMI<\/li>\n<li>IPMI NMI logs<\/li>\n<li>firmware NMI regression<\/li>\n<li>NMI alerting<\/li>\n<li>hypervisor NMI passthrough<\/li>\n<li>NMI troubleshooting<\/li>\n<li>\n<p>NMI runbook<\/p>\n<\/li>\n<li>\n<p>Long-tail questions<\/p>\n<\/li>\n<li>What causes a non-maskable interrupt in servers<\/li>\n<li>How to capture kdump after an NMI<\/li>\n<li>How to correlate NMI with application outages<\/li>\n<li>How to configure NMI watchdog in Linux<\/li>\n<li>Why am I getting frequent NMIs after firmware update<\/li>\n<li>How do cloud providers expose host NMIs<\/li>\n<li>How to test NMI detection and recovery<\/li>\n<li>How to secure BMC when collecting NMI logs<\/li>\n<li>What is the difference between SMI and NMI<\/li>\n<li>Do VMs receive NMIs from the host<\/li>\n<li>How to reduce NMI alert noise<\/li>\n<li>What to include in an NMI incident postmortem<\/li>\n<li>How to configure Prometheus for NMI events<\/li>\n<li>How to automate replacement on NMI detection<\/li>\n<li>\n<p>How to interpret machine check exception logs<\/p>\n<\/li>\n<li>\n<p>Related terminology<\/p>\n<\/li>\n<li>kernel panic<\/li>\n<li>IRQ vs NMI<\/li>\n<li>System Event Log SEL<\/li>\n<li>out-of-band management<\/li>\n<li>crash dump<\/li>\n<li>symbolized stack<\/li>\n<li>ECC memory<\/li>\n<li>DIMM failure<\/li>\n<li>PCIe AER<\/li>\n<li>watchdog timer<\/li>\n<li>BIOS\/UEFI<\/li>\n<li>SMM<\/li>\n<li>kexec<\/li>\n<li>node cordon<\/li>\n<li>auto-replace policy<\/li>\n<li>fleet analytics<\/li>\n<li>incident burn rate<\/li>\n<li>observability pipeline<\/li>\n<li>deduplication rules<\/li>\n<li>canary firmware rollout<\/li>\n<li>proactive replacement<\/li>\n<li>panic_on_oops<\/li>\n<li>kernel symbol store<\/li>\n<li>NTP\/PTP sync<\/li>\n<li>event correlation<\/li>\n<li>out-of-band logs<\/li>\n<li>vendor escalation<\/li>\n<li>hardware on-call<\/li>\n<li>platform health event<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>&#8212;<\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[375],"tags":[],"class_list":["post-2435","post","type-post","status-publish","format-standard","hentry","category-what-is-series"],"_links":{"self":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/2435","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=2435"}],"version-history":[{"count":1,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/2435\/revisions"}],"predecessor-version":[{"id":3045,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/2435\/revisions\/3045"}],"wp:attachment":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=2435"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=2435"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=2435"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}