{"id":3623,"date":"2026-02-17T17:59:29","date_gmt":"2026-02-17T17:59:29","guid":{"rendered":"https:\/\/dataopsschool.com\/blog\/apache-druid\/"},"modified":"2026-02-17T17:59:29","modified_gmt":"2026-02-17T17:59:29","slug":"apache-druid","status":"publish","type":"post","link":"https:\/\/dataopsschool.com\/blog\/apache-druid\/","title":{"rendered":"What is Apache Druid? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)"},"content":{"rendered":"\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Quick Definition (30\u201360 words)<\/h2>\n\n\n\n<p>Apache Druid is a high-performance, column-oriented, distributed data store optimized for real-time analytics and interactive OLAP queries. Analogy: Druid is like a purpose-built search engine for time-series and event analytics. Formal: A distributed, massively-parallel real-time OLAP data store with ingestion, indexing, and real-time query capabilities.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">What is Apache Druid?<\/h2>\n\n\n\n<p>Apache Druid is an open-source, columnar analytics datastore designed for sub-second queries on streaming and historical event data. It is NOT a transactional database, a full-featured data warehouse, or a general-purpose key-value store. Druid focuses on fast aggregations, flexible time-based partitioning, and hybrid real-time + historical query models.<\/p>\n\n\n\n<p>Key properties and constraints:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Column-oriented storage optimized for aggregations and scans.<\/li>\n<li>Time-centric data model with native support for time windows and rollups.<\/li>\n<li>Supports real-time ingestion from streams and batch ingestion.<\/li>\n<li>Horizontal scalability with distinct node types for ingestion, querying, and storage.<\/li>\n<li>Strong read-path optimization; write path is append-heavy and optimized for segment creation.<\/li>\n<li>Not ideal for high-cardinality joins across huge dimensions without careful design.<\/li>\n<li>Resource isolation between real-time ingestion and queries requires ops attention.<\/li>\n<\/ul>\n\n\n\n<p>Where it fits in modern cloud\/SRE workflows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Analytics serving layer for dashboards, anomaly detection, and ad-hoc exploration.<\/li>\n<li>Works as a serving backend for real-time features in ML\/AI pipelines.<\/li>\n<li>Fits into event-driven architectures: ingest from Kafka, cloud streams, or object storage.<\/li>\n<li>Deployed on Kubernetes or managed VMs; typical ops tasks include autoscaling, segment lifecycle, and compaction.<\/li>\n<li>Must be integrated with observability (metrics, logs, traces) and automated deployment\/rollbacks.<\/li>\n<\/ul>\n\n\n\n<p>Diagram description (text-only, visualize):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data producers -&gt; stream\/batch (Kafka\/S3) -&gt; Ingestor nodes (Kafka supervisors, indexing tasks) -&gt; Deep storage (S3\/GCS\/HDFS) persists segments -&gt; Historical nodes serve immutable segments from local cache -&gt; Broker nodes accept SQL\/JSON queries and route to Historical\/Realtime -&gt; Router\/Coordinator manages metadata and segment distribution -&gt; Clients use dashboards or APIs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Apache Druid in one sentence<\/h3>\n\n\n\n<p>Apache Druid is a distributed, real-time analytics datastore optimized for sub-second aggregation and filtering queries on time-based event data.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Apache Druid vs related terms (TABLE REQUIRED)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Term<\/th>\n<th>How it differs from Apache Druid<\/th>\n<th>Common confusion<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>T1<\/td>\n<td>Data Warehouse<\/td>\n<td>Focuses on batch OLAP and heavy joins; Druid focuses on low-latency querying<\/td>\n<td>Druid is not a full replacement for a DW<\/td>\n<\/tr>\n<tr>\n<td>T2<\/td>\n<td>OLAP Cube<\/td>\n<td>Cubes pre-aggregate multidimensional data; Druid offers flexible rollups and fast slices<\/td>\n<td>People conflate cube precalc with Druid rollups<\/td>\n<\/tr>\n<tr>\n<td>T3<\/td>\n<td>Time-Series DB<\/td>\n<td>TSDBs optimize for high-resolution series; Druid optimizes aggregations across events<\/td>\n<td>Druid is used for time-series analytics not raw metric storage<\/td>\n<\/tr>\n<tr>\n<td>T4<\/td>\n<td>Kafka<\/td>\n<td>Kafka is a streaming platform; Druid ingests from Kafka for analytics<\/td>\n<td>Some assume Druid stores raw streams indefinitely<\/td>\n<\/tr>\n<tr>\n<td>T5<\/td>\n<td>Elasticsearch<\/td>\n<td>ES indexes documents for search; Druid indexes columns for fast analytics<\/td>\n<td>Both support queries but different query shapes<\/td>\n<\/tr>\n<tr>\n<td>T6<\/td>\n<td>ClickHouse<\/td>\n<td>ClickHouse is another columnar DB; Druid emphasizes streaming ingestion and segment lifecycle<\/td>\n<td>Choice depends on features and ops model<\/td>\n<\/tr>\n<tr>\n<td>T7<\/td>\n<td>Trino<\/td>\n<td>Trino is a query federator; Druid is a data store that can be queried by Trino<\/td>\n<td>Confusion around federated vs native execution<\/td>\n<\/tr>\n<tr>\n<td>T8<\/td>\n<td>Cloud Data Warehouse<\/td>\n<td>Managed services focus on SQL and storage; Druid focuses on realtime serving<\/td>\n<td>Misunderstandings on cost profile and operational effort<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if any cell says \u201cSee details below\u201d)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Why does Apache Druid matter?<\/h2>\n\n\n\n<p>Business impact:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Revenue: Enables real-time dashboards and personalization features that can increase conversion by reacting to events in seconds.<\/li>\n<li>Trust: Provides consistent and fast insights for monitoring SLAs, fraud detection, and user-facing analytics.<\/li>\n<li>Risk: Incorrect configuration or missing alerting can cause stale data or query outages affecting decisions.<\/li>\n<\/ul>\n\n\n\n<p>Engineering impact:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Incident reduction: Proper SLOs and observability reduce time-to-detect and time-to-recover for query-serving incidents.<\/li>\n<li>Velocity: Fast ad-hoc queries empower data teams to iterate quickly on experiments and product metrics.<\/li>\n<\/ul>\n\n\n\n<p>SRE framing:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SLIs: query latency (p50\/p95\/p99), query success rate, ingestion lag, segment availability.<\/li>\n<li>SLOs: e.g., 99% of queries return within 1s at p95; ingestion lag &lt; 30s for real-time tiers.<\/li>\n<li>Error budgets: Used to allow controlled experimentation on compaction or node upgrades.<\/li>\n<li>Toil: Segment compaction, backfills, and capacity planning; automation reduces toil.<\/li>\n<\/ul>\n\n\n\n<p>Realistic \u201cwhat breaks in production\u201d examples:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Segment churn spikes causing high disk I\/O and GC due to aggressive compaction config.<\/li>\n<li>Broker node overloaded by unbounded heavy queries causing query timeouts for dashboards.<\/li>\n<li>Deep storage outage results in inability to recover historical segments after node failures.<\/li>\n<li>Kafka consumer lag grows due to throttled indexing tasks causing stale real-time data.<\/li>\n<li>Incorrect rollup settings lead to data re-aggregation differences and broken SLAs.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Where is Apache Druid used? (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Layer\/Area<\/th>\n<th>How Apache Druid appears<\/th>\n<th>Typical telemetry<\/th>\n<th>Common tools<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>L1<\/td>\n<td>Edge<\/td>\n<td>Rarely direct; pre-aggregated events forwarded to Druid<\/td>\n<td>Event counts, loss metrics<\/td>\n<td>Load balancers, CDN logs<\/td>\n<\/tr>\n<tr>\n<td>L2<\/td>\n<td>Network<\/td>\n<td>Receives proxied API traffic; brokers serve queries<\/td>\n<td>Request latency, error rates<\/td>\n<td>Nginx, Envoy<\/td>\n<\/tr>\n<tr>\n<td>L3<\/td>\n<td>Service<\/td>\n<td>Backend services push events to streams feeding Druid<\/td>\n<td>Ingestion lag, task health<\/td>\n<td>Kafka, PubSub<\/td>\n<\/tr>\n<tr>\n<td>L4<\/td>\n<td>Application<\/td>\n<td>Dashboards and analytics query Druid<\/td>\n<td>Query latencies, success rates<\/td>\n<td>Grafana, Superset<\/td>\n<\/tr>\n<tr>\n<td>L5<\/td>\n<td>Data<\/td>\n<td>Druid stores indexed segments and metadata<\/td>\n<td>Segment retention, compaction stats<\/td>\n<td>S3\/GCS, HDFS<\/td>\n<\/tr>\n<tr>\n<td>L6<\/td>\n<td>IaaS\/PaaS<\/td>\n<td>Druid nodes on VMs or managed instances<\/td>\n<td>Node health, disk IO<\/td>\n<td>Terraform, cloud provider monitoring<\/td>\n<\/tr>\n<tr>\n<td>L7<\/td>\n<td>Kubernetes<\/td>\n<td>Stateful pods for brokers\/histories\/realtime<\/td>\n<td>Pod restarts, CPU\/memory<\/td>\n<td>K8s, Helm<\/td>\n<\/tr>\n<tr>\n<td>L8<\/td>\n<td>Serverless<\/td>\n<td>Managed ingestion with serverless functions feeding streams<\/td>\n<td>Invocation errors, throughput<\/td>\n<td>Lambda, Cloud Run<\/td>\n<\/tr>\n<tr>\n<td>L9<\/td>\n<td>CI\/CD<\/td>\n<td>Deploys Druid configs and tasks<\/td>\n<td>Deployment success, rollout time<\/td>\n<td>GitOps, ArgoCD<\/td>\n<\/tr>\n<tr>\n<td>L10<\/td>\n<td>Observability<\/td>\n<td>Telemetry aggregation and alerts<\/td>\n<td>Dashboards, traces<\/td>\n<td>Prometheus, OpenTelemetry<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">When should you use Apache Druid?<\/h2>\n\n\n\n<p>When it\u2019s necessary:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You need sub-second aggregation and filtering on high-volume event data.<\/li>\n<li>Real-time or near-real-time ingestion from streams is required.<\/li>\n<li>Dashboards need interactive drill-downs with many concurrent users.<\/li>\n<li>Use cases that require time-based rollups and retention.<\/li>\n<\/ul>\n\n\n\n<p>When it\u2019s optional:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If query latency requirements are seconds, not sub-second.<\/li>\n<li>If batch analytics and complicated multi-table joins dominate; a traditional DW may suffice.<\/li>\n<li>For low-volume analytics where cost and ops overhead outweigh benefits.<\/li>\n<\/ul>\n\n\n\n<p>When NOT to use \/ overuse it:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Transactional workloads with frequent updates\/deletes across rows.<\/li>\n<li>High-cardinality point queries better served by a key-value store.<\/li>\n<li>Complex multi-source joins where federated engines or warehouses are better.<\/li>\n<\/ul>\n\n\n\n<p>Decision checklist:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If you need sub-second aggregations and streaming ingestion -&gt; Use Druid.<\/li>\n<li>If you need full ANSI SQL for complex joins and low ops -&gt; Consider cloud DW or Trino.<\/li>\n<li>If schema frequently changes with wide joins -&gt; Evaluate alternatives.<\/li>\n<\/ul>\n\n\n\n<p>Maturity ladder:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Beginner: Small cluster, ingest from batch files, basic dashboards, simple retention.<\/li>\n<li>Intermediate: Kafka ingestion, compaction tuning, query caching, autoscaling.<\/li>\n<li>Advanced: Multi-tenant, cross-region replication, real-time feature serving, automated recovery and chaos-tested SLOs.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">How does Apache Druid work?<\/h2>\n\n\n\n<p>Components and workflow:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Client: Issues SQL or native queries to Brokers\/Routers.<\/li>\n<li>Router\/Broker: Receives queries, routes to Historical or Real-time nodes, merges results.<\/li>\n<li>Historical nodes: Serve immutable, on-disk segments cached locally for fast reads.<\/li>\n<li>Real-time (MiddleManager\/Peon\/Indexer): Ingest streaming events, create in-memory segments, hand off to Historical.<\/li>\n<li>Coordinator: Manages segment distribution and replication.<\/li>\n<li>Overlord: Manages ingestion tasks and supervisors.<\/li>\n<li>Deep storage: Object store or HDFS holding segment files as canonical storage.<\/li>\n<li>Metadata store: Relational DB storing segment metadata, task state, and configuration.<\/li>\n<\/ul>\n\n\n\n<p>Data flow and lifecycle:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Events arrive via stream or batch.<\/li>\n<li>Indexing tasks parse, transform, and optionally rollup data into segments.<\/li>\n<li>Segments are pushed to deep storage and registered in metadata store.<\/li>\n<li>Coordinator assigns segments to Historical nodes according to replication rules.<\/li>\n<li>Brokers route queries and aggregate results from relevant nodes.<\/li>\n<li>Compaction tasks optimize segment sizes and perform repartitioning when configured.<\/li>\n<\/ol>\n\n\n\n<p>Edge cases and failure modes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Partial handoff where some segments are not fully persisted causing query inconsistencies.<\/li>\n<li>Coordinator and Overlord becoming single points of decision logic; metadata DB outages cause control-plane issues.<\/li>\n<li>Disk pressure on Historical nodes causing evictions and increased network I\/O.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Typical architecture patterns for Apache Druid<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Basic standalone: Single cluster handling ingestion and queries for dev or low-volume use.<\/li>\n<li>Streaming-focused: Realtime ingestion with Kafka supervisors, dedicated MiddleManagers, and scale-out Historical nodes.<\/li>\n<li>BI-serving cluster: Focus on high query concurrency, many small Historical nodes with heavy caching.<\/li>\n<li>Multi-tenant logical separation: Separate data sources and resource pools per team, using Coordinator rules.<\/li>\n<li>Hybrid cloud: Druid on Kubernetes with deep storage in cloud object store and IAM-managed access.<\/li>\n<li>Edge + central analytics: Edge services forward summarized events to Druid for central dashboards.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Failure modes &amp; mitigation (TABLE REQUIRED)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Failure mode<\/th>\n<th>Symptom<\/th>\n<th>Likely cause<\/th>\n<th>Mitigation<\/th>\n<th>Observability signal<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>F1<\/td>\n<td>Query timeouts<\/td>\n<td>Slow or failed queries<\/td>\n<td>Overloaded brokers or heavy queries<\/td>\n<td>Rate-limit queries and add capacity<\/td>\n<td>High query latency metrics<\/td>\n<\/tr>\n<tr>\n<td>F2<\/td>\n<td>Ingestion lag<\/td>\n<td>Growing lag in stream offsets<\/td>\n<td>Indexing tasks slow or crashed<\/td>\n<td>Autoscale indexers and tune parsing<\/td>\n<td>Kafka consumer lag<\/td>\n<\/tr>\n<tr>\n<td>F3<\/td>\n<td>Segment eviction<\/td>\n<td>Missing historical segments<\/td>\n<td>Disk pressure or misconfigured retention<\/td>\n<td>Increase disk or adjust retention<\/td>\n<td>Segment availability ratio<\/td>\n<\/tr>\n<tr>\n<td>F4<\/td>\n<td>Metadata DB down<\/td>\n<td>Orchestrator unable to assign segments<\/td>\n<td>Metadata store outage<\/td>\n<td>HA for metadata DB and backups<\/td>\n<td>Coordinator errors<\/td>\n<\/tr>\n<tr>\n<td>F5<\/td>\n<td>Deep storage failure<\/td>\n<td>Unable to load historical segments<\/td>\n<td>Object store permissions or outage<\/td>\n<td>Validate IAM and retries<\/td>\n<td>Segment push\/pull failures<\/td>\n<\/tr>\n<tr>\n<td>F6<\/td>\n<td>Compaction overload<\/td>\n<td>High CPU and IO during compaction<\/td>\n<td>Aggressive compaction schedule<\/td>\n<td>Stagger compaction tasks<\/td>\n<td>Compaction task latency<\/td>\n<\/tr>\n<tr>\n<td>F7<\/td>\n<td>Memory OOM<\/td>\n<td>JVM OOM on nodes<\/td>\n<td>Incorrect JVM heap settings<\/td>\n<td>Tune JVM and JVM flags<\/td>\n<td>Heap usage and GC pauses<\/td>\n<\/tr>\n<tr>\n<td>F8<\/td>\n<td>Unbounded queries<\/td>\n<td>System slowdown from ad-hoc queries<\/td>\n<td>Lack of query limits<\/td>\n<td>Implement query timeouts and cost limits<\/td>\n<td>High concurrent queries metric<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Key Concepts, Keywords &amp; Terminology for Apache Druid<\/h2>\n\n\n\n<p>(40+ terms)<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Segment \u2014 A unit of immutable storage for a time-chunk of data \u2014 Enables efficient scan and parallelism \u2014 Pitfall: too many small segments.<\/li>\n<li>Historical node \u2014 Serves immutable segments from local disk \u2014 Primary read-serving node \u2014 Pitfall: insufficient disk caching.<\/li>\n<li>Broker \u2014 Routes and merges query results \u2014 Central to query fanout \u2014 Pitfall: overloaded brokers cause query timeouts.<\/li>\n<li>Overlord \u2014 Manages indexing tasks \u2014 Controls ingestion lifecycles \u2014 Pitfall: single point if not HA.<\/li>\n<li>Coordinator \u2014 Manages segments and replication \u2014 Ensures data placement \u2014 Pitfall: misconfig can cause imbalance.<\/li>\n<li>Deep storage \u2014 Object store for segments \u2014 Durable segment storage \u2014 Pitfall: permissions or latency issues.<\/li>\n<li>Metadata store \u2014 Relational DB with cluster metadata \u2014 Source of truth for state \u2014 Pitfall: not highly available.<\/li>\n<li>MiddleManager \u2014 Runs indexing tasks (on-prem\/K8s) \u2014 Handles ingestion parallelism \u2014 Pitfall: insufficient resources.<\/li>\n<li>Peon \u2014 Worker executing indexing subtasks \u2014 Part of real-time ingestion \u2014 Pitfall: task failures need retries.<\/li>\n<li>Real-time node \u2014 Handles immediate queries on in-memory segments \u2014 Provides low-latency ingestion \u2014 Pitfall: memory pressure.<\/li>\n<li>Indexing task \u2014 Job to convert raw data to Druid segments \u2014 Configurable transforms \u2014 Pitfall: long-running failures.<\/li>\n<li>Supervisor \u2014 Manages streaming ingestion tasks (Kafka) \u2014 Restarts tasks on failure \u2014 Pitfall: misconfigured offset reset.<\/li>\n<li>Rollup \u2014 Aggregation during ingestion to reduce cardinality \u2014 Saves storage and speeds queries \u2014 Pitfall: data loss of granularity.<\/li>\n<li>Granularity \u2014 Time bucketing for segments \u2014 Affects segment size and query speed \u2014 Pitfall: too coarse granularity hides details.<\/li>\n<li>Partitioning \u2014 How data is split within segments \u2014 Influences query parallelism \u2014 Pitfall: high skew on partitions.<\/li>\n<li>Compaction \u2014 Rewrites segments for optimization \u2014 Reduces file count and improves scan \u2014 Pitfall: compaction can impact IO.<\/li>\n<li>Query router \u2014 Routes queries to brokers or nodes \u2014 Load distributes queries \u2014 Pitfall: single router misconfig.<\/li>\n<li>Native queries \u2014 Druid\u2019s JSON query format \u2014 Allows complex aggregations \u2014 Pitfall: not portable SQL.<\/li>\n<li>SQL in Druid \u2014 ANSI-like SQL interface \u2014 Easier for analysts \u2014 Pitfall: some features are not fully ANSI compliant.<\/li>\n<li>Columnar store \u2014 Columns stored contiguously \u2014 Efficient for aggregates \u2014 Pitfall: poor at point updates.<\/li>\n<li>Bitmap index \u2014 Fast filter index for dimensions \u2014 Speeds boolean filters \u2014 Pitfall: large bitmaps for cardinal dims.<\/li>\n<li>Inverted index \u2014 Alternative indexing for text fields \u2014 Helps filter performance \u2014 Pitfall: memory overhead.<\/li>\n<li>Vectorized processing \u2014 Batch processing within nodes \u2014 Improves CPU efficiency \u2014 Pitfall: requires JIT-friendly data shapes.<\/li>\n<li>JVM tuning \u2014 Required for Druid Java processes \u2014 Affects GC and throughput \u2014 Pitfall: wrong flags cause GC storms.<\/li>\n<li>Segment cache \u2014 Local disk or memory cache for segments \u2014 Reduces network fetches \u2014 Pitfall: cold cache on restart.<\/li>\n<li>Time chunk \u2014 Time window used for segmentization \u2014 Controls query pruning \u2014 Pitfall: misaligned chunks increase scan.<\/li>\n<li>Retention policy \u2014 Rules to drop old segments \u2014 Controls storage costs \u2014 Pitfall: accidental data loss if mis-set.<\/li>\n<li>Replica factor \u2014 Number of copies of segments \u2014 Balances availability \u2014 Pitfall: high replication increases storage cost.<\/li>\n<li>Service discovery \u2014 Finds Druid nodes in cluster \u2014 Required for routing \u2014 Pitfall: DNS TTL issues cause stale routing.<\/li>\n<li>TLS\/Encryption \u2014 Needed for secure data in transit \u2014 Security requirement \u2014 Pitfall: certificate rotation complexity.<\/li>\n<li>ACLs \u2014 Access control for Druid APIs \u2014 Protects data and admin APIs \u2014 Pitfall: misconfig can break dashboards.<\/li>\n<li>Auto-scaling \u2014 Scale nodes based on load \u2014 Cost efficient \u2014 Pitfall: lag in scaling can cause overload.<\/li>\n<li>Backfill \u2014 Reingesting historical data \u2014 Needed after schema changes \u2014 Pitfall: duplicate data if not deduped.<\/li>\n<li>Idempotence \u2014 Safe retries in ingestion tasks \u2014 Prevents duplicate segments \u2014 Pitfall: writes may be repeated without checks.<\/li>\n<li>Split\/merge \u2014 Segment operations for compaction \u2014 Optimize performance \u2014 Pitfall: causes transient imbalance.<\/li>\n<li>Materialized view \u2014 Precomputed aggregates in Druid \u2014 Speeds queries \u2014 Pitfall: extra storage and pipeline complexity.<\/li>\n<li>Query context \u2014 Parameters controlling query execution \u2014 Tuning handle for latency vs completeness \u2014 Pitfall: inconsistent contexts cause varying results.<\/li>\n<li>Ingestion spec \u2014 JSON\/YAML describing ingestion job \u2014 Defines transforms and tuning \u2014 Pitfall: complexity results in errors.<\/li>\n<li>Time zone handling \u2014 Important for correct bucketing \u2014 Affects query results \u2014 Pitfall: mismatched client and ingestion TZ.<\/li>\n<li>Cost-based optimization \u2014 Query planner considerations \u2014 Impacts distributed query plans \u2014 Pitfall: inaccurate stats lead to poor plans.<\/li>\n<li>Security manager \u2014 Optional plugin for authz\/authn \u2014 Protects APIs \u2014 Pitfall: wrong roles block automation.<\/li>\n<li>Segment lineage \u2014 Mapping of segments to source data \u2014 Useful for debugging \u2014 Pitfall: lost lineage during backfills.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">How to Measure Apache Druid (Metrics, SLIs, SLOs) (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Metric\/SLI<\/th>\n<th>What it tells you<\/th>\n<th>How to measure<\/th>\n<th>Starting target<\/th>\n<th>Gotchas<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>M1<\/td>\n<td>Query latency p95<\/td>\n<td>User experience for dashboards<\/td>\n<td>Measure end-to-end query time at broker<\/td>\n<td>p95 &lt; 1s<\/td>\n<td>Long-tail outliers at p99<\/td>\n<\/tr>\n<tr>\n<td>M2<\/td>\n<td>Query success rate<\/td>\n<td>Reliability of query serving<\/td>\n<td>Successful queries \/ total<\/td>\n<td>99.9%<\/td>\n<td>Client-side retries mask failures<\/td>\n<\/tr>\n<tr>\n<td>M3<\/td>\n<td>Ingestion lag<\/td>\n<td>Freshness of real-time data<\/td>\n<td>Time between event and availability<\/td>\n<td>&lt; 30s for real-time<\/td>\n<td>Spike during compaction<\/td>\n<\/tr>\n<tr>\n<td>M4<\/td>\n<td>Kafka consumer lag<\/td>\n<td>Backpressure in stream pipeline<\/td>\n<td>Consumer offset difference<\/td>\n<td>Near zero<\/td>\n<td>Rebalances cause spikes<\/td>\n<\/tr>\n<tr>\n<td>M5<\/td>\n<td>Segment availability<\/td>\n<td>Data availability for queries<\/td>\n<td>Ratio of assigned segments<\/td>\n<td>100% for SLA<\/td>\n<td>Slow reassigns after failover<\/td>\n<\/tr>\n<tr>\n<td>M6<\/td>\n<td>JVM GC pause time<\/td>\n<td>Node responsiveness<\/td>\n<td>GC pause durations per node<\/td>\n<td>p95 &lt; 500ms<\/td>\n<td>Long pauses during compaction<\/td>\n<\/tr>\n<tr>\n<td>M7<\/td>\n<td>Disk IO utilization<\/td>\n<td>Read\/write pressure<\/td>\n<td>Disk busy percent on historicals<\/td>\n<td>&lt; 70% sustained<\/td>\n<td>Spikes from compaction<\/td>\n<\/tr>\n<tr>\n<td>M8<\/td>\n<td>Network egress\/ingress<\/td>\n<td>Data transfer for queries<\/td>\n<td>Bytes\/sec during queries<\/td>\n<td>Baseline vs spikes<\/td>\n<td>Cross-region reads costly<\/td>\n<\/tr>\n<tr>\n<td>M9<\/td>\n<td>Compaction task success<\/td>\n<td>Background maintenance health<\/td>\n<td>Success rate of tasks<\/td>\n<td>100%<\/td>\n<td>Failures leave many small segments<\/td>\n<\/tr>\n<tr>\n<td>M10<\/td>\n<td>Heap usage<\/td>\n<td>Memory health<\/td>\n<td>Used\/allocated JVM heap<\/td>\n<td>&lt; 80%<\/td>\n<td>Memory leaks raise over time<\/td>\n<\/tr>\n<tr>\n<td>M11<\/td>\n<td>Broker CPU load<\/td>\n<td>Query routing health<\/td>\n<td>Broker CPU avg<\/td>\n<td>&lt; 70%<\/td>\n<td>Heavy merge queries increase CPU<\/td>\n<\/tr>\n<tr>\n<td>M12<\/td>\n<td>Metadata DB latency<\/td>\n<td>Control plane responsiveness<\/td>\n<td>Query latency to metadata DB<\/td>\n<td>p95 &lt; 200ms<\/td>\n<td>Locks during backups affect ops<\/td>\n<\/tr>\n<tr>\n<td>M13<\/td>\n<td>Segment push latency<\/td>\n<td>Ingestion durability<\/td>\n<td>Time to push segment to deep storage<\/td>\n<td>&lt; 30s<\/td>\n<td>Deep storage throttling<\/td>\n<\/tr>\n<tr>\n<td>M14<\/td>\n<td>Query queue length<\/td>\n<td>Query overload risk<\/td>\n<td>Pending queries at broker<\/td>\n<td>&lt; 100<\/td>\n<td>Unbounded queries cause queueing<\/td>\n<\/tr>\n<tr>\n<td>M15<\/td>\n<td>Error budget burn rate<\/td>\n<td>SLO consumption speed<\/td>\n<td>SLO violation rate over time<\/td>\n<td>Define per service<\/td>\n<td>Sudden spikes require rapid action<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Best tools to measure Apache Druid<\/h3>\n\n\n\n<p>Provide 5\u201310 tools. For each tool use this exact structure:<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Prometheus + Grafana<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Apache Druid: JVM metrics, Druid exporter metrics, query latency, ingestion lag.<\/li>\n<li>Best-fit environment: Kubernetes, VMs, cloud.<\/li>\n<li>Setup outline:<\/li>\n<li>Deploy Prometheus scrape configs for Druid metrics endpoints.<\/li>\n<li>Install Druid metrics emitter or JMX exporter.<\/li>\n<li>Create Grafana dashboards using Druid metric names.<\/li>\n<li>Configure alertmanager with routing rules.<\/li>\n<li>Add recording rules for aggregated SLIs.<\/li>\n<li>Strengths:<\/li>\n<li>Widely adopted and flexible.<\/li>\n<li>Good for custom alerting and dashboards.<\/li>\n<li>Limitations:<\/li>\n<li>Storage cost and cardinality explosion on high-label metrics.<\/li>\n<li>Requires maintenance of exporters and dashboards.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 OpenTelemetry + Tempo<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Apache Druid: Traces for indexing tasks and query request flows.<\/li>\n<li>Best-fit environment: Service-mesh or distributed tracing-aware deployments.<\/li>\n<li>Setup outline:<\/li>\n<li>Instrument HTTP clients and indexing tasks with OTLP.<\/li>\n<li>Configure collector to export to Tempo.<\/li>\n<li>Correlate traces with logs and metrics.<\/li>\n<li>Strengths:<\/li>\n<li>End-to-end request tracing.<\/li>\n<li>Helps debug slow queries and task latencies.<\/li>\n<li>Limitations:<\/li>\n<li>Requires instrumentation effort.<\/li>\n<li>High-volume traces can be costly.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 ELK Stack (Elasticsearch, Logstash, Kibana)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Apache Druid: Logs from Druid processes and task logs.<\/li>\n<li>Best-fit environment: On-prem and cloud.<\/li>\n<li>Setup outline:<\/li>\n<li>Ship logs via Filebeat or Fluentd.<\/li>\n<li>Parse Druid task logs and expose task IDs.<\/li>\n<li>Build Kibana dashboards for error patterns.<\/li>\n<li>Strengths:<\/li>\n<li>Powerful log search and correlation.<\/li>\n<li>Useful for postmortems.<\/li>\n<li>Limitations:<\/li>\n<li>Log retention costs and scaling concerns.<\/li>\n<li>Complex parsing for nested task logs.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Cloud Provider Monitoring (CloudWatch\/GCP Monitoring\/Azure Monitor)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Apache Druid: VM metrics, autoscaling events, storage metrics.<\/li>\n<li>Best-fit environment: Managed cloud deployments.<\/li>\n<li>Setup outline:<\/li>\n<li>Enable host metrics and object store metrics.<\/li>\n<li>Integrate Druid metrics via custom metrics API.<\/li>\n<li>Configure alerts based on SLO thresholds.<\/li>\n<li>Strengths:<\/li>\n<li>Tight integration with cloud infra.<\/li>\n<li>Simplifies alerting for infra events.<\/li>\n<li>Limitations:<\/li>\n<li>Limited Druid-specific dashboards out-of-the-box.<\/li>\n<li>Vendor lock-in concerns.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Tool \u2014 Chaos Engineering Tools (Litmus, Chaos Mesh)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What it measures for Apache Druid: Resilience under failure scenarios.<\/li>\n<li>Best-fit environment: Kubernetes.<\/li>\n<li>Setup outline:<\/li>\n<li>Define faults like node kill, network partition.<\/li>\n<li>Run game days and validate SLOs.<\/li>\n<li>Automate recovery checks.<\/li>\n<li>Strengths:<\/li>\n<li>Validates operational readiness.<\/li>\n<li>Surfaces hidden single points of failure.<\/li>\n<li>Limitations:<\/li>\n<li>Requires careful planning to avoid production damage.<\/li>\n<li>Needs strong rollback\/runbook automation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Recommended dashboards &amp; alerts for Apache Druid<\/h3>\n\n\n\n<p>Executive dashboard:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Panels: Overall query Latency (p50\/p95\/p99), Query success rate, Ingestion lag, Segment availability, Cost estimate.<\/li>\n<li>Why: High-level health for business stakeholders; quick signal of data freshness and query SLA.<\/li>\n<\/ul>\n\n\n\n<p>On-call dashboard:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Panels: Live query queue length, Broker CPU\/memory, JVM GC pauses, Real-time task health, Recent errors, Kafka lag.<\/li>\n<li>Why: Provides actionable view for SREs to triage page-level incidents.<\/li>\n<\/ul>\n\n\n\n<p>Debug dashboard:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Panels: Per-data-source segment counts, Segment sizes, Compaction task status, Deep storage push\/pull latencies, Traces of slow queries.<\/li>\n<li>Why: Helps engineers inspect segment layout and root cause performance issues.<\/li>\n<\/ul>\n\n\n\n<p>Alerting guidance:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Page vs ticket:<\/li>\n<li>Page for: query success rate drops below SLO, ingestion lag exceeds critical threshold, deep storage failures.<\/li>\n<li>Ticket for: compaction failures, non-urgent metric regressions.<\/li>\n<li>Burn-rate guidance:<\/li>\n<li>Use burn-rate windows (e.g., 1h, 6h) relative to error budget; page if burn rate exceeds 5x expected.<\/li>\n<li>Noise reduction tactics:<\/li>\n<li>Dedupe alerts by grouping by service and data source.<\/li>\n<li>Suppress transient spikes using anomaly detection on historical baselines.<\/li>\n<li>Use alert inhibition for planned maintenance windows.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Implementation Guide (Step-by-step)<\/h2>\n\n\n\n<p>1) Prerequisites\n&#8211; Define data sources, retention, and schema.\n&#8211; Choose deep storage and metadata DB with HA.\n&#8211; Decide on deployment pattern (Kubernetes vs VMs).\n&#8211; Establish IAM and network policies.<\/p>\n\n\n\n<p>2) Instrumentation plan\n&#8211; Export Druid JVM and application metrics.\n&#8211; Add tracing for ingestion tasks and HTTP request flows.\n&#8211; Centralize logs with structured fields for task IDs and segment IDs.<\/p>\n\n\n\n<p>3) Data collection\n&#8211; Configure ingestion specs for batch or Kafka supervisors.\n&#8211; Define rollup, granularity, and partitioning.\n&#8211; Set compaction and retention policies.<\/p>\n\n\n\n<p>4) SLO design\n&#8211; Identify SLIs: ingestion lag, query latency, query success.\n&#8211; Define SLOs with error budgets and burn-rate rules.<\/p>\n\n\n\n<p>5) Dashboards\n&#8211; Build executive, on-call, and debug dashboards.\n&#8211; Add drill-down panels to correlate metrics, traces, logs.<\/p>\n\n\n\n<p>6) Alerts &amp; routing\n&#8211; Create alerts for SLO breaches and infrastructure issues.\n&#8211; Route pages to on-call SREs and tickets to data teams.<\/p>\n\n\n\n<p>7) Runbooks &amp; automation\n&#8211; Document runbooks for common failures: broker overload, deep storage issues, ingestion lag.\n&#8211; Automate common fixes: restart indexing tasks, scale nodes, heal corrupted segments.<\/p>\n\n\n\n<p>8) Validation (load\/chaos\/game days)\n&#8211; Run synthetic query loads to validate latency SLOs.\n&#8211; Perform chaos tests on coordinator, deep storage, and brokers.\n&#8211; Run backfill and compaction exercises in staging.<\/p>\n\n\n\n<p>9) Continuous improvement\n&#8211; Review postmortems and tweak retention\/compaction.\n&#8211; Automate scaling policies and maintenance scheduling.<\/p>\n\n\n\n<p>Checklists:<\/p>\n\n\n\n<p>Pre-production checklist<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Confirm deep storage access and permissions.<\/li>\n<li>Validate metadata DB HA and backups.<\/li>\n<li>Run end-to-end ingestion and query tests.<\/li>\n<li>Baseline metrics and create dashboards.<\/li>\n<li>Define retention and data protection rules.<\/li>\n<\/ul>\n\n\n\n<p>Production readiness checklist<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Autoscaling rules in place.<\/li>\n<li>SLOs and alerting configured.<\/li>\n<li>Runbooks tested and accessible.<\/li>\n<li>Backups for metadata store scheduled.<\/li>\n<li>IAM roles and TLS certs deployed.<\/li>\n<\/ul>\n\n\n\n<p>Incident checklist specific to Apache Druid<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Verify metadata DB and deep storage accessibility.<\/li>\n<li>Check broker and coordinator health.<\/li>\n<li>Inspect ingestion task statuses and Kafka lags.<\/li>\n<li>Review disk usage and segment availability.<\/li>\n<li>Execute relevant runbook procedure and escalate if unresolved.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Use Cases of Apache Druid<\/h2>\n\n\n\n<p>Provide 8\u201312 use cases:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p>Real-time analytics for product metrics\n&#8211; Context: Product teams need live funnels and conversion rates.\n&#8211; Problem: Batch latencies cause stale dashboards.\n&#8211; Why Druid helps: Sub-second queries and streaming ingestion.\n&#8211; What to measure: Query latency, ingestion lag, event throughput.\n&#8211; Typical tools: Kafka, Grafana, Superset.<\/p>\n<\/li>\n<li>\n<p>Fraud detection and anomaly detection\n&#8211; Context: Detect anomalous user behavior.\n&#8211; Problem: Need fast aggregations over recent events.\n&#8211; Why Druid helps: Fast aggregations and low-latency time windows.\n&#8211; What to measure: Alerting latency, false positive rate.\n&#8211; Typical tools: Kafka, Python ML, Grafana.<\/p>\n<\/li>\n<li>\n<p>Ad hoc exploration for analytics teams\n&#8211; Context: Data analysts run interactive queries.\n&#8211; Problem: Data warehouse queries too slow for exploration.\n&#8211; Why Druid helps: Sub-second drill-downs and SQL interface.\n&#8211; What to measure: Query concurrency, p95 latency.\n&#8211; Typical tools: Superset, Looker, SQL clients.<\/p>\n<\/li>\n<li>\n<p>Observability backend for telemetry\n&#8211; Context: Logs and metrics analytics requiring aggregation.\n&#8211; Problem: High ingestion rates and need for retention policies.\n&#8211; Why Druid helps: Columnar store and segment pruning by time.\n&#8211; What to measure: Query performance on telemetry slices.\n&#8211; Typical tools: Kafka, Prometheus exporters, Grafana.<\/p>\n<\/li>\n<li>\n<p>Feature store serving layer for ML\n&#8211; Context: Low-latency aggregation of event features for models.\n&#8211; Problem: Need consistent historical and streaming features.\n&#8211; Why Druid helps: Consistent ingestion and query model.\n&#8211; What to measure: Feature freshness, latency.\n&#8211; Typical tools: Airflow, Kafka, model servers.<\/p>\n<\/li>\n<li>\n<p>Clickstream analytics\n&#8211; Context: Real-time user behavior tracking.\n&#8211; Problem: High-volume events need fast rollups.\n&#8211; Why Druid helps: Efficient rollup and bitmap indexes.\n&#8211; What to measure: Sessions per minute, conversion metrics.\n&#8211; Typical tools: Kafka, web trackers, Superset.<\/p>\n<\/li>\n<li>\n<p>Marketing analytics attribution\n&#8211; Context: Attribution windows require complex time bucketing.\n&#8211; Problem: Batch windows produce stale insights.\n&#8211; Why Druid helps: Time-centric bucketing and sub-second queries.\n&#8211; What to measure: Attribution latency, query churn.\n&#8211; Typical tools: CDP, ingestion pipeline, BI tools.<\/p>\n<\/li>\n<li>\n<p>IoT telemetry analytics\n&#8211; Context: Many devices sending events continuously.\n&#8211; Problem: High ingest and query volume across many time-series.\n&#8211; Why Druid helps: Scales horizontally and supports rollups.\n&#8211; What to measure: Ingestion throughput, segment sizes.\n&#8211; Typical tools: MQTT-&gt;Kafka, object storage, Grafana.<\/p>\n<\/li>\n<li>\n<p>Retail inventory analytics\n&#8211; Context: Near-real-time inventory dashboards.\n&#8211; Problem: Need frequent aggregations over SKUs and stores.\n&#8211; Why Druid helps: Fast group-by aggregations.\n&#8211; What to measure: Query freshness, segment replication.\n&#8211; Typical tools: ETL pipelines, Kafka, dashboards.<\/p>\n<\/li>\n<li>\n<p>A\/B testing and experimentation analytics\n&#8211; Context: Rapid analysis of experiments.\n&#8211; Problem: Need near-real-time aggregation to decide rollouts.\n&#8211; Why Druid helps: Low-latency aggregation and rollups.\n&#8211; What to measure: Experiment metric latency and query error.\n&#8211; Typical tools: Experiment platform, event stream, Druid.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Scenario Examples (Realistic, End-to-End)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #1 \u2014 Kubernetes deployment for product analytics<\/h3>\n\n\n\n<p><strong>Context:<\/strong> SaaS product needs interactive dashboards backed by streaming events.<br\/>\n<strong>Goal:<\/strong> Sub-second dashboard queries and near-real-time ingestion.<br\/>\n<strong>Why Apache Druid matters here:<\/strong> Supports streaming ingestion, horizontal scaling, and Kubernetes deployment patterns.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Events -&gt; Kafka -&gt; Druid Kafka supervisor on Kubernetes -&gt; MiddleManagers in K8s -&gt; Deep storage on S3 -&gt; Historical pods serve queries -&gt; Broker service fronted by ingress -&gt; Grafana\/Superset.<br\/>\n<strong>Step-by-step implementation:<\/strong> <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Provision S3 bucket and RDS for metadata DB.<\/li>\n<li>Deploy Druid helm charts with dedicated node pools.<\/li>\n<li>Configure Kafka supervisor ingestion specs.<\/li>\n<li>Set compaction and retention policies.<\/li>\n<li>Create Grafana dashboards and alerts.\n<strong>What to measure:<\/strong> Query p95, ingestion lag, Kafka consumer lag, pod restarts.<br\/>\n<strong>Tools to use and why:<\/strong> Kubernetes for orchestration, Prometheus\/Grafana for metrics, Kafka for streaming.<br\/>\n<strong>Common pitfalls:<\/strong> Pod resource limits too low causing OOMs; deep storage permissions misconfigured.<br\/>\n<strong>Validation:<\/strong> Run synthetic query load and Kafka produce\/consume latency tests.<br\/>\n<strong>Outcome:<\/strong> Interactive dashboards with sub-second response for common aggregates.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #2 \u2014 Serverless ingestion into managed Druid (PaaS)<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Small analytics team wants managed Druid with serverless ingestion functions.<br\/>\n<strong>Goal:<\/strong> Reduce ops overhead while keeping near-real-time analytics.<br\/>\n<strong>Why Apache Druid matters here:<\/strong> Even managed Druid benefits from streaming ingest for freshness.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Events -&gt; Serverless functions -&gt; Publish to cloud Pub\/Sub -&gt; Druid managed ingestion service -&gt; Deep storage managed by provider -&gt; BI tools.<br\/>\n<strong>Step-by-step implementation:<\/strong> <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Configure pub\/sub topics and IAM.<\/li>\n<li>Implement serverless functions to validate and publish events.<\/li>\n<li>Register Druid ingestion endpoint and supervisor.<\/li>\n<li>Configure SLOs and alerts via provider monitoring.<br\/>\n<strong>What to measure:<\/strong> Function errors, ingestion lag, query latency.<br\/>\n<strong>Tools to use and why:<\/strong> Managed Druid or hosted offering to reduce cluster ops; cloud monitoring for metrics.<br\/>\n<strong>Common pitfalls:<\/strong> Cold-starts in serverless causing burst latency; assumption of unlimited ingestion rate.<br\/>\n<strong>Validation:<\/strong> Spike loads with serverless warmers and measure end-to-end latency.<br\/>\n<strong>Outcome:<\/strong> Managed analytics with low operational burden and acceptable freshness.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #3 \u2014 Incident response \/ postmortem when queries fail<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Production dashboards show errors and timeouts.<br\/>\n<strong>Goal:<\/strong> Root cause identify and restore SLOs.<br\/>\n<strong>Why Apache Druid matters here:<\/strong> Central analytics failure impacts multiple teams.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Brokers -&gt; Historical nodes -&gt; Deep storage; indexing tasks running.<br\/>\n<strong>Step-by-step implementation:<\/strong> <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Triage: check broker logs, query queue length, and error metrics.<\/li>\n<li>Validate metadata DB and deep storage connectivity.<\/li>\n<li>Inspect ingestion tasks and Kafka lag.<\/li>\n<li>If broker overloaded, add broker replicas or throttle heavy queries.<\/li>\n<li>If deep storage issues, failover or fix IAM and re-push segments.\n<strong>What to measure:<\/strong> Query error rate, broker CPU, metadata DB latency.<br\/>\n<strong>Tools to use and why:<\/strong> Prometheus, Grafana, logs in ELK for traceback.<br\/>\n<strong>Common pitfalls:<\/strong> Ignoring compaction spikes and GC during incident.<br\/>\n<strong>Validation:<\/strong> After fixes, run queries and confirm p95 and success rate back to SLO.<br\/>\n<strong>Outcome:<\/strong> Restored dashboard functionality and postmortem with action items.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario #4 \u2014 Cost vs performance tuning for high-cardinality data<\/h3>\n\n\n\n<p><strong>Context:<\/strong> Analytics on user-level data with high cardinality causing storage and compute cost.<br\/>\n<strong>Goal:<\/strong> Reduce cost while keeping acceptable query latency.<br\/>\n<strong>Why Apache Druid matters here:<\/strong> Offers rollups and bitmap indexes but needs tuning for cardinality.<br\/>\n<strong>Architecture \/ workflow:<\/strong> Raw events -&gt; rollup strategy -&gt; segment partitioning -&gt; historical serving.<br\/>\n<strong>Step-by-step implementation:<\/strong> <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Identify critical vs optional dimensions.<\/li>\n<li>Apply rollup and groupBy cardinality caps.<\/li>\n<li>Use pre-aggregation for common queries and materialized views.<\/li>\n<li>Adjust replica factor and compaction settings.\n<strong>What to measure:<\/strong> Storage usage, query p95, number of segments.<br\/>\n<strong>Tools to use and why:<\/strong> Cost dashboards from cloud provider, Druid metric dashboards.<br\/>\n<strong>Common pitfalls:<\/strong> Over-rollup causing loss of needed granularity.<br\/>\n<strong>Validation:<\/strong> Run representative queries and compare performance and cost.<br\/>\n<strong>Outcome:<\/strong> Balanced config that meets cost targets while serving required queries.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Common Mistakes, Anti-patterns, and Troubleshooting<\/h2>\n\n\n\n<p>(List 20 common mistakes)<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Symptom: Many small segments. Root cause: Too fine-grained ingestion granularity. Fix: Increase time chunks and enable compaction.  <\/li>\n<li>Symptom: Query timeouts. Root cause: Heavy unbounded SQL queries. Fix: Implement query timeouts and cost limits.  <\/li>\n<li>Symptom: High JVM GC pauses. Root cause: Improper heap sizing. Fix: Tune JVM and enable G1 or ZGC if supported.  <\/li>\n<li>Symptom: Slow segment handoff. Root cause: Deep storage network or permission issue. Fix: Validate IAM and network; increase timeouts.  <\/li>\n<li>Symptom: Broker overload. Root cause: Too many concurrent merge-heavy queries. Fix: Add broker replicas and configure query limits.  <\/li>\n<li>Symptom: Data staleness. Root cause: Indexing tasks failing silently. Fix: Add task-level alerts and retries.  <\/li>\n<li>Symptom: Metadata DB latency. Root cause: Single instance DB under load. Fix: HA DB and read replicas.  <\/li>\n<li>Symptom: Segment eviction thrash. Root cause: Disk space mismanagement. Fix: Increase disk, tune retention, or adjust cache settings.  <\/li>\n<li>Symptom: Unexpected aggregates. Root cause: Incorrect rollup config. Fix: Re-ingest with corrected spec or maintain raw data copy.  <\/li>\n<li>Symptom: High cost on cloud egress. Root cause: Cross-region reads. Fix: Co-locate Druid and deep storage or replicate locally.  <\/li>\n<li>Symptom: Long compaction times. Root cause: Too many compaction tasks concurrently. Fix: Stagger compaction and limit concurrency.  <\/li>\n<li>Symptom: Authentication failures. Root cause: TLS or ACL misconfig. Fix: Rotate certs and audit ACLs.  <\/li>\n<li>Symptom: Kafka lag spikes. Root cause: Indexing task CPU starvation. Fix: Autoscale indexers or improve resource requests.  <\/li>\n<li>Symptom: Slow startup of historical nodes. Root cause: Cold segment cache and many segment loads. Fix: Pre-warm cache or stagger restarts.  <\/li>\n<li>Symptom: Corrupted segments. Root cause: Incomplete segment push. Fix: Re-push segments from deep storage or reindex.  <\/li>\n<li>Symptom: Missing metrics. Root cause: Metric emitter not configured. Fix: Enable and validate metric endpoints.  <\/li>\n<li>Symptom: Observability blind spots. Root cause: Missing trace or log correlation IDs. Fix: Instrument tasks with consistent IDs.  <\/li>\n<li>Symptom: Query recompilation overhead. Root cause: High dynamic SQL variance. Fix: Use prepared queries or caching.  <\/li>\n<li>Symptom: Ineffective alerts. Root cause: Static thresholds not tuned to workload. Fix: Use adaptive or burn-rate alerts and baselines.  <\/li>\n<li>Symptom: Repeated manual toil. Root cause: No automation for common fixes. Fix: Script and automate restarts, scaling, compaction scheduling.<\/li>\n<\/ol>\n\n\n\n<p>Observability-specific pitfalls (at least 5):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Symptom: Metrics cardinality explosion. Root cause: High label cardinality. Fix: Reduce labels and use relabeling.  <\/li>\n<li>Symptom: Missing trace context. Root cause: Incomplete instrumentation. Fix: Propagate trace IDs through ingestion tasks.  <\/li>\n<li>Symptom: Logs not correlated to metrics. Root cause: No task IDs in logs. Fix: Add structured fields with task and segment IDs.  <\/li>\n<li>Symptom: Alert storms on rolling deploys. Root cause: lack of suppression. Fix: Suppress alerts during deploy windows and use deployment hooks.  <\/li>\n<li>Symptom: Inaccurate SLI measurement. Root cause: Wrong scrape intervals. Fix: Align metric scrape frequency to SLO needs.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Best Practices &amp; Operating Model<\/h2>\n\n\n\n<p>Ownership and on-call:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data platform team owns cluster operations and runbooks.<\/li>\n<li>Product teams own ingestion specs and schema evolution.<\/li>\n<li>On-call rotations include both SREs and data engineers for fast escalations.<\/li>\n<\/ul>\n\n\n\n<p>Runbooks vs playbooks:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Runbook: Step-by-step technical actions to resolve known issues.<\/li>\n<li>Playbook: Higher-level decision guidance for escalations and communications.<\/li>\n<\/ul>\n\n\n\n<p>Safe deployments:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use canary or blue\/green deploys for brokers and coordinators.<\/li>\n<li>Rollback fast with prebuilt scripts and automated health checks.<\/li>\n<\/ul>\n\n\n\n<p>Toil reduction and automation:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Automate compaction scheduling, segment rebalancing, and indexer restarts.<\/li>\n<li>Implement autoscaling for middle managers and historical nodes.<\/li>\n<\/ul>\n\n\n\n<p>Security basics:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enable TLS for internal and external traffic.<\/li>\n<li>Use role-based ACLs and restrict admin APIs.<\/li>\n<li>Encrypt sensitive data at rest in deep storage if required.<\/li>\n<\/ul>\n\n\n\n<p>Weekly\/monthly routines:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Weekly: Review SLOs, check failed ingestion tasks.<\/li>\n<li>Monthly: Compaction health, segment replication ratio, metadata DB backups.<\/li>\n<li>Quarterly: Chaos tests and validation of disaster recovery procedures.<\/li>\n<\/ul>\n\n\n\n<p>Postmortem reviews:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Review root cause, detection and mitigation time, and follow-up actions.<\/li>\n<li>Validate if SLOs and alerts need adjustments.<\/li>\n<li>Add learnings to runbooks and automate recurring fixes.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Tooling &amp; Integration Map for Apache Druid (TABLE REQUIRED)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Category<\/th>\n<th>What it does<\/th>\n<th>Key integrations<\/th>\n<th>Notes<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>I1<\/td>\n<td>Stream Ingest<\/td>\n<td>Event streaming source<\/td>\n<td>Kafka, PubSub, Kinesis<\/td>\n<td>Core realtime ingestion sources<\/td>\n<\/tr>\n<tr>\n<td>I2<\/td>\n<td>Deep Storage<\/td>\n<td>Durable segment store<\/td>\n<td>S3, GCS, HDFS<\/td>\n<td>Required for recovery<\/td>\n<\/tr>\n<tr>\n<td>I3<\/td>\n<td>Metadata DB<\/td>\n<td>Stores cluster metadata<\/td>\n<td>MySQL, Postgres<\/td>\n<td>Needs HA<\/td>\n<\/tr>\n<tr>\n<td>I4<\/td>\n<td>Orchestration<\/td>\n<td>Cluster deployment<\/td>\n<td>Kubernetes, Terraform<\/td>\n<td>K8s common for cloud-native<\/td>\n<\/tr>\n<tr>\n<td>I5<\/td>\n<td>Metrics<\/td>\n<td>Monitoring and alerting<\/td>\n<td>Prometheus, Cloud Monitoring<\/td>\n<td>Exposes Druid metrics<\/td>\n<\/tr>\n<tr>\n<td>I6<\/td>\n<td>Dashboards<\/td>\n<td>Visualization<\/td>\n<td>Grafana, Superset<\/td>\n<td>Common BI frontends<\/td>\n<\/tr>\n<tr>\n<td>I7<\/td>\n<td>Logging<\/td>\n<td>Centralized logs<\/td>\n<td>ELK, Fluentd<\/td>\n<td>Task logs and process logs<\/td>\n<\/tr>\n<tr>\n<td>I8<\/td>\n<td>Tracing<\/td>\n<td>Request tracing<\/td>\n<td>OpenTelemetry, Jaeger<\/td>\n<td>Debug slow queries<\/td>\n<\/tr>\n<tr>\n<td>I9<\/td>\n<td>Chaos<\/td>\n<td>Resilience testing<\/td>\n<td>Chaos Mesh, Litmus<\/td>\n<td>Validates runbooks<\/td>\n<\/tr>\n<tr>\n<td>I10<\/td>\n<td>CI\/CD<\/td>\n<td>Config and deployment<\/td>\n<td>ArgoCD, Jenkins<\/td>\n<td>GitOps for configs<\/td>\n<\/tr>\n<tr>\n<td>I11<\/td>\n<td>Security<\/td>\n<td>AuthN\/AuthZ<\/td>\n<td>LDAP, OAuth2<\/td>\n<td>Protect APIs<\/td>\n<\/tr>\n<tr>\n<td>I12<\/td>\n<td>Cost mgmt<\/td>\n<td>Cloud cost visibility<\/td>\n<td>Cloud billing tools<\/td>\n<td>Important for segment replication<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Row Details (only if needed)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>None<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Frequently Asked Questions (FAQs)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What kinds of queries are fastest in Druid?<\/h3>\n\n\n\n<p>Aggregations and group-bys over time windows and filtered slices are fastest, especially with rollups and bitmap indexes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Can Druid replace my data warehouse?<\/h3>\n\n\n\n<p>Not always. Druid is optimized for interactive analytics and streaming ingestion, but it lacks full DW features like complex multi-table joins and massive storage semantics.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How does Druid handle schema changes?<\/h3>\n\n\n\n<p>Schema changes require reindexing or schema-evolution patterns; immediate structural changes may need backfills.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Is Druid suitable for high-cardinality dimensions?<\/h3>\n\n\n\n<p>It can handle high cardinality with tuning, but costs increase; consider rollups or alternative designs for very high cardinality.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Does Druid support SQL?<\/h3>\n\n\n\n<p>Yes, Druid exposes an ANSI-like SQL interface, though some behaviors differ from traditional RDBMS.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How do I secure a Druid cluster?<\/h3>\n\n\n\n<p>Use TLS, ACLs, and secure deep storage; control admin APIs and rotate credentials.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How do you size a Druid cluster?<\/h3>\n\n\n\n<p>Sizing depends on ingestion rate, query concurrency, and retention; perform load tests and iterate on node types.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What deep storage should I pick?<\/h3>\n\n\n\n<p>S3 or GCS are common for cloud; HDFS for on-prem. Must be durable and low-latency for operations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to reduce query costs?<\/h3>\n\n\n\n<p>Use rollups, materialized views, caching, and pre-aggregation to reduce scan sizes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Can Druid run on Kubernetes?<\/h3>\n\n\n\n<p>Yes; many deployments use Kubernetes with StatefulSets and persistent volumes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to handle backups?<\/h3>\n\n\n\n<p>Backup the metadata DB, and ensure deep storage is durable; maintain segment replication if needed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What causes ingestion lag?<\/h3>\n\n\n\n<p>Indexing task slowness, resource starvation, or deep storage slowdowns are common causes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How are segments compacted?<\/h3>\n\n\n\n<p>Compaction jobs rewrite segments to optimize size and reduce file count; schedule to limit disruption.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What are common observability signals?<\/h3>\n\n\n\n<p>Query latencies, ingestion lag, Kafka consumer lag, JVM metrics, and segment availability.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to test Druid at scale?<\/h3>\n\n\n\n<p>Use synthetic load generators for ingestion and query workloads; run chaos tests for resilience.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Is Druid good for ad-hoc SQL exploration?<\/h3>\n\n\n\n<p>Yes; with good design you can get near-interactive SQL experiences for analysts.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to migrate from another analytics store?<\/h3>\n\n\n\n<p>Plan re-ingestion strategies, compare query semantics, and perform parallel testing.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What are best practices for production upgrades?<\/h3>\n\n\n\n<p>Canary deploy new versions of brokers and history nodes, verify metrics, and have rollback plans.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Apache Druid is a purpose-built analytics store for fast, time-centric aggregation on streaming and historical event data. It fits modern cloud-native stacks, supports real-time ML feature serving, and requires disciplined ops for scaling, observability, and security.<\/p>\n\n\n\n<p>Next 7 days plan:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Day 1: Define key SLIs and set up Prometheus scrapes for Druid metrics.<\/li>\n<li>Day 2: Deploy a small Druid cluster in staging with deep storage and metadata DB.<\/li>\n<li>Day 3: Implement a simple Kafka supervisor ingestion and validate data availability.<\/li>\n<li>Day 4: Create executive and on-call dashboards with baseline panels.<\/li>\n<li>Day 5: Configure SLOs, alerts, and run a synthetic query load.<\/li>\n<li>Day 6: Prepare runbooks for top 3 incidents and automate one common remediation.<\/li>\n<li>Day 7: Run a short chaos test (node restart) and perform a postmortem.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix \u2014 Apache Druid Keyword Cluster (SEO)<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Primary keywords<\/li>\n<li>Apache Druid<\/li>\n<li>Druid analytics<\/li>\n<li>Druid real-time database<\/li>\n<li>Druid architecture<\/li>\n<li>\n<p>Druid tutorial<\/p>\n<\/li>\n<li>\n<p>Secondary keywords<\/p>\n<\/li>\n<li>Druid ingestion<\/li>\n<li>Druid segments<\/li>\n<li>Druid broker<\/li>\n<li>Druid historical node<\/li>\n<li>Druid coordinator<\/li>\n<li>Druid Overlord<\/li>\n<li>Druid metadata store<\/li>\n<li>Druid deep storage<\/li>\n<li>Druid compaction<\/li>\n<li>\n<p>Druid rollup<\/p>\n<\/li>\n<li>\n<p>Long-tail questions<\/p>\n<\/li>\n<li>how to tune Apache Druid for latency<\/li>\n<li>best practices for Druid compaction<\/li>\n<li>Druid vs ClickHouse for analytics<\/li>\n<li>setting up Druid on Kubernetes<\/li>\n<li>Druid ingestion from Kafka<\/li>\n<li>measuring Druid SLOs and SLIs<\/li>\n<li>Druid segmentation and partitioning guide<\/li>\n<li>Druid query optimization tips<\/li>\n<li>how to secure Apache Druid cluster<\/li>\n<li>\n<p>Druid runbook examples for incidents<\/p>\n<\/li>\n<li>\n<p>Related terminology<\/p>\n<\/li>\n<li>segment lifecycle<\/li>\n<li>time chunking<\/li>\n<li>rollup strategy<\/li>\n<li>bitmap indexing<\/li>\n<li>vectorized query engine<\/li>\n<li>broker merge<\/li>\n<li>indexing task<\/li>\n<li>supervisor<\/li>\n<li>middle manager<\/li>\n<li>peon worker<\/li>\n<li>deep storage replication<\/li>\n<li>metadata DB backups<\/li>\n<li>query context parameters<\/li>\n<li>JVM tuning for Druid<\/li>\n<li>Druid SQL support<\/li>\n<li>real-time ingestion pipeline<\/li>\n<li>historical node caching<\/li>\n<li>segment availability metrics<\/li>\n<li>ingestion lag metric<\/li>\n<li>\n<p>query success rate metric<\/p>\n<\/li>\n<li>\n<p>Deployment terms<\/p>\n<\/li>\n<li>Druid Helm chart<\/li>\n<li>Kubernetes StatefulSet Druid<\/li>\n<li>Druid on AWS<\/li>\n<li>Druid on GCP<\/li>\n<li>Druid operator<\/li>\n<li>\n<p>Druid cluster sizing<\/p>\n<\/li>\n<li>\n<p>Observability phrases<\/p>\n<\/li>\n<li>Druid Prometheus exporter<\/li>\n<li>Druid Grafana dashboard<\/li>\n<li>Druid OpenTelemetry<\/li>\n<li>Druid logging best practices<\/li>\n<li>\n<p>Druid tracing for ingestion<\/p>\n<\/li>\n<li>\n<p>Security and governance<\/p>\n<\/li>\n<li>Druid ACL configuration<\/li>\n<li>Druid TLS setup<\/li>\n<li>Druid authentication<\/li>\n<li>Druid authorization<\/li>\n<li>\n<p>Druid data retention policies<\/p>\n<\/li>\n<li>\n<p>Scaling and performance<\/p>\n<\/li>\n<li>Druid autoscaling strategies<\/li>\n<li>Druid compaction tuning<\/li>\n<li>optimizing Druid queries<\/li>\n<li>Druid segment optimization<\/li>\n<li>\n<p>reducing Druid storage cost<\/p>\n<\/li>\n<li>\n<p>Integration phrases<\/p>\n<\/li>\n<li>Druid Kafka supervisor integration<\/li>\n<li>Druid S3 deep storage<\/li>\n<li>Druid with Superset<\/li>\n<li>Druid with Grafana<\/li>\n<li>\n<p>Druid with Trino<\/p>\n<\/li>\n<li>\n<p>SRE and process<\/p>\n<\/li>\n<li>Druid SLO example<\/li>\n<li>Druid incident runbook<\/li>\n<li>Druid postmortem checklist<\/li>\n<li>Druid chaos testing<\/li>\n<li>\n<p>Druid monitoring playbook<\/p>\n<\/li>\n<li>\n<p>Cost and ops<\/p>\n<\/li>\n<li>Druid cost optimization<\/li>\n<li>Druid storage cost per TB<\/li>\n<li>Druid run cost estimate<\/li>\n<li>\n<p>Druid resource planning<\/p>\n<\/li>\n<li>\n<p>Data modeling<\/p>\n<\/li>\n<li>Druid dimension design<\/li>\n<li>Druid metric types<\/li>\n<li>Druid time granularity<\/li>\n<li>\n<p>Druid rollup tradeoffs<\/p>\n<\/li>\n<li>\n<p>Migration and interoperability<\/p>\n<\/li>\n<li>migrate to Druid<\/li>\n<li>Druid versus data warehouse<\/li>\n<li>Druid with existing BI tools<\/li>\n<li>\n<p>Druid compatibility with SQL engines<\/p>\n<\/li>\n<li>\n<p>Misc<\/p>\n<\/li>\n<li>Druid community updates 2026<\/li>\n<li>Druid cloud-native patterns<\/li>\n<li>Druid for ML feature serving<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>&#8212;<\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-3623","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/3623","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=3623"}],"version-history":[{"count":0,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/3623\/revisions"}],"wp:attachment":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=3623"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=3623"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=3623"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}