Here’s a comprehensive glossary of all the key data platform, engineering, and analytics terms we discussed—including everything from your earlier questions and the expanded list. Each keyword includes a simple explanation. This will give you a full “cheat sheet” of modern data terminology.
Complete Data Glossary & Terminology
| Keyword | Meaning / Description |
|---|---|
| Raw Data Sources | The origin of data (APIs, apps, logs, sensors, files, databases, etc.) |
| Data Lake | A storage system for raw or semi-structured data, supporting all formats (CSV, JSON, images, logs, etc.) |
| Data Lakehouse | Combines a data lake’s flexibility with data warehouse reliability (e.g., Databricks Lakehouse) |
| Data Warehouse | Centralized, structured storage for cleaned and processed data, optimized for analytics and reporting |
| Data Mart | A subset of a data warehouse, focused on a specific business unit or subject area |
| Delta Lake | Databricks’ open-source storage layer that brings ACID transactions to data lakes |
| ETL (Extract, Transform, Load) | Process of extracting data from sources, transforming it, and loading it into a warehouse or lake |
| ELT (Extract, Load, Transform) | Similar to ETL, but data is loaded first, then transformed inside the data warehouse/lakehouse |
| Data Pipeline | A workflow or series of steps that move and process data automatically |
| Data Engineering | The discipline of building systems/pipelines to collect, clean, transform, and store data for further use |
| Data Analytics | The practice of analyzing and visualizing data to find insights and support business decisions |
| Business Intelligence (BI) | Tools and processes for building dashboards, visualizations, and reports from data (e.g., Tableau, Power BI) |
| Data Science | The field of applying statistics and machine learning to data for predictions, classification, and modeling |
| Machine Learning (ML) | Techniques and models that enable computers to learn from data and make predictions or decisions |
| Streaming Data | Data that is generated and processed in real-time (continuous flow), e.g., IoT, logs, transactions |
| Batch Processing | Processing data in discrete chunks or batches, often on a schedule |
| Structured Streaming | Apache Spark’s method for processing streaming data with the same API as batch data |
| Auto Loader | Databricks’ tool to automatically ingest new files as they land in a data lake |
| Delta Live Tables (DLT) | Databricks’ framework for declarative, automated data pipeline development and management |
| Orchestration | Managing and scheduling sequences of data processing tasks or pipelines (e.g., Airflow, Databricks Workflows) |
| Data Platform | An environment that combines data storage, processing, analytics, ML, and governance (e.g., Databricks, Snowflake) |
| Data Analytics Platform | Synonym for “data platform”—emphasizes analytics and ML as part of the solution |
| Data Governance | Policies and controls for data access, privacy, quality, security, cataloging, and compliance |
| Data Quality | Ensuring data is accurate, complete, consistent, and reliable |
| Data Catalog | An organized inventory of all data assets (tables, files, fields), enabling data discovery and management |
| Metadata | “Data about data,” such as schema, owner, creation date, usage, etc. |
| Schema | The structure/definition of a dataset (field names, types, constraints) |
| Data Lineage | The record of where data comes from, how it moves, and how it’s transformed over time |
| Data Stewardship | Responsibility for managing and overseeing the proper use of data assets |
| Data Masking | Obscuring or hiding sensitive data values to protect privacy |
| Data Encryption | Securing data (at rest/in transit) by encoding it so only authorized parties can read it |
| Role-Based Access Control (RBAC) | Security system that restricts data access based on user roles/permissions |
| Table Access Control | Permissions system for controlling who can read/write specific tables/views |
| Unity Catalog | Databricks’ unified data governance solution for managing access, auditing, and cataloging |
| Data Sharing | The ability to share datasets securely between teams, organizations, or platforms (e.g., Delta Sharing) |
| Data Mesh | A decentralized data architecture that assigns ownership and responsibility to domain teams |
| Data Fabric | An architecture and set of data services providing integrated, consistent data management across the enterprise |
| DataOps | Application of DevOps practices (automation, CI/CD, monitoring) to data engineering and pipelines |
| MLOps | DevOps for machine learning: automating deployment, monitoring, and lifecycle management of models |
| Master Data Management (MDM) | Ensuring a consistent, accurate “single source of truth” for core business data |
| Semantic Layer | An abstracted data model providing consistent business logic and definitions to analytics users |
| Data Steward | A person responsible for the quality and governance of specific data assets |
| Data Orchestration | See Orchestration above |
| Data Integration | Combining data from different sources and formats into a unified view |
| Data Transformation | Changing, cleaning, or converting data from one format to another |
| Data Ingestion | The process of importing or bringing new data into your platform/lake/warehouse |
| Data Cleansing | Identifying and correcting errors, duplicates, or inconsistencies in data |
| Data Modeling | Defining and structuring how data is stored, connected, and related |
| Data Partitioning | Dividing large datasets into segments (partitions) for faster processing and querying |
| Z-Ordering | A technique (in Delta Lake) to co-locate related data in storage for performance optimization |
| Time Travel | The ability (in Delta Lake) to query or restore data as it existed at a previous point in time |
| Versioning | Keeping track of changes to data or tables so you can audit or roll back |
| Compaction | Merging many small files into fewer, larger files for performance (common in Delta Lake, Hudi, Iceberg) |
| Partition Pruning | Query optimization by skipping partitions of data that don’t match filter criteria |
| Data Skipping | Optimization by skipping irrelevant data blocks during query |
| Broadcast Join | A Spark optimization where a small table is sent to all nodes for faster joins |
| Checkpointing | Periodically saving the state of a stream or computation for fault tolerance |
| Pipeline DAG | Directed Acyclic Graph; visualizes the dependencies and order of tasks in a data pipeline |
| Job Cluster | A cluster in Databricks that’s spun up for a specific job and then terminated |
| All-Purpose Cluster | A cluster for interactive work, like notebooks and ad-hoc queries |
| Serverless SQL Warehouse | Databricks’ serverless endpoint for SQL analytics, scalable and managed |
Category: