Here’s a comprehensive glossary of all the key data platform, engineering, and analytics terms we discussed—including everything from your earlier questions and the expanded list. Each keyword includes a simple explanation. This will give you a full “cheat sheet” of modern data terminology.
Complete Data Glossary & Terminology
Keyword | Meaning / Description |
---|---|
Raw Data Sources | The origin of data (APIs, apps, logs, sensors, files, databases, etc.) |
Data Lake | A storage system for raw or semi-structured data, supporting all formats (CSV, JSON, images, logs, etc.) |
Data Lakehouse | Combines a data lake’s flexibility with data warehouse reliability (e.g., Databricks Lakehouse) |
Data Warehouse | Centralized, structured storage for cleaned and processed data, optimized for analytics and reporting |
Data Mart | A subset of a data warehouse, focused on a specific business unit or subject area |
Delta Lake | Databricks’ open-source storage layer that brings ACID transactions to data lakes |
ETL (Extract, Transform, Load) | Process of extracting data from sources, transforming it, and loading it into a warehouse or lake |
ELT (Extract, Load, Transform) | Similar to ETL, but data is loaded first, then transformed inside the data warehouse/lakehouse |
Data Pipeline | A workflow or series of steps that move and process data automatically |
Data Engineering | The discipline of building systems/pipelines to collect, clean, transform, and store data for further use |
Data Analytics | The practice of analyzing and visualizing data to find insights and support business decisions |
Business Intelligence (BI) | Tools and processes for building dashboards, visualizations, and reports from data (e.g., Tableau, Power BI) |
Data Science | The field of applying statistics and machine learning to data for predictions, classification, and modeling |
Machine Learning (ML) | Techniques and models that enable computers to learn from data and make predictions or decisions |
Streaming Data | Data that is generated and processed in real-time (continuous flow), e.g., IoT, logs, transactions |
Batch Processing | Processing data in discrete chunks or batches, often on a schedule |
Structured Streaming | Apache Spark’s method for processing streaming data with the same API as batch data |
Auto Loader | Databricks’ tool to automatically ingest new files as they land in a data lake |
Delta Live Tables (DLT) | Databricks’ framework for declarative, automated data pipeline development and management |
Orchestration | Managing and scheduling sequences of data processing tasks or pipelines (e.g., Airflow, Databricks Workflows) |
Data Platform | An environment that combines data storage, processing, analytics, ML, and governance (e.g., Databricks, Snowflake) |
Data Analytics Platform | Synonym for “data platform”—emphasizes analytics and ML as part of the solution |
Data Governance | Policies and controls for data access, privacy, quality, security, cataloging, and compliance |
Data Quality | Ensuring data is accurate, complete, consistent, and reliable |
Data Catalog | An organized inventory of all data assets (tables, files, fields), enabling data discovery and management |
Metadata | “Data about data,” such as schema, owner, creation date, usage, etc. |
Schema | The structure/definition of a dataset (field names, types, constraints) |
Data Lineage | The record of where data comes from, how it moves, and how it’s transformed over time |
Data Stewardship | Responsibility for managing and overseeing the proper use of data assets |
Data Masking | Obscuring or hiding sensitive data values to protect privacy |
Data Encryption | Securing data (at rest/in transit) by encoding it so only authorized parties can read it |
Role-Based Access Control (RBAC) | Security system that restricts data access based on user roles/permissions |
Table Access Control | Permissions system for controlling who can read/write specific tables/views |
Unity Catalog | Databricks’ unified data governance solution for managing access, auditing, and cataloging |
Data Sharing | The ability to share datasets securely between teams, organizations, or platforms (e.g., Delta Sharing) |
Data Mesh | A decentralized data architecture that assigns ownership and responsibility to domain teams |
Data Fabric | An architecture and set of data services providing integrated, consistent data management across the enterprise |
DataOps | Application of DevOps practices (automation, CI/CD, monitoring) to data engineering and pipelines |
MLOps | DevOps for machine learning: automating deployment, monitoring, and lifecycle management of models |
Master Data Management (MDM) | Ensuring a consistent, accurate “single source of truth” for core business data |
Semantic Layer | An abstracted data model providing consistent business logic and definitions to analytics users |
Data Steward | A person responsible for the quality and governance of specific data assets |
Data Orchestration | See Orchestration above |
Data Integration | Combining data from different sources and formats into a unified view |
Data Transformation | Changing, cleaning, or converting data from one format to another |
Data Ingestion | The process of importing or bringing new data into your platform/lake/warehouse |
Data Cleansing | Identifying and correcting errors, duplicates, or inconsistencies in data |
Data Modeling | Defining and structuring how data is stored, connected, and related |
Data Partitioning | Dividing large datasets into segments (partitions) for faster processing and querying |
Z-Ordering | A technique (in Delta Lake) to co-locate related data in storage for performance optimization |
Time Travel | The ability (in Delta Lake) to query or restore data as it existed at a previous point in time |
Versioning | Keeping track of changes to data or tables so you can audit or roll back |
Compaction | Merging many small files into fewer, larger files for performance (common in Delta Lake, Hudi, Iceberg) |
Partition Pruning | Query optimization by skipping partitions of data that don’t match filter criteria |
Data Skipping | Optimization by skipping irrelevant data blocks during query |
Broadcast Join | A Spark optimization where a small table is sent to all nodes for faster joins |
Checkpointing | Periodically saving the state of a stream or computation for fault tolerance |
Pipeline DAG | Directed Acyclic Graph; visualizes the dependencies and order of tasks in a data pipeline |
Job Cluster | A cluster in Databricks that’s spun up for a specific job and then terminated |
All-Purpose Cluster | A cluster for interactive work, like notebooks and ad-hoc queries |
Serverless SQL Warehouse | Databricks’ serverless endpoint for SQL analytics, scalable and managed |