Complete Data Glossary & Terminology

Here’s a comprehensive glossary of all the key data platform, engineering, and analytics terms we discussed—including everything from your earlier questions and the expanded list. Each keyword includes a simple explanation. This will give you a full “cheat sheet” of modern data terminology.

Complete Data Glossary & Terminology

Keyword	Meaning / Description
Raw Data Sources	The origin of data (APIs, apps, logs, sensors, files, databases, etc.)
Data Lake	A storage system for raw or semi-structured data, supporting all formats (CSV, JSON, images, logs, etc.)
Data Lakehouse	Combines a data lake’s flexibility with data warehouse reliability (e.g., Databricks Lakehouse)
Data Warehouse	Centralized, structured storage for cleaned and processed data, optimized for analytics and reporting
Data Mart	A subset of a data warehouse, focused on a specific business unit or subject area
Delta Lake	Databricks’ open-source storage layer that brings ACID transactions to data lakes
ETL (Extract, Transform, Load)	Process of extracting data from sources, transforming it, and loading it into a warehouse or lake
ELT (Extract, Load, Transform)	Similar to ETL, but data is loaded first, then transformed inside the data warehouse/lakehouse
Data Pipeline	A workflow or series of steps that move and process data automatically
Data Engineering	The discipline of building systems/pipelines to collect, clean, transform, and store data for further use
Data Analytics	The practice of analyzing and visualizing data to find insights and support business decisions
Business Intelligence (BI)	Tools and processes for building dashboards, visualizations, and reports from data (e.g., Tableau, Power BI)
Data Science	The field of applying statistics and machine learning to data for predictions, classification, and modeling
Machine Learning (ML)	Techniques and models that enable computers to learn from data and make predictions or decisions
Streaming Data	Data that is generated and processed in real-time (continuous flow), e.g., IoT, logs, transactions
Batch Processing	Processing data in discrete chunks or batches, often on a schedule
Structured Streaming	Apache Spark’s method for processing streaming data with the same API as batch data
Auto Loader	Databricks’ tool to automatically ingest new files as they land in a data lake
Delta Live Tables (DLT)	Databricks’ framework for declarative, automated data pipeline development and management
Orchestration	Managing and scheduling sequences of data processing tasks or pipelines (e.g., Airflow, Databricks Workflows)
Data Platform	An environment that combines data storage, processing, analytics, ML, and governance (e.g., Databricks, Snowflake)
Data Analytics Platform	Synonym for “data platform”—emphasizes analytics and ML as part of the solution
Data Governance	Policies and controls for data access, privacy, quality, security, cataloging, and compliance
Data Quality	Ensuring data is accurate, complete, consistent, and reliable
Data Catalog	An organized inventory of all data assets (tables, files, fields), enabling data discovery and management
Metadata	“Data about data,” such as schema, owner, creation date, usage, etc.
Schema	The structure/definition of a dataset (field names, types, constraints)
Data Lineage	The record of where data comes from, how it moves, and how it’s transformed over time
Data Stewardship	Responsibility for managing and overseeing the proper use of data assets
Data Masking	Obscuring or hiding sensitive data values to protect privacy
Data Encryption	Securing data (at rest/in transit) by encoding it so only authorized parties can read it
Role-Based Access Control (RBAC)	Security system that restricts data access based on user roles/permissions
Table Access Control	Permissions system for controlling who can read/write specific tables/views
Unity Catalog	Databricks’ unified data governance solution for managing access, auditing, and cataloging
Data Sharing	The ability to share datasets securely between teams, organizations, or platforms (e.g., Delta Sharing)
Data Mesh	A decentralized data architecture that assigns ownership and responsibility to domain teams
Data Fabric	An architecture and set of data services providing integrated, consistent data management across the enterprise
DataOps	Application of DevOps practices (automation, CI/CD, monitoring) to data engineering and pipelines
MLOps	DevOps for machine learning: automating deployment, monitoring, and lifecycle management of models
Master Data Management (MDM)	Ensuring a consistent, accurate “single source of truth” for core business data
Semantic Layer	An abstracted data model providing consistent business logic and definitions to analytics users
Data Steward	A person responsible for the quality and governance of specific data assets
Data Orchestration	See Orchestration above
Data Integration	Combining data from different sources and formats into a unified view
Data Transformation	Changing, cleaning, or converting data from one format to another
Data Ingestion	The process of importing or bringing new data into your platform/lake/warehouse
Data Cleansing	Identifying and correcting errors, duplicates, or inconsistencies in data
Data Modeling	Defining and structuring how data is stored, connected, and related
Data Partitioning	Dividing large datasets into segments (partitions) for faster processing and querying
Z-Ordering	A technique (in Delta Lake) to co-locate related data in storage for performance optimization
Time Travel	The ability (in Delta Lake) to query or restore data as it existed at a previous point in time
Versioning	Keeping track of changes to data or tables so you can audit or roll back
Compaction	Merging many small files into fewer, larger files for performance (common in Delta Lake, Hudi, Iceberg)
Partition Pruning	Query optimization by skipping partitions of data that don’t match filter criteria
Data Skipping	Optimization by skipping irrelevant data blocks during query
Broadcast Join	A Spark optimization where a small table is sent to all nodes for faster joins
Checkpointing	Periodically saving the state of a stream or computation for fault tolerance
Pipeline DAG	Directed Acyclic Graph; visualizes the dependencies and order of tasks in a data pipeline
Job Cluster	A cluster in Databricks that’s spun up for a specific job and then terminated
All-Purpose Cluster	A cluster for interactive work, like notebooks and ad-hoc queries
Serverless SQL Warehouse	Databricks’ serverless endpoint for SQL analytics, scalable and managed

Complete Data Glossary & Terminology

Leave a Comment Cancel reply