Complete Data Glossary & Terminology

Here’s a comprehensive glossary of all the key data platform, engineering, and analytics terms we discussed—including everything from your earlier questions and the expanded list. Each keyword includes a simple explanation. This will give you a full “cheat sheet” of modern data terminology.


Complete Data Glossary & Terminology

KeywordMeaning / Description
Raw Data SourcesThe origin of data (APIs, apps, logs, sensors, files, databases, etc.)
Data LakeA storage system for raw or semi-structured data, supporting all formats (CSV, JSON, images, logs, etc.)
Data LakehouseCombines a data lake’s flexibility with data warehouse reliability (e.g., Databricks Lakehouse)
Data WarehouseCentralized, structured storage for cleaned and processed data, optimized for analytics and reporting
Data MartA subset of a data warehouse, focused on a specific business unit or subject area
Delta LakeDatabricks’ open-source storage layer that brings ACID transactions to data lakes
ETL (Extract, Transform, Load)Process of extracting data from sources, transforming it, and loading it into a warehouse or lake
ELT (Extract, Load, Transform)Similar to ETL, but data is loaded first, then transformed inside the data warehouse/lakehouse
Data PipelineA workflow or series of steps that move and process data automatically
Data EngineeringThe discipline of building systems/pipelines to collect, clean, transform, and store data for further use
Data AnalyticsThe practice of analyzing and visualizing data to find insights and support business decisions
Business Intelligence (BI)Tools and processes for building dashboards, visualizations, and reports from data (e.g., Tableau, Power BI)
Data ScienceThe field of applying statistics and machine learning to data for predictions, classification, and modeling
Machine Learning (ML)Techniques and models that enable computers to learn from data and make predictions or decisions
Streaming DataData that is generated and processed in real-time (continuous flow), e.g., IoT, logs, transactions
Batch ProcessingProcessing data in discrete chunks or batches, often on a schedule
Structured StreamingApache Spark’s method for processing streaming data with the same API as batch data
Auto LoaderDatabricks’ tool to automatically ingest new files as they land in a data lake
Delta Live Tables (DLT)Databricks’ framework for declarative, automated data pipeline development and management
OrchestrationManaging and scheduling sequences of data processing tasks or pipelines (e.g., Airflow, Databricks Workflows)
Data PlatformAn environment that combines data storage, processing, analytics, ML, and governance (e.g., Databricks, Snowflake)
Data Analytics PlatformSynonym for “data platform”—emphasizes analytics and ML as part of the solution
Data GovernancePolicies and controls for data access, privacy, quality, security, cataloging, and compliance
Data QualityEnsuring data is accurate, complete, consistent, and reliable
Data CatalogAn organized inventory of all data assets (tables, files, fields), enabling data discovery and management
Metadata“Data about data,” such as schema, owner, creation date, usage, etc.
SchemaThe structure/definition of a dataset (field names, types, constraints)
Data LineageThe record of where data comes from, how it moves, and how it’s transformed over time
Data StewardshipResponsibility for managing and overseeing the proper use of data assets
Data MaskingObscuring or hiding sensitive data values to protect privacy
Data EncryptionSecuring data (at rest/in transit) by encoding it so only authorized parties can read it
Role-Based Access Control (RBAC)Security system that restricts data access based on user roles/permissions
Table Access ControlPermissions system for controlling who can read/write specific tables/views
Unity CatalogDatabricks’ unified data governance solution for managing access, auditing, and cataloging
Data SharingThe ability to share datasets securely between teams, organizations, or platforms (e.g., Delta Sharing)
Data MeshA decentralized data architecture that assigns ownership and responsibility to domain teams
Data FabricAn architecture and set of data services providing integrated, consistent data management across the enterprise
DataOpsApplication of DevOps practices (automation, CI/CD, monitoring) to data engineering and pipelines
MLOpsDevOps for machine learning: automating deployment, monitoring, and lifecycle management of models
Master Data Management (MDM)Ensuring a consistent, accurate “single source of truth” for core business data
Semantic LayerAn abstracted data model providing consistent business logic and definitions to analytics users
Data StewardA person responsible for the quality and governance of specific data assets
Data OrchestrationSee Orchestration above
Data IntegrationCombining data from different sources and formats into a unified view
Data TransformationChanging, cleaning, or converting data from one format to another
Data IngestionThe process of importing or bringing new data into your platform/lake/warehouse
Data CleansingIdentifying and correcting errors, duplicates, or inconsistencies in data
Data ModelingDefining and structuring how data is stored, connected, and related
Data PartitioningDividing large datasets into segments (partitions) for faster processing and querying
Z-OrderingA technique (in Delta Lake) to co-locate related data in storage for performance optimization
Time TravelThe ability (in Delta Lake) to query or restore data as it existed at a previous point in time
VersioningKeeping track of changes to data or tables so you can audit or roll back
CompactionMerging many small files into fewer, larger files for performance (common in Delta Lake, Hudi, Iceberg)
Partition PruningQuery optimization by skipping partitions of data that don’t match filter criteria
Data SkippingOptimization by skipping irrelevant data blocks during query
Broadcast JoinA Spark optimization where a small table is sent to all nodes for faster joins
CheckpointingPeriodically saving the state of a stream or computation for fault tolerance
Pipeline DAGDirected Acyclic Graph; visualizes the dependencies and order of tasks in a data pipeline
Job ClusterA cluster in Databricks that’s spun up for a specific job and then terminated
All-Purpose ClusterA cluster for interactive work, like notebooks and ad-hoc queries
Serverless SQL WarehouseDatabricks’ serverless endpoint for SQL analytics, scalable and managed


Leave a Comment