{"id":339,"date":"2025-08-06T01:02:59","date_gmt":"2025-08-06T01:02:59","guid":{"rendered":"https:\/\/dataopsschool.com\/blog\/?p=339"},"modified":"2025-08-06T01:03:00","modified_gmt":"2025-08-06T01:03:00","slug":"complete-data-glossary-terminology","status":"publish","type":"post","link":"https:\/\/dataopsschool.com\/blog\/complete-data-glossary-terminology\/","title":{"rendered":"Complete Data Glossary &amp; Terminology"},"content":{"rendered":"\n<p>Here\u2019s a <strong>comprehensive glossary of all the key data platform, engineering, and analytics terms<\/strong> we discussed\u2014<strong>including everything from your earlier questions and the expanded list<\/strong>. Each keyword includes a simple explanation. This will give you a full \u201ccheat sheet\u201d of modern data terminology.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Complete Data Glossary &amp; Terminology<\/strong><\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Keyword<\/strong><\/th><th><strong>Meaning \/ Description<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>Raw Data Sources<\/strong><\/td><td>The origin of data (APIs, apps, logs, sensors, files, databases, etc.)<\/td><\/tr><tr><td><strong>Data Lake<\/strong><\/td><td>A storage system for raw or semi-structured data, supporting all formats (CSV, JSON, images, logs, etc.)<\/td><\/tr><tr><td><strong>Data Lakehouse<\/strong><\/td><td>Combines a data lake\u2019s flexibility with data warehouse reliability (e.g., Databricks Lakehouse)<\/td><\/tr><tr><td><strong>Data Warehouse<\/strong><\/td><td>Centralized, structured storage for cleaned and processed data, optimized for analytics and reporting<\/td><\/tr><tr><td><strong>Data Mart<\/strong><\/td><td>A subset of a data warehouse, focused on a specific business unit or subject area<\/td><\/tr><tr><td><strong>Delta Lake<\/strong><\/td><td>Databricks\u2019 open-source storage layer that brings ACID transactions to data lakes<\/td><\/tr><tr><td><strong>ETL (Extract, Transform, Load)<\/strong><\/td><td>Process of extracting data from sources, transforming it, and loading it into a warehouse or lake<\/td><\/tr><tr><td><strong>ELT (Extract, Load, Transform)<\/strong><\/td><td>Similar to ETL, but data is loaded first, then transformed inside the data warehouse\/lakehouse<\/td><\/tr><tr><td><strong>Data Pipeline<\/strong><\/td><td>A workflow or series of steps that move and process data automatically<\/td><\/tr><tr><td><strong>Data Engineering<\/strong><\/td><td>The discipline of building systems\/pipelines to collect, clean, transform, and store data for further use<\/td><\/tr><tr><td><strong>Data Analytics<\/strong><\/td><td>The practice of analyzing and visualizing data to find insights and support business decisions<\/td><\/tr><tr><td><strong>Business Intelligence (BI)<\/strong><\/td><td>Tools and processes for building dashboards, visualizations, and reports from data (e.g., Tableau, Power BI)<\/td><\/tr><tr><td><strong>Data Science<\/strong><\/td><td>The field of applying statistics and machine learning to data for predictions, classification, and modeling<\/td><\/tr><tr><td><strong>Machine Learning (ML)<\/strong><\/td><td>Techniques and models that enable computers to learn from data and make predictions or decisions<\/td><\/tr><tr><td><strong>Streaming Data<\/strong><\/td><td>Data that is generated and processed in real-time (continuous flow), e.g., IoT, logs, transactions<\/td><\/tr><tr><td><strong>Batch Processing<\/strong><\/td><td>Processing data in discrete chunks or batches, often on a schedule<\/td><\/tr><tr><td><strong>Structured Streaming<\/strong><\/td><td>Apache Spark\u2019s method for processing streaming data with the same API as batch data<\/td><\/tr><tr><td><strong>Auto Loader<\/strong><\/td><td>Databricks\u2019 tool to automatically ingest new files as they land in a data lake<\/td><\/tr><tr><td><strong>Delta Live Tables (DLT)<\/strong><\/td><td>Databricks\u2019 framework for declarative, automated data pipeline development and management<\/td><\/tr><tr><td><strong>Orchestration<\/strong><\/td><td>Managing and scheduling sequences of data processing tasks or pipelines (e.g., Airflow, Databricks Workflows)<\/td><\/tr><tr><td><strong>Data Platform<\/strong><\/td><td>An environment that combines data storage, processing, analytics, ML, and governance (e.g., Databricks, Snowflake)<\/td><\/tr><tr><td><strong>Data Analytics Platform<\/strong><\/td><td>Synonym for \u201cdata platform\u201d\u2014emphasizes analytics and ML as part of the solution<\/td><\/tr><tr><td><strong>Data Governance<\/strong><\/td><td>Policies and controls for data access, privacy, quality, security, cataloging, and compliance<\/td><\/tr><tr><td><strong>Data Quality<\/strong><\/td><td>Ensuring data is accurate, complete, consistent, and reliable<\/td><\/tr><tr><td><strong>Data Catalog<\/strong><\/td><td>An organized inventory of all data assets (tables, files, fields), enabling data discovery and management<\/td><\/tr><tr><td><strong>Metadata<\/strong><\/td><td>\u201cData about data,\u201d such as schema, owner, creation date, usage, etc.<\/td><\/tr><tr><td><strong>Schema<\/strong><\/td><td>The structure\/definition of a dataset (field names, types, constraints)<\/td><\/tr><tr><td><strong>Data Lineage<\/strong><\/td><td>The record of where data comes from, how it moves, and how it\u2019s transformed over time<\/td><\/tr><tr><td><strong>Data Stewardship<\/strong><\/td><td>Responsibility for managing and overseeing the proper use of data assets<\/td><\/tr><tr><td><strong>Data Masking<\/strong><\/td><td>Obscuring or hiding sensitive data values to protect privacy<\/td><\/tr><tr><td><strong>Data Encryption<\/strong><\/td><td>Securing data (at rest\/in transit) by encoding it so only authorized parties can read it<\/td><\/tr><tr><td><strong>Role-Based Access Control (RBAC)<\/strong><\/td><td>Security system that restricts data access based on user roles\/permissions<\/td><\/tr><tr><td><strong>Table Access Control<\/strong><\/td><td>Permissions system for controlling who can read\/write specific tables\/views<\/td><\/tr><tr><td><strong>Unity Catalog<\/strong><\/td><td>Databricks\u2019 unified data governance solution for managing access, auditing, and cataloging<\/td><\/tr><tr><td><strong>Data Sharing<\/strong><\/td><td>The ability to share datasets securely between teams, organizations, or platforms (e.g., Delta Sharing)<\/td><\/tr><tr><td><strong>Data Mesh<\/strong><\/td><td>A decentralized data architecture that assigns ownership and responsibility to domain teams<\/td><\/tr><tr><td><strong>Data Fabric<\/strong><\/td><td>An architecture and set of data services providing integrated, consistent data management across the enterprise<\/td><\/tr><tr><td><strong>DataOps<\/strong><\/td><td>Application of DevOps practices (automation, CI\/CD, monitoring) to data engineering and pipelines<\/td><\/tr><tr><td><strong>MLOps<\/strong><\/td><td>DevOps for machine learning: automating deployment, monitoring, and lifecycle management of models<\/td><\/tr><tr><td><strong>Master Data Management (MDM)<\/strong><\/td><td>Ensuring a consistent, accurate \u201csingle source of truth\u201d for core business data<\/td><\/tr><tr><td><strong>Semantic Layer<\/strong><\/td><td>An abstracted data model providing consistent business logic and definitions to analytics users<\/td><\/tr><tr><td><strong>Data Steward<\/strong><\/td><td>A person responsible for the quality and governance of specific data assets<\/td><\/tr><tr><td><strong>Data Orchestration<\/strong><\/td><td>See Orchestration above<\/td><\/tr><tr><td><strong>Data Integration<\/strong><\/td><td>Combining data from different sources and formats into a unified view<\/td><\/tr><tr><td><strong>Data Transformation<\/strong><\/td><td>Changing, cleaning, or converting data from one format to another<\/td><\/tr><tr><td><strong>Data Ingestion<\/strong><\/td><td>The process of importing or bringing new data into your platform\/lake\/warehouse<\/td><\/tr><tr><td><strong>Data Cleansing<\/strong><\/td><td>Identifying and correcting errors, duplicates, or inconsistencies in data<\/td><\/tr><tr><td><strong>Data Modeling<\/strong><\/td><td>Defining and structuring how data is stored, connected, and related<\/td><\/tr><tr><td><strong>Data Partitioning<\/strong><\/td><td>Dividing large datasets into segments (partitions) for faster processing and querying<\/td><\/tr><tr><td><strong>Z-Ordering<\/strong><\/td><td>A technique (in Delta Lake) to co-locate related data in storage for performance optimization<\/td><\/tr><tr><td><strong>Time Travel<\/strong><\/td><td>The ability (in Delta Lake) to query or restore data as it existed at a previous point in time<\/td><\/tr><tr><td><strong>Versioning<\/strong><\/td><td>Keeping track of changes to data or tables so you can audit or roll back<\/td><\/tr><tr><td><strong>Compaction<\/strong><\/td><td>Merging many small files into fewer, larger files for performance (common in Delta Lake, Hudi, Iceberg)<\/td><\/tr><tr><td><strong>Partition Pruning<\/strong><\/td><td>Query optimization by skipping partitions of data that don\u2019t match filter criteria<\/td><\/tr><tr><td><strong>Data Skipping<\/strong><\/td><td>Optimization by skipping irrelevant data blocks during query<\/td><\/tr><tr><td><strong>Broadcast Join<\/strong><\/td><td>A Spark optimization where a small table is sent to all nodes for faster joins<\/td><\/tr><tr><td><strong>Checkpointing<\/strong><\/td><td>Periodically saving the state of a stream or computation for fault tolerance<\/td><\/tr><tr><td><strong>Pipeline DAG<\/strong><\/td><td>Directed Acyclic Graph; visualizes the dependencies and order of tasks in a data pipeline<\/td><\/tr><tr><td><strong>Job Cluster<\/strong><\/td><td>A cluster in Databricks that\u2019s spun up for a specific job and then terminated<\/td><\/tr><tr><td><strong>All-Purpose Cluster<\/strong><\/td><td>A cluster for interactive work, like notebooks and ad-hoc queries<\/td><\/tr><tr><td><strong>Serverless SQL Warehouse<\/strong><\/td><td>Databricks\u2019 serverless endpoint for SQL analytics, scalable and managed<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><\/h2>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Here\u2019s a comprehensive glossary of all the key data platform, engineering, and analytics terms we discussed\u2014including everything from your earlier questions and the expanded list. Each keyword&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-339","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/339","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=339"}],"version-history":[{"count":1,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/339\/revisions"}],"predecessor-version":[{"id":340,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/339\/revisions\/340"}],"wp:attachment":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=339"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=339"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=339"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}