{"id":161,"date":"2025-06-21T06:01:38","date_gmt":"2025-06-21T06:01:38","guid":{"rendered":"https:\/\/dataopsschool.com\/blog\/?p=161"},"modified":"2025-06-30T13:29:58","modified_gmt":"2025-06-30T13:29:58","slug":"%f0%9f%a7%aa-great-expectations-in-devsecops-a-comprehensive-tutorial","status":"publish","type":"post","link":"https:\/\/dataopsschool.com\/blog\/%f0%9f%a7%aa-great-expectations-in-devsecops-a-comprehensive-tutorial\/","title":{"rendered":"\ud83e\uddea Great Expectations in DevSecOps: A Comprehensive Tutorial"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">\ud83d\udccc Introduction &amp; Overview<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What is <em>Great Expectations<\/em>?<\/h3>\n\n\n\n<p><strong>Great Expectations (GE)<\/strong> is an open-source Python-based data validation, documentation, and profiling framework. It helps teams <strong>define, test, and document expectations about data<\/strong> as it flows through pipelines, ensuring that data quality issues are detected early and automatically<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/raw.githubusercontent.com\/datarootsio\/tutorial-great-expectations\/5fb8d3b6e02d7447b65ec05918c4f610faccb252\/figures\/in_out.png\" alt=\"\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">History or Background<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Developed by <strong>Superconductive<\/strong>, GE originated as an internal tool for validating data in machine learning pipelines.<\/li>\n\n\n\n<li>Became open source in 2018.<\/li>\n\n\n\n<li>GE has since evolved to support <strong>data observability, test-driven development for data<\/strong>, and compliance checks.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Why is it Relevant in DevSecOps?<\/h3>\n\n\n\n<p>In DevSecOps, the goal is to embed <strong>security and quality at every phase<\/strong> of the software development lifecycle. GE plays a critical role in the <strong>&#8220;Sec&#8221; and &#8220;Ops&#8221;<\/strong> of DevSecOps by:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Validating <strong>data quality before it&#8217;s used in production<\/strong>.<\/li>\n\n\n\n<li>Supporting <strong>data compliance and governance standards<\/strong> like GDPR, HIPAA.<\/li>\n\n\n\n<li>Enabling <strong>automated testing and CI\/CD data validation<\/strong>, just like unit tests for code.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd11 Core Concepts &amp; Terminology<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Key Terms and Definitions<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Term<\/th><th>Definition<\/th><\/tr><\/thead><tbody><tr><td>Expectation<\/td><td>A declarative rule that data should follow (e.g., <code>column values not null<\/code>)<\/td><\/tr><tr><td>Data Context<\/td><td>A configuration environment to run GE workflows<\/td><\/tr><tr><td>Suite<\/td><td>A group of expectations applied to a dataset<\/td><\/tr><tr><td>Checkpoint<\/td><td>A specific configuration to run suites on datasets<\/td><\/tr><tr><td>Validation Result<\/td><td>The outcome of applying an expectation suite to data<\/td><\/tr><tr><td>Data Docs<\/td><td>Auto-generated documentation for validation results<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">How It Fits Into the DevSecOps Lifecycle<\/h3>\n\n\n\n<p>GE integrates naturally into these phases:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Development<\/strong>: Define expectation suites during pipeline creation.<\/li>\n\n\n\n<li><strong>Security<\/strong>: Validate PII, encryption, or data masking.<\/li>\n\n\n\n<li><strong>CI\/CD<\/strong>: Automate data tests in CI workflows using tools like GitHub Actions.<\/li>\n\n\n\n<li><strong>Operations<\/strong>: Continuously monitor data quality in production pipelines.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83c\udfd7\ufe0f Architecture &amp; How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Components and Internal Workflow<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Expectation Suites<\/strong>: YAML or JSON files with rules.<\/li>\n\n\n\n<li><strong>Batch<\/strong>: A unit of data (e.g., a file, database table) on which expectations are applied.<\/li>\n\n\n\n<li><strong>Checkpoint<\/strong>: YAML configuration to run expectations on a data batch.<\/li>\n\n\n\n<li><strong>Data Docs<\/strong>: HTML-based validation reports.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/docs.greatexpectations.io\/assets\/images\/gx_oss_process-050a4264f415a1bff3ceea3ac6f9b3a0.png\" alt=\"\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture Diagram (Described)<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>User\/CI Trigger\n     |\n     v\n&#091; Data Context ]\n     |\n     |---&gt; Reads Expectation Suite\n     |---&gt; Loads Data Batch (CSV, DB, etc.)\n     |---&gt; Executes Checkpoint\n     |\n     v\n&#091; Validation Results ]\n     |\n     v\n&#091; Data Docs HTML Report ]\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Integration Points with CI\/CD and Cloud<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Tool<\/th><th>Integration Strategy<\/th><\/tr><\/thead><tbody><tr><td>GitHub Actions<\/td><td>Add GE checks as jobs in <code>.github\/workflows<\/code><\/td><\/tr><tr><td>Jenkins<\/td><td>Script-based integration via shell or Python<\/td><\/tr><tr><td>AWS S3<\/td><td>Load or store data and docs<\/td><\/tr><tr><td>Azure Data Lake<\/td><td>Source and validate structured data<\/td><\/tr><tr><td>DBs (Postgres, Snowflake, etc.)<\/td><td>Direct expectation checks on SQL tables<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">\u2699\ufe0f Installation &amp; Getting Started<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisites<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Python 3.8+<\/li>\n\n\n\n<li>pip<\/li>\n\n\n\n<li>Optional: Docker, Jupyter<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Step-by-Step Beginner-Friendly Setup<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">\ud83d\udd39 1. Install Great Expectations<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install great_expectations\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">\ud83d\udd39 2. Initialize GE in Your Project<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>great_expectations init\n<\/code><\/pre>\n\n\n\n<p>Creates the <code>great_expectations\/<\/code> folder with scaffolding.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">\ud83d\udd39 3. Create Your First Expectation Suite<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>great_expectations suite new\n<\/code><\/pre>\n\n\n\n<p>Follow prompts to create expectations using:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>CLI<\/li>\n\n\n\n<li>Jupyter Notebook<\/li>\n\n\n\n<li>YAML config<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">\ud83d\udd39 4. Run a Checkpoint<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>great_expectations checkpoint new my_checkpoint\ngreat_expectations checkpoint run my_checkpoint\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">\ud83d\udd39 5. View Data Docs<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>open great_expectations\/uncommitted\/data_docs\/local_site\/index.html\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83c\udf0d Real-World Use Cases<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">1. <strong>Data Quality Testing in CI\/CD<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use GE to validate datasets before merging PRs in CI.<\/li>\n\n\n\n<li>Fail builds if expectations (e.g., <code>no null emails<\/code>) fail.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2. <strong>Security &amp; Compliance Validation<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enforce <code>expect_column_values_to_match_regex<\/code> for email or SSN fields.<\/li>\n\n\n\n<li>Flag unencrypted or out-of-policy data entries.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3. <strong>ML Pipeline Validation<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Check distributions, missing values, and outliers in ML training datasets.<\/li>\n\n\n\n<li>Prevent garbage in, garbage out (GIGO).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4. <strong>Healthcare &amp; Finance (Industry-Specific)<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Validate patient data formats (HIPAA compliance).<\/li>\n\n\n\n<li>Ensure transaction records follow schema (PCI-DSS, GDPR).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">\u2705 Benefits &amp; \u26a0\ufe0f Limitations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Key Advantages<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u2705 Declarative, readable tests for data<\/li>\n\n\n\n<li>\u2705 Easy integration with Python, Jupyter, and CI tools<\/li>\n\n\n\n<li>\u2705 Generates automated documentation<\/li>\n\n\n\n<li>\u2705 Supports multiple backends (files, DBs, cloud)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common Limitations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u274c Requires Python environment<\/li>\n\n\n\n<li>\u274c Performance may degrade on very large datasets<\/li>\n\n\n\n<li>\u274c Learning curve for non-data engineers<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udee1\ufe0f Best Practices &amp; Recommendations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Security, Performance &amp; Maintenance<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Mask PII<\/strong> in Data Docs using <code>evaluation_parameters<\/code><\/li>\n\n\n\n<li><strong>Limit test scope<\/strong> (e.g., sample datasets) to avoid performance hits<\/li>\n\n\n\n<li><strong>Version-control<\/strong> expectation suites like code<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance &amp; Automation Ideas<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Automate GE runs in CI\/CD (daily, on pull requests)<\/li>\n\n\n\n<li>Align suites with <strong>compliance rulesets<\/strong> (e.g., ISO, SOC 2)<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd04 Comparison with Alternatives<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Tool<\/th><th>GE<\/th><th>Deequ (Amazon)<\/th><th>Soda SQL<\/th><\/tr><\/thead><tbody><tr><td>Language<\/td><td>Python<\/td><td>Scala<\/td><td>SQL \/ YAML<\/td><\/tr><tr><td>Open Source<\/td><td>\u2705 Yes<\/td><td>\u2705 Yes<\/td><td>\u2705 Yes<\/td><\/tr><tr><td>Data Docs<\/td><td>\u2705 Beautiful HTML<\/td><td>\u274c<\/td><td>\u2705<\/td><\/tr><tr><td>CI\/CD Friendly<\/td><td>\u2705 Strong integration<\/td><td>\u26a0\ufe0f Medium<\/td><td>\u2705<\/td><\/tr><tr><td>Use Case<\/td><td>General data validation<\/td><td>ML + Big Data pipelines<\/td><td>DataOps + BI<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">When to Choose Great Expectations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u2705 If your stack is Python-based<\/li>\n\n\n\n<li>\u2705 If you want custom validation logic<\/li>\n\n\n\n<li>\u2705 For beautiful data documentation<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83e\uddfe Conclusion<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Final Thoughts<\/h3>\n\n\n\n<p>Great Expectations is a <strong>powerful and flexible data validation framework<\/strong> that fits neatly into the DevSecOps mindset. With automation, security, and governance built into data workflows, it ensures you trust your data just as much as your code.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Future Trends<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Growing use of GE in <strong>DataOps pipelines<\/strong><\/li>\n\n\n\n<li>Native plugins for <strong>dbt<\/strong>, <strong>Apache Airflow<\/strong>, and <strong>Kubernetes<\/strong><\/li>\n\n\n\n<li>Enhanced integration with <strong>cloud-native and AI\/ML workflows<\/strong><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Next Steps<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Explore <a href=\"https:\/\/docs.greatexpectations.io\/\">Great Expectations Official Docs<\/a><\/li>\n\n\n\n<li>Join the <a href=\"https:\/\/greatexpectations.io\/slack\">Slack community<\/a><\/li>\n\n\n\n<li>Try building a sample <strong>CI\/CD data validation pipeline<\/strong><\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>\ud83d\udccc Introduction &amp; Overview What is Great Expectations? Great Expectations (GE) is an open-source Python-based data validation, documentation, and profiling framework. It helps teams define, test, and&#8230; <\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-161","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/161","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=161"}],"version-history":[{"count":2,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/161\/revisions"}],"predecessor-version":[{"id":308,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/161\/revisions\/308"}],"wp:attachment":[{"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=161"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=161"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=161"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}