Data Governance as Code
From Spreadsheets to Git
Imagine a pipeline where every governance policy—data quality checks, access rules, compliance constraints—is executed automatically at every commit. No Excel, no bureaucracy, just governance as code. That’s the paradigm shift we’re entering right now.
Why Governance as Code? If Infrastructure as Code and Security as Code transformed their fields, why is data governance still stuck in static documents, wikis, and spreadsheets?
Two recent contributions make the case clear:
Gable.ai describes governance as the “RoboCop of data,” emphasizing automation, modularity, CI/CD, and version control. Sukhpreet Kaur highlights how embedding trust at scale requires policies written and executed as code—not as slideware.
The rationale is rather simple:
- Automation: Policies are enforced at runtime, not after the fact.
- Versioning: Every change is traceable and reviewable.
- CI/CD integration: Governance becomes part of deployment, not an afterthought.
- Auditability: Compliance isn’t checked once a year—it’s continuous.
Example 1: Data Quality as Code with Python Using Great Expectations, you can define quality rules declaratively:
from great_expectations.dataset import PandasDataset
import pandas as pd
class MyDataSet(PandasDataset):
@PandasDataset.expectation
def expect_cpf_not_null(self):
return {
"success": self['cpf'].notnull().all(),
"result": {"observed_value": int(self['cpf'].notnull().all())}
}
df = pd.DataFrame({'cpf': ['12345678901', None, '98765432100']})
print(MyDataSet(df).expect_cpf_not_null())
This enforces a business rule: no null values in CPF. If the test fails, your CI/CD pipeline can block the merge or deployment—just like failing unit tests.
Example 2: Detecting PII Columns Automatically Instead of hard-coding “cpf”, scan a dataset for personally identifiable information (PII) and enforce governance rules automatically.
import pandas as pd
import re
PII_PATTERNS = {
"cpf": r"^\d{11}@@CODEBLOCK_1@@quot;,
"email": r"[^@]+@[^@]+\.[^@]+",
"phone": r"^\+?\d{8,15}@@CODEBLOCK_1@@quot;
}
def detect_pii(df: pd.DataFrame):
pii_report = {}
for col in df.columns:
for pii_type, pattern in PII_PATTERNS.items():
if df[col].astype(str).str.match(pattern).any():
pii_report[col] = pii_type
return pii_report
Example usage
df = pd.DataFrame({
"user_id": [1, 2, 3],
"email": ["alice@mail.com", "bob@mail.com", "not_an_email"],
"phone": ["+5511999999999", "12345", None]
})
print(detect_pii(df)) # {'email': 'email', 'phone': 'phone'}
This can be integrated into your pipeline so that any new column containing PII triggers masking/encryption automatically. This is real governance as code: detection + automated enforcement.
The Bigger Picture “Data Governance as Code” isn’t about tools alone—it’s about a new mindset:
Stop managing governance in manual processes. Start treating governance as software artifacts: testable, reviewable, auditable. Integrate governance into your DevOps / DataOps lifecycle.
This is how we truly embed trust at scale: governance that lives inside pipelines, not in binders.
If infrastructure and security evolved to “as code,” so can governance. Start small:
Write your first quality check with Great Expectations. Define a simple policy in OPA. Commit them to Git and run them in CI/CD.
Governance is no longer a spreadsheet. It’s code.