Unity Catalog and Enterprise Data Governance Tools: How Should They Fit In Your Stack
Introduction
I find myself spending a lot of time with customers and partners on ‘Unity Catalog and co-existence with other Enterprise catalogs….and need for 2 catalogs’ .
So I figured I would write a short blog to help the community to drill down my response to this question : ‘If I deploy Unity Catalog as part of my Databricks implementation - do I still need another Data Catalog? ‘… Well as any good consultant would respond …‘It depends’
First, my background…
8 years at Informatica, 4 of those included selling and deploying Data Catalogs and governance solutions and 2.5 years at Alation;
Competed against Collibra and other data catalogs frequently.
Let’s break it down into various scenarios.
For starters, Unity Catalog is Databricks' answer to Enterprises’ seeking a comprehensive data governance tool for data and AI assets. Let’s get past the fluffy, sales tag lines and boil it down to what it means….
Unity Catalog offers these benefits:
Centralized data governance across your data engineering, analytics, ML and AI assets.
Tracks upstream and downstream lineage across your data pipelines.
Enables you to perform data discovery across your data assets existing in and outside(limited) of Databricks platform.
Makes access management for your Databricks data assets simpler across workspaces, meta stores and clouds.
Scenario #1
If you are using the Databricks Intelligence Platform for data-engineering and/or data-science use cases including GenAI/ LLMOps - well, you should ( and must) be using Unity Catalog because. Unity Catalog is critical to help manage permissions and access governance across all your data, ML and AI assets - and more importantly - allowing other business and technical users to search, discover and explore ( more on this later) these data, ML and AI assets. More detailed blog here
Scenario #2
If you are using the Databricks Intelligence Platform … Do I need (yet) another Enterprise Data Catalog/Governance tool? Let’s dig in…
> If your current Data catalog/governance solution is well adopted, driving value and being used for “governance” use cases ( business glossaries, stewardship workflows, domain definitions, ontologies, etc)... then it's an EMPHATIC YES….. please continue to use your Enterprise data catalog (Alation, Informatica, Collibra) alongside Unity Catalog.
How can Unity Catalog help? There are out of the box integrations built by Enterprise catalog partners to Unity Catalog.
Unity Catalog will be able to automatically track lineage for all workload executions within Databricks no matter where the workloads where executed from - python, pySpark or SparkSQL
Unity Catalog will be particularly useful to understand governance and lineage around models, features, LLM Ops and the combination of data and AI assets used for ML ops to meet both regulatory and operational use cases.
Unity Catalog can feed documentation, data lineage, data quality and other metrics to your Enterprise Data Catalog.
>> If your current Data Catalog solution
Suffers from low adoption and engagement from SMEs and end-users, AND
The use cases are mostly around data search/discovery and basic stewardship/documentation of your data assets, AND
Your Enterprise is buying into the Lakehouse strategy and plan on using the Databricks Lakehouse for analytics and data science use cases
If yes, yes and yes above - then, Unity Catalog could be your Enterprise Data catalog (See more here on how)
Some new and very exciting capabilities (still in preview mode) are the data exploration capabilities that allows end-users to ask questions in natural language and quick visualizations of the data as part of the search, discovery and exploration capability.
CAUTION:
Below are some gaps in capabilities (today, but closing fast) with Unity Catalog compared to the broader Enterprise Data Catalog solutions in a mature market.
BI Connectors, BI curation and ability to extract end to end lineage to BI tools
Support for traditional data governance use cases :
Stewardship workflows,
hierarchies of glossaries, domains, etc,
privacy law workflows and remediation plan,
Lacks connectors to less common data sources - application stack - SAP, Salesforce, Workday.
Simplified UI for business user adoption
Does not integrate with other business productivity tools
Scenario #3
What if we have no plans for the Databricks Intelligence Platform?
Well, first, you are missing out ..and second.. Unity Catalog will not be in consideration. Unity Catalog will not won’t work standalone if you don’t have any data or ML/AI assets in the Lakehouse. I hope this is obvious enough that it doesn’t merit further discussion.
We are data people, so here is a tabular view of the decision matrix with various considerations…
Hope this helps you in your data catalog discovery journey.
Would love your feedback - regardless of whether you agree or disagree - would appreciate any dialogue !!