Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me

Setting up a data catalog isn’t just a tool problem. My work with Azure Purview and Collibra showed success depends on governance, metadata, and adoption.

Kuladeep Sandra

May. 27, 26 · Analysis

Likes (0)

Comment

Save

4.6K Views

My data catalog project was the third time in my career that I had led a catalog implementation. My first was a custom-built solution in 2015 that worked but required three engineers to maintain. Number two was an off-the-shelf tool that nobody used because it was too cumbersome to keep current. For this third attempt, I wanted to get it right.

We implemented Azure Purview for automated discovery and technical metadata, and Collibra for business glossary, data ownership, and governance workflows. They serve different functions and are connected through a custom integration. Here is how we set it up and what surprised us.

Why Two Tools?

Azure Purview is excellent at automated technical metadata collection. Purview scans your data sources on a schedule, discovers tables and columns, infers data types, and builds an automatically-maintained lineage graph. Automated discovery is its primary value. Doing this manually doesn't scale, and any manually-maintained catalog falls behind the actual state of the data within months.

Purview isn't good at business governance workflows: data stewardship, business term assignment, data quality certification, access request approvals. These require human processes with approvals and audit trails that Purview's workflow capabilities do not cover adequately.

Collibra handles the governance workflow side. Business data stewards maintain the business glossary in Collibra. Ownership assignments and data quality certifications go through Collibra's workflow engine. When a data consumer wants to know what a dataset means in business terms, they look in Collibra. When they want to know where the data physically lives and what its schema is, they look in Purview.

The Purview Setup

Purview scans are configured per data source. We set up scans for our three ADLS Gen2 storage accounts, our Azure SQL databases, our Databricks Unity Catalog, and our Azure Data Factory pipelines. Scans run daily for production data sources and weekly for development.

Purview builds a lineage graph from ADF pipelines, which is genuinely useful. We can see, for any given table, which pipelines write to it and which tables it reads from. Lineage tracking has been valuable three times in incident investigations where we needed to understand the upstream sources of a corrupted dataset.

Custom classifications are worth the setup time. Purview comes with built-in classifiers for common PII patterns: email addresses, phone numbers, credit card numbers, and national ID formats for several countries. We added custom classifiers for our internal account number formats and insurance policy number patterns. Automated classification isn't perfect, about 85% accurate in our testing, but it surfaces PII-candidate columns that manual review would miss.

     Python
    
 

    # Purview scan configuration (REST API)
  import requests

  def create_purview_scan(account_name, collection, data_source):
      url = (f"https://{account_name}.purview.azure.com/scan/datasources/"
              f"{data_source}/scans/daily-production-scan")
      body = {
           "kind": "AzureStorageMsi",
          "properties": {
               "scanRulesetName": "custom-pii-ruleset",
               "scanRulesetType": "Custom",
               "collection": {"referenceName": collection},
               "credential": {
                   "referenceName": "managed-identity",
                  "credentialType": "ManagedIdentity"
               }
          },
           "trigger": {
               "recurrence": {
                   "frequency": "Day",
                   "interval": 1,
                   "startTime": "2024-01-01T02:00:00Z",
                   "timezone": "UTC"
               }
          }
      }
      resp = requests.put(url, json=body, headers=get_auth_headers())
      return resp.json()

  # Custom classifier for internal account numbers
  custom_classifier = {
       "kind": "Custom",
       "properties": {
           "classificationName": "INTERNAL_ACCOUNT_NUMBER",
           "description": "Internal 12-digit account number format",
           "classificationRule": {
               "kind": "Regex",
              "pattern": "^ACC[0-9]{9}$",
               "minimumPercentageMatch": 75
          }
      }
  }
   

The Collibra Integration

We built a nightly sync that reads technical metadata from Purview via its REST API and creates or updates corresponding assets in Collibra. Our sync maps Purview datasets to Collibra data assets, adds technical metadata (schema, classification, lineage summary) as attributes on the Collibra asset, and creates a link between the Collibra and Purview assets so users can navigate between the business and technical views.

Building this sync took about six weeks of engineering time. It's the part of the implementation I considered most for an off-the-shelf connector, but the available connectors didn't handle our specific Purview classification tagging approach correctly. Our custom sync has been running for 14 months with minimal maintenance.

     Python
    
 

    # Nightly Purview-to-Collibra metadata sync (Python)
  import requests
  from datetime import datetime

  def sync_purview_to_collibra(purview_client, collibra_client):
       """Sync technical metadata from Purview to Collibra assets."""
      # Fetch all cataloged assets from Purview
       purview_assets = purview_client.discovery.query(
           keywords="*",
           filter={"and": [
               {"entityType": "azure_datalake_gen2_path"},
               {"classification": ["confidential", "restricted"]}
          ]},
           limit=1000
      )

      for asset in purview_assets['value']:
           collibra_asset = collibra_client.find_or_create_asset(
               name=asset['name'],
               domain="Data Lake Assets",
               type="Data Set"
          )
          # Sync technical metadata as attributes
           collibra_client.update_attributes(collibra_asset['id'], {
               "Technical Schema": asset.get('schema', ''),
               "Data Classification": asset.get('classification', []),
               "Purview Asset Link": asset['id'],
               "Last Scanned": asset.get('lastScanTime', ''),
               "Lineage Summary": get_lineage_summary(
                   purview_client, asset['id']),
               "Sync Timestamp": datetime.utcnow().isoformat()
          })

      return {"synced": len(purview_assets['value']),
               "timestamp": datetime.utcnow().isoformat()}
   

What Adoption Looked Like

Adoption was slow. We launched the catalog with a communication campaign, internal documentation, and a live demo. After three months, we'd had about 30% of our target user base actively using it, primarily data engineers who were looking up lineage information. Analysts and business stakeholders, the people Collibra was primarily designed to support, were largely not using it.

Adoption really broke through when we integrated the catalog with our data access request process. Previously, access requests went to a Jira form. We changed the process: to request access to a dataset, you start from the Collibra data asset page. Each access request is contextualized with the asset's classification, ownership, and purpose, which both the requester and the approver see during the approval workflow. Usage of Collibra for data assets grew by 300% in the month after we made this change.

     Python
    
 

    # Collibra asset mapping schema for access request workflow
  {
     "asset_type": "Data Set",
     "domain": "Data Lake Assets",
     "attributes": {
       "Technical Name": {"type": "text", "source": "purview"},
       "Business Name": {"type": "text", "source": "steward"},
       "Data Classification": {
         "type": "single_select",
         "values": ["public", "internal", "confidential", "restricted"],
         "source": "purview"
      },
       "Owner Team": {"type": "text", "source": "steward"},
       "PII Columns": {"type": "multi_select", "source": "purview"},
       "Quality Certification": {
         "type": "single_select",
         "values": ["certified", "provisional", "uncertified"],
         "source": "steward"
      },
       "Access Request URL": {
         "type": "url",
         "template": "https://collibra.internal/access/{asset_id}"
      }
    },
     "workflow": {
       "access_request": {
         "approvers": ["asset_owner", "data_governance_lead"],
         "sla_hours": 48,
         "auto_revoke_days": 365
      }
    }
  }
   

The Honest Caveat

A data catalog requires ongoing investment that is easy to underestimate. Automated parts, Purview's scanning and discovery, take care of themselves. Business governance parts, glossary maintenance, stewardship assignments, and quality certifications require human effort that must be budgeted and owned.

Our Collibra business glossary currently covers about 60% of our production datasets. The remaining 40% have technical metadata from Purview but no business context. That 40% is smaller than it was six months ago, which means we are making progress. But it's a real gap that we manage explicitly rather than pretending the catalog is complete.

azure Data (computing)

Opinions expressed by DZone contributors are their own.

Related

Trending