DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Your API Authentication Isn’t Broken; It’s Quietly Failing in These 6 Ways
  • Context-Aware Authorization for AI Agents
  • When Search Started Breaking at Scale: How We Chose the Right Search Engine
  • The ID That Costs Millions: Why API Authorization Failures Keep Winning

Trending

  • Compliance Automated Standard Solution (COMPASS), Part 11: Compliance as Code, the OSCAL MCP Server Way
  • Alternative Structured Concurrency
  • Zero-Downtime Deployments for Java Apps on Kubernetes
  • Detecting Plan Regression in SQL Server Using Query Store
  1. DZone
  2. Data Engineering
  3. Data
  4. Building Unified, Access-Aware Search in a Data Mesh

Building Unified, Access-Aware Search in a Data Mesh

By indexing authorization alongside metadata, we delivered fast, trustable search that shows every user exactly what they’re allowed to see

By 
Anshul Pathak user avatar
Anshul Pathak
·
Dec. 30, 25 · Analysis
Likes (0)
Comment
Save
Tweet
Share
1.4K Views

Join the DZone community and get the full member experience.

Join For Free

Over the last few years, I have worked on a distributed data platform in which each engineering group owns the data it produces from heterogeneous data processing systems. Our central platform follows a data mesh model, as every domain manages its own pipelines and storage, but discovery in the unified system should be seamless to anyone searching across the mesh. That goal sounded simple, but wasn’t that simple considering we host datasets from ultra and black undisclosed products.

The first complaint we heard, repeatedly:

“I know the dataset exists, but how can I find it without having access to it?”

Usually, the dataset existed. It was just invisible in someone else's catalog, labeled differently, or hidden by access rules that refused to allow the dataset to be discovered. That feedback forced us to rebuild discovery so users could see everything they’re allowed to see, and nothing more.

This post walks through the four-layer architecture that made that possible.

What Unified Search Means in a Mesh

The platform had to surface more than datasets. Our search index needed to cover:

  • Analytical datasets across warehouses and lakes
  • ML models and feature stores
  • Lineage graphs
  • Policy and classification tags
  • Search-driven documentation and usage signals

And the most important rule: search must return only what the user can actually access, not “search then filter.” Even revealing the name of a restricted dataset is a leak.

We also discovered that access-aware search isn’t just about security, but it builds trust across our customers. When results feel incomplete, people stop using the platform. 

Architectural Layers

We ended up with a four-layer architecture:

  1. Discovery – asset registration and metadata standardization
  2. Enrichment – attach lineage, compliance, and popularity data
  3. Indexing – store enriched assets in text and vector indexes
  4. Authorization – enforce access control before scoring results

Each layer does one thing well, and we made sure that strict functional boundaries are explicit when implementing it.

Distributed data mesh

Layer 1: Discovery

The discovery layer handles asset registration, producers publish schemas, ownership, tags, policies, and lineage to a central metadata catalog. Each asset gets a consistent representation, no matter which domain or compute platform created it.

Example metadata event:

JSON
 
{
  "dataset": "sales.customer",
  "owner": "retail_analytics",
  "tags": ["pii", "financial"],
  "policy": "restricted",
  "lineage": {
     "upstream": ["raw.transactions", "raw.customers"],
     "downstream": ["reports.customer_ltv"]
  },
  "schema": {
     "columns": [
       {"name": "customer_id", "type": "string", "pii": false},
       {"name": "email", "type": "string", "pii": true},
       {"name": "revenue", "type": "decimal", "pii": false}
    ]
  }
}


Lineage is important for propagated permissioning. If dataset B contains data from dataset A, and A is restricted, B inherits that restriction. Our first version ignored this and accidentally exposed derived datasets.

Layer 2: Enrichment

Every metadata change becomes an event on Kafka. An enrichment consumer subscribes and adds context: lineage links from a graph store, compliance labels from tag combinations, and popularity scores from query logs, or adds a description.

Example:

Python
 
class EnrichmentConsumer:
    def __init__(self, kafka_consumer, lineage_client, compliance_engine):
        self.consumer = kafka_consumer
        self.lineage = lineage_client
        self.compliance = compliance_engine

    def process(self, event):

        # Resolve full lineage graph
        upstream = self.lineage.get_ancestors(event.dataset)
        downstream = self.lineage.get_descendants(event.dataset)
 
        # Derive compliance labels from tags
        if "pii" in event.tags:
             event.compliance_labels.add("gdpr_subject")
        if "financial" in event.tags:
             event.compliance_labels.add("sox_relevant")

        # Propagate restrictions from upstream

         inherited_policy = self.compliance.max_restriction(upstream)
         event.effective_policy = max(event.policy, inherited_policy)


        # Compute popularity score
         event.popularity = self.query_logs.get_access_count(event.dataset)

        return EnrichedAsset(event, upstream, downstream)


The effective_policy field merges declared and inherited rules. Also, in some cases, because security metadata changes faster than schemas, enrichment recomputes this for every event, not just at registration.

Layer 3: Indexing

Enriched assets flow into two parallel indexes optimized for different query types:

  • Text search (Elasticsearch/Lucene): filters and keyword queries such as “datasets tagged PII owned by retail_analytics.”
  • Vector search (FAISS or Elastic kNN): semantic similarity like "datasets similar to our churn model" even if names differ.

Both share the same document structure:

JSON
 
# Elasticsearch mapping
asset_mapping = {
   "properties": {
     "dataset": {"type": "keyword"},
     "owner": {"type": "keyword"},
     "tags": {"type": "keyword"},
     "description": {"type": "text"},
     "effective_policy": {"type": "keyword"},
     "allowed_groups": {"type": "keyword"},  # Pre-computed access list
     "embedding": {
       "type": "dense_vector",
       "dims": 384,
       "index": true,
       "similarity": "cosine"
    }
  }
}


The allowed_groups field is key. We pre-compute which user groups can access each asset and store that list directly in the index. That shifts permission checks from query time to index time.

Layer 4: Authorization

At query time, authorization runs before scoring or ranking. Our first approach checked permissions via API calls for every result, which reduced latency from 150 ms to nearly 20s.

The fix: keep a local cache of user-group memberships next to the search engine and apply them as pre-filters.

Python
 
class AuthorizationFilter:
    def __init__(self, group_cache):
         self.group_cache = group_cache  # Redis or local cache

    def build_filter(self, user_id):

        # Fetch user's group memberships from cache
        user_groups = self.group_cache.get_groups(user_id)

        # Build Elasticsearch terms filter
        return {
           "bool": {
             "filter": {
               "terms": {
                 "allowed_groups": list(user_groups)
              }
            }
          }
        }


    def execute_search(self, user_id, query):

        auth_filter = self.build_filter(user_id)

        # Merge auth filter with user's search query
        filtered_query = {
           "query": {
             "bool": {
               "must": query,
               "filter": auth_filter["bool"]["filter"]
            }
          }
        }

        return self.es_client.search(filtered_query)


This reduced response time back under 200 ms and guaranteed that restricted datasets never appear in the candidate set.

Search Execution Flow

  1. Fetch user group membership from cache
  2. Build an allowed-document filter
  3. Apply the filter before scoring
  4. Run search and ranking only on the permitted corpus

Latency stays constant as authorization rules grow, and users no longer stumble across off-limits assets.

 Search execution flow

The Hard Problems You Don’t See on the Whiteboard

  • Lineage means inherited access. If dataset B uses data from A, B inherits A’s restrictions. We enforce this in enrichment.
  • Security metadata changes faster than schemas. Group memberships change constantly, so index segments must refresh in real time.
  • Even embeddings follow governance. If a description contains red-flag terms, we apply tighter vector-index restrictions.

What Worked Well

After rollout, the improvements were clear:

  • Fewer duplicate tables
  • Higher reuse of curated datasets
  • Search logs exposed ownership gaps
  • More documentation contributions from data consumers

Search became both a discovery tool and a quality feedback loop.

Closing Thoughts

In a data mesh, discovery is where the design of how data is indexed is tested. Indexing documents is easy. Indexing real-time authorization is operationally a hard job.

The four-layer design - discovery, enrichment, indexing, authorization - turns an open-ended search problem into a predictable pipeline. Each layer has a clear contract, and authorization becomes data, not runtime checks.

When search results respect both data structure and access rules, engineers stop second-guessing what they see. That confidence is what makes a mesh truly work.

Metadata authentication Search engine (computing)

Opinions expressed by DZone contributors are their own.

Related

  • Your API Authentication Isn’t Broken; It’s Quietly Failing in These 6 Ways
  • Context-Aware Authorization for AI Agents
  • When Search Started Breaking at Scale: How We Chose the Right Search Engine
  • The ID That Costs Millions: Why API Authorization Failures Keep Winning

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook