DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Creating Real-Time Dashboards With AWS AppSync
  • How To Set Up a Scalable and Highly-Available GraphQL API in Minutes
  • How Hasura 2.0 Works: A Design and Engineering Look
  • How to Build a Pokedex React App with a Slash GraphQL Backend

Trending

  • Four Essential Tips for Building a Robust REST API in Java
  • Navigating the LLM Landscape: A Comparative Analysis of Leading Large Language Models
  • Ensuring Configuration Consistency Across Global Data Centers
  • Next-Gen IoT Performance Depends on Advanced Power Management ICs
  1. DZone
  2. Data Engineering
  3. Databases
  4. How to Create Data Lineage With the Tableau GraphQL Metadata API

How to Create Data Lineage With the Tableau GraphQL Metadata API

Tableau has a rich metadata API exposed through a GraphQL interface that you can use to extract your data lineage.

By 
Grant Seward user avatar
Grant Seward
·
Dec. 08, 20 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
9.4K Views

Join the DZone community and get the full member experience.

Join For Free

I love data. The ways it can be used to curate value and express relationships never ceases to amaze me. To this extent, visualizing data is often one of the most powerful ways to share insights and Tableau certainly is one of - if not the - most popular data visualization tools on the market. It's extremely simple for non-technical users to develop rich and meaningful graphs with a pretty intuitive UI and there are some really nice features under the hood that are used to speed up query performance when extracts are stored within Tableau.

My absolute favorite Tableau feature is that you can query your metadata using the same GraphQL API that Tableau itself uses. A portion of the metadata exposed includes the lineage for the fields, sheets, tables and data stores that exist within your Tableau Site. Exposing the metadata via an extensive API like this is a really forward thinking idea from the team behind Tableau. 

How to Use the Tableau Metadata API

The Tableau Metadata API is exposed via GraphQL and Tableau is wrapped in a python library, the Tableau Server Client. This library is one of the easiest APIs to use - Tableau has simplified all authentication and serialization to allow users to just focus on the query they want to execute.

Pros:

  • The graph enables many different entities and data assets within Tableau to be queried
  • The API performance is really good, even when requesting a large number of multidimensional relationships
  • The Python client is extremely simple and intuitive to use, handling the authentication and serialization for the user

Cons:

  • The documentation is sparse - it's not clear when to expect upstream or downstream data lineage assets to be provided or when they will be null
  • A "full" lineage for each data asset is not available, you can only extract lineage from one step upstream or one step downstream (at least from what I could tell from using the API)
  • Tableau releases a new API version every quarter or so but the docs do not depict which features are available in which version

Let's look at some code that can be used to query your Tableau metadata.

Authentication

You can use the Tableau API by authenticating with your username and password but the more secure and suggested approach is to use a client token. I've also created a simple helper function below to authenticate and execute queries.

Python
 




x


 
1
import os
2
import tableauserverclient as TSC
3

          
4
TOKEN_NAME = os.environ.get('TOKEN_NAME' ,'some-token')
5
TOKEN = os.environ.get('TOKEN', 'your-token-value')
6
SITE_NAME = os.environ.get('SITE_NAME', 'your-site')
7

          
8
# If using Tableau Online this might be 'https://prod-useast-b.online.tableau.com'
9
SERVER = os.environ.get('SERVER', 'your-server')
10
SERVER_VERSION = os.environ.get('SERVER_VERSION', '3.9')
11

          
12
tableau_auth = TSC.PersonalAccessTokenAuth(TOKEN_NAME, TOKEN, SITE_NAME)
13
server = TSC.Server(SERVER)
14
server.version = SERVER_VERSION
15

          
16
# Helper function to run queries
17
def run_query(query):
18
    with server.auth.sign_in(tableau_auth):
19
        resp = server.metadata.query(query)
20
        resp = resp['data']
21
        if isinstance(resp, list):
22
            resp = resp[0]
23
        return resp



Define the Query

The Tableau Metadata API is a fantastic way to start learning GraphQL since Tableau handles all of the serialization for you and their Graph follows a consistent and easy to understand set of conventions.

The function below executes a query that will return all calculated fields that exist within your Site. The beautiful thing here with GraphQL is that we can simultaneously ask Tableau to return all of the fields that reference each of the calculated fields and we can go even deeper to request all of the sheets for each field that is referencing a calculated field.

Python
 




x


 
1
def get_all_calculated_fields(batch_size=100):
2
    all_calculated_fields = []
3
    has_next = True
4
    start = 0
5
    while has_next is True:
6
        query = """
7
        {
8
            calculatedFieldsConnection (first: %s, offset: %s){
9
                nodes {
10
                    id
11
                    name
12
                    formula
13
                    referencedByFields {
14
                        fields {
15
                          id
16
                          name
17
                          sheets {
18
                            id
19
                            name
20
                          }
21
                        }
22
                    }
23
                }
24
                pageInfo {
25
                    hasNextPage
26
                    endCursor
27
                }
28
            }
29
        }
30
        """ % (batch_size, start)
31
        resp = run_query(query)
32
        all_calculated_fields.extend(resp['calculatedFieldsConnection']['nodes'])
33
        start = start + batch_size
34
        if resp['calculatedFieldsConnection']['pageInfo']['hasNextPage'] == False:
35
            has_next = False
36

          
37
    return all_calculated_fields
2
def get_all_calculated_fields(batch_size=100):



Create Your Data Lineage

Now that you have your metadata from Tableau, how you structure and use the output is completely up to you. This example will define edges and nodes. These are the fundamental building blocks for network relationships and data lineage.

Python
 




x


 
1
def format_nodes_and_edges(calc_fields):
2
    nodes = []
3
    edges = []
4
    for calc in calc_fields:
5
        # Add each calculated field to the nodes
6
        calc_field_name = 'CalcField - ' + calc['name']
7
        nodes.append(calc_field_name)
8

          
9
        # For each field that references the calculated field, add a node
10
        for ref_field in calc['referencedByFields']:
11
            for field in ref_field['fields']:
12
                # Calculated fields may show up under referenced fields, if that 
13
                # happens, do not overwrite the existing node
14
                if field['id'] not in nodes:
15
                    field_name = 'Field - ' + field['name']
16
                    edges.append((calc_field_name, field_name))
17

          
18
                    # Create a reference to each sheet that uses this field
19
                    for sheet in field['sheets']:
20
                        sheet_name = 'Sheet - ' + sheet['name']
21
                        nodes.append(sheet_name)
22
                        edges.append((field_name, sheet_name))
23
    return list(set(nodes)), edges



View the Edges and Nodes

Running all of the functions above will now result in creating the objects required to visualize your data lineage. The nodes and edges can be plugged into just about any network visualization tool, such as NetworkX,  to view the output.

Python
 




x


 
1
calculated_fields = get_all_calculated_fields()
2
nodes, edges = format_nodes_and_edges(calculated_fields)
3
  
4
nodes 
5
# [
6
#   ...
7
#   'CalcField - Click-to-Open',
8
#   'Sheet - Sheet 5',
9
#   'CalcField - Minutes of Delay per Flight',
10
#   'Sheet - Opportunities ',
11
#   ...
12
# ]
13

          
14
edges 
15
# [
16
#   ...
17
#   ('CalcField - Difference from Region', 'Field - State'),
18
#   ('Field - State', 'Sheet - Obesity Scatter Plot'),
19
#   ('Field - State', 'Sheet - Obesity Map'),
20
#   ...
21
# ]



Closing Thoughts

I applaud Tableau for enabling this form of data access even though I believe this is a highly underutilized and under-leveraged benefit. Many companies do not fully make use of this metadata from Tableau to the full extent. Understanding how data moves and dependencies between data is such a critical feature especially as organizations try to maintain well-managed practices and controls around how their data is used. As you look to leverage Tableau metadata and data lineage within your company, make sure that you're taking the extra step to connect that data lineage with the upstream processes to give a complete and comprehensive perspective of your lineage.

API Data visualization Metadata GraphQL Database

Published at DZone with permission of Grant Seward. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Creating Real-Time Dashboards With AWS AppSync
  • How To Set Up a Scalable and Highly-Available GraphQL API in Minutes
  • How Hasura 2.0 Works: A Design and Engineering Look
  • How to Build a Pokedex React App with a Slash GraphQL Backend

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!