Microsoft Fabric: The Developer's Guide on API Automation of Security and Data Governance

The article discusses popular scenarios for automation governance in Microsoft Fabric, which may help organizations more easily govern Fabric.

Iurii Iurchenko

Mar. 19, 26 · Analysis

Likes (1)

Comment

Save

3.6K Views

While working with Data Analytics Systems, it is crucial to understand what is happening with the data, who can see specific data, which data we already have in the system, and which should be ingested. This is a typical business challenge that most companies face after implementing a new data analytics solution.

That article observes the automation of the two most critical parts of governance, which we may face in Microsoft Fabric:

How to understand which tables in which Lakehouses already exist, to avoid duplicated data ingestion and additional related costs.
How to automate hours of work for gathering permissions across workspaces in Microsoft Fabric, on who has access to what in Microsoft Fabric, especially when the company has 10-100+ workspaces, and it becomes a nightmare to manage.

Those questions are emerging across different organizations. For that purpose, we will use elegant code that requires only the necessary rights to execute Spark Notebooks and no additional security configuration, which will use Microsoft Fabric's native API to gather the required information.

To run that code, the only necessary step is to create a Spark Notebook inside Microsoft Fabric. Then, paste the code below and run it in the notebook to yield the desired results.

Solution 1: Get Already Ingested Tables Across All Lakehouses in All Workspaces

So, that solution gives us comprehensive data on which delta tables exist across all data lakes in Microsoft Fabric. To build that solution, you may utilize the following code:

    Python
   
 

   import requests
import pandas as pd
from notebookutils import mssparkutils
from pyspark.sql import SparkSession

spark = SparkSession.getActiveSession()

# 1. Get Fabric API token
token = mssparkutils.credentials.getToken(
    "https://api.fabric.microsoft.com"
)

headers = {
    "Authorization": f"Bearer {token}",
    "Content-Type": "application/json"
}

BASE_URL = "https://api.fabric.microsoft.com/v1"

# 2. Get workspaces visible to you
workspaces_response = requests.get(
    f"{BASE_URL}/workspaces",
    headers=headers
)

workspaces_response.raise_for_status()
workspaces = workspaces_response.json().get("value", [])

print(f"Found {len(workspaces)} workspaces")


# 3. Collect lakehouses + tables
rows = []

for ws in workspaces:
    ws_id = ws["id"]
    ws_name = ws["displayName"]

    # Get lakehouses
    lakehouses_response = requests.get(
        f"{BASE_URL}/workspaces/{ws_id}/lakehouses",
        headers=headers
    )
    lakehouses_response.raise_for_status()
    lakehouses = lakehouses_response.json().get("value", [])

    for lh in lakehouses:
        lh_id = lh["id"]
        lh_name = lh["displayName"]

        # Get tables
        tables_response = requests.get(
            f"{BASE_URL}/workspaces/{ws_id}/lakehouses/{lh_id}/tables",
            headers=headers
        )
        
        try:
            print("Success get tables")
            tables_response.raise_for_status()

            tables = tables_response.json().get("data", [])

            for t in tables:
                rows.append({
                    "Workspace": ws_name,
                    "Lakehouse": lh_name,
                    "TableName": t.get("name"),
                    "TableType": t.get("type"),
                    "Location": t.get("location"),
                    "format": t.get("format")
                })
        except:
            print("Failed to get tables")

# 4. Convert to Spark DataFrame
if rows:
    df = pd.DataFrame(rows)
    spark_df = spark.createDataFrame(df)
    spark_df.createOrReplaceTempView("fabric_lakehouse_inventory")
    print("Temp view created: fabric_lakehouse_inventory")
else:
    print("No lakehouses or tables found.")
  

After the Spark View is created earlier, you can run the necessary query to view the data.

    Python
   
 

   sql = '''
  select
      Workspace
      ,Lakehouse
      ,TableName
      ,TableType
      ,Location
      ,format
  from fabric_lakehouse_inventory
'''

display(spark.sql(sql))
  

This code snippet returns all tables in the lakehouses, along with their supporting information. As an example, if you need to find tables that potentially have information related to payments, you may add a filter expression to the query above, using a Spark SQL alternative to the Python one above (with a marker %sql at the beginning, to trigger Spark SQL execution).

    SQL
   
 

     %%sql
  select
      Workspace
      ,Lakehouse
      ,TableName
      ,TableType
      ,Location
      ,format
  from fabric_lakehouse_inventory
  where lower(TableName) like '%payment%'
  

This query will help identify all tables where payments may be stored.

Solution 2: Get Users Per Workspace Insights in Microsoft Fabric

Now, when we figured out how to deal with tables in Microsoft Fabric, we can deep dive into one more challenge people face in Fabric, related to understanding who has rights to what workspace in Microsoft Fabric across all workspaces.

It is a critical part because, by default, Microsoft Fabric does not provide the report with that information in a convenient way, and without full admin rights to the system.

For that purpose, we may utilize that code:

    Python
   
 

   import requests
import pandas as pd
from notebookutils import mssparkutils

# 1. Get access token from Fabric session
token = mssparkutils.credentials.getToken(
    "https://analysis.windows.net/powerbi/api"
)

headers = {
    "Authorization": f"Bearer {token}",
    "Content-Type": "application/json"
}

# 2. Get workspaces visible to you
workspaces_response = requests.get(
    "https://api.powerbi.com/v1.0/myorg/groups",
    headers=headers
)

workspaces = workspaces_response.json().get("value", [])

print(f"Found {len(workspaces)} workspaces")

# 3. Collect users per workspace
rows = []

for ws in workspaces:
    ws_id = ws["id"]
    ws_name = ws["name"]

    users_response = requests.get(
        f"https://api.powerbi.com/v1.0/myorg/groups/{ws_id}/users",
        headers=headers
    )

    users = users_response.json().get("value", [])

    for u in users:
        rows.append({
            "Workspace": ws_name,
            "Principal": u.get("identifier"),
            "Role": u.get("groupUserAccessRight"),
            "Type": u.get("principalType")
        })

# 4. Convert to DataFrame
df = pd.DataFrame(rows)
spark_df = spark.createDataFrame(df)
spark_df.createOrReplaceTempView("workspace_access_view")
print("Temp view created: workspace_access_view")
  

After running that code, we may run a query against a Spark view in a similar manner (with a marker %sql at the beginning, to trigger Spark SQL execution):

    SQL
   
 

   %%sql
select
  Workspace
  ,Principal
  ,Role
  ,Type
from workspace_access_view
  

As a result, it is easy to understand who has what access.

As you can see, neither solution requires any additional secret configuration. They natively leverage Microsoft Fabric's capabilities to obtain API tokens and proceed.

What Is Next?

Now, let us consider what else can be done or how the code above may be extended to deliver even greater value to the company.

Extension 1: Automated Data Pipeline With the Reporting

The examples above solve the common cases that often require significant human resources to handle manually. So, after creating a base notebook to review the necessary information occasionally, the logical next step may be to build a data pipeline that gathers daily snapshots of the information into some maintenance tables. These pipelines may ingest on a daily or weekly basis the data, which may help answer the following business questions:

Which tables were added recently?
Which tables were removed recently?
Who has access to that workspace on that date or date range?

The system design of that pipeline will include:

Spark notebook, which creates a Spark dataframe and appends it to the desired admin table
Data pipeline, to schedule that process
A PBI or other report on top of that admin table, to show the necessary information.

Extension 2: Security Solution to Scan Notebooks for Sensitive Information

Even though this article doesn't mention additional features that the API can bring to the system, a curious reader may extend these capabilities by others. For example, it is relatively easy to build a similar solution, using slightly different APIs that scan all Spark notebooks for the specific text. That way, it is possible to build a solution that, for example, scans Spark notebooks. It may result in building a data lineage solution for the notebooks or a security checker for the notebook code, for example, to identify open keys or other sensitive information in the code and raise an alert. The limitation here is only the organization's need and the available engineer's time to implement that solution.

Conclusion

APIs in Microsoft Fabric open the door to building many features on demand. So, the code snippets above may be very useful as examples for further expansion on governance in Microsoft Fabric. They can be easily converted into respective data pipelines, which can be run daily to provide the company with valuable insights into its data assets.

API security SPARK (programming language)

Opinions expressed by DZone contributors are their own.

Related

Trending