Microsoft AI Involuntarily Exposed a Secret Giving Access to 38TB of Confidential Data for 3 Years

The story of how an overprovisioned SAS token exposed a massive 38TB trove of private data on GitHub for nearly three years.

Thomas Segura

Oct. 14, 23 · Analysis

Likes (5)

Comment

Save

6.3K Views

The WIZ Research team recently discovered that an overprovisioned SAS token had been lying exposed on GitHub for nearly three years. This token granted access to a massive 38-terabyte trove of private data. This Azure storage contained additional secrets, such as private SSH keys, hidden within the disk backups of two Microsoft employees. This revelation underscores the importance of robust data security measures.

What Happened?

WIZ Research recently disclosed a data exposure incident found on Microsoft’s AI GitHub repository on June 23, 2023.

The researchers managing the GitHub used an Azure Storage sharing feature through an SAS token to give access to a bucket of open-source AI training data.

This token was misconfigured, giving access to the account's entire cloud storage rather than the intended bucket.

This storage comprised 38TB of data, including a disk backup of two employees’ workstations with secrets, private keys, passwords, and more than 30,000 internal Microsoft Teams messages.

SAS (Shared Access Signatures) are signed URLs for sharing Azure Storage resources. They are configured with fine-grained controls over how a client can access the data: what resources are exposed (full account, container, or selection of files), with what permissions, and for how long. See Azure Storage documentation.

After disclosing the incident to Microsoft, the SAS token was invalidated. From its first commit to GitHub (July 20, 2020) to its revoking, nearly three years elapsed. See the timeline presented by the Wiz Research team:

Yet, as emphasized by the WIZ Research team, there was a misconfiguration with the Shared Access Signature (SAS).

Data Exposure

The token was allowing anyone to access an additional 38TB of data, including sensitive data such as secret keys, personal passwords, and over 30,000 internal Microsoft Teams messages from hundreds of Microsoft employees.

Here is an excerpt from some of the most sensitive data recovered by the Wiz team:

As highlighted by the researchers, this could have allowed an attacker to inject malicious code into the storage blob that could then automatically execute with every download by a user (presumably an AI researcher) trusting in Microsoft's reputation, which could have led to a supply chain attack.

Security Risks

According to the researchers, Account SAS tokens such as the one presented in their research present a high-security risk. This is because these tokens are highly permissive, long-lived tokens that escape the monitoring perimeter of administrators.

When a user generates a new token, it is signed by the browser and doesn't trigger any Azure event. To revoke a token, an administrator needs to rotate the signing account key, therefore revoking all the other tokens at once.

Ironically, the security risk of a Microsoft product feature (Azure SAS tokens) caused an incident for a Microsoft research team, a risk recently referenced by the second version of the Microsoft threat matrix for storage services:

Secrets Sprawl

This example perfectly underscores the pervasive issue of secrets sprawl within organizations, even those with advanced security measures. Intriguingly, it highlights how an AI research team, or any data team, can independently create tokens that could potentially jeopardize the organization. These tokens can cleverly sidestep the security safeguards designed to shield the environment.

Mitigation Strategies

For Azure Storage Users:

1 - Avoid Account Sas Tokens

The lack of monitoring makes this feature a security hole in your perimeter. A better way to share data externally is using a Service SAS with a Stored Access Policy. This feature binds a SAS token to a policy, providing the ability to centrally manage token policies.

Better though, if you don't need to use this Azure Storage sharing feature, is to simply disable SAS access for each account you own.

2 - Enable Azure Storage Analytics

Active SAS token usage can be monitored through the Storage Analytics logs for each of your storage accounts. Azure Metrics allows the monitoring of SAS-authenticated requests and identifies storage accounts that have been accessed through SAS tokens, for up to 93 days.

For All:

1 - Audit Your Github Perimeter for Sensitive Credentials

With around 90 million developer accounts, 300 million hosted repositories, and 4 million active organizations, including 90% of Fortune 100 companies, GitHub holds a much larger attack surface than meets the eye.

Last year, GitGuardian uncovered 10 million leaked secrets on public repositories, up 67% from the previous year.

GitHub must be actively monitored as part of any organization's security perimeter. Incidents involving leaked credentials on the platform continue to cause massive breaches for large companies, and this security hole in Microsoft's protective shell wasn't without reminding us of the Toyota data breach from a year ago.

On October 7, 2022 Toyota, the Japanese-based automotive manufacturer, revealed they had accidentally exposed a credential allowing access to customer data in a public GitHub repo for nearly 5 years. The code was made public from December 2017 through September 2022.

If your company has development teams, likely, some of your company's secrets (API keys, tokens, passwords) end up on public GitHub. Therefore it is highly recommended to audit your GitHub attack surface as part of your attack surface management program.

Final Words

Every organization, regardless of size, needs to be prepared to tackle a wide range of emerging risks. These risks often stem from insufficient monitoring of extensive software operations within today's modern enterprises. In this case, an AI research team inadvertently created and exposed a misconfigured cloud storage sharing link, bypassing security guardrails. But how many other departments - support, sales, operations, or marketing - could find themselves in a similar situation? The increasing dependence on software, data, and digital services amplifies cyber risks on a global scale.

Combatting the spread of confidential information and its associated risks necessitates reevaluating security teams' oversight and governance capabilities.

AI Data security GitHub azure Data (computing) SAS (software) security

Published at DZone with permission of Thomas Segura. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending