Secrets in Code: Understanding Secret Detection and Its Blind Spots
Even the best secret scanners miss valid tokens in open-source projects. This research shares data and practical tips for safer API-key design.
In a world where attackers routinely scan public repositories for leaked credentials, secrets in source code represent a high-value target. But even with the growth of secret detection tools, many valid secrets still go unnoticed. It’s not because the secrets are hidden, but because the detection rules are too narrow or overcorrect in an attempt to avoid false positives. This creates a trade-off between wasting development time investigating false signals and risking a compromised account.
This article highlights research that uncovered hundreds of valid secrets from various third-party services publicly leaked on GitHub. Responsible disclosure of the specific findings is important, but the broader learnings include which types of secrets are common, the patterns in their formatting that cause them to be missed, and how scanners work so that their failure points can be improved.
Further, for platforms that are accessed with secrets, there are actionable improvements that can better protect developer communities.
What Are “Secrets” in Source Code?
When we say “secrets,” we’re not only talking about API tokens. Secrets include any sensitive value that, if exposed, could lead to unauthorized access, account compromise, or data leakage. This includes:
- API Keys: Tokens issued by services like OpenAI, GitHub, Stripe, or Gemini.
- Cloud Credentials: Access keys for managing AWS cloud resources or infrastructure.
- JWT Signing Keys: Secrets used to sign or verify JSON Web Tokens, often used in authentication logic.
- Session Tokens or OAuth Tokens: Temporary credentials for session continuity or authorization.
- One-Time Use Tokens: Password reset tokens, email verification codes, or webhook secrets.
- Sensitive User Data: Passwords or user attributes included in authentication payloads.
Secrets can be hardcoded, generated dynamically, or embedded in token structures like JWTs. Regardless of the specific form, the goal is always to keep them out of source control management systems.
How Secret Scanners Work
Secret scanners generally detect secrets using patterns. For example, a GitHub Personal Access Token (PAT) like:
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n
might be matched by a regex rule such as:
ghp_[A-Za-z0-9]{36}
To reduce false positives that string literal matching alone might flag, scanners often rely on:
- Validation: Once a match is found, some tools will try to validate the secret is in fact a secret and not a placeholder example. This can be done by contacting its respective service. Making an authentication request to an API and interpreting the response code would let the scanner know if it is an active credential.
- Word Boundaries: Ensure the pattern is surrounded by non-alphanumeric characters (e.g.
\bghp_...\b), to avoid matching base64 blobs or gibberish. - Keywords: Contextual terms nearby (e.g. “github” or “openai”) can better infer the token’s source or use.
This works well for many credential-like secrets, but for some tools this isn’t done in a way that is much more clever than running grep.
Take another example:
const s = "h@rdc0ded-s3cr3t";
const t = jwt.sign(payload, s);
There’s no unique prefix in cases like this. No format. But it’s still a secret, and if leaked, it could let an attacker forge authentication tokens. Secret scanners that only look for credential-shaped strings would miss this entirely.
A Few Common Secret Blind Spots
1. Hardcoded JWT Secrets
In a review of over 2,000 Node.js modules using popular JWT libraries, many hardcoded JWT secrets were found:
const opts = { secretOrKey: "hardcoded-secret-here" };
passport.use(new JwtStrategy(opts, verify));
These are not always caught by conventional secret scanners, because they don’t follow known token formats. If committed to source control, they can be exploited to sign or verify forged JWTs. The semantic data flow of a hardcoded secret to an authorization function can lead to much better results.
2. JWTs With Sensitive Payloads
A subtle but serious risk occurs when JWTs are constructed with entire user objects, including passwords or admin flags:
const token = jwt.sign(user, obj);
This often happens when working with ORM objects like Mongoose or Sequelize. If the model evolves over time to include sensitive fields, they may inadvertently end up inside issued tokens. The result: passwords, emails, or admin flags get leaked in every authentication response.
3. Secrets Hidden by Word Boundaries
In a separate research survey project, hundreds of leaks were detected from overfitting word boundaries. Word boundaries (\b) in regex patterns are used to reduce noise by preventing matches inside longer strings. But they also miss secrets embedded in HTML, comments, or a misplaced paste:
{/* <CardComponent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nents> */}
Scanners requiring clean boundaries around the token will miss this even if the secret is valid.
Similarly, URL-encoded secrets (like in logs or scripts) are frequently overlooked:
%22Bearer%20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n%22
Scanning GitHub Repos and Finding Missed Secrets
We wanted to learn how to better tune a tool and make adjustments for non-word boundary checks so tested it with the best secret scanning tools on the market for strengths and weaknesses: GitHub, GitGuardian, Kingfisher, Semgrep, and Trufflehog.
The main tokens discovered across a wide number of open-source projects were GitHub classic and fine-grained PATs, in addition to AI services such as OpenAI, Anthropic, Gemini, Perplexity, Huggingface, xAI, and Langsmith. Less common but also discovered were email providers and developer platform keys.
- We found that few providers we tested detected the valid tokens associated with GitHub.
- GitHub’s default secret scanning did not detect OpenAI tokens within word-boundaries, this includes push protection and once leaked within a repository.
- The other tokens varied per-provider; some detected or missed Anthropic, Gemini, Perplexity, Huggingface, xAI, Deepseek and others.
The keys were missed due to either overly strict non-word boundaries or looking for specific keywords that either were in the wrong place or did not exist in the file.

Some of the common problem classes with non-word boundaries include: unintentional placement, terminal output, encodings and escape formats, non-word character end-lines, unnecessary boundaries, or generalized regex.
Common Token Prefixes and Pattern Examples
Here's a sampling of secret token formats that scanners might detect or miss. The reasons for this include the word boundary problems but also the non-unique prefixes can prevent the ability to validate against an authorization endpoint as a true secret that has been leaked.
| Service Provider | patterns | risk factors |
|---|---|---|
| GitHub | ghp_ github_pat_ gho_ ghu_ ghr_ ghs_ |
Multiple formats to look for. Often can be missed if embedded in strings or URL-encoded. |
| OpenAI | sk- | Using a hyphen can break some boundary-based detection methods. Ambiguity due to overlap with DeepSeek, but inclusion of T3BlbkFJ pattern in some formats can be a signal, but not consistently used. |
| DeepSeek | sk- | Using a hyphen can break some boundary-based detection methods. Easily misclassified as OpenAI without additional hints. |
| Anthropic | sk-ant- | Using a hyphen can break some boundary-based detection methods. End pattern of AA and ant- helps with unique identifiecation. |
| Stripe | sk_live_ sk_test_ |
Shares prefix with other service providers creating collisions for auth validation when discovered. |
| APIDeck | sk_live_ sk_test_ |
Shares prefixes with Stripe which makes validation difficult. |
| Groq | gsk_ | Similar format but has slightly different identifier which can help with uniqueness. |
| Notion | secret_ | Common prefix for many services increases prevalence of false positives by not being able to validate authentication. |
| ConvertAPI | secret_ | Common prefix for many services increases prevalence of false positives by not being able to validate authentication. |
| LaunchDarkly | api- | Common prefix for many services increase prevalence of false positives by not being able to validate authentication. |
| Robinhood | api- | Common prefix for many services increase prevalence of false positives by not being able to validate authentication. |
| Nvidia | nvapi- | Allows string to end in a hyphen (-) which can break some boundary-based detection methods. |
This is just a sample of the many platforms that have secrets. To help safeguard them it is important to distinguish between an example placeholder and the real thing, so being able to uniquely identify the source becomes challenging.
Improving Secret Detection
To improve the accuracy and completeness of secret detection, consider the following strategies:
For Development Teams
- Avoid hardcoded secrets. Use environment variables or secret managers even if only meant to be a placeholder example because it can fire false positives and risk missing true positives when they occur.
- Use static analysis. Catch patterns like string literals in crypto functions but also data flow patterns that can cross between files (inter-file) to expose secrets in unexpected ways that can be caught.
- Automate checking your codebase. Use tools that continuously monitor source code check-ins, preferably through pre-commit hooks to identify whenever secrets are accidentally introduced into the code base. Relying on your SCM provider to do this is not often enough.
For Service Providers
- Use unique, identifiable prefixes for secrets. It helps with detection.
- Document exact token formats because the transparency makes it easier for tools to catch it. Offer validation endpoints so that development teams can be confident in any findings being true positives.
- Expire or encourage rotating tokens automatically to minimize damage.
Conclusion
Secrets aren’t always easy to spot. They’re not always wrapped in clear delimiters, and they don’t always look like credentials. Sometimes they hide in authentication logic, passed into token payloads, or hardcoded during development.
We explained how secret detection works, where it falls short, and how real-world leaks occur in ways many scanners don’t expect. From hardcoded JWT secrets to misplaced token strings, the cost of undetected secrets is high but preventable.
Comments