Why Security Scanning Isn't Enough for MCP Servers

A secure MCP server can still break production. Twenty heuristic rules score readiness by catching missing timeouts, unsafe retries, and absent error schemas.

Nik Kale

Mar. 19, 26 · Analysis

Likes (1)

Comment

Save

4.9K Views

The Gap Nobody Is Talking About

The Model Context Protocol (MCP) is quickly becoming the de facto standard between AI agents and the tools they use. The adoption is growing rapidly - from coding assistants to enterprise automation platforms, MCP servers are replacing custom API integrations everywhere.

As a result of the MCP's rapid growth, the security community is now stepping up with solutions to address potential security threats. Solutions such as Cisco's open-source MCP scanner, Invariant Labs' MCP analyzer, and the OWASP MCP Cheat Sheet are helping organizations identify malicious MCP tool definitions, prompt injection attack vectors, and supply chain-related risk factors. These are significant efforts. But here's the problem: a secure MCP server can still take down your production environment.

Security scanners answer the question "Is this tool malicious?" They do not answer "Will this tool behave reliably when called 10,000 times at 3 AM during an incident?" That second question is what separates a demo from a production deployment, and it's a question almost nobody systematically asks.

I built a Readiness Analyzer to answer it, and contributed it to Cisco's MCP Scanner. Here's what I learned about the gap, and how to close it.

The Production Readiness Problem

Consider a typical MCP tool definition:

    JSON
   
 

   {
  "name": "execute_query",
  "description": "Run a database query",
  "inputSchema": {
    "type": "object",
    "properties": {
      "query": { "type": "string" }
    }
  }
}
  

A security scanner would look for prompt injection patterns in the description, verify that the input schema allows dangerous inputs, and compare the tool's behavior to its intended behavior. All important.

But from an operational standpoint, this tool definition is a minefield:

No timeout specified. A slow query will hang up the entire agent workflow indefinitely if one occurs.
No retry configuration. If the connection to the database drops off, does the agent attempt to retry, forever, or with backoff?
No error response schema. What does the agent see when this tool fails: an HTTP 500, a Python traceback, or nothing?
No input validation hints. The schema accepts any string, including a SELECT * on a 500GB table.
No rate limit guidance. An autonomous agent could continuously hammer this endpoint in a tight loop.

None of these is a security vulnerability. All of them will cause production incidents.

From Lesson to Analyzer: 20 Heuristic Rules

After repeatedly seeing these patterns while shipping tools into production, I designed a static analysis engine with 20 heuristic rules organized into eight categories. The goal was to create a "production readiness score," a single number (0-100) that tells you whether an MCP tool is ready for real workloads.

Static readiness analysis is not unique to MCP. Teams use many readiness checklists to assess the deployment readiness of Kubernetes environments, APIs, microservice health checks, and more. The key difference between these types of readiness analyses is that the MCP tool definitions include sufficient metadata to enable static readiness analysis. Still, the rules were not documented until now.

The Rule Categories

Timeout Guards (HEUR-001, HEUR-002)

The most common production failure mode for MCP tools. When an agent calls a tool that results in a network request, database query, or other external API call, and there is no timeout, a single slow response can cascade throughout the agent's workflow. The analyzer checks whether the tool definitions include timeouts and whether they are reasonable for the type of operation.

Retrying (HEUR-003, HEUR-004)

Retries without a limit result in infinite loops; Retries without exponential backoff result in "thundering herds". The analyzer will flag tools that do not provide retry configurations or that retry indefinitely without exponential backoff and/or jitter.

Error Handling (HEUR-005, HEUR-006, HEUR-007)

When a tool provided by MCP fails, the agent requires structured error data to make decisions about what action to take (retry, fallback to another alternative, escalate to a human). The analyzer will check if the tools provide error response schemas, document error classifications, and describe the failure modes.

Quality of Description (HEUR-009, HEUR-010, HEUR-016 – 020)

While this is a readiness issue, it is not just a documentation issue. LLMs use tool descriptions to find/select and invoke tools. If the description is ambiguous, it will be misused (the wrong parameter, at the wrong time, etc.). Therefore, the analyzer will evaluate the quality of the description in terms of its length, how specific it is, and whether it provides precondition, side effects, and scope limitations.

Input Validation (HEUR-011, HEUR-012)

Beyond the schema type, production tools require input validation constraints such as string length limits, enumerated values for categorical inputs, and range bounds for numeric inputs; otherwise, an autonomous agent will always supply inputs that are technically correct but operationally catastrophic.

Operational Configuration (HEUR-008, HEUR-013, HEUR-014)

Rate limits, concurrency bounds, and resource quotas are the control mechanisms used to prevent a well-intentioned agent from overloading a backend service. The analyzer will flag tools that support write operations or resource-intensive queries that lack operational guardrails.

Resource Management (HEUR-015)

Tools that establish connections/file handles/sessions require corresponding cleanup semantics. The analyzer will determine whether resources that establish tools describe their lifecycle, particularly important for long-running agent workflows that invoke hundreds of tools in a single session.

Safety Checks

Safety checks are cross-cutting rules that will identify patterns such as missing idempotence declarations on write operations, no pagination on list endpoints, and modifications to state without describing reversibility.

The Readiness Score

Each finding carries a severity weight (HIGH, MEDIUM, LOW, INFO). The analyzer aggregates these into a readiness score from 0-100, with a production-ready threshold of 70. This isn't a pass/fail; it's a signal to engineering teams about where to invest effort before deployment.

A score of 92 indicates that this tool was built with great care and will likely meet your organization's operational requirements. Conversely, a readiness score of 55 indicates that this tool works as expected during demonstration but may struggle to meet the demands of a real-world production environment.

Architecture: Designed for Extension

The Readiness Analyzer follows a provider abstraction pattern with three tiers:

Tier 1: The Heuristic Engine (Zero Dependencies)

This is a self-contained engine that operates via static code analysis using regular expressions, string matching, and schema inspection. It does not make any API calls, use any external services, nor require any special configuration. This was a deliberate design decision: the baseline scanner should run in CI/CD pipelines, air-gapped environments, and even on a developer's laptop without requiring any configuration beyond installing the package.

Tier 2: OPA Policy Provider (Optional)

If your organization already has policy-based infrastructure in place, the analyzer can evaluate each tool's definition against Rego policies. This will enable teams to create their own operational standards - e.g., all tools in the payments namespace must have a specified timeout under 5 seconds - and have those standards enforced automatically by the system.

Tier 3: LLM Semantic Analysis (Optional)

For a deeper assessment of a tool, the analyzer can utilize an LLM to assess properties of the tool that cannot be evaluated statically - i.e., whether the documented error-handling processes are actually helpful, whether the described failure modes are comprehensive, etc., and whether the scope of the tool is well-defined. The primary reason this tier is optional is that it requires both an API key and network access.

The key design principle is progressive capability: the tool is useful with zero configuration and becomes more powerful as you add integrations.

Integrating With Existing Security Scanning

The Readiness Analyzer complements the existing MCP Scanner engines rather than replacing them. A typical scan now looks like:

    Shell
   
   mcp-scanner --analyzers yara,readiness --server-url http://localhost:8000/mcp

The output includes both security findings and readiness findings:

    Shell
   
 

   === MCP Scanner Detailed Results ===

Tool: execute_query
Status: completed
Safe: No
Findings:
  • [HIGH] HEUR-001: Tool 'execute_query' does not specify a timeout.
    Category: MISSING_TIMEOUT_GUARD
    Readiness Score: 55
    Production Ready: No

  • [MEDIUM] HEUR-003: Tool 'execute_query' does not specify a retry limit.
    Category: UNSAFE_RETRY_LOOP

  • [MEDIUM] HEUR-006: Tool 'execute_query' does not define an error response schema.
    Category: MISSING_ERROR_SCHEMA

Tool: get_user
Status: completed
Safe: Yes
Findings:
  • [INFO] HEUR-012: Tool 'get_user' input schema lacks validation hints.
    Category: NON_DETERMINISTIC_RESPONSE
    Readiness Score: 92
    Production Ready: Yes
  

This gives teams a complete picture: is this tool safe (security) and ready (operations)?

Lessons from the Contribution Process

Building the analyzer was one challenge. Getting the analyzer accepted into an open-source project with several maintainers, continuous integration (CI) checks, and code scanning was another challenge.

A few things I learned that might help others contribute to security tooling projects:

Complement, don't compete. The MCP Scanner already had three powerful security analysis engines. A proposal for "the best security scanner" would potentially have been met with skepticism by the maintainers. I instead recognized a vacant space - operational readiness - that the existing engines did not address. The contribution expanded the project's value proposition rather than questioning its existing architecture.

Start with zero dependencies. The heuristic engine requires no API keys, external services, or optional packages. This made integration dramatically simpler and reduced the review surface. The OPA and LLM tiers came as optional extensions, not requirements.

Bring data, not opinions. When the maintainers asked for evidence that the rules worked, I provided an analysis of false positives and true positives across numerous test cases. When a reviewer ran the analyzer against a corpus of 2,300+ skills and found that some rules were too noisy, the response was to adjust thresholds based on empirical data - not to argue about them in theory.

What's Next

The 20 heuristic rules are a starting point. As MCP adoption matures and more tools move into production, the readiness taxonomy will need to grow. Areas that I'm actively researching:

Multi-tool interaction patterns. Individual tool readiness is necessary but not sufficient. When an agent uses three separate tools to perform a chain of tasks (query a database, transform the results, write to an API), the potential failure points increase exponentially. Analyzing these multi-tool interactions requires a graph-based view of the interactions that none of today's scanners provide.

Runtime behavioral validation. Static analysis finds configuration discrepancies; however, it cannot find a tool that produces valid-looking data during testing but degrades quietly under load. If we connect the readiness scanning to the runtime telemetry, for example, through OpenTelemetry traces of actual tool invocations, this creates a feedback loop that can inform readiness scores based on production behavior.

Organizational policy integration. Every organization has different operational standards. The timeout requirements for a financial company differ from those for a media company. Deeper OPA integration and library templates for organizational policies would allow teams to capture their standards as reusable, shareable rule packs.

Where to Find the Rules

The Readiness Analyzer is available now as part of Cisco's open-source MCP Scanner:

    Shell
   
   pip install cisco-ai-mcp-scanner
mcp-scanner --analyzers readiness --server-url http://localhost:8000/mcp

Repository: github.com/cisco-ai-defense/mcp-scanner

The tool scans MCP servers for both security threats and production readiness issues. It works as a CLI, a REST API, or as an integrated component in CI/CD pipelines. No API keys are required for the readiness analyzer - it runs purely on static analysis.

If you are deploying MCP servers into production, scan them not just for security but also for readiness.

API security large language model

Opinions expressed by DZone contributors are their own.

Related

Trending