Zscaler MCP On AWS: Review, Bugs, And Fixes

by Admin 44 views
Zscaler MCP on AWS: Review of AgentCore Deployment, Critical Bugs, and Security Fixes

Hey guys! Let's dive deep into the deployment of Zscaler's MCP (Model Context Protocol) AgentCore on AWS, focusing on the nitty-gritty details, the problems we ran into, and most importantly, the solutions. We'll be looking at the official Zscaler MCP AgentCore Docker image and exploring some critical issues that need addressing. Get ready for a deep dive into the code, security vulnerabilities, and potential fixes to get the most out of your Zscaler MCP setup on AWS.

Unveiling the Issues: Critical Bugs in Zscaler MCP AgentCore

First off, we've spotted some critical bugs within the official Zscaler MCP AgentCore Docker image (specifically zscaler/zscaler-mcp-server:0.4.0-bedrock). These aren't just minor annoyances; these are fundamental flaws that prevent the server from working as it should, especially when dealing with standard MCP clients. On top of that, it doesn't quite align with the security best practices that AWS recommends. Let's break down the major problems and what we can do about them. For reference, the image we're talking about is: 709825985650.dkr.ecr.us-east-1.amazonaws.com/zscaler/zscaler-mcp-server:0.4.0-bedrock.

The tools/list Bug: A Deep Dive

The handle_tools_list() function is where the trouble begins, guys. We found a few critical bugs here that completely mess up the MCP protocol, making it impossible for standard MCP clients to discover the available tools. Let's see what's going wrong.

The Buggy Implementation

Here's a simplified look at the buggy code:

async def handle_tools_list() -> Dict[str, Any]:
    tools = mcp_server.server.list_tools()  # ❌ Missing await
    
    return {
        "status": "success",
        "tool": "tools/list",
        "result": [json.dumps(tools, indent=2)]  # ❌ Double serialization
    }

The Problems, Guys!

  1. Missing await keyword: The async call isn't awaited, which means the function returns a coroutine object instead of the actual tools. Yikes!
  2. Double JSON serialization: The tools get serialized into a JSON string, and then that string is wrapped in an array. Why, though?
  3. Incorrect response format: The function returns {"status": "success", "result": [...]} instead of the MCP-compliant {"tools": [...]}. This is a big no-no.
  4. Object serialization failure: It tries to serialize Python Tool objects without converting them into dictionaries. This is a recipe for disaster.

What the Output Looks Like (Broken)

Here's what you get, which is not what we want:

{
  "status": "success",
  "tool": "tools/list",
  "result": [
    "[{\"name\": \"zpa_list_app_segments\", ...}]"  // ❌ String, not object
  ]
}

What the Output Should Look Like (MCP Protocol)

This is what we're aiming for. It's clean and follows the MCP spec:

{
  "tools": [
    {
      "name": "zpa_list_app_segments",
      "description": "List all application segments in ZPA",
      "inputSchema": {
        "type": "object",
        "properties": {...}
      }
    }
  ]
}

The Impact

  • ❌ Breaks all standard MCP clients (Claude Desktop, QuickSuite, you name it).
  • ❌ Violates the MCP protocol specification. Come on, guys!
  • ❌ Tools are undiscoverable and unusable.
  • ⚠️ Might work with Genesis (which wraps everything), masking the bug. Sneaky.

Proposed Fix

Here's the suggested fix:

async def handle_tools_list() -> Dict[str, Any]:
    # Get the list of tools from the MCP server
    tools = await mcp_server.server.list_tools()  # βœ… Added await
    
    # Convert Tool objects to dictionaries for JSON serialization
    tools_list = []
    for tool in tools:
        tool_dict = {
            "name": tool.name,
            "description": tool.description,
        }
        # MCP spec uses inputSchema (camelCase)
        if hasattr(tool, 'inputSchema'):
            tool_dict["inputSchema"] = tool.inputSchema
        tools_list.append(tool_dict)
    
    # Return MCP protocol format: {"tools": [...]}
    return {"tools": tools_list}  # βœ… Correct format

The Fix in a Nutshell

Essentially, we add the await keyword, correctly format the response, and make sure that the Tool objects are converted to dictionaries. This means our standard MCP clients can discover and use the tools. It’s all about getting the format right, so the clients can do their job.

Security Alert: AWS Secrets Manager Support Needed

This is crucial, guys. The current setup requires you to pass Zscaler API credentials as plain-text environment variables, which is a big no-no when it comes to AWS security best practices. We need to fix this ASAP.

The Problem: Plain-Text Credentials

Here's what the current implementation looks like:

# Credentials must be passed as plain-text environment variables
ENV ZSCALER_CLIENT_ID=iq7u4xxxxxk6
ENV ZSCALER_CLIENT_SECRET=supersecretvalue123  # ❌ Plain text!
ENV ZSCALER_CUSTOMER_ID=2xxxxxxxxxxxx8

The Risks, Explained

Risk Impact
ECS Task Definition Exposure Anyone with ecs:DescribeTaskDefinition can read secrets
CloudFormation Exposure Secrets visible in stack parameters and outputs
Container Inspection docker inspect reveals all environment variables
No Encryption at Rest Credentials stored in plain text in AWS APIs
No Audit Trail No CloudTrail logs for credential access
No Rotation Support Requires redeployment to update credentials
Compliance Failures Fails SOC2, PCI-DSS, HIPAA, ISO 27001 audits

An Example of the Exposure

Anyone with ECS read permissions can easily extract your secrets:

# Anyone with ECS read permissions can extract secrets
aws ecs describe-task-definition --task-definition zscaler-mcp

# Output exposes credentials in plain text:
{
  "environment": [
    {"name": "ZSCALER_CLIENT_SECRET", "value": "supersecretvalue123"}
  ]
}

The Solution: AWS Secrets Manager Integration

Here’s how we should do it:

import boto3
from botocore.exceptions import ClientError

# Fetch credentials from Secrets Manager if configured
secret_arn = os.environ.get('ZSCALER_SECRET_ARN')
if secret_arn:
    try:
        region = secret_arn.split(':')[3]
        client = boto3.client('secretsmanager', region_name=region)
        response = client.get_secret_value(SecretId=secret_arn)
        secret = json.loads(response['SecretString'])
        
        # Set all secret keys as environment variables
        for key, value in secret.items():
            os.environ[key] = str(value)
        
        logger.info(f"Loaded credentials from Secrets Manager")
    except ClientError as e:
        logger.error(f"Failed to fetch credentials: {e}")
        raise

The Benefits

  • βœ… Credentials are encrypted at rest with AWS KMS.
  • βœ… IAM-based access control.
  • βœ… CloudTrail audit logging.
  • βœ… Automatic rotation support.
  • βœ… Compliance with SOC2, PCI-DSS, HIPAA.
  • βœ… Zero plain-text credential exposure.

Missing Features: Protocol Negotiation and Client Support

Let’s move on to some other missing pieces that are preventing Zscaler MCP from working smoothly. These include things like not handling the MCP initialize and ping methods, and a lack of support for standard MCP clients.

MCP Protocol Negotiation

The Missing Implementation

We need to handle the initialize and ping methods. Here's what's missing:

# No handling for these required MCP methods:
# - initialize (protocol version negotiation)
# - ping (health check)

The Impact

  • ❌ Cannot negotiate protocol versions with clients.
  • ❌ No support for MCP 2024-11-05 or 2025-03-26 protocols.
  • ❌ Breaks the handshake with standard MCP clients.
  • ❌ No health check mechanism.

The Solution

if method == "ping":
    logger.info("Handling MCP ping request")
    result = {}  # MCP spec: ping returns empty object
    
elif method == "initialize":
    logger.info("Handling MCP initialize request")
    # Support both 2024-11-05 and 2025-03-26 protocol versions
    client_protocol = payload.get("params", {}).get("protocolVersion", "2024-11-05")
    logger.info(f"Client requested protocol version: {client_protocol}")
    result = {
        "protocolVersion": client_protocol,  # Echo back client's version
        "capabilities": {"tools": {}},
        "serverInfo": {"name": "zscaler-mcp", "version": "1.0.0"}
    }

Standard MCP Client Support

The Limitation

The image currently only supports the AWS Genesis NDJSON format, not standard MCP clients like Claude Desktop or QuickSuite, which use JSON-RPC or SSE (Server-Sent Events).

# Only returns Genesis NDJSON format
return StreamingResponse(
    generate_streaming_response(response_data, session_id),
    media_type="application/x-ndjson",  # Genesis only
)

The Impact

  • ❌ Cannot be used with Claude Desktop.
  • ❌ Cannot be used with QuickSuite.
  • ❌ Cannot be used with standard MCP testing tools.
  • ❌ Limited to AWS Genesis runtime only.

The Solution

Add content negotiation based on request format:

# Check if this is a standard MCP client or Genesis
is_jsonrpc = payload.get("jsonrpc") == "2.0"
accept_header = request.headers.get("accept", "")
prefers_sse = "text/event-stream" in accept_header

if is_jsonrpc:
    # Standard JSON-RPC response for MCP clients
    response_content = {
        "jsonrpc": "2.0",
        "id": payload.get("id"),
        "result": result
    }
    
    if prefers_sse:
        # SSE format for streaming clients
        async def sse_generator():
            yield f"data: {json.dumps(response_content)}\n\n"
        
        return StreamingResponse(
            sse_generator(),
            media_type="text/event-stream",
        )
    else:
        # Standard JSON response
        return JSONResponse(content=response_content)
else:
    # Genesis streaming NDJSON response
    return StreamingResponse(
        generate_streaming_response(response_data, session_id),
        media_type="application/x-ndjson",
    )

Service Filtering: Keeping Things Lean

The current setup loads all Zscaler services (ZPA, ZIA, ZDX, ZCC, ZIdentity) without the ability to filter. This often leads to exceeding MCP client tool limits.

The Problem with Too Many Tools

The Zscaler MCP server exposes a ton of tools:

  • ZPA: ~30 tools
  • ZIA: ~40 tools
  • ZDX: ~15 tools
  • ZCC: ~10 tools
  • ZIdentity: ~10 tools

Many MCP clients have hard limits on the number of tools they can handle:

MCP Client Tool Limit Result with All Services
Claude Desktop ~50 tools ❌ Fails to load or truncates
Some Genesis Agents ~100 tools ⚠️ Performance degradation
QuickSuite ~200 tools βœ… Works but slow
Custom Clients Varies ❌ May fail silently

The Real-World Impact

When testing with Claude Desktop:

# Without filtering (100+ tools)
❌ Error: "Too many tools provided. Maximum 50 tools supported."

# With filtering to only ZPA (30 tools)
βœ… Success: All tools loaded and functional

The Impacts of Loading Everything

  • 🚫 Client Compatibility: Exceeds tool limits in Claude Desktop and other clients.
  • πŸ’° Higher AWS costs: Bedrock charges per tool invocation.
  • ⏱️ Slower startup: Initializes all services even if they are unused.
  • πŸ”§ No flexibility: Cannot disable unused services.
  • πŸ“Š Harder debugging: More tools to troubleshoot.
  • ⚑ Performance degradation: Large tool lists slow down the client UX.

The Solution

# Read ZSCALER_MCP_SERVICES environment variable to filter services
services_env = os.environ.get('ZSCALER_MCP_SERVICES', '')

if services_env:
    enabled_services = set(s.strip() for s in services_env.split(',') if s.strip())
    logger.info(f"Filtering to services: {enabled_services}")
    mcp_server = ZscalerMCPServer(enabled_services=enabled_services)
else:
    logger.info("Loading all services")
    mcp_server = ZscalerMCPServer()

How to Use It

# Only enable ZPA and ZIA
ZSCALER_MCP_SERVICES="zpa,zia"

Logging Configuration

Right now, the official image uses fixed INFO level logging, which isn't very helpful for detailed debugging or production. Let's fix this.

The Problem: Fixed Logging Level

logging.basicConfig(
    level=logging.INFO,  # ❌ Fixed, cannot change
    format='% (asctime)s - %(name)s - %(levelname)s - %(message)s'
)

The Impact

  • πŸ› Harder debugging: Can't enable DEBUG logs.
  • πŸ“Š No traffic inspection: Can't log HTTP headers/bodies.
  • πŸ” Limited troubleshooting: Missing critical diagnostic information.

The Solution

# Configure logging with environment variable
log_level = os.environ.get('LOG_LEVEL', 'INFO').upper()
logging.basicConfig(
    level=getattr(logging, log_level, logging.INFO),
    format='% (asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger.info(f"Logging level set to: {log_level}")

# Optional HTTP traffic logging middleware
@app.middleware("http")
async def log_request_response(request: Request, call_next):
    if os.environ.get('LOG_HEADERS', 'false').lower() == 'true':
        logger.info(f"Request: {request.method} {request.url.path}")
        logger.info(f"Headers: {dict(request.headers)}")
    
    response = await call_next(request)
    return response

Usage

# Enable debug logging
LOG_LEVEL=DEBUG

# Enable HTTP traffic logging
LOG_HEADERS=true

Summary of Issues and Recommended Actions

Let’s put it all together in a quick summary table:

Issue Severity Impact Status
tools/list bug πŸ”΄ Critical Breaks MCP clients Not fixed
No Secrets Manager πŸ”΄ Critical Security vulnerability Not implemented
No protocol negotiation 🟑 High Breaks handshake Not implemented
Genesis-only support 🟑 High Limited compatibility Not implemented
No service filtering 🟑 Medium Higher costs Not implemented
Fixed logging 🟑 Medium Harder debugging Not implemented

Recommended Actions

  1. Immediate (Critical):
    • Fix the tools/list async/await and response format bug.
    • Add AWS Secrets Manager support for credential management.
  2. High Priority:
    • Implement MCP initialize and ping methods.
    • Add JSON-RPC and SSE support for standard MCP clients.
  3. Medium Priority:
    • Add service filtering via environment variable.
    • Implement configurable logging levels.

Testing and Contributing

We've validated these issues by:

  1. Extracting the official Docker image filesystem.
  2. Comparing with a working production implementation.
  3. Testing with multiple MCP clients.
  4. Reviewing MCP protocol specification compliance.

Test Environment:

  • Image: zscaler/zscaler-mcp-server:0.4.0-bedrock
  • Platform: linux/arm64
  • Extracted: /tmp/zscaler-official/app/web_server.py

We’re ready to help. We have working implementations of all these fixes and are happy to contribute them back to the project. Just let us know the best way to do it!

Recommendation: Open-Source the AgentCore Build

Here’s a suggestion that we think would make a big difference in terms of usability, security, and overall adoption.

The Current Situation

The AgentCore/Bedrock-specific build is only available as a pre-built Docker image in AWS Marketplace ECR.

  • Image: 709825985650.dkr.ecr.us-east-1.amazonaws.com/zscaler/zscaler-mcp-server:0.4.0-bedrock
  • Source code: Not available in the public repository
  • Build process: Undocumented

The Inconsistency

It's odd because the rest of the Zscaler MCP project is fully open source.

Component Status Repository
Core MCP Server βœ… Open Source zscaler/zscaler-sdk-python-mcp
All Tool Implementations βœ… Open Source Public GitHub
ZPA Tools βœ… Open Source Public GitHub
ZIA Tools βœ… Open Source Public GitHub
ZDX Tools βœ… Open Source Public GitHub
ZCC Tools βœ… Open Source Public GitHub
ZIdentity Tools βœ… Open Source Public GitHub
AgentCore Wrapper ❌ Closed Only pre-built image

The Problem

Why hide only the AgentCore wrapper? It's just an HTTP adapter (~300 lines) that translates Genesis NDJSON to MCP protocol calls. It has no proprietary logic, algorithms, or competitive advantages.

Why This Is Problematic

  1. Security Concerns: Users can’t audit the build process, verify what’s in the container, or validate security practices.
  2. Lack of Transparency: The build process is hidden, and there's no visibility into dependencies or configurations.
  3. Easy to Reverse Engineer: Container images are easy to extract, making the obscurity pointless.
  4. Hinders Adoption: Enterprise customers often need source code review, custom builds, and vulnerability scanning.
  5. Prevents Bug Fixes: Users can't submit fixes or validate proposed solutions.

Recommended Approach

Make the AgentCore/Genesis wrapper code publicly available in the repository.

zscaler-mcp/
β”œβ”€β”€ src/
β”‚   └── zscaler_mcp/
β”‚       β”œβ”€β”€ server.py          # Core MCP server (already public)
β”‚       β”œβ”€β”€ tools/             # Tool implementations (already public)
β”‚       └── web_server.py      # Genesis wrapper (currently hidden)
β”œβ”€β”€ docker/
β”‚   β”œβ”€β”€ Dockerfile             # Build instructions (currently hidden)
β”‚   └── requirements.txt       # Dependencies (currently hidden)
└── docs/
    └── agentcore-deployment.md  # Deployment guide (currently missing)

Benefits of Making It Public

  1. βœ… Increased Trust
  2. βœ… Better Security
  3. βœ… Faster Bug Fixes
  4. βœ… Improved Quality
  5. βœ… Easier Adoption
  6. βœ… Community Growth
  7. βœ… Better Documentation
  8. βœ… Reduced Support Burden

Critical for Enterprise Adoption

Enterprise security requirements often include source code review, custom container builds, vulnerability scanning, and supply chain security.

Real-World Enterprise Blockers

Without source code and a Dockerfile, enterprises can't build from source, scan dependencies, generate SBOMs, or apply internal security policies.

Enterprise Approval Process

The current closed-source approach blocks enterprises from adopting the solution, regardless of its technical merit.

Conclusion

We strongly urge Zscaler to make the AgentCore build publicly available. The current model creates unnecessary friction, reduces trust, and hinders adoption. Making the code public would align with industry best practices and accelerate adoption of the Zscaler MCP server in AWS environments.

Precedent

Most successful MCP server implementations are fully open source, including Anthropic's, AWS's, and the community's.

References