Skip to content

Runbook: MCP Service

Version Date Author Change Description
1.0 2025-09-17 Gemini Assistant Initial Draft

1. Service Overview

  • Purpose: The Model Context Protocol (MCP) Service acts as a secure adapter, exposing a standard set of internal tools (e.g., queryUsers) that the ai_service can consume via the MCP standard.
  • Criticality: High (Tier 2). If this service is down, the AI assistant cannot perform any actions.
  • Owners: Core Infrastructure Team (#infra-oncall on Slack).
  • Key Dependencies:
    • Internal: ai_service (the only client), auth_service, and other internal APIs that the tools connect to.

2. On-Call & Escalation

  • Primary On-Call: PagerDuty schedule "Core Infrastructure On-Call".
  • Escalation: Escalate to the "Core Infrastructure Lead".

3. Monitoring and Alerting

  • Primary Dashboard: [Link to Grafana Dashboard for MCP Service]
  • Log Location: service:"mcp-service"
  • Key Metrics:
    • mcp_tool_invocations_total: Total number of tool calls, partitioned by tool name (queryUsers, etc.) and outcome (success/failure).
    • mcp_tool_latency_seconds_p95: Latency for each tool. High latency can point to a slow downstream internal API.
    • internal_api_errors_total: Number of errors received when a tool calls a downstream service (e.g., auth_service).

4. Common Alerts and Troubleshooting Steps

Alert: HighToolExecutionFailureRate (P2)

  • Meaning: A specific tool (e.g., manageUserGroup) is failing frequently.
  • Diagnosis:
    1. Check the Grafana dashboard to identify which tool is failing.
    2. Filter the logs for that tool: service:"mcp-service" AND tool_name:"manageUserGroup" AND outcome:"failure".
    3. The log should contain the error returned from the downstream internal API (e.g., a 404 from auth_service because the user was not found).
  • Resolution:
    • The problem is almost always with the downstream service the tool is calling. Use the error message to identify the correct downstream runbook (e.g., runbook-auth-service.md) and begin troubleshooting there.

Alert: MCPServiceDown (P1)

  • Meaning: The service is not responding to health checks.
  • Diagnosis: The ai_service logs will be filled with connection errors to the mcp_service.
  • Resolution:
    1. Check the deployment status: kubectl get pods -l app=mcp-service -n production.
    2. If pods are crashing, check logs for panics. This likely requires a rollback of the last deployment.
    3. If no pods are running, check the deployment configuration.