Runbook: MCP Service¶

Version	Date	Author	Change Description
1.0	2025-09-17	Gemini Assistant	Initial Draft

1. Service Overview¶

Purpose: The Model Context Protocol (MCP) Service acts as a secure adapter, exposing a standard set of internal tools (e.g., queryUsers) that the ai_service can consume via the MCP standard.
Criticality: High (Tier 2). If this service is down, the AI assistant cannot perform any actions.
Owners: Core Infrastructure Team (#infra-oncall on Slack).
Key Dependencies:
- Internal: ai_service (the only client), auth_service, and other internal APIs that the tools connect to.

Primary Dashboard: [Link to Grafana Dashboard for MCP Service]
Log Location: service:"mcp-service"
Key Metrics:
- mcp_tool_invocations_total: Total number of tool calls, partitioned by tool name (queryUsers, etc.) and outcome (success/failure).
- mcp_tool_latency_seconds_p95: Latency for each tool. High latency can point to a slow downstream internal API.
- internal_api_errors_total: Number of errors received when a tool calls a downstream service (e.g., auth_service).

Meaning: A specific tool (e.g., manageUserGroup) is failing frequently.
Diagnosis:
1. Check the Grafana dashboard to identify which tool is failing.
2. Filter the logs for that tool: service:"mcp-service" AND tool_name:"manageUserGroup" AND outcome:"failure".
3. The log should contain the error returned from the downstream internal API (e.g., a 404 from auth_service because the user was not found).
Resolution:
- The problem is almost always with the downstream service the tool is calling. Use the error message to identify the correct downstream runbook (e.g., runbook-auth-service.md) and begin troubleshooting there.

Meaning: The service is not responding to health checks.
Diagnosis: The ai_service logs will be filled with connection errors to the mcp_service.
Resolution:
1. Check the deployment status: kubectl get pods -l app=mcp-service -n production.
2. If pods are crashing, check logs for panics. This likely requires a rollback of the last deployment.
3. If no pods are running, check the deployment configuration.