Runbook: AI Service
Version
Date
Author
Change Description
1.0
2025-09-17
Gemini Assistant
Initial Draft
1. Service Overview
Purpose : The AI Service is the core orchestration component for the AI-Enhanced Workflow. It uses LangGraph to interpret user intent, manage conversational state, and execute tasks by calling tools on the mcp_service.
Criticality : High (Tier 2) . An outage will render the AI chat functionality unusable, but will not impact the core application's manual functionality.
Owners : AI/ML Team (#ai-oncall on Slack).
Key Dependencies :
Internal : api_gateway (for receiving requests), mcp_service (critical, for tool execution).
Infrastructure : Redis (critical, for session state), MongoDB (critical, for chat history).
2. On-Call & Escalation
Primary On-Call : PagerDuty schedule "AI Service On-Call".
Escalation : Escalate to the "AI/ML Lead" PagerDuty schedule.
3. Monitoring and Alerting
Primary Dashboard : [Link to Grafana Dashboard for AI Service]
Log Location : service:"ai-service"
Key Metrics :
langgraph_errors_total: A spike indicates issues with the orchestration logic (e.g., invalid state transitions, tool calling failures).
llm_query_latency_seconds_p95: High latency here points to issues with the external LLM provider.
mcp_call_errors_total: Number of errors when calling the mcp_service. Helps isolate faults.
redis_connection_errors_total: Any value greater than zero is a critical alert.
mongodb_write_errors_total: Any value greater than zero is a critical alert.
4. Common Alerts and Troubleshooting Steps
Alert: HighLangGraphErrorRate (P2)
Meaning : The agent orchestration is failing frequently.
Diagnosis :
Check logs for "error":"LangGraph execution failed". The log entry should contain the specific node and state that caused the failure.
Look for errors related to unexpected LLM output or failed tool validation.
Resolution :
This is likely a code bug in the graph definition. Roll back the last deployment and create a high-priority ticket with the diagnostic information.
Alert: RedisConnectionFailed (P1)
Meaning : The service cannot connect to the Redis cache. Active conversations will fail.
Diagnosis : Check the service logs for "error":"Redis connection refused".
Resolution : This is an infrastructure issue. Escalate immediately to the database/cache infrastructure team.
Alert: MCPServiceUnreachable (P2)
Meaning : The ai_service cannot connect to the mcp_service to execute tools.
Diagnosis : Logs will show errors like "mcp_service call failed" or "upstream connection error".
Resolution : The issue is with the mcp_service. Switch to the runbook-mcp-service.md and begin troubleshooting it.