Skip to content

Runbook: AI Service

Version Date Author Change Description
1.0 2025-09-17 Gemini Assistant Initial Draft

1. Service Overview

  • Purpose: The AI Service is the core orchestration component for the AI-Enhanced Workflow. It uses LangGraph to interpret user intent, manage conversational state, and execute tasks by calling tools on the mcp_service.
  • Criticality: High (Tier 2). An outage will render the AI chat functionality unusable, but will not impact the core application's manual functionality.
  • Owners: AI/ML Team (#ai-oncall on Slack).
  • Key Dependencies:
    • Internal: api_gateway (for receiving requests), mcp_service (critical, for tool execution).
    • Infrastructure: Redis (critical, for session state), MongoDB (critical, for chat history).

2. On-Call & Escalation

  • Primary On-Call: PagerDuty schedule "AI Service On-Call".
  • Escalation: Escalate to the "AI/ML Lead" PagerDuty schedule.

3. Monitoring and Alerting

  • Primary Dashboard: [Link to Grafana Dashboard for AI Service]
  • Log Location: service:"ai-service"
  • Key Metrics:
    • langgraph_errors_total: A spike indicates issues with the orchestration logic (e.g., invalid state transitions, tool calling failures).
    • llm_query_latency_seconds_p95: High latency here points to issues with the external LLM provider.
    • mcp_call_errors_total: Number of errors when calling the mcp_service. Helps isolate faults.
    • redis_connection_errors_total: Any value greater than zero is a critical alert.
    • mongodb_write_errors_total: Any value greater than zero is a critical alert.

4. Common Alerts and Troubleshooting Steps

Alert: HighLangGraphErrorRate (P2)

  • Meaning: The agent orchestration is failing frequently.
  • Diagnosis:
    1. Check logs for "error":"LangGraph execution failed". The log entry should contain the specific node and state that caused the failure.
    2. Look for errors related to unexpected LLM output or failed tool validation.
  • Resolution:
    • This is likely a code bug in the graph definition. Roll back the last deployment and create a high-priority ticket with the diagnostic information.

Alert: RedisConnectionFailed (P1)

  • Meaning: The service cannot connect to the Redis cache. Active conversations will fail.
  • Diagnosis: Check the service logs for "error":"Redis connection refused".
  • Resolution: This is an infrastructure issue. Escalate immediately to the database/cache infrastructure team.

Alert: MCPServiceUnreachable (P2)

  • Meaning: The ai_service cannot connect to the mcp_service to execute tools.
  • Diagnosis: Logs will show errors like "mcp_service call failed" or "upstream connection error".
  • Resolution: The issue is with the mcp_service. Switch to the runbook-mcp-service.md and begin troubleshooting it.