Runbook: AI Service¶

Version	Date	Author	Change Description
1.0	2025-09-17	Gemini Assistant	Initial Draft

1. Service Overview¶

Purpose: The AI Service is the core orchestration component for the AI-Enhanced Workflow. It uses LangGraph to interpret user intent, manage conversational state, and execute tasks by calling tools on the mcp_service.
Criticality: High (Tier 2). An outage will render the AI chat functionality unusable, but will not impact the core application's manual functionality.
Owners: AI/ML Team (#ai-oncall on Slack).
Key Dependencies:
- Internal: api_gateway (for receiving requests), mcp_service (critical, for tool execution).
- Infrastructure: Redis (critical, for session state), MongoDB (critical, for chat history).

Primary Dashboard: [Link to Grafana Dashboard for AI Service]
Log Location: service:"ai-service"
Key Metrics:
- langgraph_errors_total: A spike indicates issues with the orchestration logic (e.g., invalid state transitions, tool calling failures).
- llm_query_latency_seconds_p95: High latency here points to issues with the external LLM provider.
- mcp_call_errors_total: Number of errors when calling the mcp_service. Helps isolate faults.
- redis_connection_errors_total: Any value greater than zero is a critical alert.
- mongodb_write_errors_total: Any value greater than zero is a critical alert.

Meaning: The agent orchestration is failing frequently.
Diagnosis:
1. Check logs for "error":"LangGraph execution failed". The log entry should contain the specific node and state that caused the failure.
2. Look for errors related to unexpected LLM output or failed tool validation.
Resolution:
- This is likely a code bug in the graph definition. Roll back the last deployment and create a high-priority ticket with the diagnostic information.

Meaning: The service cannot connect to the Redis cache. Active conversations will fail.
Diagnosis: Check the service logs for "error":"Redis connection refused".
Resolution: This is an infrastructure issue. Escalate immediately to the database/cache infrastructure team.

Meaning: The ai_service cannot connect to the mcp_service to execute tools.
Diagnosis: Logs will show errors like "mcp_service call failed" or "upstream connection error".
Resolution: The issue is with the mcp_service. Switch to the runbook-mcp-service.md and begin troubleshooting it.