Runbook: API Gateway¶
| Version | Date | Author | Change Description |
|---|---|---|---|
| 1.0 | 2025-09-11 | Senior Systems Analyst | Initial Draft |
1. Service Overview¶
- Purpose: The API Gateway is the single, unified entry point for all client-side requests. It handles protocol translation (REST <-> gRPC), orchestrates the SSO flow, and performs session validation and authorization checks.
- Criticality: Critical (Tier 1). A full outage of this service will make the entire application inaccessible to all users. Degraded performance will affect every single user action.
- Owners: Core Infrastructure Team (
#infra-oncallon Slack). - Key Dependencies:
- External: User's Browser/Client.
- Internal: AuthService (critical dependency for every protected request), AIService (critical for the
/api/aiendpoint).
2. On-Call & Escalation¶
- Primary On-Call: PagerDuty schedule "APIGateway On-Call".
- Escalation: If the primary on-call does not acknowledge within 15 minutes, escalate to the "Core Infrastructure Lead" PagerDuty schedule.
3. Monitoring and Alerting¶
- Primary Dashboard: [Link to Grafana Dashboard for API Gateway]
- Log Location: Logs are streamed to our central logging platform. Use the following query:
service:"api-gateway"- To view live logs in production:
kubectl logs -f -l app=api-gateway -n production
- To view live logs in production:
- Key Metrics:
http_requests_total: Total number of incoming HTTP requests, partitioned by path and status code (2xx, 4xx, 5xx). A spike in 5xx is a key indicator of problems.http_request_latency_seconds_p99: 99th percentile latency for HTTP requests. If this exceeds 1s, it's a major issue.upstream_grpc_errors_total: The number of errors the gateway receives when calling backend services (like AuthService). This helps distinguish gateway problems from upstream problems.
4. Common Alerts and Troubleshooting Steps¶
Alert: HighHttp5xxErrorRate (P1)¶
-
Meaning: The percentage of HTTP responses with a 5xx status code has exceeded a critical threshold (e.g., 5%) over the last 5 minutes. The application is likely down or severely impaired for many users.
-
Initial Diagnosis:
- Check the gateway logs immediately for fatal errors:
kubectl logs -l app=api-gateway -n production --since=10m | grep "error|panic|fatal" - Look for the specific error code.
500 Internal Server Error: The problem is likely within the gateway itself (a code bug).503 Service Unavailable: The gateway cannot reach an upstream service, almost certainly theAuthService.
- Check the status of the
AuthServicedeployment. Is it running? Are its pods restarting?
- Check the gateway logs immediately for fatal errors:
-
Resolution Steps:
- If
500errors and logs show a panic: This was likely caused by a recent deployment. Immediately initiate a rollback (see SOPs below). - If
503errors: The issue is with theAuthService. Switch to therunbook-auth-service.mdand begin troubleshooting it. The gateway is functioning correctly but its dependency is down.
- If
Alert: HighApiLatency (P2)¶
-
Meaning: The p99 latency for API Gateway responses is high. Users are experiencing a slow, sluggish application.
-
Initial Diagnosis:
- Check the Grafana dashboard. Is the latency in the gateway itself, or is it in the upstream gRPC calls to
AuthService? The dashboard should have panels for both. - If upstream latency is high: The bottleneck is the
AuthService. Switch to its runbook. - If gateway latency is high but upstream is normal: The gateway itself is the problem. Check its resource utilization (CPU/Memory).
- Check the Grafana dashboard. Is the latency in the gateway itself, or is it in the upstream gRPC calls to
-
Resolution Steps:
- If gateway resource utilization is high: Scale up the number of pods to handle the load.
kubectl scale deployment/api-gateway -n production --replicas=5 - If no obvious cause: Perform a rolling restart of the gateway pods.
kubectl rollout restart deployment/api-gateway -n production
- If gateway resource utilization is high: Scale up the number of pods to handle the load.
Alert: HighAiServiceErrorRate (P2)¶
- Meaning: The
/api/aiendpoint is returning a high number of 5xx errors. - Initial Diagnosis:
- This indicates a problem with the upstream
ai_service. - The gateway logs will show gRPC errors for calls to the
ai_service.
- This indicates a problem with the upstream
- Resolution Steps:
- The gateway is functioning correctly. Switch to the
runbook-ai-service.mdand begin troubleshooting it.
- The gateway is functioning correctly. Switch to the
Alert: AuthServiceUnreachable (P1)¶
- Meaning: The API Gateway's gRPC client cannot establish a connection to the
AuthService. All protected endpoints will fail. - Initial Diagnosis:
- The gateway logs will be filled with "upstream connection error" or "service unavailable" for gRPC calls.
- Confirm DNS resolution and network connectivity from a gateway pod to the
AuthServiceKubernetes service.# Get a pod name GATEWAY_POD=$(kubectl get pods -l app=api-gateway -n production -o jsonpath='{.items[0].metadata.name}') # Exec into the pod and try to reach the auth service kubectl exec -it $GATEWAY_POD -n production -- sh # Inside the pod: # nc -vz auth-service.production.svc.cluster.local <GRPC_PORT>
- Resolution Steps:
- If the connectivity check fails, the problem is either with the
AuthServicedeployment (it has no running pods) or a Kubernetes networking issue. - Check the status of the
AuthServicepods:kubectl get pods -l app=auth-service -n production. - If
AuthServicepods are running, this is likely a network policy or service discovery issue. Escalate to the core infrastructure team.
- If the connectivity check fails, the problem is either with the
5. Standard Operating Procedures (SOPs)¶
- How to Roll Back a Deployment:
# See the history of deployments kubectl rollout history deployment/api-gateway -n production # Roll back to the previous version kubectl rollout undo deployment/api-gateway -n production
6. How to Run Unit and Integration Tests¶
To run the unit and integration tests for the API Gateway service, navigate to the src/services/api_gateway-ts directory and execute the following commands:
cd src/services/api_gateway-ts
npm install
npm test
This will execute all tests defined in the project and report the results.
7. How to Build and Deploy¶
Building the API Gateway¶
To build the API Gateway service for production, navigate to the src/services/api_gateway-ts directory and run:
cd src/services/api_gateway-ts
npm install
npm run build
This will compile the TypeScript code into JavaScript and place the output in the dist directory.
Deploying the API Gateway¶
The API Gateway service is containerized and deployed to Kubernetes.
Manual Deployment (for emergencies or specific environments):
- Ensure the service is built (see "Building the API Gateway" above).
- Build the Docker image:
docker build -t your-registry/api-gateway:latest . docker push your-registry/api-gateway:latest - Apply the Kubernetes manifests:
kubectl apply -f k8s/api-gateway-deployment.yaml kubectl apply -f k8s/api-gateway-service.yaml
For automated deployments, refer to the CI/CD pipelines (e.g., services-ci.yml in .github/workflows).
8. How to Run Locally (Dev Env)¶
To run the API Gateway service locally in a development environment:
- Ensure Docker Compose services are running: The API Gateway depends on the Auth Service. Make sure all necessary services are up and running using
docker-compose up -d. - Navigate to the API Gateway directory:
cd src/services/api_gateway-ts - Install dependencies:
npm install - Start the development server:
The API Gateway will typically be available at
npm run start:devhttp://localhost:3000(as configured indocker-compose.yml).