Skip to content

Runbook: API Gateway

Version Date Author Change Description
1.0 2025-09-11 Senior Systems Analyst Initial Draft

1. Service Overview

  • Purpose: The API Gateway is the single, unified entry point for all client-side requests. It handles protocol translation (REST <-> gRPC), orchestrates the SSO flow, and performs session validation and authorization checks.
  • Criticality: Critical (Tier 1). A full outage of this service will make the entire application inaccessible to all users. Degraded performance will affect every single user action.
  • Owners: Core Infrastructure Team (#infra-oncall on Slack).
  • Key Dependencies:
    • External: User's Browser/Client.
    • Internal: AuthService (critical dependency for every protected request), AIService (critical for the /api/ai endpoint).

2. On-Call & Escalation

  • Primary On-Call: PagerDuty schedule "APIGateway On-Call".
  • Escalation: If the primary on-call does not acknowledge within 15 minutes, escalate to the "Core Infrastructure Lead" PagerDuty schedule.

3. Monitoring and Alerting

  • Primary Dashboard: [Link to Grafana Dashboard for API Gateway]
  • Log Location: Logs are streamed to our central logging platform. Use the following query: service:"api-gateway"
    • To view live logs in production: kubectl logs -f -l app=api-gateway -n production
  • Key Metrics:
    • http_requests_total: Total number of incoming HTTP requests, partitioned by path and status code (2xx, 4xx, 5xx). A spike in 5xx is a key indicator of problems.
    • http_request_latency_seconds_p99: 99th percentile latency for HTTP requests. If this exceeds 1s, it's a major issue.
    • upstream_grpc_errors_total: The number of errors the gateway receives when calling backend services (like AuthService). This helps distinguish gateway problems from upstream problems.

4. Common Alerts and Troubleshooting Steps

Alert: HighHttp5xxErrorRate (P1)

  • Meaning: The percentage of HTTP responses with a 5xx status code has exceeded a critical threshold (e.g., 5%) over the last 5 minutes. The application is likely down or severely impaired for many users.

  • Initial Diagnosis:

    1. Check the gateway logs immediately for fatal errors:
      kubectl logs -l app=api-gateway -n production --since=10m | grep "error|panic|fatal"
      
    2. Look for the specific error code.
      • 500 Internal Server Error: The problem is likely within the gateway itself (a code bug).
      • 503 Service Unavailable: The gateway cannot reach an upstream service, almost certainly the AuthService.
    3. Check the status of the AuthService deployment. Is it running? Are its pods restarting?
  • Resolution Steps:

    • If 500 errors and logs show a panic: This was likely caused by a recent deployment. Immediately initiate a rollback (see SOPs below).
    • If 503 errors: The issue is with the AuthService. Switch to the runbook-auth-service.md and begin troubleshooting it. The gateway is functioning correctly but its dependency is down.

Alert: HighApiLatency (P2)

  • Meaning: The p99 latency for API Gateway responses is high. Users are experiencing a slow, sluggish application.

  • Initial Diagnosis:

    1. Check the Grafana dashboard. Is the latency in the gateway itself, or is it in the upstream gRPC calls to AuthService? The dashboard should have panels for both.
    2. If upstream latency is high: The bottleneck is the AuthService. Switch to its runbook.
    3. If gateway latency is high but upstream is normal: The gateway itself is the problem. Check its resource utilization (CPU/Memory).
  • Resolution Steps:

    • If gateway resource utilization is high: Scale up the number of pods to handle the load.
      kubectl scale deployment/api-gateway -n production --replicas=5
      
    • If no obvious cause: Perform a rolling restart of the gateway pods.
      kubectl rollout restart deployment/api-gateway -n production
      

Alert: HighAiServiceErrorRate (P2)

  • Meaning: The /api/ai endpoint is returning a high number of 5xx errors.
  • Initial Diagnosis:
    1. This indicates a problem with the upstream ai_service.
    2. The gateway logs will show gRPC errors for calls to the ai_service.
  • Resolution Steps:
    1. The gateway is functioning correctly. Switch to the runbook-ai-service.md and begin troubleshooting it.

Alert: AuthServiceUnreachable (P1)

  • Meaning: The API Gateway's gRPC client cannot establish a connection to the AuthService. All protected endpoints will fail.
  • Initial Diagnosis:
    1. The gateway logs will be filled with "upstream connection error" or "service unavailable" for gRPC calls.
    2. Confirm DNS resolution and network connectivity from a gateway pod to the AuthService Kubernetes service.
      # Get a pod name
      GATEWAY_POD=$(kubectl get pods -l app=api-gateway -n production -o jsonpath='{.items[0].metadata.name}')
      
      # Exec into the pod and try to reach the auth service
      kubectl exec -it $GATEWAY_POD -n production -- sh
      # Inside the pod:
      # nc -vz auth-service.production.svc.cluster.local <GRPC_PORT>
      
  • Resolution Steps:
    1. If the connectivity check fails, the problem is either with the AuthService deployment (it has no running pods) or a Kubernetes networking issue.
    2. Check the status of the AuthService pods: kubectl get pods -l app=auth-service -n production.
    3. If AuthService pods are running, this is likely a network policy or service discovery issue. Escalate to the core infrastructure team.

5. Standard Operating Procedures (SOPs)

  • How to Roll Back a Deployment:
    # See the history of deployments
    kubectl rollout history deployment/api-gateway -n production
    
    # Roll back to the previous version
    kubectl rollout undo deployment/api-gateway -n production
    

6. How to Run Unit and Integration Tests

To run the unit and integration tests for the API Gateway service, navigate to the src/services/api_gateway-ts directory and execute the following commands:

cd src/services/api_gateway-ts
npm install
npm test

This will execute all tests defined in the project and report the results.

7. How to Build and Deploy

Building the API Gateway

To build the API Gateway service for production, navigate to the src/services/api_gateway-ts directory and run:

cd src/services/api_gateway-ts
npm install
npm run build

This will compile the TypeScript code into JavaScript and place the output in the dist directory.

Deploying the API Gateway

The API Gateway service is containerized and deployed to Kubernetes.

Manual Deployment (for emergencies or specific environments):

  1. Ensure the service is built (see "Building the API Gateway" above).
  2. Build the Docker image:
    docker build -t your-registry/api-gateway:latest .
    docker push your-registry/api-gateway:latest
    
  3. Apply the Kubernetes manifests:
    kubectl apply -f k8s/api-gateway-deployment.yaml
    kubectl apply -f k8s/api-gateway-service.yaml
    

For automated deployments, refer to the CI/CD pipelines (e.g., services-ci.yml in .github/workflows).

8. How to Run Locally (Dev Env)

To run the API Gateway service locally in a development environment:

  1. Ensure Docker Compose services are running: The API Gateway depends on the Auth Service. Make sure all necessary services are up and running using docker-compose up -d.
  2. Navigate to the API Gateway directory:
    cd src/services/api_gateway-ts
    
  3. Install dependencies:
    npm install
    
  4. Start the development server:
    npm run start:dev
    
    The API Gateway will typically be available at http://localhost:3000 (as configured in docker-compose.yml).