Skip to content

Runbook: AuthService

Version Date Author Change Description
1.0 2025-09-11 Senior Systems Analyst Initial Draft

1. Service Overview

  • Purpose: The AuthService is the central and sole authority for user identity, authentication, and authorization.
  • Criticality: Critical (Tier 1). A full outage of this service will prevent all users from logging in and will block all authenticated API calls, effectively rendering the entire application unusable. Degraded performance will impact every user interaction.
  • Owners: Core Infrastructure Team (#infra-oncall on Slack).
  • Key Dependencies:
    • External: Google Identity Platform (for SSO).
    • Internal: PostgreSQL Database (Primary Datastore), Redis (Session/Permission Cache).

2. On-Call & Escalation

  • Primary On-Call: PagerDuty schedule "AuthService On-Call".
  • Escalation: If the primary on-call does not acknowledge within 15 minutes, escalate to the "Core Infrastructure Lead" PagerDuty schedule.

3. Monitoring and Alerting

  • Primary Dashboard: [Link to Grafana Dashboard for AuthService]
  • Log Location: Logs are streamed to our central logging platform. Use the following query: service:"auth-service"
    • To view live logs in production: kubectl logs -f -l app=auth-service -n production
  • Key Metrics:
    • grpc_requests_total: Total number of gRPC requests, partitioned by method and status code. A spike in non-OK codes is a key indicator of problems.
    • grpc_request_latency_seconds_p99: 99th percentile latency. If this exceeds 500ms, it triggers a HighAuthLatency alert.
    • db_connection_pool_active: Number of active database connections. Should not approach the configured maximum.
    • redis_cache_hit_ratio: The ratio of cache hits to total lookups. A sudden drop indicates a problem with Redis or the caching logic.

4. Common Alerts and Troubleshooting Steps

Alert: HighAuthLatency (P2)

  • Meaning: The p99 latency for gRPC requests to the AuthService has exceeded 500ms for more than 5 minutes. Users are experiencing slow logins and API responses.

  • Initial Diagnosis:

    1. Check the logs for errors:
      kubectl logs -l app=auth-service -n production --since=15m | grep "error"
      
    2. Check for slow database queries: Access the database admin panel and look for long-running queries in the pg_stat_activity view.
    3. Check resource utilization: Look at the CPU and Memory graphs on the Grafana dashboard. Is the service resource-starved?
  • Resolution Steps:

    • If high CPU/Memory: The service may be under high load. Temporarily scale up the number of pods:
      kubectl scale deployment/auth-service -n production --replicas=5
      
      Investigate the source of the load spike after the incident.
    • If slow database queries: Identify the problematic query. If it can be safely killed, do so. Escalate to the database team if the issue persists.
    • If no obvious cause: Perform a rolling restart of the service to see if the issue is transient.
      kubectl rollout restart deployment/auth-service -n production
      

Alert: DatabaseConnectivityFailure (P1)

  • Meaning: The AuthService cannot connect to the PostgreSQL database. The service is HARD DOWN.
  • Initial Diagnosis:
    1. Confirm the service logs: The logs will be filled with "failed to connect to database" errors.
    2. Attempt to connect to the DB from the pod:
      # Get a pod name
      AUTH_POD=$(kubectl get pods -l app=auth-service -n production -o jsonpath='{.items[0].metadata.name}')
      
      # Exec into the pod and try to connect
      kubectl exec -it $AUTH_POD -n production -- sh
      # Inside the pod:
      # psql -h $DATABASE_HOST -U $DATABASE_USER -d $DATABASE_NAME
      
  • Resolution Steps:
    1. If the connection from the pod fails, the issue is with the database or the network.
    2. IMMEDIATELY ESCALATE to the database team and the network team. This is not an application-level issue.
    3. Check the status of the Cloud SQL/RDS instance in the cloud provider console.

Alert: RedisCacheUnreachable (P2)

  • Meaning: The AuthService cannot connect to the Redis cache. Performance will be severely degraded, and database load will be high.
  • Initial Diagnosis:
    1. Logs will show "failed to connect to redis" errors.
    2. Ping the Redis host from an application pod.
  • Resolution Steps:
    1. Follow the disaster recovery plan for the Redis cluster.
    2. If the cache cannot be recovered quickly, as a last resort, disable caching via the feature flag environment variable (CACHE_ENABLED=false) and perform a rolling restart of the AuthService. Warning: This will significantly increase database load. Monitor the database closely.

5. Standard Operating Procedures (SOPs)

  • How to Roll Back a Deployment:
    # See the history of deployments
    kubectl rollout history deployment/auth-service -n production
    
    # Roll back to the previous version
    kubectl rollout undo deployment/auth-service -n production
    

6. How to Run Unit and Integration Tests

To run the unit and integration tests for the Auth Service, navigate to the src/services/auth-ts directory and execute the following commands:

cd src/services/auth-ts
npm install
npm test

This will execute all tests defined in the project and report the results.

7. How to Build and Deploy

Building the Auth Service

To build the Auth Service for production, navigate to the src/services/auth-ts directory and run:

cd src/services/auth-ts
npm install
npm run build

This will compile the TypeScript code into JavaScript and place the output in the dist directory.

Deploying the Auth Service

The Auth Service is containerized and deployed to Kubernetes.

Manual Deployment (for emergencies or specific environments):

  1. Ensure the service is built (see "Building the Auth Service" above).
  2. Build the Docker image:
    docker build -t your-registry/auth-service:latest .
    docker push your-registry/auth-service:latest
    
  3. Apply the Kubernetes manifests:
    kubectl apply -f k8s/auth-service-deployment.yaml
    kubectl apply -f k8s/auth-service-service.yaml
    

For automated deployments, refer to the CI/CD pipelines (e.g., services-ci.yml in .github/workflows).