Runbook: AuthService¶
| Version | Date | Author | Change Description |
|---|---|---|---|
| 1.0 | 2025-09-11 | Senior Systems Analyst | Initial Draft |
1. Service Overview¶
- Purpose: The AuthService is the central and sole authority for user identity, authentication, and authorization.
- Criticality: Critical (Tier 1). A full outage of this service will prevent all users from logging in and will block all authenticated API calls, effectively rendering the entire application unusable. Degraded performance will impact every user interaction.
- Owners: Core Infrastructure Team (
#infra-oncallon Slack). - Key Dependencies:
- External: Google Identity Platform (for SSO).
- Internal: PostgreSQL Database (Primary Datastore), Redis (Session/Permission Cache).
2. On-Call & Escalation¶
- Primary On-Call: PagerDuty schedule "AuthService On-Call".
- Escalation: If the primary on-call does not acknowledge within 15 minutes, escalate to the "Core Infrastructure Lead" PagerDuty schedule.
3. Monitoring and Alerting¶
- Primary Dashboard: [Link to Grafana Dashboard for AuthService]
- Log Location: Logs are streamed to our central logging platform. Use the following query:
service:"auth-service"- To view live logs in production:
kubectl logs -f -l app=auth-service -n production
- To view live logs in production:
- Key Metrics:
grpc_requests_total: Total number of gRPC requests, partitioned by method and status code. A spike in non-OK codes is a key indicator of problems.grpc_request_latency_seconds_p99: 99th percentile latency. If this exceeds 500ms, it triggers aHighAuthLatencyalert.db_connection_pool_active: Number of active database connections. Should not approach the configured maximum.redis_cache_hit_ratio: The ratio of cache hits to total lookups. A sudden drop indicates a problem with Redis or the caching logic.
4. Common Alerts and Troubleshooting Steps¶
Alert: HighAuthLatency (P2)¶
-
Meaning: The p99 latency for gRPC requests to the AuthService has exceeded 500ms for more than 5 minutes. Users are experiencing slow logins and API responses.
-
Initial Diagnosis:
- Check the logs for errors:
kubectl logs -l app=auth-service -n production --since=15m | grep "error" - Check for slow database queries: Access the database admin panel and look for long-running queries in the
pg_stat_activityview. - Check resource utilization: Look at the CPU and Memory graphs on the Grafana dashboard. Is the service resource-starved?
- Check the logs for errors:
-
Resolution Steps:
- If high CPU/Memory: The service may be under high load. Temporarily scale up the number of pods:
Investigate the source of the load spike after the incident.
kubectl scale deployment/auth-service -n production --replicas=5 - If slow database queries: Identify the problematic query. If it can be safely killed, do so. Escalate to the database team if the issue persists.
- If no obvious cause: Perform a rolling restart of the service to see if the issue is transient.
kubectl rollout restart deployment/auth-service -n production
- If high CPU/Memory: The service may be under high load. Temporarily scale up the number of pods:
Alert: DatabaseConnectivityFailure (P1)¶
- Meaning: The AuthService cannot connect to the PostgreSQL database. The service is HARD DOWN.
- Initial Diagnosis:
- Confirm the service logs: The logs will be filled with "failed to connect to database" errors.
- Attempt to connect to the DB from the pod:
# Get a pod name AUTH_POD=$(kubectl get pods -l app=auth-service -n production -o jsonpath='{.items[0].metadata.name}') # Exec into the pod and try to connect kubectl exec -it $AUTH_POD -n production -- sh # Inside the pod: # psql -h $DATABASE_HOST -U $DATABASE_USER -d $DATABASE_NAME
- Resolution Steps:
- If the connection from the pod fails, the issue is with the database or the network.
- IMMEDIATELY ESCALATE to the database team and the network team. This is not an application-level issue.
- Check the status of the Cloud SQL/RDS instance in the cloud provider console.
Alert: RedisCacheUnreachable (P2)¶
- Meaning: The AuthService cannot connect to the Redis cache. Performance will be severely degraded, and database load will be high.
- Initial Diagnosis:
- Logs will show "failed to connect to redis" errors.
- Ping the Redis host from an application pod.
- Resolution Steps:
- Follow the disaster recovery plan for the Redis cluster.
- If the cache cannot be recovered quickly, as a last resort, disable caching via the feature flag environment variable (
CACHE_ENABLED=false) and perform a rolling restart of the AuthService. Warning: This will significantly increase database load. Monitor the database closely.
5. Standard Operating Procedures (SOPs)¶
- How to Roll Back a Deployment:
# See the history of deployments kubectl rollout history deployment/auth-service -n production # Roll back to the previous version kubectl rollout undo deployment/auth-service -n production
6. How to Run Unit and Integration Tests¶
To run the unit and integration tests for the Auth Service, navigate to the src/services/auth-ts directory and execute the following commands:
cd src/services/auth-ts
npm install
npm test
This will execute all tests defined in the project and report the results.
7. How to Build and Deploy¶
Building the Auth Service¶
To build the Auth Service for production, navigate to the src/services/auth-ts directory and run:
cd src/services/auth-ts
npm install
npm run build
This will compile the TypeScript code into JavaScript and place the output in the dist directory.
Deploying the Auth Service¶
The Auth Service is containerized and deployed to Kubernetes.
Manual Deployment (for emergencies or specific environments):
- Ensure the service is built (see "Building the Auth Service" above).
- Build the Docker image:
docker build -t your-registry/auth-service:latest . docker push your-registry/auth-service:latest - Apply the Kubernetes manifests:
kubectl apply -f k8s/auth-service-deployment.yaml kubectl apply -f k8s/auth-service-service.yaml
For automated deployments, refer to the CI/CD pipelines (e.g., services-ci.yml in .github/workflows).