Runbook: AuthService¶

Version	Date	Author	Change Description
1.0	2025-09-11	Senior Systems Analyst	Initial Draft

1. Service Overview¶

Purpose: The AuthService is the central and sole authority for user identity, authentication, and authorization.
Criticality: Critical (Tier 1). A full outage of this service will prevent all users from logging in and will block all authenticated API calls, effectively rendering the entire application unusable. Degraded performance will impact every user interaction.
Owners: Core Infrastructure Team (#infra-oncall on Slack).
Key Dependencies:
- External: Google Identity Platform (for SSO).
- Internal: PostgreSQL Database (Primary Datastore), Redis (Session/Permission Cache).

2. On-Call & Escalation¶

Primary On-Call: PagerDuty schedule "AuthService On-Call".
Escalation: If the primary on-call does not acknowledge within 15 minutes, escalate to the "Core Infrastructure Lead" PagerDuty schedule.

3. Monitoring and Alerting¶

Primary Dashboard: [Link to Grafana Dashboard for AuthService]
Log Location: Logs are streamed to our central logging platform. Use the following query: service:"auth-service"
- To view live logs in production: kubectl logs -f -l app=auth-service -n production
Key Metrics:
- grpc_requests_total: Total number of gRPC requests, partitioned by method and status code. A spike in non-OK codes is a key indicator of problems.
- grpc_request_latency_seconds_p99: 99^th percentile latency. If this exceeds 500ms, it triggers a HighAuthLatency alert.
- db_connection_pool_active: Number of active database connections. Should not approach the configured maximum.
- redis_cache_hit_ratio: The ratio of cache hits to total lookups. A sudden drop indicates a problem with Redis or the caching logic.

4. Common Alerts and Troubleshooting Steps¶

Alert: `HighAuthLatency` (P2)¶

Meaning: The p99 latency for gRPC requests to the AuthService has exceeded 500ms for more than 5 minutes. Users are experiencing slow logins and API responses.
Initial Diagnosis:
1. Check the logs for errors:
```
kubectl logs -l app=auth-service -n production --since=15m | grep "error"
```
2. Check for slow database queries: Access the database admin panel and look for long-running queries in the pg_stat_activity view.
3. Check resource utilization: Look at the CPU and Memory graphs on the Grafana dashboard. Is the service resource-starved?
Resolution Steps:
- If high CPU/Memory: The service may be under high load. Temporarily scale up the number of pods:
```
kubectl scale deployment/auth-service -n production --replicas=5
```
  Investigate the source of the load spike after the incident.
- If slow database queries: Identify the problematic query. If it can be safely killed, do so. Escalate to the database team if the issue persists.
- If no obvious cause: Perform a rolling restart of the service to see if the issue is transient.
```
kubectl rollout restart deployment/auth-service -n production
```

Alert: `DatabaseConnectivityFailure` (P1)¶

Meaning: The AuthService cannot connect to the PostgreSQL database. The service is HARD DOWN.

Initial Diagnosis:

Confirm the service logs: The logs will be filled with "failed to connect to database" errors.

Attempt to connect to the DB from the pod:

# Get a pod name
AUTH_POD=$(kubectl get pods -l app=auth-service -n production -o jsonpath='{.items[0].metadata.name}')

# Exec into the pod and try to connect
kubectl exec -it $AUTH_POD -n production -- sh
# Inside the pod:
# psql -h $DATABASE_HOST -U $DATABASE_USER -d $DATABASE_NAME

Resolution Steps:
1. If the connection from the pod fails, the issue is with the database or the network.
2. IMMEDIATELY ESCALATE to the database team and the network team. This is not an application-level issue.
3. Check the status of the Cloud SQL/RDS instance in the cloud provider console.

Alert: `RedisCacheUnreachable` (P2)¶

Meaning: The AuthService cannot connect to the Redis cache. Performance will be severely degraded, and database load will be high.
Initial Diagnosis:
1. Logs will show "failed to connect to redis" errors.
2. Ping the Redis host from an application pod.
Resolution Steps:
1. Follow the disaster recovery plan for the Redis cluster.
2. If the cache cannot be recovered quickly, as a last resort, disable caching via the feature flag environment variable (CACHE_ENABLED=false) and perform a rolling restart of the AuthService. Warning: This will significantly increase database load. Monitor the database closely.

5. Standard Operating Procedures (SOPs)¶

How to Roll Back a Deployment:

# See the history of deployments
kubectl rollout history deployment/auth-service -n production

# Roll back to the previous version
kubectl rollout undo deployment/auth-service -n production

6. How to Run Unit and Integration Tests¶

To run the unit and integration tests for the Auth Service, navigate to the src/services/auth-ts directory and execute the following commands:

cd src/services/auth-ts
npm install
npm test

This will execute all tests defined in the project and report the results.

7. How to Build and Deploy¶

Building the Auth Service¶

To build the Auth Service for production, navigate to the src/services/auth-ts directory and run:

cd src/services/auth-ts
npm install
npm run build

This will compile the TypeScript code into JavaScript and place the output in the dist directory.

Deploying the Auth Service¶

The Auth Service is containerized and deployed to Kubernetes.

Manual Deployment (for emergencies or specific environments):

Ensure the service is built (see "Building the Auth Service" above).

Build the Docker image:

docker build -t your-registry/auth-service:latest .
docker push your-registry/auth-service:latest

Apply the Kubernetes manifests:

kubectl apply -f k8s/auth-service-deployment.yaml
kubectl apply -f k8s/auth-service-service.yaml

For automated deployments, refer to the CI/CD pipelines (e.g., services-ci.yml in .github/workflows).