Incident Response Runbook - Production Services (Python)¶

Purpose
This runbook provides step-by-step instructions for responding to production incidents. Use this guide when you receive: - Automated alerts from Sentry or by monitoring dashboards. - Manual reports from team members or internal communications.

Target audience: On-call engineers and team members responding to live production incidents.

Services to apply this guide: python backend services running on Google Cloud Run.

HF-API

Webhooks-API

Forestfleet-API

🚨 When an Incident Alert Fires¶

First, stay calm. Production incidents are stressful, but following this systematic approach will help you resolve issues efficiently.

What constitutes an incident?¶

Service is completely down or significantly degraded
Error rates above normal thresholds (typically >10% 5xx errors)
Response times significantly slower than baseline (currently ~200ms)
Data loss or corruption

Step 1: Immediate Response (0-5 minutes)¶

Acknowledge and Take Ownership¶

When you receive an alert:

Acknowledge the alert immediately: either take ownership of Sentry alert or respond to internal communication in case of manual alerting.
Take Linear ticket Ownership: Automated Sentry alerts creates a Linar ticket. After the alert is raised, take the ownership of the ticket.
Open outages Slack channel: Send a first message to #outage-emergency-channel indicating laert investigation and the acknowledge of it.

Declare the incident: after first sight of alert, declare the incident by posting:

@here 🚨 INCIDENT DECLARED
Service: [service-name]
Issue: [brief description or sintoms]
Responder: @your-name

Quick Smoke Test¶

Verify the issue is real before diving deep: - Visit CloudRun services status and checkc if serivces are operational and serving traffic. - Look at services dashboard to confirm metrics show problems: - HF-API - Grafana - Forestfleet-API - Grafana - Webhooks-API - Grafana - Check if you can reproduce the issue, calling from local or directly in production.

If everything looks normal: The alert might be a false positive, but don't dismiss it yet. Continue investigating to be sure.

Step 2: Assess Impact & Scope (5-15 minutes)¶

Understand What's Broken¶

Check user impact: - Is this affecting all users or a subset? - Which features/endpoints are impacted?

Investigation path – HF-API	Investigation path – Webhooks
1. Check Grafana dashboard for request volume drops, increases in 5xx error codes, or outbound request failures to upstream services (Wunder).	1. Check Grafana dashboard for request volume drops, increases in 5xx error codes, or outbound request failures to upstream services (Wunder, Supabase, Hookdeck).
2. Review error rates by endpoint in API Gateway Endpoints.	2. Check Cloud Run service health and last release.
3. Check Cloud Run service health and last release.	3. Review Hookdeck webhook issues for any related incidents.

Determine Service Scope¶

Service boundaries: - Is this isolated to one service or affects more than one? - Check dependencies [Wunder|Hookdeck|Supabase|Directus] - Look for upstream/downstream service impacts

The overall idea here is verify if the issue is caused and scoped to our domain or belong to outside serivces.

Step 3: Classify the Incident¶

Assign Severity Level¶

TBD levels and classifications

Use this classification to determine urgency and communication needs:

P0 - Critical (All hands on deck)

P1 - High (Urgent response needed)

P2 - Medium (Important but not critical)

P3 - Low (Monitor and fix when possible)

Step 4: Initial Investigation¶

Start with the Obvious¶

Check recent changes first - most incidents are caused by recent deployments or configuration changes: 1. Review deployments in the last 24 hours. 2. Check if any configuration was modified. CloudRun Logs are super helpful for this. 3. Look for terraform infrastructure changes. 4. Review any database migrations or schema changes

Examine Logs and Metrics¶

Log Analysis Strategy:

# Start with error-level logs from the affected timeframe
# Look for patterns like:
- Connection refused/timeout errors
- Database query failures  
- External API failures
- Memory/resource exhaustion
- Authentication/authorization failures

Key Metrics to Check: - Error Rate: What percentage of requests are failing? - Latency: Are successful requests slower than normal? - Throughput: Is request volume normal or dropping? - Resource Usage: CPU, memory, database connections, number of instances running.

Use Error Tracking System¶

In Sentry:

[HF-API|Webhooks-API|(Forestfleet-API)[https://humanforest-dev.sentry.io/issues/?project=4504859846115328&statsPeriod=7d]] 1. Look for error spikes that correlate with the incident timeline 2. Check if there are new error types introduced recently 3. Review the error frequency and affected user count 4. Look at the error stack traces for clues about root cause

Step 5: Communication During Investigation¶

Keep Stakeholders Informed¶

Internal Communication: - Post updates in #outage-emergency-channel responding as thread to the initial message every 10-15 minutes for P0/P1 incidents. - Proposed format:

🔍 UPDATE
Status: Still investigating
Findings: [what you've discovered]
Next steps: [what you're trying next]
ETA for next update: [15 minutes]

External Communication (if needed): - If the issue is related to Wunder scalate directly via email: emergency.support@wundermobility.com and #ext-wunder-support. Always tag direct manager.

When to Escalate¶

Escalate immediately if: - You cannot identify the root cause within 15 minutes - The issue is beyond your expertise area - You need additional resources or permissions - The incident is escalating in severity

Step 6: Common Investigation Paths¶

Error Rate Spikes (5xx Errors)¶

Common patterns: - Application crashes: Check for out-of-memory errors, unhandled exceptions - Configuration errors: Recent config changes causing startup failures - Upstream services failures: External APIs returning errors (Check Wunder, Directus, Supabase)

Investigation approach: 1. Sample recent error logs to identify error types 2. Check if errors correlate with specific endpoints 3. Review recent deployments for introduced bugs 4. Review last CloudBuild execution to check if any new internal dependency was updated and have conflicts. 5. Verify all dependencies are healthy

Complete Service Outage¶

Immediate actions: 1. Check if API Gateway is healthy 2. Verify DNS resolution is working: Cloudflare 3. Check if your service instances are running and healthy 4. Look for infrastructure-level issues (network, compute resources)

Step 7: Mitigation and Recovery¶

P0 full service degradation¶

In extreme cases where the services effects customers experience, it could be need to activate the KillSwitch. To do this, please contact with your manager.

The kill siwtch sends a Pusher message to a cached channel (30 min cache) with the reservations service status to the App.

Choose Your Mitigation Strategy¶

Quick wins (try first): - Restart the service if it appears stuck or corrupted --> redeploy. - Rollback to previous release to last known good version. - Scale up resources if you see resource exhaustion - Clear caches if cache corruption is suspected: - HF-API cache clear - HF-API territories cache clear --> hf_api_refresh_cached_territories - Enable maintenance mode to reduce load while fixing

More involved actions: - Disable problematic features using feature flags if exists. - Implement temporary workarounds in load balancer or code to add a hotfix.

Rollback Strategy¶

When to rollback: - Issue started shortly after a deployment - You've identified the problematic change - No quick fix is available

How to rollback safely: 1. Identify the last known good version 2. Check if any database migrations need to be reverted 3. Reroute traffic to last healthy revision. 4. Monitor metrics during rollback to confirm resolution

Step 8: Verify Resolution¶

Confirm the Fix¶

Don't just trust metrics - validate manually: 1. Test the previously failing functionality yourself 2. Check that error rates have returned to baseline 3. Verify response times are within normal range 4. Monitor for at least 10-15 minutes to ensure stability

Watch for secondary effects: - Check dependent services haven't been impacted - Monitor for any queue backlogs that need processing - Verify background jobs are running normally

Step 9: Incident Closure¶

Communication¶

Announce resolution: Send a last message to the previously initiated thread (and also to Channel #outage-emergency-channel)

✅ INCIDENT RESOLVED
Duration: [X hours/minutes]
Impact: [brief summary]
Root cause: [one-line explanation]
Resolution: [what fixed it]
Next steps: [follow-up actions if any]

Documentation¶

Capture while it's fresh:

Create a new Postmortem report following this template.

Some actions needed to take:

Timeline of events and actions taken
Root cause analysis (even if preliminary)
What worked and what didn't during resolution
Follow-up tasks to prevent recurrence

Step 10: Post-Incident Actions¶

Immediate Follow-up¶

For P0/P1 incidents: - Schedule a post-mortem meeting within 48 hours - Create tickets for any identified preventive measures - Update monitoring/alerting based on learnings

For all incidents: - Update this runbook if you found gaps or improvements - Share learnings with the team - Review if incident classification was accurate

Emergency Contacts & Resources¶

Who to Contact¶

Situation	Contact	Method
Need help investigating	@tech-ops-team or @tech-rev-team in Slack	Immediate
P0/P1 escalation needed	Engineering Manager	Phone + Slack
Customers	CS Team team	`#outage-emergency-channel`/`@Sam Roberts`/`@Gareth`

Quick Access Links¶

Monitoring & Dashboards:

Service dashboards: Forest - Grafana
Infrastructure by Code: Terraform
Error tracking: Sentry

Remember: The goal is to restore service as quickly as possible while gathering enough information to prevent future occurrences. When in doubt, ask for help - incidents are team efforts, not individual challenges.