Skip to content

Incident Response Runbook - Production Services (Python)ΒΆ

Purpose
This runbook provides step-by-step instructions for responding to production incidents. Use this guide when you receive: - Automated alerts from Sentry or by monitoring dashboards. - Manual reports from team members or internal communications.

Target audience: On-call engineers and team members responding to live production incidents.

Services to apply this guide: python backend services running on Google Cloud Run.

  • HF-API
  • Webhooks-API
  • Forestfleet-API

🚨 When an Incident Alert Fires¢

First, stay calm. Production incidents are stressful, but following this systematic approach will help you resolve issues efficiently.

What constitutes an incident?ΒΆ

  • Service is completely down or significantly degraded
  • Error rates above normal thresholds (typically >10% 5xx errors)
  • Response times significantly slower than baseline (currently ~200ms)
  • Data loss or corruption

Step 1: Immediate Response (0-5 minutes)ΒΆ

Acknowledge and Take OwnershipΒΆ

When you receive an alert:

  1. Acknowledge the alert immediately: either take ownership of Sentry alert or respond to internal communication in case of manual alerting.
  2. Take Linear ticket Ownership: Automated Sentry alerts creates a Linar ticket. After the alert is raised, take the ownership of the ticket.
  3. Open outages Slack channel: Send a first message to #outage-emergency-channel indicating laert investigation and the acknowledge of it.
  4. Declare the incident: after first sight of alert, declare the incident by posting:
    @here 🚨 INCIDENT DECLARED
    Service: [service-name]
    Issue: [brief description or sintoms]
    Responder: @your-name
    

Quick Smoke TestΒΆ

Verify the issue is real before diving deep: - Visit CloudRun services status and checkc if serivces are operational and serving traffic. - Look at services dashboard to confirm metrics show problems: - HF-API - Grafana - Forestfleet-API - Grafana - Webhooks-API - Grafana - Check if you can reproduce the issue, calling from local or directly in production.

If everything looks normal: The alert might be a false positive, but don't dismiss it yet. Continue investigating to be sure.


Step 2: Assess Impact & Scope (5-15 minutes)ΒΆ

Understand What's BrokenΒΆ

Check user impact: - Is this affecting all users or a subset? - Which features/endpoints are impacted?

Investigation path – HF-API Investigation path – Webhooks
1. Check Grafana dashboard for request volume drops, increases in 5xx error codes, or outbound request failures to upstream services (Wunder). 1. Check Grafana dashboard for request volume drops, increases in 5xx error codes, or outbound request failures to upstream services (Wunder, Supabase, Hookdeck).
2. Review error rates by endpoint in API Gateway Endpoints. 2. Check Cloud Run service health and last release.
3. Check Cloud Run service health and last release. 3. Review Hookdeck webhook issues for any related incidents.

Determine Service ScopeΒΆ

Service boundaries: - Is this isolated to one service or affects more than one? - Check dependencies [Wunder|Hookdeck|Supabase|Directus] - Look for upstream/downstream service impacts

The overall idea here is verify if the issue is caused and scoped to our domain or belong to outside serivces.


Step 3: Classify the IncidentΒΆ

Assign Severity LevelΒΆ

TBD levels and classifications

Use this classification to determine urgency and communication needs:

P0 - Critical (All hands on deck)

P1 - High (Urgent response needed)

P2 - Medium (Important but not critical)

P3 - Low (Monitor and fix when possible)


Step 4: Initial InvestigationΒΆ

Start with the ObviousΒΆ

Check recent changes first - most incidents are caused by recent deployments or configuration changes: 1. Review deployments in the last 24 hours. 2. Check if any configuration was modified. CloudRun Logs are super helpful for this. 3. Look for terraform infrastructure changes. 4. Review any database migrations or schema changes

Examine Logs and MetricsΒΆ

Log Analysis Strategy:

# Start with error-level logs from the affected timeframe
# Look for patterns like:
- Connection refused/timeout errors
- Database query failures  
- External API failures
- Memory/resource exhaustion
- Authentication/authorization failures

Key Metrics to Check: - Error Rate: What percentage of requests are failing? - Latency: Are successful requests slower than normal? - Throughput: Is request volume normal or dropping? - Resource Usage: CPU, memory, database connections, number of instances running.

Use Error Tracking SystemΒΆ

In Sentry:

[HF-API|Webhooks-API|(Forestfleet-API)[https://humanforest-dev.sentry.io/issues/?project=4504859846115328&statsPeriod=7d]] 1. Look for error spikes that correlate with the incident timeline 2. Check if there are new error types introduced recently 3. Review the error frequency and affected user count 4. Look at the error stack traces for clues about root cause


Step 5: Communication During InvestigationΒΆ

Keep Stakeholders InformedΒΆ

Internal Communication: - Post updates in #outage-emergency-channel responding as thread to the initial message every 10-15 minutes for P0/P1 incidents. - Proposed format:

πŸ” UPDATE
Status: Still investigating
Findings: [what you've discovered]
Next steps: [what you're trying next]
ETA for next update: [15 minutes]

External Communication (if needed): - If the issue is related to Wunder scalate directly via email: emergency.support@wundermobility.com and #ext-wunder-support. Always tag direct manager.

When to EscalateΒΆ

Escalate immediately if: - You cannot identify the root cause within 15 minutes - The issue is beyond your expertise area - You need additional resources or permissions - The incident is escalating in severity


Step 6: Common Investigation PathsΒΆ

Error Rate Spikes (5xx Errors)ΒΆ

Common patterns: - Application crashes: Check for out-of-memory errors, unhandled exceptions - Configuration errors: Recent config changes causing startup failures - Upstream services failures: External APIs returning errors (Check Wunder, Directus, Supabase)

Investigation approach: 1. Sample recent error logs to identify error types 2. Check if errors correlate with specific endpoints 3. Review recent deployments for introduced bugs 4. Review last CloudBuild execution to check if any new internal dependency was updated and have conflicts. 5. Verify all dependencies are healthy

Complete Service OutageΒΆ

Immediate actions: 1. Check if API Gateway is healthy 2. Verify DNS resolution is working: Cloudflare 3. Check if your service instances are running and healthy 4. Look for infrastructure-level issues (network, compute resources)


Step 7: Mitigation and RecoveryΒΆ

P0 full service degradationΒΆ

In extreme cases where the services effects customers experience, it could be need to activate the KillSwitch. To do this, please contact with your manager.

The kill siwtch sends a Pusher message to a cached channel (30 min cache) with the reservations service status to the App.

Choose Your Mitigation StrategyΒΆ

Quick wins (try first): - Restart the service if it appears stuck or corrupted --> redeploy. - Rollback to previous release to last known good version. - Scale up resources if you see resource exhaustion - Clear caches if cache corruption is suspected: - HF-API cache clear - HF-API territories cache clear --> hf_api_refresh_cached_territories - Enable maintenance mode to reduce load while fixing

More involved actions: - Disable problematic features using feature flags if exists. - Implement temporary workarounds in load balancer or code to add a hotfix.

Rollback StrategyΒΆ

When to rollback: - Issue started shortly after a deployment - You've identified the problematic change - No quick fix is available

How to rollback safely: 1. Identify the last known good version 2. Check if any database migrations need to be reverted 3. Reroute traffic to last healthy revision. 4. Monitor metrics during rollback to confirm resolution


Step 8: Verify ResolutionΒΆ

Confirm the FixΒΆ

Don't just trust metrics - validate manually: 1. Test the previously failing functionality yourself 2. Check that error rates have returned to baseline 3. Verify response times are within normal range 4. Monitor for at least 10-15 minutes to ensure stability

Watch for secondary effects: - Check dependent services haven't been impacted - Monitor for any queue backlogs that need processing - Verify background jobs are running normally


Step 9: Incident ClosureΒΆ

CommunicationΒΆ

Announce resolution: Send a last message to the previously initiated thread (and also to Channel #outage-emergency-channel)

βœ… INCIDENT RESOLVED
Duration: [X hours/minutes]
Impact: [brief summary]
Root cause: [one-line explanation]
Resolution: [what fixed it]
Next steps: [follow-up actions if any]

DocumentationΒΆ

Capture while it's fresh:

Create a new Postmortem report following this template.

Some actions needed to take:

  • Timeline of events and actions taken
  • Root cause analysis (even if preliminary)
  • What worked and what didn't during resolution
  • Follow-up tasks to prevent recurrence

Step 10: Post-Incident ActionsΒΆ

Immediate Follow-upΒΆ

For P0/P1 incidents: - Schedule a post-mortem meeting within 48 hours - Create tickets for any identified preventive measures - Update monitoring/alerting based on learnings

For all incidents: - Update this runbook if you found gaps or improvements - Share learnings with the team - Review if incident classification was accurate


Emergency Contacts & ResourcesΒΆ

Who to ContactΒΆ

Situation Contact Method
Need help investigating @tech-ops-team or @tech-rev-team in Slack Immediate
P0/P1 escalation needed Engineering Manager Phone + Slack
Customers CS Team team #outage-emergency-channel/@Sam Roberts/@Gareth

Monitoring & Dashboards:


Remember: The goal is to restore service as quickly as possible while gathering enough information to prevent future occurrences. When in doubt, ask for help - incidents are team efforts, not individual challenges.