Traditional incident response plans assume a change-averse enterprise with a security operations centre. Modern engineering teams ship multiple times a day, run their own infrastructure, and may not have a dedicated security team. This lesson covers incident response adapted for teams that move fast.
Before the incident
Define severity levels
Agree on these before an incident happens, not during one:
| Severity | Definition | Response | Example |
|---|---|---|---|
| SEV-1 | Active breach, data exfiltration, service fully compromised | All hands, immediate | Attacker has shell access to production |
| SEV-2 | Confirmed exploit, limited impact, or high-risk vulnerability actively targeted | Dedicated responders, within 1h | SQL injection exploited but WAF is partially blocking |
| SEV-3 | Suspicious activity, potential compromise, unconfirmed | Investigate during business hours | Unusual outbound traffic from one pod |
| SEV-4 | Vulnerability disclosed, no evidence of exploitation | Triage at next standup | New CVE in a dependency you use |
Build a lightweight playbook
You do not need a 50-page document. You need answers to these questions, written down and accessible:
- Who is on call? Rotation schedule with contact details.
- Where do we communicate? Dedicated Slack channel (#incident-response), bridge call link.
- Who can make containment decisions? (e.g., "Any on-call engineer can block an IP or disable an account without approval.")
- Where are the runbooks? Link to your wiki/repo.
- Who do we notify externally? Legal, customers, regulators — and at what severity.
- Where do credentials live? Break-glass accounts for emergency access.
Break-glass access
Production access should be limited day-to-day, but you need a way to get elevated access during an incident:
- Break-glass accounts stored in a secrets manager, with access logged and audited
- Just-in-time access via tools like AWS SSM Session Manager, Teleport, or StrongDM
- Pre-approved emergency roles that can be assumed with MFA + approval from a second person
Test break-glass access quarterly. If the first time you try it is during an incident, it will not work.
During the incident
The first 15 minutes
When an alert fires or someone reports a potential incident:
1. Acknowledge and assess (2 minutes)
- Acknowledge the alert in PagerDuty / your alerting tool
- Open the incident channel (#incident-2025-03-15-auth-breach)
- Post initial context: "Auth service showing 500 failed logins/min from multiple IPs targeting admin accounts"
- Assign severity: SEV-2
2. Assemble the team (3 minutes)
Page the minimum people needed:
- On-call engineer for the affected service
- Security on-call (if you have one)
- Incident commander (can be the same person initially)
3. Contain (10 minutes)
The priority is stop the bleeding, not understand root cause:
| Scenario | Containment action |
|---|---|
| Credential stuffing | Block source IPs at WAF, enable CAPTCHA, force MFA |
| Compromised service account | Revoke the credential, redeploy with a new one |
| Data exfiltration in progress | Isolate the affected service (network policy, kill pod) |
| Malicious dependency | Roll back to the last known-good deployment |
| Compromised admin account | Disable the account, revoke all sessions |
Containment is not fixing. It is stopping further damage. You can investigate properly once the active threat is neutralised.
Roles during an incident
Keep it simple. Three roles are enough for most teams:
Incident Commander (IC):
- Owns the incident timeline and communication
- Makes decisions when the team cannot agree
- Does NOT do the technical investigation
Technical Lead:
- Investigates root cause and impact
- Executes containment actions
- Documents technical findings
Communications Lead:
- Updates stakeholders (management, customers if needed)
- Tracks action items
- Manages the incident channel (keeps it focused)
For SEV-3 and SEV-4, one person can fill all three roles.
Communication during an incident
In the incident channel:
14:23 [IC] Incident opened. Auth service brute force in progress. SEV-2.
14:25 [Tech] Source: 47 distinct IPs, all from AS12345. Targeting admin@, finance@, ceo@.
14:28 [Tech] Blocked AS12345 at WAF. Attack volume dropping.
14:30 [IC] No evidence of successful compromise. Checking auth logs for successful logins from those IPs.
14:35 [Tech] Confirmed: zero successful logins from the blocked range. User accounts unaffected.
14:40 [IC] Downgrading to SEV-3. Monitoring. Will file post-incident review.
To stakeholders (SEV-1 and SEV-2):
Update every 30 minutes, even if there is nothing new. The template:
Subject: [SEV-2] Auth service brute force — Update #2
Status: Contained
Impact: No customer data compromised. No successful unauthorised logins.
Actions taken: Blocked attack source at WAF. Monitoring for resumption.
Next update: 15:30 UTC or sooner if status changes.
Evidence preservation
During containment, preserve evidence before you clean up:
- Snapshot affected instances/volumes before terminating them
- Export relevant logs to a separate, immutable location
- Save network captures if available
- Screenshot dashboards showing the attack timeline
- Preserve the CI/CD audit trail if the build pipeline was compromised
Do not destroy evidence in the rush to restore service.
After the incident
Blameless post-incident review
Run a post-incident review (PIR) within 48 hours while memories are fresh. The format:
## Post-Incident Review: Auth Service Brute Force
**Date:** 2025-03-15
**Severity:** SEV-2
**Duration:** 17 minutes (14:23 – 14:40 UTC)
**IC:** Alice
### Timeline
- 14:23 — Alert fired: >20 failed logins/min to admin accounts
- 14:25 — On-call engineer acknowledged, opened incident channel
- 14:28 — Source IPs identified and blocked at WAF
- 14:35 — Confirmed no successful compromise
- 14:40 — Incident downgraded to SEV-3, monitoring continued
### What went well
- Alert fired within 2 minutes of attack start
- WAF block was applied in <5 minutes
- Clear communication in incident channel
### What could be improved
- No automated rate limiting on login endpoint — relied on manual WAF rule
- Break-glass access to WAF took 3 minutes to obtain (MFA token issue)
### Action items
| Action | Owner | Due |
|--------|-------|-----|
| Add rate limiting to /api/auth/login (10 req/min per IP) | Bob | 2025-03-22 |
| Test break-glass access monthly instead of quarterly | Alice | 2025-04-01 |
| Add geographic anomaly detection to login alerts | Carol | 2025-03-29 |
The goal is learning, not blame. Focus on system improvements, not individual failures.
Action item tracking
PIR action items are useless if they go into a document nobody reads. Track them where you track work:
- Create Jira/Linear tickets from each action item
- Tag them with
incident-followup - Review completion in the next sprint planning
- Track the overall closure rate of PIR action items (target: >90% closed within SLA)
Tabletop exercises
Practice incident response before you need it. A tabletop exercise is a walkthrough of a hypothetical incident scenario:
Format:
- Facilitator presents a scenario ("Attacker has compromised a developer's laptop with access to the production Kubernetes cluster")
- Team walks through their response step by step
- Facilitator introduces complications ("The developer's SSH key was also used to push code 2 hours ago")
- Discussion on gaps, missing runbooks, unclear escalation paths
Run quarterly. Keep it to 45 minutes. Rotate scenarios between application-layer attacks, infrastructure compromise, supply chain attacks, and insider threats.
Incident response for common scenarios
Compromised dependency
- Identify which services use the dependency and which version
- Check if the compromised code path is reachable in your application
- Roll back to a known-good version or patch
- Check build logs — was the compromised version used in any recent deployment?
- Scan running containers for the compromised package
Leaked credential
- Revoke the credential immediately
- Rotate to a new credential
- Audit usage logs for the compromised credential — was it used by anyone other than CI/CD?
- Check for lateral movement — what could the credential access?
- Determine how it leaked (commit history, CI logs, error messages)
Ransomware or data destruction
- Isolate affected systems from the network
- Do NOT pay the ransom
- Restore from backups (verify backups were not compromised)
- Preserve forensic evidence
- Engage legal and law enforcement if customer data was affected
Summary
Prepare before incidents happen: define severity levels, build lightweight playbooks, test break-glass access. During incidents, contain first, investigate second, and communicate continuously. After incidents, run blameless PIRs within 48 hours and track action items to completion. Practice with tabletop exercises quarterly. The difference between a 15-minute incident and a 15-hour one is almost always preparation.
