JustAppSec

Incident Response for Teams That Ship Daily

Playbooks, communication, and containment for fast-moving engineering teams.

0:00

Traditional incident response plans assume a change-averse enterprise with a security operations centre. Modern engineering teams ship multiple times a day, run their own infrastructure, and may not have a dedicated security team. This lesson covers incident response adapted for teams that move fast.

Before the incident

Define severity levels

Agree on these before an incident happens, not during one:

SeverityDefinitionResponseExample
SEV-1Active breach, data exfiltration, service fully compromisedAll hands, immediateAttacker has shell access to production
SEV-2Confirmed exploit, limited impact, or high-risk vulnerability actively targetedDedicated responders, within 1hSQL injection exploited but WAF is partially blocking
SEV-3Suspicious activity, potential compromise, unconfirmedInvestigate during business hoursUnusual outbound traffic from one pod
SEV-4Vulnerability disclosed, no evidence of exploitationTriage at next standupNew CVE in a dependency you use

Build a lightweight playbook

You do not need a 50-page document. You need answers to these questions, written down and accessible:

  1. Who is on call? Rotation schedule with contact details.
  2. Where do we communicate? Dedicated Slack channel (#incident-response), bridge call link.
  3. Who can make containment decisions? (e.g., "Any on-call engineer can block an IP or disable an account without approval.")
  4. Where are the runbooks? Link to your wiki/repo.
  5. Who do we notify externally? Legal, customers, regulators — and at what severity.
  6. Where do credentials live? Break-glass accounts for emergency access.

Break-glass access

Production access should be limited day-to-day, but you need a way to get elevated access during an incident:

  • Break-glass accounts stored in a secrets manager, with access logged and audited
  • Just-in-time access via tools like AWS SSM Session Manager, Teleport, or StrongDM
  • Pre-approved emergency roles that can be assumed with MFA + approval from a second person

Test break-glass access quarterly. If the first time you try it is during an incident, it will not work.

During the incident

The first 15 minutes

When an alert fires or someone reports a potential incident:

1. Acknowledge and assess (2 minutes)

- Acknowledge the alert in PagerDuty / your alerting tool
- Open the incident channel (#incident-2025-03-15-auth-breach)
- Post initial context: "Auth service showing 500 failed logins/min from multiple IPs targeting admin accounts"
- Assign severity: SEV-2

2. Assemble the team (3 minutes)

Page the minimum people needed:

  • On-call engineer for the affected service
  • Security on-call (if you have one)
  • Incident commander (can be the same person initially)

3. Contain (10 minutes)

The priority is stop the bleeding, not understand root cause:

ScenarioContainment action
Credential stuffingBlock source IPs at WAF, enable CAPTCHA, force MFA
Compromised service accountRevoke the credential, redeploy with a new one
Data exfiltration in progressIsolate the affected service (network policy, kill pod)
Malicious dependencyRoll back to the last known-good deployment
Compromised admin accountDisable the account, revoke all sessions

Containment is not fixing. It is stopping further damage. You can investigate properly once the active threat is neutralised.

Roles during an incident

Keep it simple. Three roles are enough for most teams:

Incident Commander (IC):

  • Owns the incident timeline and communication
  • Makes decisions when the team cannot agree
  • Does NOT do the technical investigation

Technical Lead:

  • Investigates root cause and impact
  • Executes containment actions
  • Documents technical findings

Communications Lead:

  • Updates stakeholders (management, customers if needed)
  • Tracks action items
  • Manages the incident channel (keeps it focused)

For SEV-3 and SEV-4, one person can fill all three roles.

Communication during an incident

In the incident channel:

14:23 [IC] Incident opened. Auth service brute force in progress. SEV-2.
14:25 [Tech] Source: 47 distinct IPs, all from AS12345. Targeting admin@, finance@, ceo@.
14:28 [Tech] Blocked AS12345 at WAF. Attack volume dropping.
14:30 [IC] No evidence of successful compromise. Checking auth logs for successful logins from those IPs.
14:35 [Tech] Confirmed: zero successful logins from the blocked range. User accounts unaffected.
14:40 [IC] Downgrading to SEV-3. Monitoring. Will file post-incident review.

To stakeholders (SEV-1 and SEV-2):

Update every 30 minutes, even if there is nothing new. The template:

Subject: [SEV-2] Auth service brute force — Update #2

Status: Contained
Impact: No customer data compromised. No successful unauthorised logins.
Actions taken: Blocked attack source at WAF. Monitoring for resumption.
Next update: 15:30 UTC or sooner if status changes.

Evidence preservation

During containment, preserve evidence before you clean up:

  • Snapshot affected instances/volumes before terminating them
  • Export relevant logs to a separate, immutable location
  • Save network captures if available
  • Screenshot dashboards showing the attack timeline
  • Preserve the CI/CD audit trail if the build pipeline was compromised

Do not destroy evidence in the rush to restore service.

After the incident

Blameless post-incident review

Run a post-incident review (PIR) within 48 hours while memories are fresh. The format:

## Post-Incident Review: Auth Service Brute Force
**Date:** 2025-03-15
**Severity:** SEV-2
**Duration:** 17 minutes (14:23 – 14:40 UTC)
**IC:** Alice

### Timeline
- 14:23 — Alert fired: >20 failed logins/min to admin accounts
- 14:25 — On-call engineer acknowledged, opened incident channel
- 14:28 — Source IPs identified and blocked at WAF
- 14:35 — Confirmed no successful compromise
- 14:40 — Incident downgraded to SEV-3, monitoring continued

### What went well
- Alert fired within 2 minutes of attack start
- WAF block was applied in <5 minutes
- Clear communication in incident channel

### What could be improved
- No automated rate limiting on login endpoint — relied on manual WAF rule
- Break-glass access to WAF took 3 minutes to obtain (MFA token issue)

### Action items
| Action | Owner | Due |
|--------|-------|-----|
| Add rate limiting to /api/auth/login (10 req/min per IP) | Bob | 2025-03-22 |
| Test break-glass access monthly instead of quarterly | Alice | 2025-04-01 |
| Add geographic anomaly detection to login alerts | Carol | 2025-03-29 |

The goal is learning, not blame. Focus on system improvements, not individual failures.

Action item tracking

PIR action items are useless if they go into a document nobody reads. Track them where you track work:

  • Create Jira/Linear tickets from each action item
  • Tag them with incident-followup
  • Review completion in the next sprint planning
  • Track the overall closure rate of PIR action items (target: >90% closed within SLA)

Tabletop exercises

Practice incident response before you need it. A tabletop exercise is a walkthrough of a hypothetical incident scenario:

Format:

  1. Facilitator presents a scenario ("Attacker has compromised a developer's laptop with access to the production Kubernetes cluster")
  2. Team walks through their response step by step
  3. Facilitator introduces complications ("The developer's SSH key was also used to push code 2 hours ago")
  4. Discussion on gaps, missing runbooks, unclear escalation paths

Run quarterly. Keep it to 45 minutes. Rotate scenarios between application-layer attacks, infrastructure compromise, supply chain attacks, and insider threats.

Incident response for common scenarios

Compromised dependency

  1. Identify which services use the dependency and which version
  2. Check if the compromised code path is reachable in your application
  3. Roll back to a known-good version or patch
  4. Check build logs — was the compromised version used in any recent deployment?
  5. Scan running containers for the compromised package

Leaked credential

  1. Revoke the credential immediately
  2. Rotate to a new credential
  3. Audit usage logs for the compromised credential — was it used by anyone other than CI/CD?
  4. Check for lateral movement — what could the credential access?
  5. Determine how it leaked (commit history, CI logs, error messages)

Ransomware or data destruction

  1. Isolate affected systems from the network
  2. Do NOT pay the ransom
  3. Restore from backups (verify backups were not compromised)
  4. Preserve forensic evidence
  5. Engage legal and law enforcement if customer data was affected

Summary

Prepare before incidents happen: define severity levels, build lightweight playbooks, test break-glass access. During incidents, contain first, investigate second, and communicate continuously. After incidents, run blameless PIRs within 48 hours and track action items to completion. Practice with tabletop exercises quarterly. The difference between a 15-minute incident and a 15-hour one is almost always preparation.


This training content is AI-assisted and reviewed by our team, but issues may be missed and best practices evolve rapidly. Send corrections to [email protected].