Every alert that pages should require action. >20% false positives = team ignores everything.
What to monitor
Application: 401/403 spikes (credential stuffing), 500 spikes (exploitation), unusual response volume (exfiltration), login from new geo.
Infrastructure: Outbound to unusual IPs, unexpected processes, IAM assumption from unexpected source.
Business logic: Gift card redemption spike, mass account creation, password reset spike.
Severity levels
| Sev | Response | Example |
|---|---|---|
| P1 | Immediate | Active breach, RCE |
| P2 | 1 hour | Credential stuffing |
| P3 | 1 day | Unusual login |
| P4 | Next triage | Single failed login |
Alert structure
Every alert: enough context to investigate without querying logs. Links to runbook and relevant logs.
Reducing noise
Deduplication: 50 identical alerts = 1 alert with count.
Correlation: Single 404 normal. 500 distinct 404s to /admin/* from same IP = attack.
Baseline comparison: Use percentages, not raw counts.
Runbooks
Every P1/P2 links to a runbook: What does this mean? How to investigate? How to contain? Who to contact?
Tuning
Track every alert outcome. Review weekly. Tune thresholds. Retire stale alerts. After incidents: "What alert would have caught this?"
The takeaway
Monitor app, infra, business logic. Clear severity with response expectations. Enough context to investigate. Link to runbooks. Deduplicate, correlate, tune continuously.
