Good logs mean nothing if no one looks at them. This lesson covers how to build a monitoring and alerting pipeline that surfaces real threats without burying your team in false positives.
The alert quality problem
Most security alerting fails for one of two reasons:
- Too many alerts — the team ignores them all (alert fatigue)
- Too few alerts — real attacks go unnoticed
The goal is not "alert on everything." The goal is: every alert that pages someone should require action.
Measuring alert quality
| Metric | Target | How to measure |
|---|---|---|
| True positive rate | > 80% | Alerts that led to actual investigation / total alerts |
| Mean time to detect (MTTD) | < 15 minutes for critical | Time from attack to first alert |
| Mean time to acknowledge (MTTA) | < 5 minutes for P1 | Time from alert to human acknowledgement |
| Alert-to-incident ratio | Track trend | Alerts that became incidents / total alerts |
| Noise ratio | < 20% | Alerts dismissed without action / total alerts |
If more than 20% of your alerts are false positives, your team will start ignoring all of them.
What to monitor
Application-layer signals
| Signal | What it indicates |
|---|---|
| Spike in 401/403 responses | Credential stuffing, broken auth, enumeration |
| Spike in 400/422 responses | Input fuzzing, API abuse |
| Spike in 500 errors | Exploitation attempt triggering unhandled exceptions |
| Unusual data volume in responses | Data exfiltration |
| New API endpoints receiving traffic | Shadow endpoints, forgotten debug routes |
| Login from new geolocation | Account takeover |
| Session created without login | Session fixation or token theft |
Infrastructure signals
| Signal | What it indicates |
|---|---|
| Outbound connections to unusual IPs | Compromised service phoning home |
| DNS queries to uncommon domains | DNS exfiltration, C2 communication |
| CPU/memory spike on a specific pod | Cryptomining, DoS, runaway exploit |
| Unexpected process execution | Container escape, RCE |
| IAM role assumption from unexpected source | Lateral movement in cloud |
| S3/blob storage access from new IP | Data theft |
Business logic signals
| Signal | What it indicates |
|---|---|
| Gift card / coupon redemption spike | Fraud, brute-forced codes |
| Unusually large order or transfer | Account takeover, financial fraud |
| Mass account creation | Spam, bot registration |
| Password reset volume spike | Credential stuffing precursor |
Building effective alerts
Severity levels
Define clear severity levels and response expectations:
| Severity | Response time | Example | Notification |
|---|---|---|---|
| P1 — Critical | Immediate (page) | Active data breach, RCE detected | PagerDuty, phone call |
| P2 — High | Within 1 hour | Credential stuffing in progress, privilege escalation | Slack + PagerDuty |
| P3 — Medium | Within 1 business day | Unusual login pattern, failed MFA spike | Slack channel |
| P4 — Low | Next triage meeting | Single failed login, minor config drift | Dashboard only |
Alert structure
Every alert should contain enough context to begin investigation without querying logs:
{
"alert": "Brute force login detected",
"severity": "P2",
"timestamp": "2025-03-15T14:23:01Z",
"source": "auth-api",
"details": {
"targetUser": "[email protected]",
"sourceIp": "203.0.113.45",
"failedAttempts": 47,
"timeWindow": "5 minutes",
"geoLocation": "Unknown VPN provider"
},
"links": {
"runbook": "https://wiki.internal/runbooks/brute-force",
"logs": "https://grafana.internal/explore?query=...",
"dashboard": "https://grafana.internal/d/auth-security"
}
}
Alert routing
Route alerts to the right people at the right time:
# Example AlertManager routing
route:
receiver: default-slack
routes:
- match:
severity: P1
receiver: pagerduty-oncall
repeat_interval: 5m
- match:
severity: P2
receiver: security-slack-urgent
repeat_interval: 15m
- match:
severity: P3
receiver: security-slack
repeat_interval: 4h
group_wait: 10m
group_by: [alert_name, service]
Reducing noise
Deduplication: Group repeated alerts within a time window. 50 identical alerts in 5 minutes should produce 1 alert with a count.
Correlation: Alert on patterns, not individual events. A single 404 is normal. 500 distinct 404s to /admin/* from the same IP in 2 minutes is an attack.
Baseline comparison: "10 failed logins" means different things for a service with 100 users versus one with 10 million. Use percentage-based thresholds or statistical baselines.
Suppression during maintenance: Automatically suppress alerts during planned deployments or maintenance windows:
inhibit_rules:
- source_match:
alertname: DeploymentInProgress
target_match:
severity: P3
equal: [service]
Monitoring stack options
Open-source stack
Prometheus → Alertmanager → PagerDuty / Slack
↓
Grafana (dashboards)
Loki → Grafana alerting (log-based)
Prometheus for metrics (request rates, error rates, latency). Loki for log aggregation. Grafana for dashboards and log-based alerting.
Cloud-native options
| Provider | Metrics | Logs | Alerting |
|---|---|---|---|
| AWS | CloudWatch Metrics | CloudWatch Logs | CloudWatch Alarms, SNS |
| GCP | Cloud Monitoring | Cloud Logging | Alerting Policies |
| Azure | Azure Monitor | Log Analytics | Action Groups |
SaaS platforms
Datadog, Splunk, Elastic Security, Sumo Logic — these combine metrics, logs, and alerting into a single platform with built-in security detection rules.
Security dashboards
Build dashboards that give at-a-glance situational awareness:
Authentication dashboard
- Login success/failure rate over time
- Failed logins by user (top 10)
- Failed logins by source IP (top 10)
- Geographic distribution of logins
- MFA adoption rate
- Active sessions count
API security dashboard
- Request rate by endpoint
- Error rate by status code (4xx, 5xx)
- Requests by authentication method
- Rate limit triggers
- Top clients by request volume
- Unusual request size distribution
Deployment and change dashboard
- Recent deployments (service, version, deployer)
- Configuration changes
- Permission changes (IAM, RBAC)
- New infrastructure provisioned
Runbooks
Every P1 and P2 alert should link to a runbook. A runbook answers:
- What does this alert mean? One-sentence explanation.
- Is this urgent? Criteria for escalation.
- How do I investigate? Specific queries to run, dashboards to check.
- How do I contain it? Immediate actions (block IP, disable account, revoke token).
- Who do I contact? Escalation path.
Example runbook snippet:
## Brute Force Login Alert
**What:** More than 20 failed logins to a single account in 5 minutes.
**Investigate:**
1. Check if the target account exists and is active
2. Check the source IP — is it a known VPN/proxy? Multiple accounts targeted?
3. Check if any login succeeded after the failures
**Contain:**
- If from a single IP: block at WAF for 24h
- If account was compromised: disable account, revoke sessions, notify user
- If distributed (many IPs): enable CAPTCHA on login, escalate to P1
**Escalate to:** Security on-call if account was compromised or attack is distributed
Tuning over time
Alerting is never "done." Build a feedback loop:
- Track every alert outcome — was it a true positive, false positive, or noise?
- Review weekly — which alerts fired most? Which were all false positives?
- Tune thresholds — adjust based on baseline changes (new feature, traffic growth)
- Retire stale alerts — if an alert has not fired a true positive in 6 months, question whether it is still relevant
- Add new rules — after every incident, ask: "What alert would have caught this?"
Summary
Monitor application, infrastructure, and business logic signals. Define severity levels with clear response expectations and route alerts to the right people. Every alert should contain enough context to begin investigation immediately and link to a runbook. Deduplicate, correlate, and baseline to reduce noise. Track alert quality metrics and tune continuously — an alert that is always ignored is worse than no alert at all.
