JustAppSec

Monitoring and Alerting for Security Events

From noise to signal — building alerts that surface real threats.

0:00

Good logs mean nothing if no one looks at them. This lesson covers how to build a monitoring and alerting pipeline that surfaces real threats without burying your team in false positives.

The alert quality problem

Most security alerting fails for one of two reasons:

  1. Too many alerts — the team ignores them all (alert fatigue)
  2. Too few alerts — real attacks go unnoticed

The goal is not "alert on everything." The goal is: every alert that pages someone should require action.

Measuring alert quality

MetricTargetHow to measure
True positive rate> 80%Alerts that led to actual investigation / total alerts
Mean time to detect (MTTD)< 15 minutes for criticalTime from attack to first alert
Mean time to acknowledge (MTTA)< 5 minutes for P1Time from alert to human acknowledgement
Alert-to-incident ratioTrack trendAlerts that became incidents / total alerts
Noise ratio< 20%Alerts dismissed without action / total alerts

If more than 20% of your alerts are false positives, your team will start ignoring all of them.

What to monitor

Application-layer signals

SignalWhat it indicates
Spike in 401/403 responsesCredential stuffing, broken auth, enumeration
Spike in 400/422 responsesInput fuzzing, API abuse
Spike in 500 errorsExploitation attempt triggering unhandled exceptions
Unusual data volume in responsesData exfiltration
New API endpoints receiving trafficShadow endpoints, forgotten debug routes
Login from new geolocationAccount takeover
Session created without loginSession fixation or token theft

Infrastructure signals

SignalWhat it indicates
Outbound connections to unusual IPsCompromised service phoning home
DNS queries to uncommon domainsDNS exfiltration, C2 communication
CPU/memory spike on a specific podCryptomining, DoS, runaway exploit
Unexpected process executionContainer escape, RCE
IAM role assumption from unexpected sourceLateral movement in cloud
S3/blob storage access from new IPData theft

Business logic signals

SignalWhat it indicates
Gift card / coupon redemption spikeFraud, brute-forced codes
Unusually large order or transferAccount takeover, financial fraud
Mass account creationSpam, bot registration
Password reset volume spikeCredential stuffing precursor

Building effective alerts

Severity levels

Define clear severity levels and response expectations:

SeverityResponse timeExampleNotification
P1 — CriticalImmediate (page)Active data breach, RCE detectedPagerDuty, phone call
P2 — HighWithin 1 hourCredential stuffing in progress, privilege escalationSlack + PagerDuty
P3 — MediumWithin 1 business dayUnusual login pattern, failed MFA spikeSlack channel
P4 — LowNext triage meetingSingle failed login, minor config driftDashboard only

Alert structure

Every alert should contain enough context to begin investigation without querying logs:

{
  "alert": "Brute force login detected",
  "severity": "P2",
  "timestamp": "2025-03-15T14:23:01Z",
  "source": "auth-api",
  "details": {
    "targetUser": "[email protected]",
    "sourceIp": "203.0.113.45",
    "failedAttempts": 47,
    "timeWindow": "5 minutes",
    "geoLocation": "Unknown VPN provider"
  },
  "links": {
    "runbook": "https://wiki.internal/runbooks/brute-force",
    "logs": "https://grafana.internal/explore?query=...",
    "dashboard": "https://grafana.internal/d/auth-security"
  }
}

Alert routing

Route alerts to the right people at the right time:

# Example AlertManager routing
route:
  receiver: default-slack
  routes:
    - match:
        severity: P1
      receiver: pagerduty-oncall
      repeat_interval: 5m

    - match:
        severity: P2
      receiver: security-slack-urgent
      repeat_interval: 15m

    - match:
        severity: P3
      receiver: security-slack
      repeat_interval: 4h
      group_wait: 10m
      group_by: [alert_name, service]

Reducing noise

Deduplication: Group repeated alerts within a time window. 50 identical alerts in 5 minutes should produce 1 alert with a count.

Correlation: Alert on patterns, not individual events. A single 404 is normal. 500 distinct 404s to /admin/* from the same IP in 2 minutes is an attack.

Baseline comparison: "10 failed logins" means different things for a service with 100 users versus one with 10 million. Use percentage-based thresholds or statistical baselines.

Suppression during maintenance: Automatically suppress alerts during planned deployments or maintenance windows:

inhibit_rules:
  - source_match:
      alertname: DeploymentInProgress
    target_match:
      severity: P3
    equal: [service]

Monitoring stack options

Open-source stack

Prometheus  →  Alertmanager  →  PagerDuty / Slack
     ↓
  Grafana (dashboards)

Loki  →  Grafana alerting (log-based)

Prometheus for metrics (request rates, error rates, latency). Loki for log aggregation. Grafana for dashboards and log-based alerting.

Cloud-native options

ProviderMetricsLogsAlerting
AWSCloudWatch MetricsCloudWatch LogsCloudWatch Alarms, SNS
GCPCloud MonitoringCloud LoggingAlerting Policies
AzureAzure MonitorLog AnalyticsAction Groups

SaaS platforms

Datadog, Splunk, Elastic Security, Sumo Logic — these combine metrics, logs, and alerting into a single platform with built-in security detection rules.

Security dashboards

Build dashboards that give at-a-glance situational awareness:

Authentication dashboard

  • Login success/failure rate over time
  • Failed logins by user (top 10)
  • Failed logins by source IP (top 10)
  • Geographic distribution of logins
  • MFA adoption rate
  • Active sessions count

API security dashboard

  • Request rate by endpoint
  • Error rate by status code (4xx, 5xx)
  • Requests by authentication method
  • Rate limit triggers
  • Top clients by request volume
  • Unusual request size distribution

Deployment and change dashboard

  • Recent deployments (service, version, deployer)
  • Configuration changes
  • Permission changes (IAM, RBAC)
  • New infrastructure provisioned

Runbooks

Every P1 and P2 alert should link to a runbook. A runbook answers:

  1. What does this alert mean? One-sentence explanation.
  2. Is this urgent? Criteria for escalation.
  3. How do I investigate? Specific queries to run, dashboards to check.
  4. How do I contain it? Immediate actions (block IP, disable account, revoke token).
  5. Who do I contact? Escalation path.

Example runbook snippet:

## Brute Force Login Alert

**What:** More than 20 failed logins to a single account in 5 minutes.

**Investigate:**
1. Check if the target account exists and is active
2. Check the source IP — is it a known VPN/proxy? Multiple accounts targeted?
3. Check if any login succeeded after the failures

**Contain:**
- If from a single IP: block at WAF for 24h
- If account was compromised: disable account, revoke sessions, notify user
- If distributed (many IPs): enable CAPTCHA on login, escalate to P1

**Escalate to:** Security on-call if account was compromised or attack is distributed

Tuning over time

Alerting is never "done." Build a feedback loop:

  1. Track every alert outcome — was it a true positive, false positive, or noise?
  2. Review weekly — which alerts fired most? Which were all false positives?
  3. Tune thresholds — adjust based on baseline changes (new feature, traffic growth)
  4. Retire stale alerts — if an alert has not fired a true positive in 6 months, question whether it is still relevant
  5. Add new rules — after every incident, ask: "What alert would have caught this?"

Summary

Monitor application, infrastructure, and business logic signals. Define severity levels with clear response expectations and route alerts to the right people. Every alert should contain enough context to begin investigation immediately and link to a runbook. Deduplicate, correlate, and baseline to reduce noise. Track alert quality metrics and tune continuously — an alert that is always ignored is worse than no alert at all.


This training content is AI-assisted and reviewed by our team, but issues may be missed and best practices evolve rapidly. Send corrections to [email protected].