SentinelLog Anomaly Detector

Developer

Original Idea

Log Anomaly Detector A background service that learns baselines and posts Slack alerts on spikes.

Product Requirements Document (PRD): SentinelLog Anomaly Detector

1. Executive Summary

SentinelLog is an intelligent, background-service-first observability platform designed to eliminate the manual toil of setting static thresholds for system logs. By leveraging high-performance Go-based ingestion and streaming statistical algorithms, SentinelLog learns the unique "heartbeat" of a distributed system. It automatically identifies volume spikes and pattern deviations, delivering actionable alerts via Slack and PagerDuty. Designed for the 2026 infrastructure landscape, it prioritizes zero-copy processing and sub-millisecond detection latency.

2. Problem Statement

In modern distributed architectures, the sheer volume of logs makes manual monitoring impossible. DevOps and SRE teams face two primary issues:

  1. Alert Fatigue: Static thresholds fail to account for natural cyclical patterns (e.g., higher traffic on Monday mornings), leading to "noisy" alerts.
  2. Silent Failures: Critical but subtle anomalies are often missed because they don't hit a high enough arbitrary "ceiling" to trigger a traditional alarm.
  3. Configuration Overhead: Manually updating hundreds of alert rules as services evolve is a significant operational burden.

3. Goals & Success Metrics

  • Zero-Config Baselines: 90% of log sources should require no manual threshold setting after a 24-hour learning period.
  • High-Performance Ingestion: Support >1,000,000 log events per second with <100ms end-to-end processing latency.
  • Reduced MTTR: Reduce Mean Time to Resolution for volume-based incidents by 40% through faster anomaly detection.
  • Precision Alerting: Maintain a false-positive rate of <5% through adaptive 24-hour retraining cycles.

4. User Personas

  • DevOps Engineer (Jordan): Needs to integrate log sources quickly via CLI or Kubernetes sidecars and ensure the system scales with traffic.
  • SRE (Sam): Focuses on minimizing noise and ensuring that when an alert fires in Slack, it represents a genuine infrastructure deviation.
  • Backend Architect (Alex): Wants a high-level view of service health and historical trend comparisons to plan capacity.

5. User Stories

  • As a DevOps Engineer, I want to stream logs via a simple HTTP endpoint so that I can integrate any service regardless of the language it’s written in.
  • As an SRE, I want the system to learn that "High Traffic Mondays" are normal so that I don't get paged for expected surges.
  • As a Security Analyst, I want immediate alerts when log volume for an auth-service spikes by 3 standard deviations, as this may indicate a brute-force attack.
  • As a Developer, I want to see a visual chart of the spike compared to the baseline so I can quickly identify the magnitude of the issue.

6. Functional Requirements

6.1 Real-Time Ingestion

  • Support for HTTP/Syslog and OTLP-compliant structured JSON.
  • Implementation of Zero-Copy Networking using Go’s splice() syscalls for maximum throughput.

6.2 Statistical Detection Engine

  • Online Learning: Use alexander-yu/stream to calculate Moving Average and Standard Deviation in $O(1)$ time.
  • Z-Score Analysis: Trigger alerts when current volume exceeds the baseline by a configurable sensitivity (Default: 3-Sigma).
  • Adaptive Windowing: Implement a 24-hour "Seasonality" window to adjust baselines based on time-of-day.

6.3 Alerting & Integration

  • Slack Webhooks: Post rich Block Kit messages with "Acknowledge" and "Silence" buttons.
  • PagerDuty V2 Events: Send alerts with dedup_key to prevent redundant paging.
  • Sensitivity Controls: Allow users to toggle between Low (5-Sigma), Medium (3-Sigma), and High (2-Sigma) sensitivity per source.

6.4 Visualization (Nice-to-Have)

  • Real-time Grafana-style dashboards using uPlot for 60fps time-series rendering.
  • Root cause clustering to group similar error messages during a spike.

7. Technical Requirements

7.1 Tech Stack (2026 Standards)

  • Backend: Go 1.25.6 (utilizing "Green Tea" GC for 40% lower overhead).
  • Frontend: React 19.2.x (with React Compiler 1.0+) and Vite 7.3.1.
  • Styling: Tailwind CSS 4.1.x (Oxide engine).
  • Database: TimescaleDB 2.24+ (utilizing UUIDv7 primary keys and Direct-to-Columnstore ingestion).
  • Infrastructure: AWS EKS with Karpenter for node scaling and KEDA for event-driven pod scaling.

7.2 Integrations

  • Slack API: Socket Mode for secure, internal-only communication.
  • OpenTelemetry (OTEL): Standardized ingestion path for trace-linked logs.
  • HashiCorp Vault: For managing Slack/PagerDuty webhook secrets.

8. Data Model

8.1 LogSource (Entity)

  • source_id: UUIDv7 (Primary Key, Time-ordered)
  • api_key: Hash (For ingestion auth)
  • retention_days: Integer (Default 30)

8.2 BaselineProfile (Entity)

  • profile_id: UUIDv7
  • source_id: UUIDv7 (FK)
  • window_start: Timestamp
  • moving_mean: Float64
  • moving_stddev: Float64

8.3 AnomalyEvent (Entity)

  • event_id: UUIDv7
  • source_id: UUIDv7 (FK)
  • deviation_score: Float
  • raw_data_sample: JSONB (Stores snippet of logs during spike)

9. API Specification

9.1 Ingest Logs

  • Endpoint: POST /v1/ingest
  • Auth: Header X-Sentinel-Key
  • Payload:
    {
      "timestamp": "2026-01-19T14:00:00Z",
      "service": "auth-api",
      "level": "error",
      "message": "Connection timeout"
    }
    
  • Response: 202 Accepted (Processed asynchronously)

9.2 Get Anomaly Details

  • Endpoint: GET /v1/anomalies/{id}
  • Response: 200 OK with JSON containing deviation_score and uPlot compatible data arrays.

10. UI/UX Requirements

  • The Dashboard: Must feature a "Live Heartbeat" view using uPlot.
  • Interaction: Users should be able to drag-select a time range to see "Semantic Clusters" of logs.
  • Theming: Default to "Midnight Pro" dark mode (Tailwind 4.1 native support).
  • Alert Configuration: A simple 3-step wizard: 1. Generate Key -> 2. Pipe Logs -> 3. Set Slack Webhook.

11. Non-Functional Requirements

  • Security: AES-256 encryption for log data at rest; HMAC-SHA256 signature verification for incoming webhooks.
  • Availability: 99.9% uptime for the ingestion endpoint.
  • Scalability: Use KEDA to scale Go worker pods based on SQS queue depth.

12. Out of Scope

  • Long-term historical log storage (SentinelLog is for detection, not an archive like Elasticsearch).
  • Direct log modification or remediation (we alert, we don't fix).
  • Support for non-structured plain text logs without a predefined regex.

13. Risks & Mitigations

  • Risk: Thundering herd during global outages causing the detector to crash.
    • Mitigation: Implement NATS JetStream as a buffer between ingestion and the detection engine.
  • Risk: Massive data costs in TimescaleDB.
    • Mitigation: Use Direct-to-Columnstore and Tiered Storage (S3) for data older than 7 days.

14. Implementation Tasks

Phase 1: Project Setup

  • [ ] Initialize Go backend with version 1.25.6
  • [ ] Scaffold React 19.2 frontend with Vite 7.3 and Tailwind 4.1 (Oxide)
  • [ ] Configure PostgreSQL with TimescaleDB 2.24 extension
  • [ ] Set up GitHub Actions for CI with Go/Node 2026-LTS environments

Phase 2: Ingestion & Storage

  • [ ] Implement POST /v1/ingest with zero-copy buffer pools
  • [ ] Create TimescaleDB Hypertables using UUIDv7 primary keys
  • [ ] Configure Direct-to-Columnstore policy for high-volume log chunks
  • [ ] Implement API key validation middleware using Redis for caching

Phase 3: Detection Engine

  • [ ] Integrate alexander-yu/stream for online mean/std-dev calculations
  • [ ] Build the 24-hour retraining cron job (Go Routine + Ticker)
  • [ ] Implement the Z-Score logic (3-Sigma thresholding)
  • [ ] Create logic for AnomalyEvent persistence when thresholds are breached

Phase 4: Alerting & UI

  • [ ] Build Slack Block Kit builder for anomaly notifications
  • [ ] Implement PagerDuty V2 Event integration with de-duplication
  • [ ] Develop the uPlot time-series component in React
  • [ ] Build the "Source Configuration" dashboard with Tailwind 4.1

Phase 5: Cloud & Scaling

  • [ ] Write Helm charts for EKS deployment
  • [ ] Configure Karpenter NodePools for c7g (Graviton) instances
  • [ ] Set up KEDA ScaledObjects based on ingestion queue depth
  • [ ] Implement Vault sidecar for secret injection of Slack tokens