Original Idea
Log Anomaly Detector A background service that learns baselines and posts Slack alerts on spikes.
Product Requirements Document (PRD): SentinelLog Anomaly Detector
1. Executive Summary
SentinelLog is an intelligent, background-service-first observability platform designed to eliminate the manual toil of setting static thresholds for system logs. By leveraging high-performance Go-based ingestion and streaming statistical algorithms, SentinelLog learns the unique "heartbeat" of a distributed system. It automatically identifies volume spikes and pattern deviations, delivering actionable alerts via Slack and PagerDuty. Designed for the 2026 infrastructure landscape, it prioritizes zero-copy processing and sub-millisecond detection latency.
2. Problem Statement
In modern distributed architectures, the sheer volume of logs makes manual monitoring impossible. DevOps and SRE teams face two primary issues:
- Alert Fatigue: Static thresholds fail to account for natural cyclical patterns (e.g., higher traffic on Monday mornings), leading to "noisy" alerts.
- Silent Failures: Critical but subtle anomalies are often missed because they don't hit a high enough arbitrary "ceiling" to trigger a traditional alarm.
- Configuration Overhead: Manually updating hundreds of alert rules as services evolve is a significant operational burden.
3. Goals & Success Metrics
- Zero-Config Baselines: 90% of log sources should require no manual threshold setting after a 24-hour learning period.
- High-Performance Ingestion: Support >1,000,000 log events per second with <100ms end-to-end processing latency.
- Reduced MTTR: Reduce Mean Time to Resolution for volume-based incidents by 40% through faster anomaly detection.
- Precision Alerting: Maintain a false-positive rate of <5% through adaptive 24-hour retraining cycles.
4. User Personas
- DevOps Engineer (Jordan): Needs to integrate log sources quickly via CLI or Kubernetes sidecars and ensure the system scales with traffic.
- SRE (Sam): Focuses on minimizing noise and ensuring that when an alert fires in Slack, it represents a genuine infrastructure deviation.
- Backend Architect (Alex): Wants a high-level view of service health and historical trend comparisons to plan capacity.
5. User Stories
- As a DevOps Engineer, I want to stream logs via a simple HTTP endpoint so that I can integrate any service regardless of the language it’s written in.
- As an SRE, I want the system to learn that "High Traffic Mondays" are normal so that I don't get paged for expected surges.
- As a Security Analyst, I want immediate alerts when log volume for an auth-service spikes by 3 standard deviations, as this may indicate a brute-force attack.
- As a Developer, I want to see a visual chart of the spike compared to the baseline so I can quickly identify the magnitude of the issue.
6. Functional Requirements
6.1 Real-Time Ingestion
- Support for HTTP/Syslog and OTLP-compliant structured JSON.
- Implementation of Zero-Copy Networking using Go’s
splice()syscalls for maximum throughput.
6.2 Statistical Detection Engine
- Online Learning: Use
alexander-yu/streamto calculate Moving Average and Standard Deviation in $O(1)$ time. - Z-Score Analysis: Trigger alerts when current volume exceeds the baseline by a configurable sensitivity (Default: 3-Sigma).
- Adaptive Windowing: Implement a 24-hour "Seasonality" window to adjust baselines based on time-of-day.
6.3 Alerting & Integration
- Slack Webhooks: Post rich Block Kit messages with "Acknowledge" and "Silence" buttons.
- PagerDuty V2 Events: Send alerts with
dedup_keyto prevent redundant paging. - Sensitivity Controls: Allow users to toggle between Low (5-Sigma), Medium (3-Sigma), and High (2-Sigma) sensitivity per source.
6.4 Visualization (Nice-to-Have)
- Real-time Grafana-style dashboards using uPlot for 60fps time-series rendering.
- Root cause clustering to group similar error messages during a spike.
7. Technical Requirements
7.1 Tech Stack (2026 Standards)
- Backend: Go 1.25.6 (utilizing "Green Tea" GC for 40% lower overhead).
- Frontend: React 19.2.x (with React Compiler 1.0+) and Vite 7.3.1.
- Styling: Tailwind CSS 4.1.x (Oxide engine).
- Database: TimescaleDB 2.24+ (utilizing UUIDv7 primary keys and Direct-to-Columnstore ingestion).
- Infrastructure: AWS EKS with Karpenter for node scaling and KEDA for event-driven pod scaling.
7.2 Integrations
- Slack API: Socket Mode for secure, internal-only communication.
- OpenTelemetry (OTEL): Standardized ingestion path for trace-linked logs.
- HashiCorp Vault: For managing Slack/PagerDuty webhook secrets.
8. Data Model
8.1 LogSource (Entity)
source_id: UUIDv7 (Primary Key, Time-ordered)api_key: Hash (For ingestion auth)retention_days: Integer (Default 30)
8.2 BaselineProfile (Entity)
profile_id: UUIDv7source_id: UUIDv7 (FK)window_start: Timestampmoving_mean: Float64moving_stddev: Float64
8.3 AnomalyEvent (Entity)
event_id: UUIDv7source_id: UUIDv7 (FK)deviation_score: Floatraw_data_sample: JSONB (Stores snippet of logs during spike)
9. API Specification
9.1 Ingest Logs
- Endpoint:
POST /v1/ingest - Auth: Header
X-Sentinel-Key - Payload:
{ "timestamp": "2026-01-19T14:00:00Z", "service": "auth-api", "level": "error", "message": "Connection timeout" } - Response:
202 Accepted(Processed asynchronously)
9.2 Get Anomaly Details
- Endpoint:
GET /v1/anomalies/{id} - Response:
200 OKwith JSON containingdeviation_scoreanduPlotcompatible data arrays.
10. UI/UX Requirements
- The Dashboard: Must feature a "Live Heartbeat" view using uPlot.
- Interaction: Users should be able to drag-select a time range to see "Semantic Clusters" of logs.
- Theming: Default to "Midnight Pro" dark mode (Tailwind 4.1 native support).
- Alert Configuration: A simple 3-step wizard: 1. Generate Key -> 2. Pipe Logs -> 3. Set Slack Webhook.
11. Non-Functional Requirements
- Security: AES-256 encryption for log data at rest; HMAC-SHA256 signature verification for incoming webhooks.
- Availability: 99.9% uptime for the ingestion endpoint.
- Scalability: Use KEDA to scale Go worker pods based on SQS queue depth.
12. Out of Scope
- Long-term historical log storage (SentinelLog is for detection, not an archive like Elasticsearch).
- Direct log modification or remediation (we alert, we don't fix).
- Support for non-structured plain text logs without a predefined regex.
13. Risks & Mitigations
- Risk: Thundering herd during global outages causing the detector to crash.
- Mitigation: Implement NATS JetStream as a buffer between ingestion and the detection engine.
- Risk: Massive data costs in TimescaleDB.
- Mitigation: Use Direct-to-Columnstore and Tiered Storage (S3) for data older than 7 days.
14. Implementation Tasks
Phase 1: Project Setup
- [ ] Initialize Go backend with version 1.25.6
- [ ] Scaffold React 19.2 frontend with Vite 7.3 and Tailwind 4.1 (Oxide)
- [ ] Configure PostgreSQL with TimescaleDB 2.24 extension
- [ ] Set up GitHub Actions for CI with Go/Node 2026-LTS environments
Phase 2: Ingestion & Storage
- [ ] Implement
POST /v1/ingestwith zero-copy buffer pools - [ ] Create TimescaleDB Hypertables using UUIDv7 primary keys
- [ ] Configure
Direct-to-Columnstorepolicy for high-volume log chunks - [ ] Implement API key validation middleware using Redis for caching
Phase 3: Detection Engine
- [ ] Integrate
alexander-yu/streamfor online mean/std-dev calculations - [ ] Build the 24-hour retraining cron job (Go Routine + Ticker)
- [ ] Implement the Z-Score logic (3-Sigma thresholding)
- [ ] Create logic for
AnomalyEventpersistence when thresholds are breached
Phase 4: Alerting & UI
- [ ] Build Slack Block Kit builder for anomaly notifications
- [ ] Implement PagerDuty V2 Event integration with de-duplication
- [ ] Develop the uPlot time-series component in React
- [ ] Build the "Source Configuration" dashboard with Tailwind 4.1
Phase 5: Cloud & Scaling
- [ ] Write Helm charts for EKS deployment
- [ ] Configure Karpenter NodePools for
c7g(Graviton) instances - [ ] Set up KEDA ScaledObjects based on ingestion queue depth
- [ ] Implement Vault sidecar for secret injection of Slack tokens