Product Requirements Document (PRD): SentinelLog Anomaly Detector

1. Executive Summary

SentinelLog is an intelligent, background-service-first observability platform designed to eliminate the manual toil of setting static thresholds for system logs. By leveraging high-performance Go-based ingestion and streaming statistical algorithms, SentinelLog learns the unique "heartbeat" of a distributed system. It automatically identifies volume spikes and pattern deviations, delivering actionable alerts via Slack and PagerDuty. Designed for the 2026 infrastructure landscape, it prioritizes zero-copy processing and sub-millisecond detection latency.

2. Problem Statement

In modern distributed architectures, the sheer volume of logs makes manual monitoring impossible. DevOps and SRE teams face two primary issues:

Alert Fatigue: Static thresholds fail to account for natural cyclical patterns (e.g., higher traffic on Monday mornings), leading to "noisy" alerts.
Silent Failures: Critical but subtle anomalies are often missed because they don't hit a high enough arbitrary "ceiling" to trigger a traditional alarm.
Configuration Overhead: Manually updating hundreds of alert rules as services evolve is a significant operational burden.

3. Goals & Success Metrics

Zero-Config Baselines: 90% of log sources should require no manual threshold setting after a 24-hour learning period.
High-Performance Ingestion: Support >1,000,000 log events per second with <100ms end-to-end processing latency.
Reduced MTTR: Reduce Mean Time to Resolution for volume-based incidents by 40% through faster anomaly detection.
Precision Alerting: Maintain a false-positive rate of <5% through adaptive 24-hour retraining cycles.

4. User Personas

DevOps Engineer (Jordan): Needs to integrate log sources quickly via CLI or Kubernetes sidecars and ensure the system scales with traffic.
SRE (Sam): Focuses on minimizing noise and ensuring that when an alert fires in Slack, it represents a genuine infrastructure deviation.
Backend Architect (Alex): Wants a high-level view of service health and historical trend comparisons to plan capacity.

5. User Stories

As a DevOps Engineer, I want to stream logs via a simple HTTP endpoint so that I can integrate any service regardless of the language it’s written in.
As an SRE, I want the system to learn that "High Traffic Mondays" are normal so that I don't get paged for expected surges.
As a Security Analyst, I want immediate alerts when log volume for an auth-service spikes by 3 standard deviations, as this may indicate a brute-force attack.
As a Developer, I want to see a visual chart of the spike compared to the baseline so I can quickly identify the magnitude of the issue.

6. Functional Requirements

6.1 Real-Time Ingestion

Support for HTTP/Syslog and OTLP-compliant structured JSON.
Implementation of Zero-Copy Networking using Go’s splice() syscalls for maximum throughput.

6.2 Statistical Detection Engine

Online Learning: Use alexander-yu/stream to calculate Moving Average and Standard Deviation in $O(1)$ time.
Z-Score Analysis: Trigger alerts when current volume exceeds the baseline by a configurable sensitivity (Default: 3-Sigma).
Adaptive Windowing: Implement a 24-hour "Seasonality" window to adjust baselines based on time-of-day.

6.3 Alerting & Integration

Slack Webhooks: Post rich Block Kit messages with "Acknowledge" and "Silence" buttons.
PagerDuty V2 Events: Send alerts with dedup_key to prevent redundant paging.
Sensitivity Controls: Allow users to toggle between Low (5-Sigma), Medium (3-Sigma), and High (2-Sigma) sensitivity per source.

6.4 Visualization (Nice-to-Have)

Real-time Grafana-style dashboards using uPlot for 60fps time-series rendering.
Root cause clustering to group similar error messages during a spike.

7. Technical Requirements

7.1 Tech Stack (2026 Standards)

Backend: Go 1.25.6 (utilizing "Green Tea" GC for 40% lower overhead).
Frontend: React 19.2.x (with React Compiler 1.0+) and Vite 7.3.1.
Styling: Tailwind CSS 4.1.x (Oxide engine).
Database: TimescaleDB 2.24+ (utilizing UUIDv7 primary keys and Direct-to-Columnstore ingestion).
Infrastructure: AWS EKS with Karpenter for node scaling and KEDA for event-driven pod scaling.

7.2 Integrations

Slack API: Socket Mode for secure, internal-only communication.
OpenTelemetry (OTEL): Standardized ingestion path for trace-linked logs.
HashiCorp Vault: For managing Slack/PagerDuty webhook secrets.

8. Data Model

8.1 LogSource (Entity)

source_id: UUIDv7 (Primary Key, Time-ordered)
api_key: Hash (For ingestion auth)
retention_days: Integer (Default 30)

8.2 BaselineProfile (Entity)

profile_id: UUIDv7
source_id: UUIDv7 (FK)
window_start: Timestamp
moving_mean: Float64
moving_stddev: Float64

8.3 AnomalyEvent (Entity)

event_id: UUIDv7
source_id: UUIDv7 (FK)
deviation_score: Float
raw_data_sample: JSONB (Stores snippet of logs during spike)

9. API Specification

9.1 Ingest Logs

Endpoint: POST /v1/ingest
Auth: Header X-Sentinel-Key

Payload:

{
  "timestamp": "2026-01-19T14:00:00Z",
  "service": "auth-api",
  "level": "error",
  "message": "Connection timeout"
}

Response: 202 Accepted (Processed asynchronously)

9.2 Get Anomaly Details

Endpoint: GET /v1/anomalies/{id}
Response: 200 OK with JSON containing deviation_score and uPlot compatible data arrays.

10. UI/UX Requirements

The Dashboard: Must feature a "Live Heartbeat" view using uPlot.
Interaction: Users should be able to drag-select a time range to see "Semantic Clusters" of logs.
Theming: Default to "Midnight Pro" dark mode (Tailwind 4.1 native support).
Alert Configuration: A simple 3-step wizard: 1. Generate Key -> 2. Pipe Logs -> 3. Set Slack Webhook.

11. Non-Functional Requirements

Security: AES-256 encryption for log data at rest; HMAC-SHA256 signature verification for incoming webhooks.
Availability: 99.9% uptime for the ingestion endpoint.
Scalability: Use KEDA to scale Go worker pods based on SQS queue depth.

12. Out of Scope

Long-term historical log storage (SentinelLog is for detection, not an archive like Elasticsearch).
Direct log modification or remediation (we alert, we don't fix).
Support for non-structured plain text logs without a predefined regex.

13. Risks & Mitigations

Risk: Thundering herd during global outages causing the detector to crash.
- Mitigation: Implement NATS JetStream as a buffer between ingestion and the detection engine.
Risk: Massive data costs in TimescaleDB.
- Mitigation: Use Direct-to-Columnstore and Tiered Storage (S3) for data older than 7 days.

14. Implementation Tasks

Phase 1: Project Setup

[ ] Initialize Go backend with version 1.25.6
[ ] Scaffold React 19.2 frontend with Vite 7.3 and Tailwind 4.1 (Oxide)
[ ] Configure PostgreSQL with TimescaleDB 2.24 extension
[ ] Set up GitHub Actions for CI with Go/Node 2026-LTS environments

Phase 2: Ingestion & Storage

[ ] Implement POST /v1/ingest with zero-copy buffer pools
[ ] Create TimescaleDB Hypertables using UUIDv7 primary keys
[ ] Configure Direct-to-Columnstore policy for high-volume log chunks
[ ] Implement API key validation middleware using Redis for caching

Phase 3: Detection Engine

[ ] Integrate alexander-yu/stream for online mean/std-dev calculations
[ ] Build the 24-hour retraining cron job (Go Routine + Ticker)
[ ] Implement the Z-Score logic (3-Sigma thresholding)
[ ] Create logic for AnomalyEvent persistence when thresholds are breached

Phase 4: Alerting & UI

[ ] Build Slack Block Kit builder for anomaly notifications
[ ] Implement PagerDuty V2 Event integration with de-duplication
[ ] Develop the uPlot time-series component in React
[ ] Build the "Source Configuration" dashboard with Tailwind 4.1

Phase 5: Cloud & Scaling

[ ] Write Helm charts for EKS deployment
[ ] Configure Karpenter NodePools for c7g (Graviton) instances
[ ] Set up KEDA ScaledObjects based on ingestion queue depth
[ ] Implement Vault sidecar for secret injection of Slack tokens