Postmortem Pro

Developer

Original Idea

Incident Postmortem Builder A web app that builds timelines from logs and turns them into actionable follow-ups.

Product Requirements Document (PRD): Postmortem Pro

1. Executive Summary

Postmortem Pro is a specialized web application designed to transform the chaotic aftermath of an engineering incident into a structured, insightful, and actionable postmortem report. By automating log ingestion, providing an interactive "drag-and-drop" timeline builder, and facilitating real-time collaborative analysis, the platform reduces the time-to-report while ensuring critical follow-up tasks are never forgotten. It bridges the gap between raw infrastructure data (logs/traces) and human narrative (RCA).

2. Problem Statement

Engineering teams struggle with "Incident Archaeology"—the manual, error-prone process of hunting through fragmented logs across multiple platforms to reconstruct what happened. This leads to:

  • Delayed Root Cause Analysis (RCA): Memories fade while reports sit in drafts.
  • Context Loss: Critical Slack/Teams discussions are lost in the scroll-back.
  • Lack of Accountability: Follow-up actions created in docs are rarely synced to Jira or GitHub, leading to recurring incidents.

3. Goals & Success Metrics

Business Goals

  • Reduce the average time to complete a postmortem report by 50%.
  • Increase the completion rate of postmortem follow-up actions.
  • Standardize the quality of RCA across the engineering organization.

Success Metrics (KPIs)

  • Mean Time to Report (MTTRp): Time from incident resolution to "Published" report status.
  • Action Item Sync Rate: % of postmortem action items successfully exported to Jira/GitHub.
  • Log-to-Timeline Conversion: Number of log entries promoted to timeline events per incident.
  • Collaboration Depth: Average number of concurrent editors per report.

4. User Personas

  • Incident Commander (IC): Needs to quickly build a high-level narrative and delegate RCA sections.
  • Site Reliability Engineer (SRE): Needs to ingest massive log files and identify the exact millisecond of failure.
  • Engineering Manager (EM): Needs to track high-level trends and ensure action items are assigned and funded.
  • Platform Engineer: Needs to understand infrastructure-wide impact (Blast Radius).

5. User Stories

  • As an SRE, I want to upload a 500MB log file so that I can filter for errors and drag them into a chronological timeline.
  • As an Incident Commander, I want to pull in a Slack thread so that the key decisions made during the heat of the moment are preserved.
  • As a Manager, I want a "Five Whys" template so that my team performs deep root cause analysis rather than surface-level fixes.
  • As a Developer, I want my assigned follow-up task to automatically appear in my GitHub Issues so I don't have to check a separate tool.

6. Functional Requirements

6.1. Log Management & Ingestion

  • Automated Ingestion: CLI-based upload or direct File Upload (CSV, JSON, Plain Text).
  • Masking Engine: Automated PII and secret redaction using Go-based masq and gitleaks patterns.
  • Search & Filter: High-speed querying of ingested logs via ClickHouse Inverted Index.

6.2. Interactive Timeline Builder

  • Visual Interface: A millisecond-accurate timeline using @cyca/react-timeline-editor.
  • Event Promotion: One-click "Promote to Timeline" from raw log entries.
  • Annotation: Ability to add manual notes, images, or "Decision" markers to any point on the timeline.

6.3. Collaborative Editor

  • Real-time Editing: Google Docs-style multi-user editing using Tiptap and Liveblocks.
  • RCA Templates: Pre-built blocks for "Five Whys," "Fishbone (Ishikawa)," and "Blast Radius."
  • Contextual Mentions: @mention team members and #link to infrastructure components.

6.4. Integration & Sync

  • Bi-directional Task Sync: Create Jira/GitHub issues from the doc; update the doc when the issue status changes.
  • Communication Capture: Extract context from Slack/Teams using the 2026 "Agentic Context" pattern (Teams Workflows App).

7. Technical Requirements

7.1. Tech Stack (2026 Standard)

  • Frontend: Next.js 16.1.2 with TypeScript, Tailwind CSS, and Turbopack for builds. Use the use cache directive for data fetching.
  • Backend: Go 1.25.6 (utilizing the "Green Tea" GC for low-latency log processing).
  • Database (Analytics): ClickHouse on AWS EKS, using S3 Express One Zone for the hot tier.
  • Database (Transactional): PostgreSQL 17 (RDS).
  • Real-time Sync: Yjs with Liveblocks for document CRDTs.
  • Infrastructure: AWS EKS with Karpenter for Graviton4 (r8g) node orchestration.

7.2. Integration Specifics

  • GitHub: Octokit Go SDK (OpenAPI-generated).
  • Jira: go-atlassian library for Jira Cloud v3 API.
  • Messaging: Microsoft Teams Workflows App (Power Automate) to replace legacy connectors.

8. Data Model

8.1. Incident

  • id: UUID
  • title: String
  • severity: Enum (P0, P1, P2)
  • status: Enum (Draft, Published, Archived)
  • window_start: Timestamp
  • window_end: Timestamp

8.2. LogEntry (ClickHouse)

  • timestamp: DateTime64(3)
  • service_name: LowCardinality(String)
  • level: Enum8
  • message: String (Inverted Indexed)
  • metadata: Map(String, String)

8.3. TimelineEvent

  • id: UUID
  • incident_id: UUID (FK)
  • type: Enum (Log, Manual, Chat)
  • content: Text
  • offset_ms: Int64

8.4. ActionItem

  • id: UUID
  • external_provider: Enum (Jira, GitHub)
  • external_id: String
  • assignee_email: String
  • sync_status: Enum (In-Sync, Error, Pending)

9. API Specification (Key Endpoints)

POST /api/v1/incidents/{id}/logs/upload

  • Request: Multipart form data (Log file).
  • Action: Triggers Go worker to parse, mask PII, and stream to ClickHouse.

GET /api/v1/incidents/{id}/timeline

  • Response: Chronological list of TimelineEvents with millisecond offsets.

PATCH /api/v1/action-items/{id}/sync

  • Request: { "status": "closed" }
  • Action: Updates Jira/GitHub ticket and reflects change in Postmortem Pro.

10. UI/UX Requirements

  • Dark Mode First: Designed for SRE "War Room" environments.
  • Density Controls: Allow users to toggle between "Compact" and "Spaced" log views.
  • The "Split View": Left pane contains the raw log explorer; right pane contains the collaborative document. Dragging from left to right creates a timeline event.

11. Non-Functional Requirements

  • Performance: Log search results should return in <200ms for 100M+ rows (leveraging ClickHouse Inverted Index).
  • Security: AES-256 encryption at rest; SOC2-compliant audit logging for all PII access.
  • Availability: 99.9% uptime; regional failover for the log ingestion pipeline.

12. Out of Scope

  • Real-time alerting and monitoring (Datadog/PagerDuty replacement).
  • Automated incident resolution (AI-driven auto-remediation).
  • Native mobile applications (Web-first).

13. Risks & Mitigations

  • Risk: High log volume crashing the ingestion engine.
    • Mitigation: Implement Kafka/MSK as a buffer and use Go 1.26's container-aware runtime to scale compute.
  • Risk: PII leakage in the postmortem report.
    • Mitigation: Two-tier masking (application-level via masq + ingestion-level via OTel processors).

14. Implementation Tasks

Phase 1: Project Setup & Core Infra

  • [ ] Initialize Next.js 16.1.2 project with TypeScript and Turbopack.
  • [ ] Setup Go 1.25.6 backend service with slog and masq integration.
  • [ ] Deploy ClickHouse on EKS using Karpenter and S3 Express One Zone.
  • [ ] Configure Clerk/Auth0 for SSO-based authentication.

Phase 2: Log Ingestion & Storage

  • [ ] Build Go-based log parser with Gitleaks secret detection library.
  • [ ] Implement ClickHouse schema with Inverted Index V2 for full-text search.
  • [ ] Create UI for file upload and raw log explorer (virtualized list).

Phase 3: Timeline & Editor

  • [ ] Integrate @cyca/react-timeline-editor for the builder UI.
  • [ ] Implement Tiptap collaborative editor with Liveblocks sync.
  • [ ] Build the "Drag-to-Promote" interaction between log explorer and timeline.

Phase 4: External Integrations

  • [ ] Implement bi-directional Jira sync using go-atlassian.
  • [ ] Implement GitHub Issues sync using Octokit Go SDK.
  • [ ] Build Slack/Teams event listener for ambient context capture.

Phase 5: Polishing & Export

  • [ ] Implement PDF and Markdown export functionality.
  • [ ] Add RCA templates (Five Whys, Fishbone).
  • [ ] Final security audit and PII masking verification.