Original Idea
Incident Postmortem Builder A web app that builds timelines from logs and turns them into actionable follow-ups.
Product Requirements Document (PRD): Postmortem Pro
1. Executive Summary
Postmortem Pro is a specialized web application designed to transform the chaotic aftermath of an engineering incident into a structured, insightful, and actionable postmortem report. By automating log ingestion, providing an interactive "drag-and-drop" timeline builder, and facilitating real-time collaborative analysis, the platform reduces the time-to-report while ensuring critical follow-up tasks are never forgotten. It bridges the gap between raw infrastructure data (logs/traces) and human narrative (RCA).
2. Problem Statement
Engineering teams struggle with "Incident Archaeology"—the manual, error-prone process of hunting through fragmented logs across multiple platforms to reconstruct what happened. This leads to:
- Delayed Root Cause Analysis (RCA): Memories fade while reports sit in drafts.
- Context Loss: Critical Slack/Teams discussions are lost in the scroll-back.
- Lack of Accountability: Follow-up actions created in docs are rarely synced to Jira or GitHub, leading to recurring incidents.
3. Goals & Success Metrics
Business Goals
- Reduce the average time to complete a postmortem report by 50%.
- Increase the completion rate of postmortem follow-up actions.
- Standardize the quality of RCA across the engineering organization.
Success Metrics (KPIs)
- Mean Time to Report (MTTRp): Time from incident resolution to "Published" report status.
- Action Item Sync Rate: % of postmortem action items successfully exported to Jira/GitHub.
- Log-to-Timeline Conversion: Number of log entries promoted to timeline events per incident.
- Collaboration Depth: Average number of concurrent editors per report.
4. User Personas
- Incident Commander (IC): Needs to quickly build a high-level narrative and delegate RCA sections.
- Site Reliability Engineer (SRE): Needs to ingest massive log files and identify the exact millisecond of failure.
- Engineering Manager (EM): Needs to track high-level trends and ensure action items are assigned and funded.
- Platform Engineer: Needs to understand infrastructure-wide impact (Blast Radius).
5. User Stories
- As an SRE, I want to upload a 500MB log file so that I can filter for errors and drag them into a chronological timeline.
- As an Incident Commander, I want to pull in a Slack thread so that the key decisions made during the heat of the moment are preserved.
- As a Manager, I want a "Five Whys" template so that my team performs deep root cause analysis rather than surface-level fixes.
- As a Developer, I want my assigned follow-up task to automatically appear in my GitHub Issues so I don't have to check a separate tool.
6. Functional Requirements
6.1. Log Management & Ingestion
- Automated Ingestion: CLI-based upload or direct File Upload (CSV, JSON, Plain Text).
- Masking Engine: Automated PII and secret redaction using Go-based
masqandgitleakspatterns. - Search & Filter: High-speed querying of ingested logs via ClickHouse Inverted Index.
6.2. Interactive Timeline Builder
- Visual Interface: A millisecond-accurate timeline using
@cyca/react-timeline-editor. - Event Promotion: One-click "Promote to Timeline" from raw log entries.
- Annotation: Ability to add manual notes, images, or "Decision" markers to any point on the timeline.
6.3. Collaborative Editor
- Real-time Editing: Google Docs-style multi-user editing using Tiptap and Liveblocks.
- RCA Templates: Pre-built blocks for "Five Whys," "Fishbone (Ishikawa)," and "Blast Radius."
- Contextual Mentions: @mention team members and #link to infrastructure components.
6.4. Integration & Sync
- Bi-directional Task Sync: Create Jira/GitHub issues from the doc; update the doc when the issue status changes.
- Communication Capture: Extract context from Slack/Teams using the 2026 "Agentic Context" pattern (Teams Workflows App).
7. Technical Requirements
7.1. Tech Stack (2026 Standard)
- Frontend: Next.js 16.1.2 with TypeScript, Tailwind CSS, and Turbopack for builds. Use the
use cachedirective for data fetching. - Backend: Go 1.25.6 (utilizing the "Green Tea" GC for low-latency log processing).
- Database (Analytics): ClickHouse on AWS EKS, using S3 Express One Zone for the hot tier.
- Database (Transactional): PostgreSQL 17 (RDS).
- Real-time Sync: Yjs with Liveblocks for document CRDTs.
- Infrastructure: AWS EKS with Karpenter for Graviton4 (
r8g) node orchestration.
7.2. Integration Specifics
- GitHub: Octokit Go SDK (OpenAPI-generated).
- Jira:
go-atlassianlibrary for Jira Cloud v3 API. - Messaging: Microsoft Teams Workflows App (Power Automate) to replace legacy connectors.
8. Data Model
8.1. Incident
id: UUIDtitle: Stringseverity: Enum (P0, P1, P2)status: Enum (Draft, Published, Archived)window_start: Timestampwindow_end: Timestamp
8.2. LogEntry (ClickHouse)
timestamp: DateTime64(3)service_name: LowCardinality(String)level: Enum8message: String (Inverted Indexed)metadata: Map(String, String)
8.3. TimelineEvent
id: UUIDincident_id: UUID (FK)type: Enum (Log, Manual, Chat)content: Textoffset_ms: Int64
8.4. ActionItem
id: UUIDexternal_provider: Enum (Jira, GitHub)external_id: Stringassignee_email: Stringsync_status: Enum (In-Sync, Error, Pending)
9. API Specification (Key Endpoints)
POST /api/v1/incidents/{id}/logs/upload
- Request: Multipart form data (Log file).
- Action: Triggers Go worker to parse, mask PII, and stream to ClickHouse.
GET /api/v1/incidents/{id}/timeline
- Response: Chronological list of
TimelineEventswith millisecond offsets.
PATCH /api/v1/action-items/{id}/sync
- Request:
{ "status": "closed" } - Action: Updates Jira/GitHub ticket and reflects change in Postmortem Pro.
10. UI/UX Requirements
- Dark Mode First: Designed for SRE "War Room" environments.
- Density Controls: Allow users to toggle between "Compact" and "Spaced" log views.
- The "Split View": Left pane contains the raw log explorer; right pane contains the collaborative document. Dragging from left to right creates a timeline event.
11. Non-Functional Requirements
- Performance: Log search results should return in <200ms for 100M+ rows (leveraging ClickHouse Inverted Index).
- Security: AES-256 encryption at rest; SOC2-compliant audit logging for all PII access.
- Availability: 99.9% uptime; regional failover for the log ingestion pipeline.
12. Out of Scope
- Real-time alerting and monitoring (Datadog/PagerDuty replacement).
- Automated incident resolution (AI-driven auto-remediation).
- Native mobile applications (Web-first).
13. Risks & Mitigations
- Risk: High log volume crashing the ingestion engine.
- Mitigation: Implement Kafka/MSK as a buffer and use Go 1.26's container-aware runtime to scale compute.
- Risk: PII leakage in the postmortem report.
- Mitigation: Two-tier masking (application-level via
masq+ ingestion-level via OTel processors).
- Mitigation: Two-tier masking (application-level via
14. Implementation Tasks
Phase 1: Project Setup & Core Infra
- [ ] Initialize Next.js 16.1.2 project with TypeScript and Turbopack.
- [ ] Setup Go 1.25.6 backend service with
slogandmasqintegration. - [ ] Deploy ClickHouse on EKS using Karpenter and S3 Express One Zone.
- [ ] Configure Clerk/Auth0 for SSO-based authentication.
Phase 2: Log Ingestion & Storage
- [ ] Build Go-based log parser with Gitleaks secret detection library.
- [ ] Implement ClickHouse schema with Inverted Index V2 for full-text search.
- [ ] Create UI for file upload and raw log explorer (virtualized list).
Phase 3: Timeline & Editor
- [ ] Integrate
@cyca/react-timeline-editorfor the builder UI. - [ ] Implement Tiptap collaborative editor with Liveblocks sync.
- [ ] Build the "Drag-to-Promote" interaction between log explorer and timeline.
Phase 4: External Integrations
- [ ] Implement bi-directional Jira sync using
go-atlassian. - [ ] Implement GitHub Issues sync using Octokit Go SDK.
- [ ] Build Slack/Teams event listener for ambient context capture.
Phase 5: Polishing & Export
- [ ] Implement PDF and Markdown export functionality.
- [ ] Add RCA templates (Five Whys, Fishbone).
- [ ] Final security audit and PII masking verification.