Product Requirements Document (PRD): FlagForge

1. Executive Summary

FlagForge is a high-performance, developer-first Feature Flagging and Toggle Management platform. By decoupling code deployments from feature releases, FlagForge enables engineering teams to perform risk-free canary rollouts, targeted user segmenting, and real-time kill-switch actions. The system is designed for ultra-low latency (<20ms evaluation) and enterprise-grade governance through immutable audit logs and multi-tenant RBAC.

2. Problem Statement

Modern engineering teams face high risks during production deployments because code changes and feature activations are coupled. Currently, teams struggle with:

Deployment Anxiety: Inability to "turn off" a buggy feature without a full rollback.
Rigid Targeting: Hardcoded logic for beta testers or regional rollouts requires code redeployments.
Lack of Visibility: No central source of truth or audit trail for who changed what configuration and when.
Performance Bottlenecks: Existing solutions often add significant latency to application load times.

3. Goals & Success Metrics

Goal 1: Sub-20ms flag evaluation latency at the edge.
Goal 2: 99.99% availability for the evaluation API.
Goal 3: 100% traceability for configuration changes via immutable logs.
Metrics:
- MTTR (Mean Time To Recovery): Reduced by 50% using real-time kill switches.
- Deployment Frequency: Increased by 3x by allowing dark launches.
- SDK Overhead: Less than 5kb added to client bundles.

4. User Personas

Software Engineer (Devin): Wants easy-to-integrate SDKs and local caching so the app doesn't slow down.
Product Manager (Patty): Wants to toggle features for specific "Beta" users without asking engineers for help.
SRE/DevOps (Sam): Needs to see audit logs during an incident and perform percentage-based rollouts to monitor system health.
QA Engineer (Quincy): Needs to force-enable specific flag variants for testing purposes in staging environments.

5. User Stories

As an Engineer, I want to define a flag in the console so that I can wrap my new code in a conditional block before it is finished.
As a PM, I want to rollout a feature to 10% of users in the US so that I can measure impact before a full release.
As an SRE, I want an immutable audit log of all changes so that I can identify the root cause of a configuration-induced outage.
As an Admin, I want to require a "four-eyes" approval for any production flag change to prevent accidental "kill-switch" triggers.
As a Developer, I want real-time updates via SSE so that my application reacts to flag changes instantly without polling.

6. Functional Requirements

6.1 Flag Management

Flag Types: Support for Boolean (On/Off) and Multivariate (Strings, JSON, Numbers).
Environments: Default support for Development, Staging, and Production with unique API keys.
Kill Switch: Global override to disable any flag instantly across all segments.

6.2 Targeting & Rollouts

Attribute-Based Rules: Target users by email, region, plan type, or custom attributes.
Percentage Rollouts: Deterministic canary releases using MurmurHash3 to ensure user stickiness.
Scheduled Toggles: Ability to schedule a flag to turn on/off at a specific UTC timestamp.

6.3 Governance & Security

Audit Logs: Immutable record of actor, action, timestamp, old_state, and new_state.
Four-Eyes Approval: Optional workflow requiring a second authorized user to approve production changes.
RBAC: Define roles (Admin, Editor, Viewer) at the Project and Environment levels.

7. Technical Requirements

7.1 Tech Stack (2026 Standards)

Frontend: React v19.2.1 (utilizing Server Components and the use hook), Tailwind CSS v4.x (Oxide engine), TanStack Query v5.90.19.
Backend: Go (Golang) v1.26 (utilizing encoding/json/v2 for zero-alloc parsing and the "Green Tea" GC).
Database:
- PostgreSQL: Primary store for configuration and Audit Logs (using Row-Level Security).
- Redis: Low-latency caching and Pub/Sub for real-time flag invalidation.
Auth/Multi-tenancy: Clerk (Organization Management for B2B multi-tenancy).

7.2 Architecture Patterns

Real-time Updates: Server-Sent Events (SSE) over HTTP/3 for unidirectional push.
Deterministic Hashing: murmur3 for consistent bucket assignment in canary rollouts.
Edge Delivery: CloudFront Functions for <1ms evaluation logic using CloudFront KeyValueStore.

8. Data Model

9. API Specification

9.1 Evaluation API (SDK Consumption)

POST /v1/evaluate

Request: { "flagKey": "new-ui", "context": { "userId": "123", "region": "us-east-1" } }
Response: { "value": true, "variant": "on", "reason": "rule_match" }

9.2 Management API (Console)

PATCH /v1/flags/:id/target

Request: { "rollout_percentage": 25, "rules": [...] }
Security: Requires Clerk JWT + Environment Editor permissions.

10. UI/UX Requirements

Dashboard: A list view of flags with status indicators (Active, Inactive, Stale).
Rule Builder: A "natural language" style UI for creating rules (e.g., "If email ends with @company.com → show Variant A").
Diff View: When requesting an approval, show a JSON/Visual diff of the rule changes.
Performance Metrics: Small sparklines in the UI showing the evaluation frequency of each flag.

11. Non-Functional Requirements

Latency: Evaluation endpoint P99 < 20ms.
Scalability: Support up to 100,000 concurrent SSE connections per Go node.
Immutability: Audit logs must use PostgreSQL RLS to prevent DELETE or UPDATE operations.
SDK Safety: SDKs must fail-open (return default value) if the API is unreachable.

12. Out of Scope

Native Mobile SDKs (Swift/Kotlin) for Phase 1 (Web SDK only).
Automatic flag cleanup (detecting unused code in GitHub).
Advanced ML-driven automated rollbacks (manual kill-switch only for MVP).

13. Risks & Mitigations

Risk: Flag evaluation adds latency to client apps.
- Mitigation: Local in-memory caching in SDKs + SSE for background updates.
Risk: Database load from millions of SDK calls.
- Mitigation: Redis cache layer + CloudFront KeyValueStore for edge evaluation.
Risk: Accidental production outages via misconfiguration.
- Mitigation: Mandatory "Four-Eyes" approval and environment-scoped API keys.

14. Implementation Tasks

Phase 1: Project Setup & Core API

[ ] Initialize Go 1.26 Backend with Fiber v3.
[ ] Initialize React 19.2.1 Frontend with Tailwind 4.x.
[ ] Configure Clerk for Multi-tenant Organization management.
[ ] Set up PostgreSQL schema with audit schema isolation.

Phase 2: Flag Evaluation Engine

[ ] Implement MurmurHash3 logic for percentage rollouts in Go.
[ ] Build Rule Evaluation engine (Bexpr or RuleGo).
[ ] Create /v1/evaluate high-concurrency endpoint.
[ ] Implement Redis caching strategy for flag definitions.

Phase 3: Dashboard & Governance

[ ] Build Flag Management UI (CRUD operations).
[ ] Implement "Four-Eyes" approval workflow (Change Requests).
[ ] Build Immutable Audit Log viewer with JSONB diffing.
[ ] Integrate SSE for real-time dashboard updates.

Phase 4: Edge & SDKs

[ ] Develop TypeScript Lightweight SDK with stale-while-revalidate caching.
[ ] Deploy CloudFront Functions for Edge Evaluation logic.
[ ] Implement Redis Pub/Sub to trigger SSE events on flag changes.
[ ] End-to-end load testing to ensure <20ms latency.