VaultCheck CLI

Operations

Original Idea

CLI Backup Verifier A CLI tool that verifies backups, runs checksum tests, and emails a summary.

Product Requirements Document (PRD): VaultCheck CLI

1. Executive Summary

VaultCheck CLI is a high-performance, Go-based command-line utility designed for DevOps and System Administrators to bridge the "integrity gap" in backup workflows. Unlike standard backup tools that focus on data movement, VaultCheck focuses on data certainty. It performs recursive checksum validation, cross-platform comparison (local to cloud), and provides automated reporting. Built on Go 1.25.6, it leverages modern concurrency patterns and resource throttling to ensure verification does not disrupt production workloads.

2. Problem Statement

The "Backup Paradox" states that a backup is only as good as its last successful restore. However, most organizations only verify backups during a crisis. Current solutions are either:

  1. Too manual: Requiring custom scripts that are hard to maintain.
  2. Resource heavy: Saturating Disk IO or Network Bandwidth during business hours.
  3. Incomplete: Failing to account for bit-rot or S3 ETag complexities in cloud storage.

VaultCheck provides a standardized, automated, and performant way to guarantee that backup archives are bit-perfect matches of their source.

3. Goals & Success Metrics

Goals

  • Data Integrity: Ensure 100% detection of bit-rot or truncated files.
  • Operational Efficiency: Reduce manual verification time by 90% via automation.
  • Resource Safety: Prevent system slowdowns using intelligent IO and CPU throttling.

Success Metrics

  • Zero False Negatives: 100% detection rate of modified files in test suites.
  • Performance: Achieve hashing speeds within 10% of theoretical hardware limits (NVMe/Network).
  • Reliability: 99.9% successful completion rate for scheduled CRON tasks.

4. User Personas

  • DevOps Engineer (Jordan): Wants to integrate backup verification into CI/CD pipelines to ensure snapshots are valid before moving to production.
  • System Administrator (Pat): Needs a tool to run nightly on local file servers and receive a Slack summary if corruption is detected.
  • Security Auditor (Alex): Requires a tamper-proof log (SQLite history) showing that data integrity has been checked weekly for compliance (SOC2/ISO27001).

5. User Stories

  • As a DevOps Engineer, I want to define verification profiles in YAML so I can version-control my backup check configurations.
  • As a SysAdmin, I want to limit the tool's IO usage to 10MB/s during work hours so that I don't impact user performance.
  • As an SRE, I want verification metrics pushed to Prometheus so I can alert on "time since last successful verification."
  • As a Security Lead, I want my cloud credentials stored in the OS Keyring rather than plaintext files.

6. Functional Requirements

6.1 Core Verification Engine

  • Recursive Hashing: Support for MD5 (legacy) and SHA-256 (standard).
  • Fast-Path Verification: Option to skip files where mtime and size match the local manifest.
  • Cloud Comparison: Native integration with AWS S3 and Azure Blob Storage to compare local files against cloud objects.
  • Parallel Processing: Concurrent hashing using a bounded worker pool based on runtime.GOMAXPROCS.

6.2 Throttling & Scheduling

  • Bandwidth Limiting: Global and per-profile IO limits (e.g., --max-io 50MB).
  • Production Hour Logic: Dynamic throttling based on time-of-day windows.
  • Exit Code Mapping: Standardized codes (0: Success, 1: Corruption, 2: System Error) for CRON/CI integration.

6.3 Reporting & Notifications

  • Multi-Channel Alerts: Integration with Slack, PagerDuty, and Email (SendGrid).
  • Structured Logging: JSON-based logs using log/slog for ingestion into ELK/Loki.
  • Historical Tracking: Local SQLite database to store the status of the last 1,000 runs.

7. Technical Requirements

7.1 Tech Stack

  • Language: Go 1.25.6 (utilizing Green Tea GC and Swiss Table maps).
  • CLI Framework: Cobra v1.10.2.
  • Database: ncruces/go-sqlite3 (WASM-based CGO-free driver for portability).
  • Secret Management: 99designs/keyring for OS-level credential storage.
  • Concurrency: golang.org/x/sync/errgroup with SetLimit for resource bounding.
  • Logging: Native log/slog with slog.LevelVar for dynamic level control.

7.2 Integrations

  • Cloud: AWS SDK for Go v2, Azure SDK for Go (azblob).
  • Observability: Prometheus Pushgateway for ephemeral metrics.
  • Notifications: nikoksr/notify for multi-channel dispatch.

8. Data Model

8.1 VerificationProfile (YAML)

id: "daily-s3-backup"
source: "/mnt/data/backups"
destination: "s3://my-vault-bucket/archive"
algorithm: "sha256"
throttling:
  max_io_mb: 10
  prod_hours: "09:00-17:00"
notifications:
  channels: ["slack", "email"]

8.2 SQLite Schema

  • verification_runs: id (UUID), profile_id (String), timestamp (DateTime), status (Enum), files_processed (Int), bytes_verified (BigInt), corruption_count (Int).
  • file_manifest: file_path (String), last_hash (String), last_size (BigInt), last_seen (DateTime).

9. API Specification (Internal CLI)

  • vaultcheck verify --profile [name]: Triggers a run.
  • vaultcheck auth set --service [s3|azure]: Prompts for credentials and stores them in Keyring.
  • vaultcheck history --days 7: Returns a table of recent runs.
  • vaultcheck health: Self-test of cloud connectivity and IO permissions.

10. UI/UX Requirements

  • Terminal Output: Use ANSI colors for status (Green: OK, Red: Fail).
  • Progress Visualization: Real-time progress bars showing "Bytes Processed" and "Estimated Time Remaining."
  • Interactive Config: A vaultcheck init wizard to help users build their first YAML profile.
  • Summary Table: A clean ASCII table printed at the end of every manual execution.

11. Non-Functional Requirements

  • Performance: Must use io.CopyBuffer with a 128KB+ buffer to maximize NVMe throughput.
  • Security: Enforce "Read-Only" mode for all source and destination paths to ensure the tool never modifies backup data.
  • Memory Footprint: Maximum 100MB RAM usage regardless of the number of files (streaming IO).
  • Portability: Single binary distribution for Linux, macOS, and Windows via GitHub Releases.

12. Out of Scope

  • Backup Execution: VaultCheck will not perform the actual backup; it only verifies existing ones.
  • Encryption: The tool will not encrypt files; it assumes encryption-at-rest is handled by the storage provider.
  • GUI: No graphical user interface; this is a pure CLI tool.

13. Risks & Mitigations

  • Risk: S3 ETags don't match local MD5s for multipart uploads.
    • Mitigation: Implement rclone-style logic to detect multipart chunks and calculate local ETags accordingly.
  • Risk: High CPU usage during SHA-256 calculation.
    • Mitigation: Utilize Go's crypto/sha256 which uses hardware acceleration (AVX/ARM Crypto Extensions) and implement worker pool limits.
  • Risk: Network egress costs during cloud verification.
    • Mitigation: Default to "Metadata-Only" verification with an optional flag for "Full Deep-Check."

14. Implementation Tasks

Phase 1: Project Setup

  • [ ] Initialize project with Go 1.25.6
  • [ ] Set up Cobra v1.10.2 for command routing
  • [ ] Configure golangci-lint with strict performance and security rules
  • [ ] Implement slog structured logging with JSON/Text handlers

Phase 2: Core Hashing Engine

  • [ ] Implement errgroup worker pool for concurrent file walking
  • [ ] Build hashing logic using io.CopyBuffer (128KB buffers) and sync.Pool
  • [ ] Create Tier 1 (Metadata) and Tier 2 (Deep Hash) logic
  • [ ] Implement os.OpenRoot (Go 1.24+) for secure directory traversal

Phase 3: State & Security

  • [ ] Integrate ncruces/go-sqlite3 for history and manifest tracking
  • [ ] Implement 99designs/keyring for secure cloud credential storage
  • [ ] Build YAML profile parser and validator

Phase 4: Cloud & Notifications

  • [ ] Add AWS S3 SDK v2 integration for object listing
  • [ ] Add Azure Blob Storage SDK integration
  • [ ] Integrate nikoksr/notify for Slack and SendGrid alerts
  • [ ] Implement Prometheus Pushgateway client for run metrics

Phase 5: Throttling & Polish

  • [ ] Implement shapeio for bandwidth limiting
  • [ ] Add time-window logic for production hour throttling
  • [ ] Build ANSI progress bars and summary tables
  • [ ] Create GitHub Actions for cross-platform binary releases (Homebrew/Releases)