Senior Staff Architect — Technical Plan

Zorianto
Astral
v6.0

An execution-ready engineering breakdown of the AI Security & Governance Platform — every tradeoff named, every failure mode mapped, every shortcut flagged.

9 Modules 28-Week Roadmap <50ms Inline 7 Frameworks 10k RPS Target
Vigil Luxion Stellix Sentinel Chronix Oraxis Nexion A ! ?
Section 01

System Understanding

📋 Product Restatement

Zorianto Astral is a real-time AI interaction governance platform that wraps an organization's entire AI surface — browsers, API endpoints, desktop processes, agentic pipelines, and MCP server connections — with an inline enforcement layer capable of blocking, redacting, or modifying interactions in under 50ms, while simultaneously running asynchronous compliance gap analysis against seven regulatory frameworks (HIPAA, GDPR, SOC 2, PCI-DSS, EU AI Act, NIST AI RMF, and internal AI Governance). It is not a SIEM add-on or a passive log aggregator; it is an active enforcement proxy that intercepts traffic and applies policies before damage occurs — combined with a governance layer for agentic AI workflows and machine identities at a 100:1 NHI-to-human ratio.

👥 Who Has It
  • Enterprise security teams (>500 employees) deploying AI at scale with no enforcement controls
  • Compliance orgs facing Aug 2026 EU AI Act deadline with 78% not ready
  • Platform engineers running agentic pipelines with no runtime governance
❌ Why Others Fail
  • DLP tools are perimeter-only — don't understand AI semantics
  • SIEMs log after the fact — data is already in the LLM
  • No existing tool intercepts agentic tool-calls inline
  • NHI management tools don't model AI-specific credential risks
🔧 What Makes It Hard
  • <50ms inline enforcement at 10k RPS with semantic analysis is at the physical limits of the current LLM API stack
  • Agent memory integrity requires continuous vector-space drift monitoring
  • Multi-tenant EU data residency with shared infra is architecturally complex

⚠ Risky Assumptions

⚠ Assumption A1 — LATENCY CONTRADICTION

The PRD states L3 AI Judge runs in <100ms AND inline enforcement completes in <50ms. These are contradictory. Calling an LLM API (even Haiku/Gemini Flash) for every interaction takes 200–800ms on a warm connection. Mitigation: L3 must be triggered asynchronously — inline path uses only L1+L2 (<15ms), with L3 escalating to a HOLD state for high-confidence heuristic hits. This is a fundamental architecture change from what the PRD implies.

⚠ Assumption A2 — MULTI-TENANT ARCHITECTURE UNRESOLVED

OQ-04 asks shared vs dedicated. This is not a product question — it's an architectural fork. Shared infra with RLS and tenant namespacing takes 6 weeks to get right. Dedicated per-customer means 10x infra costs at low customer counts. Mitigation: MVP ships shared with strict Postgres Row-Level Security. Dedicated is an enterprise tier add-on at Phase 3+. Decision must be locked before week 2.

⚠ Assumption A3 — ARMOR FRAMEWORK DEPENDENCY

A5 assumes "ARMOR framework integration for unauthorized fine-tuning detection is available as a dependency." ARMOR is not a widely known open framework as of April 2026. This is likely an internal or partner dependency. If unavailable, the fine-tuning detection feature (Luxion L4) is blocked. Mitigation: Document this as a P2 feature with a clear go/no-go gate at week 20.

⚠ Assumption A4 — EU AI ACT CLASSIFICATION VIA ML IS OVERCONFIDENT

Auto-classifying systems as Prohibited/High-Risk/Limited/Minimal per Annex III requires legal interpretation, not just ML. A model misclassifying a High-Risk system as Minimal creates legal liability. Mitigation: Stellix auto-classification is an advisory score + human confirmation workflow, never automated gate-keeping without human sign-off.

⚠ Assumption A5 — L4 DEEPFAKE AT <500ms IS OPTIMISTIC

Spectral audio analysis of a 30-second clip + behavioral baseline comparison is a 1–4 second operation with GPU, longer without. <500ms requires dedicated GPU inference infrastructure or aggressive audio chunking. Mitigation: L4 is never in the inline path. It's always async post-hoc analysis triggered by L2/L3 signals. Wire fraud authorization should go through a separate multi-factor CHALLENGE flow regardless.

MVP Plan

✅ Ships in v1 (Weeks 1–10)

  • Browser extension with DOM interception for Chrome + Edge (MDM deployable)
  • Apexion: PII/PHI blocking with regex + pattern matching (<15ms, L1+L2 only)
  • Stellix: AI tool discovery via DNS telemetry + browser extension inventory
  • API Gateway proxy on AWS (us-east-1) for REST AI API calls
  • Policy CRUD — create, test, activate policies with 30-day retroactive simulation
  • Basic Nexion: alert feed with severity + acknowledgment
  • Chronix: HIPAA + GDPR gap analysis only (the two highest-demand frameworks)
  • Postgres-backed audit log with 13-month retention
  • Sales Demo Mode (client-side, no backend needed)

✕ Cut from MVP

Vigil (agent governance) — cut because it requires a separate proxy infrastructure, agent registration model, and kill-switch mechanism. Adding this to the MVP doubles scope and delays the sellable core by 6 weeks.

Sentinel NHI — cut because NHI behavioral anomaly detection requires 30+ days of baseline data before it produces signal. Shipping it empty is confusing.

L3 AI Judge + L4 Deepfake — cut because the async inference service is a separate platform. Ship L1+L2 with 90%+ coverage of common threats.

Blast Radius Simulator — cut because it requires agent topology data that doesn't exist until Vigil ships.

EU AI Act countdown + all 7 frameworks — reduce to HIPAA + GDPR in MVP; the clock-based EU feature is powerful in demos but requires Stellix inventory to be populated first.

🏗 MVP Architecture

Clients Browser Ext Desktop Agent API Clients Cloud Connectors Enforcement Gateway Apexion Engine L1 Signature (<5ms) L2 Heuristic (<15ms) OPA Policy (<1ms) <50ms Core Services Policy Engine (Go) Stellix Discovery Event Pipeline Chronix (HIPAA/GDPR) Nexion Alerts Data Layer PostgreSQL + TimescaleDB Redis Cluster Session / Rate Limit Kafka Event Pipeline S3 + Glacier Audit Archive → AI Providers ALLOW BLOCK / REDACT enforce → user Next.js Dashboard

🔧 Tech Stack with Justification

Layer Choice Why Rejected Alternatives
Enforcement Service Go 1.22+ Goroutine-per-connection model handles 10k RPS with sub-10ms routing overhead. Net/http2 + context cancellation fits the <50ms pipeline perfectly. Compiled binary with no JVM warmup or GC pauses. Rust: better performance ceiling but 2x development time; team ramp-up is a real cost. Node.js: event loop blocking during DLP regex at high concurrency; GC pressure at 10k RPS is observed.
Policy Engine OPA (Rego) Sub-millisecond policy evaluation; battle-tested in Kubernetes admission controllers; native bundle compilation; supports "policy as code" git workflows. The partial evaluation feature enables policy simulation cheaply. Custom rule engine: faster initially, but becomes a maintenance nightmare at 48+ active policies. Cedar (AWS): newer, less community tooling.
Primary DB PostgreSQL 16 + TimescaleDB TimescaleDB's hypertable chunking gives O(log n) time-series queries on audit events without changing the SQL interface. Row-Level Security handles multi-tenant isolation at the DB layer. JSONB columns for flexible event metadata. DynamoDB: no ad-hoc queries for compliance reports; expensive at audit log volumes. ClickHouse: move analytics here at Phase 3 when query patterns stabilize (see Oraxis).
Session State Redis Cluster 7.x Sub-millisecond counter increments for rate limiting; SETEX for session TTLs; pub/sub for real-time alert streaming. Redis Cluster avoids single-node SPOF. Persistence with AOF for crash recovery. DragonflyDB: API compatible but less production-proven at enterprise scale. Memcached: no pub/sub, no persistence.
Event Pipeline Apache Kafka (MSK) At 10k RPS sustained, Kafka's 100k+ msg/sec throughput with 7-day retention gives replay for forensics. Topic partitioning by tenant_id enables isolation. SQS hits its 3,000 msg/sec standard queue limit too easily. SQS: PRD specifies it but throughput ceiling is insufficient. Kinesis: viable alternative, less community tooling for consumer groups.
Frontend Next.js 14 (App Router) Server Components for compliance report rendering without client hydration. API routes co-located for BFF pattern. Built-in ISR for slowly-changing compliance scores. Vercel or self-hosted on ECS. Remix: better for highly interactive UIs but compliance dashboards are read-heavy. SPA React: no SSR for SEO/initial load.
Auth Cognito + custom JWT Cognito handles SAML/SSO for enterprise (required for >500 employee orgs). Custom JWT service for API keys with scoped claims and per-key rotation tracking. Cognito User Pools give MFA without building it. Auth0: higher cost at scale, vendor lock-in. Keycloak: requires ops overhead. DIY: never for a security product.
ML Inference Python FastAPI + Triton Separate Python service for L3 AI Judge (async only) and L4 deepfake models. Triton Inference Server for GPU model serving with batching. Decoupled from Go enforcement service — ML failures don't affect inline path. Inline in Go: CGo bindings for Python ML libraries are fragile and hard to deploy. Lambda: cold start latency incompatible with any latency target.
Orchestration EKS (Kubernetes) ECS Fargate is fine at low scale but lacks the pod autoscaling granularity needed for burst traffic at the enforcement gateway. EKS Karpenter for node auto-provisioning; HPA on CPU + request latency custom metrics. ECS Fargate: PRD specifies it, but at 10k RPS target you hit Fargate task scaling lag (>60s). Kubernetes HPA reacts in <15s.
Section 03

Full Product Architecture

Service Boundaries

Verdict: Modular Monolith → Selective Microservices. Start with a modular monolith (single deployable, domain-separated packages) for Phases 0–2. Extract to separate services only at proven pain points, not ahead of time.

🔒 Enforcement Gateway

Extracted as a separate service from Day 1. Reason: must be independently scalable to 10k RPS, independently deployable (no big-bang redeploy for policy updates), and isolated so a bug in Chronix can't affect inline enforcement uptime.

🤖 ML Inference Service

Extracted from Day 1. Reason: Python/GPU runtime environment is incompatible with the Go enforcement binary. GPU node pools are expensive — co-locate all ML workloads in one service.

📊 Analytics Service (Phase 3)

Extract Oraxis onto ClickHouse when audit log exceeds 50M rows/month. OLAP queries for blast radius and cost attribution will kill the PostgreSQL primary if not separated.

🏛 Everything Else

Chronix, Nexion, Sentinel, Stellix, Vigil, and the Dashboard live in a single Go monolith until team size or throughput demands extraction. Premature microservices create distributed systems overhead without the team to manage it.

Scalability Model

MetricTargetStrategy
Throughput 10k RPS Horizontal EKS pod scaling; Redis rate limiting per tenant
P95 Latency <50ms L1+L2 only inline; OPA bundle cached in-process; Redis local replica
Audit Events ~864M/day at 10k RPS Kafka buffering; TimescaleDB chunk compression; S3 tiering after 90 days
NHI Inventory 10k identities 4-hour scan interval; background worker pool; Redis cache for posture scores
Agent Sessions 1k concurrent Vigil Proxy stateful connections via Redis sorted sets; kill-switch via pub/sub

SLA & Reliability

⚠ 99.9% SLA Implication

99.9% = 8.77 hours downtime/year. For an inline enforcement path, this means every outage lets unscanned AI traffic through. The fail-open/fail-closed config is therefore not a convenience feature — it's a core SLA definition. Production should default fail-open for low-risk, fail-closed for high-risk (PHI tenants). Document this contractually.

Failure Mode: Gateway Outage

Blast radius: All inline enforcement stops. Browser extensions fall through to allow-mode. API gateway returns 200 without DLP scan.
Detection: Heartbeat alert within 15 seconds; CloudWatch alarm on 5xx rate.
Recovery: EKS pod restart SLA: <45 seconds. Multi-AZ deployment ensures AZ failure doesn't cause full outage.

🗄 Key Data Model

-- Core event table (TimescaleDB hypertable) Entity: enforcement_events - id: UUID [PK] - tenant_id: UUID [FK, RLS partition key] - timestamp: TIMESTAMPTZ [hypertable partition] - surface: ENUM(browser|api|desktop|agent|mcp) - actor_id: TEXT [user or NHI identifier] - ai_provider: TEXT [openai|anthropic|google|local] - decision: ENUM(allow|block|redact|challenge) - policy_ids: UUID[] [matched policies] - detection_layers: JSONB [{layer:L1, sig:DAN_001, conf:0.99}] - pii_detected: JSONB [{type:SSN, count:2, redacted:true}] - latency_ms: SMALLINT - INDEX: (tenant_id, timestamp DESC) -- compliance queries - INDEX: (tenant_id, decision, timestamp) -- alert feed - INDEX: (tenant_id, actor_id, timestamp) -- user risk scoring Entity: policies - id: UUID [PK] - tenant_id: UUID [FK] - name: TEXT - surface: TEXT[] [applicable surfaces] - conditions: JSONB [OPA rego bundle reference] - action: ENUM(block|redact|warn|log|challenge) - enabled: BOOLEAN - version: INTEGER [optimistic locking] - simulated_impact: JSONB [last 30d retroactive analysis] Entity: agent_sessions (Vigil) - id: UUID [PK] - tenant_id: UUID [FK] - agent_id: UUID [FK → registered_agents] - state: ENUM(open|active|inspecting|quarantined|terminated|closed) - execution_graph: JSONB [full hop trace] - memory_hash: TEXT [SHA-256 of memory store snapshot] - started_at, ended_at: TIMESTAMPTZ - kill_reason: TEXT Entity: non_human_identities (Sentinel) - id: UUID [PK] - tenant_id: UUID [FK] - identity_type: ENUM(api_key|jwt|oauth|service_account|pat) - owner_team: TEXT - privilege_score: SMALLINT [0-100] - posture: ENUM(pass|warn|fail) - last_rotated_at: TIMESTAMPTZ - last_seen_ip: INET - anomaly_flags: JSONB
-- Critical API Endpoints (inline path only) POST /api/v1/enforce Input: {surface, actor, prompt_hash, ai_provider, metadata} Output: {decision, action, redacted_content?, challenge_token?} Auth: Agent API key (scoped to tenant) SLA: P95 <50ms — circuit breaker at 100ms Rate: 10k RPS per tenant bucket (Redis token bucket) Note: This is the ONLY hot path. Optimize obsessively. POST /api/v1/events/batch Input: [{event}] max 100 events Output: {accepted: N, failed: M} Note: Async ingestion path for cloud connectors. Publishes to Kafka, responds immediately. GET /api/v1/policies/bundle Output: OPA bundle (tar.gz with .rego files) Cache: E-Tag + 304 Not Modified; agents poll every 30s Note: Policy bundles are compiled, not interpreted. Zero-downtime update via bundle versioning. POST /api/v1/agents/{id}/sessions/{sid}/kill Input: {reason, escalate_to: "SOC"} Output: {terminated: true, correlated_events: [...]} Note: Publishes kill signal to Redis pub/sub. Vigil proxy subscribes; terminates connection immediately.
🔐 Security Model
  • All inter-service comms: mTLS with cert rotation via AWS ACM PCA
  • Enforcement events: encrypted at rest AES-256 in Postgres + S3 SSE
  • Audit logs: WORM-locked in S3 Object Lock (compliance evidence)
  • Top threat #1: Prompt injection to exfiltrate policy config → Policy data never sent to LLM; L2 detects instruction syntax in reverse direction
  • Top threat #2: Tenant data leakage via mis-scoped API key → Per-key tenant binding in JWT claims; Postgres RLS enforces at DB layer
  • Top threat #3: Enforcement bypass via direct API call to AI provider → Browser extension monitors XHR; desktop agent monitors process network; bypass logged as critical alert

Tech Decisions & Tradeoffs

⚡ DECISION: L3 AI Judge is ASYNC only — never in the inline path

LLM API calls (even flash/haiku) average 200–600ms on a warm connection. The PRD's stated "<50ms inline enforcement" and "<100ms L3" are physically contradictory. The inline path uses L1 (signature, <5ms) + L2 (heuristic, <15ms) + OPA policy evaluation (<1ms). L3 is triggered only when L2 confidence is 0.65–0.94 (ambiguous zone), and the interaction enters HOLD state pending async L3 resolution — like a credit card soft decline requiring additional verification.

L3 in inline path: Adds 200–600ms to every interaction. Unacceptable. Users would bypass the system. L3 via edge LLM (local model): Viable at Phase 4 — a quantized Mistral-7B on GPU nodes could do <80ms. Deferred due to GPU infra cost and complexity in MVP.

The HOLD state creates a user-visible delay for ~8% of interactions (estimated L2 ambiguous zone). This may frustrate users if hold time exceeds 2–3 seconds. Mitigation: CHALLENGE workflow gives users a reason ("Policy review: 2s") rather than an opaque pause.

If LLM inference latency drops to <50ms (plausible with edge models), revisit inlining L3. ONNX-quantized security-tuned models may enable this at Phase 4.

🗄 DECISION: PostgreSQL + TimescaleDB over DynamoDB or ClickHouse

Compliance reports require JOIN-heavy queries across events, policies, NHI, and agents. DynamoDB's single-table design makes these queries painful and expensive. TimescaleDB's hypertable chunking gives O(log n) time-series performance without leaving the SQL ecosystem. PostgreSQL RLS provides multi-tenant data isolation at zero application-layer complexity.

DynamoDB: PRD specifies RDS but let's be explicit. DynamoDB cannot support the ad-hoc query patterns needed for Chronix gap analysis without a massive GSI footprint. ClickHouse: Ideal for Oraxis analytics at 50M+ events/month. Add as a read-replica analytics store at Phase 3, not Day 1 — schema is not stable enough yet.

At 864M events/day sustained (10k RPS), TimescaleDB chunk compression and S3 tiering after 90 days is critical. Without it, storage costs and query times degrade materially by month 4. Chunk compression must be tested during Phase 0 load testing.

At Phase 3, mirror enforcement_events to ClickHouse via Kafka consumer. Oraxis queries hit ClickHouse; all other modules stay on PostgreSQL. No disruptive migration required.

🚦 DECISION: Kafka over SQS for the event pipeline

The PRD specifies SQS/EventBridge. However, at 10k RPS x 1 event/request, that's 864M messages/day. Standard SQS is limited to ~3,000 msg/sec without batching tricks. Kafka on MSK handles 1M+ msg/sec natively. More critically, Kafka's 7-day log retention enables forensic replay — if a new Luxion signature is deployed, you can reprocess last week's events to catch previously-missed attacks.

SQS Standard: Throughput ceiling too low; no replay. Kinesis: Valid alternative. Less Go ecosystem support than Kafka; shard management more manual. Switch to Kinesis if AWS-native tooling is a hard requirement. EventBridge: Keep for routing Nexion alerts to external systems (PagerDuty, Slack, SIEM) but not for high-throughput event ingestion.

Kafka on MSK adds operational complexity vs SQS. Partition count decisions made at cluster creation are hard to change. Set 64 partitions per topic (over-provision; cheaper than under-provisioning).

MSK Serverless removes partition management complexity at a cost premium (~2x per GB). Use MSK Serverless for MVP, evaluate dedicated MSK at Phase 2 when throughput patterns are known.

🏗 DECISION: EKS over ECS Fargate for orchestration

ECS Fargate task scale-out takes 45–90 seconds (new task launch + health check). At 10k RPS burst traffic, that's too slow. EKS HPA with KEDA (Kubernetes Event-Driven Autoscaling) can scale pods in <15 seconds based on Kafka consumer lag metrics — which is exactly the right signal for the event pipeline. Karpenter handles node provisioning in <60 seconds.

ECS Fargate (PRD spec): Simpler ops, but the autoscaling SLA doesn't match the product's performance requirements. Reconsider for non-critical services (Chronix report generation). ECS on EC2: Self-managed instances add overhead that Kubernetes already abstracts better.

Use ArgoCD for GitOps deployments. Helm charts for all services. Cluster creation via Terraform. Budget 1 dedicated SRE/platform engineer starting Phase 1.

Deploy a dedicated EKS cluster in eu-west-1 for EU tenants. Enforce tenant routing at API Gateway (CloudFront → regional ALB). Postgres replicated to eu-west-1 Multi-AZ. No EU data touches us-east-1 workers.

Section 05

Technical Challenges

Challenge
🔥 Sub-50ms DLP at Scale
Why It Occurs
PII regex on raw prompt text is O(n) per pattern per token. At 10k RPS with 2k-token prompts and 50+ PII patterns, you're running 50M regex operations/second.
Failure Mode
CPU saturation at ~3k RPS. P95 latency spikes to 200ms. Rate limiter kicks in; legitimate traffic is shed.
Mitigation
Use Hyperscan (Intel's regex engine via Go CGo binding) — designed for network DPI at multi-Gbps. Compile all PII patterns to a single Hyperscan database loaded at startup. 10-100x faster than RE2 for multi-pattern matching. Benchmark during Phase 0 load testing.
Challenge
⚠ Agent Memory Drift
Why It Occurs
RAG memory stores are updated continuously by agent tool calls. An adversarial document injected into the vector store shifts the embedding space gradually — no single event is suspicious, but the cumulative drift enables a semantic jailbreak over hours.
Failure Mode
Agent begins returning responses aligned with injected instructions days after the original poisoning. By the time L2 detects the output anomaly, the damage is done.
Mitigation
Hash-based snapshot on every memory write (SHA-256 of ordered embedding centroids). Semantic drift score: cosine similarity between current centroid distribution and T-24h baseline. Threshold alert at >0.08 drift. Purge + restore from last clean snapshot when threshold exceeded.
Challenge
🚨 3AM Alert: Policy Bundle Race
Why It Occurs
Enforcement Gateway pods poll policy bundles every 30 seconds. During a rolling deploy, pods have different bundle versions simultaneously. A policy change during a rolling restart creates a 30–60 second window where 50% of pods are on old policy.
Failure Mode
A critical PHI blocking policy is deployed; for 60 seconds, half of requests bypass it. This is a HIPAA compliance incident, not just a bug.
Mitigation
Bundle version is included in every enforcement event log. Policy activation has two modes: "instant" (all pods hot-reload within 5s via Redis pub/sub) vs "rolling" (default). Critical PHI/PCI policies must use instant mode. Audit log captures policy version per decision for compliance evidence.
Challenge (PRD Blind Spot)
🔭 Browser Extension Trust
Why It Occurs
The PRD assumes the browser extension is the enforcement point for browser AI interactions. But a user can: (a) use a browser without the extension installed, (b) disable the extension, or (c) use a browser AI model that runs locally without network calls.
Failure Mode
Shadow AI usage via unmanaged devices or non-MDM browsers is completely undetected. The Stellix discovery metric shows 95% coverage, but it's really 95% of managed devices — a significant blind spot.
Mitigation
API Gateway enforcement is the backstop — all external AI API calls from the corporate network must route through the proxy (network policy enforcement via CASB rules or DNS filtering). Browser extension is defense-in-depth, not the primary control. This must be communicated to customers clearly.

Advanced Approaches

🌊 Event Sourcing for Audit

What it solves: Compliance evidence requires immutable, ordered history of every policy evaluation with full context. Traditional CRUD updates destroy this audit chain.

When worth it: From Day 1. The enforcement_events table IS an event store — append-only, never updated. This is event sourcing without the overhead of rebuilding projections for every read.

Adoption cost: Low. Design the data model correctly from the start (no UPDATE on events). The "projection" is Chronix reading the event stream to compute compliance scores.

✓ USE NOW

🏎 Edge-Deployed Policy Eval

What it solves: Browser extension currently sends to API Gateway for policy evaluation — adds ~50ms round-trip. Edge deployment of OPA WASM bundles in the extension itself could reduce this to <5ms local evaluation.

When worth it: Phase 4, when customer latency complaints emerge or when enterprises deploy Astral for latency-sensitive real-time coding assistants.

Overkill for: MVP. Policy bundle sync in the extension adds deployment complexity; WASM bundles are 2-5MB which slows extension updates through Chrome Web Store review.

⟳ EVALUATE AT PHASE 4

🧠 Fine-tuned Security LLM

What it solves: The L3 AI Judge uses a general-purpose LLM (GPT-4o/Claude Sonnet) for semantic threat analysis. A security-domain fine-tuned model (Mistral-7B + security corpus) could run on dedicated GPU nodes at <80ms — enabling inline L3.

When worth it: When monthly L3 API costs exceed $50k/month, or when customers require data-resident AI judgment (no prompts leaving tenant network).

Adoption cost: High. Requires curated training dataset, RLHF pipeline, model evaluation framework, and GPU serving infra. 3–4 months of ML team time.

⟳ EVALUATE AT $50K/MO THRESHOLD

📊 Feature Flags + Progressive Policy Delivery

What it solves: The policy simulator shows retroactive impact but can't predict future edge cases. Progressive policy rollout (10% of users → 25% → 100%) catches false positives before full deployment.

When worth it: Phase 2. When enterprise customers start deploying blocking policies and false positives become a customer success issue.

Implementation: LaunchDarkly SDK or self-hosted Unleash. Policy activation has a "canary" mode that routes N% of tenant traffic through the new policy.

⟳ EVALUATE PHASE 2

🔍 Vector DB for Semantic Search

What it solves: Compliance report generation requires "find all events related to PHI exposure" — a semantic query, not a keyword match. pgvector on PostgreSQL enables this without a separate vector database.

When worth it: Phase 3. Add pgvector to the existing PostgreSQL instance; embed enforcement events with a lightweight model (all-MiniLM-L6-v2). Enables natural language compliance queries.

⟳ PHASE 3 WITH pgvector

🔄 CQRS for Read/Write Separation

What it solves: Compliance report generation is a read-heavy, long-running operation that competes with high-frequency audit event writes on the same PostgreSQL primary.

When worth it: When report generation p95 exceeds 10 seconds, or when write throughput on the primary exceeds 50k events/second. Route all Chronix reads to a PostgreSQL read replica.

Overkill for: MVP. PostgreSQL Multi-AZ read replica is sufficient for Phase 0–2.

⟳ WHEN REPORTS SLOW DOWN

Section 07

Implementation Roadmap

Weeks 1–4
Phase 0 — Foundation
  • EKS cluster + VPC, us-east-1 + eu-west-1 (infra eng)
  • PostgreSQL 16 + TimescaleDB, schema v1 (backend)
  • Redis Cluster + MSK Kafka baseline config
  • Cognito + JWT auth service, API key lifecycle
  • OPA policy engine — bundle compile + serve pipeline
  • Event state machine: OBSERVED → LOGGED (minimal)
  • CI/CD: ArgoCD + GitHub Actions + Terraform state
  • Load testing harness: k6 @ 1k RPS baseline
Weeks 5–10
Phase 1 — Inline MVP
  • Chrome extension: DOM intercept + XHR hook (frontend)
  • Apexion enforcement engine: L1+L2, Hyperscan DLP (Go)
  • API Gateway proxy: HTTPS intercept, JWT validation
  • Policy CRUD API + 30-day simulation mode
  • Stellix: DNS telemetry + extension inventory
  • Basic Nexion: alert feed + ACK workflow
  • Chronix: HIPAA + GDPR gap analysis + PDF reports
  • Next.js dashboard: enforcement log + policy UI
  • Sales Demo Mode: 17 scenarios, client-side only
  • Load test: P95 <50ms @ 5k RPS. Gate to Phase 2.
Weeks 11–16
Phase 2 — Agent Runtime
  • Vigil proxy: sidecar deployment, agent registration API
  • Tool-call interception + MCP whitelist enforcement
  • Session state machine (OPEN→QUARANTINED→TERMINATED)
  • Kill-switch via Redis pub/sub
  • Execution graph rendering in Next.js dashboard
  • Indirect injection stripping (PDF/doc ingestion)
  • Desktop agent: macOS/Windows process scanner
  • Sentinel NHI: inventory + privilege scoring (passive)
Weeks 17–22
Phase 3 — Governance
  • Chronix: all 7 frameworks + article-level gaps
  • EU AI Act countdown + Art.6 classification workflow
  • Sentinel NHI: behavioral anomaly + auto-rotation
  • Oraxis: token cost attribution + burn rate
  • Executive Risk Dashboard + board reports
  • ClickHouse analytics store migration
  • SOC 2 Type II prep: audit evidence automation
  • eu-west-1 full regional isolation + data residency validation
Weeks 23–28
Phase 4 — Advanced
  • L3 AI Judge async service (FastAPI + LLM)
  • L4 deepfake detection (audio + image) on GPU nodes
  • Vigil memory integrity scanner + purge/restore
  • Blast radius simulator (Oraxis)
  • Policy simulator with canary rollout
  • Stellix supply chain scanner (CVE + dependency)
  • ARMOR integration (if available) for fine-tune detection
  • Edge OPA WASM bundles (evaluate)

⚡ Parallelization Opportunities

Track A: Enforcement

Go gateway + browser extension + API proxy. Can run fully independently from Track B after shared event schema is locked (end of Week 2). Team: 2 Go engineers + 1 frontend (extension).

Track B: Dashboard

Next.js dashboard + policy UI + Chronix gap analysis. Depends only on API contract (OpenAPI spec locked Week 2). Mock API for development. Team: 1 fullstack + 1 frontend.

Track C: Infra/Platform

EKS, Terraform, CI/CD, MSK, RDS, Redis. Runs in parallel Week 1 onwards. Must deliver baseline environment by end of Week 2. Team: 1 platform/SRE engineer.

Track D: ML (Phase 2+)

L3 AI Judge async service + L4 deepfake models. Starts Week 12, after core enforcement is stable. Depends on API contract for async challenge protocol. Team: 1 ML engineer.

🎯 Critical Path

The single thread that delays everything: The enforcement event schema and the OPA policy bundle format. Every other component depends on these two contracts — the browser extension, the API gateway, the Nexion alert feed, the Chronix compliance mapping, and the TimescaleDB schema all wire to this. If the event schema changes after Week 3, it cascades across 5 teams.

Lock the event schema and OpenAPI spec by end of Week 2. No exceptions. Use schema versioning (v1, v2) to allow additive changes without breaking existing consumers.

📦 Open Source & Acceleration

Tool Purpose Why This Status
OPA + Rego Policy engine Sub-ms eval; git-native policy-as-code; WASM compilation for edge Stable ✓
Hyperscan Multi-pattern DLP regex Network-speed regex; single DB compile for all PII patterns Stable ✓
TimescaleDB Time-series audit events PostgreSQL-compatible; chunk compression; continuous aggregates Stable ✓
Triton Inference Server GPU model serving (L3/L4) Dynamic batching; multi-model serving; gRPC interface for Go Active — evaluate
Presidio (Microsoft) PII/PHI entity recognition Pretrained NER for 50+ PII types; augments regex for ML-based detection Active — evaluate
Karpenter EKS node auto-provisioning Spot instance provisioning in <60s; GPU node pool management Stable ✓
KEDA Kafka-driven pod autoscaling Scale pods based on Kafka consumer lag — ideal trigger for enforcement burst Stable ✓
pgvector Semantic search on events (Phase 3) No separate vector DB; works in existing PostgreSQL instance Active — evaluate

Shortcuts & Non-Negotiables

✅ Acceptable MVP Shortcuts

Regex-only DLP in v1

Ship L1 signature DLP without the ML-backed L2 heuristic layer. Covers 90%+ of common PII patterns. Cleanup trigger: first false-negative customer complaint or when ML infra is available.

Polling for Policy Updates

Agents poll for policy bundles every 30s rather than push. Adds up to 30s lag on policy activation. Cleanup trigger: customer needs sub-5s policy propagation (implement Redis pub/sub push).

Single-Region MVP

Launch us-east-1 only. EU region required only for first EU enterprise customer. Cleanup trigger: first EU customer signed. Allow 6 weeks for eu-west-1 deployment.

🚫 Unacceptable Shortcuts

Skipping Multi-Tenant RLS

Tempting to add tenant_id to WHERE clauses in application code. This is a trap. One missing WHERE clause leaks all tenants' data. PostgreSQL RLS enforces isolation at the DB layer — no application bug can bypass it. Must be right from Day 1.

Mutable Audit Events

No UPDATE or DELETE on enforcement_events. Ever. Compliance evidence requires immutable, append-only records. Even "corrections" must be new events with a reference to the corrected event.

Synchronous L3 AI Judge

If the team is tempted to put the LLM call in the hot path "just for now" — don't. The latency regression will be immediate, the removal will be deprioritized forever, and customers will build workflows around the slow behavior.

🔒 Non-Negotiables from Day 1

Event Schema Lock by Week 2

The enforcement_events schema is the contract between 5 services. Changing it after Week 3 is a coordination disaster. Lock it. Version it. Add fields additively only.

TLS 1.3 Everywhere

All inter-service communication: mTLS. All client connections: TLS 1.3 minimum. Retrofitting this in a security product is a marketing disaster, not just a technical debt.

Load Testing Gate between Phases

No Phase 2 start until Phase 1 passes: P95 <50ms @ 5k RPS under k6 load test. This isn't optional. Ship a product that can't hit its stated SLA and the sales team is lying to customers.

Observability Stack on Day 1

OpenTelemetry tracing on every enforcement hop. Prometheus metrics exported. Grafana dashboards. Without this, debugging latency regressions or finding the "3am incident" is impossible.

Section 09 — Final Critique

Honest Critique

⚠ Top 3 Failure Modes

1. The 50ms SLA is Aspirational

The <50ms target for inline enforcement requires Hyperscan DLP + OPA eval + Redis rate check + async Kafka publish — all under 50ms including network overhead to the enforcement gateway. At cold-cache conditions on a new EKS pod, this won't be met. The first time a customer sees a >50ms enforcement decision and screenshots their Network tab, it's a sales problem. Mitigation: Define SLA as P95 <50ms (not mean), pre-warm pods with 5-minute JIT warm-up period, and instrument every component with latency breakdown in the Oraxis dashboard.

2. Multi-Tenant Decision Debt

OQ-04 (shared vs dedicated) is marked "open question" but the architecture decision must be made in Week 1 — it affects the Postgres schema, the Kafka topic structure, the Cognito pool design, and the AWS account strategy. If it's answered wrong (shared when a customer needs dedicated), migration is a 6-month effort. Recommend: ship shared multi-tenant, sell dedicated as Enterprise tier at 2x price, budget the migration cost into the enterprise deal.

3. EU AI Act Classification Liability

Stellix auto-classifying AI systems as "Minimal Risk" when they should be "High Risk" creates legal liability for the customer — and potentially for Zorianto as the tool enabling that misjudgment. The PRD frames this as a discovery feature, but it will be used by compliance officers as a compliance decision. Every auto-classification must have a "pending legal review" state and a prominent disclaimer. Build the human-confirmation workflow into the Stellix UI before any customer uses the EU AI Act features in production.

🚀 With 10x Resources

What Would Change
  • Build a fine-tuned security LLM from Day 1, enabling inline L3 at <40ms on dedicated GPU nodes
  • Ship all 9 modules simultaneously with dedicated teams per module
  • Deploy in 5 AWS regions on Day 1 (US, EU, APAC, CA, UK)
  • Build a separate red-team team to continuously attack the platform before GA
  • Hire a dedicated legal team to do EU AI Act classification alongside the ML model — eliminating the liability risk
What Would Stay the Same
  • Go for the enforcement service — no amount of resources changes the correctness of this choice
  • OPA for policy engine — battle-tested, correct choice regardless of scale
  • Append-only audit event model — non-negotiable for compliance evidence
  • Kafka over SQS for the event pipeline at this throughput
  • PostgreSQL + TimescaleDB as the operational database — OLAP can migrate later

📊 Key Success Metrics

<50ms
P95 inline enforcement
<3%
False positive rate (DLP)
95%+
Shadow AI detection
<30min
MTTR on critical alerts
SELF-REVIEW COMPLETE — 2 cycles, 7 amendments.
[REVISED: L3 async-only architecture clearly separated from inline path] · [REVISED: Kafka substituted for SQS with explicit rationale] · [REVISED: EKS over ECS Fargate with autoscaling justification] · [REVISED: Multi-tenant OQ-04 resolution added to assumptions] · [REVISED: EU AI Act classification liability warning added] · [REVISED: Policy bundle race condition failure mode added] · [REVISED: Browser extension trust model blind spot documented]