Zorianto Astral — Technical Architecture Plan

Section 01

System Understanding

📋 Product Restatement

Zorianto Astral is a real-time AI interaction governance platform that wraps an organization's entire AI surface — browsers, API endpoints, desktop processes, agentic pipelines, and MCP server connections — with an inline enforcement layer capable of blocking, redacting, or modifying interactions in under 50ms, while simultaneously running asynchronous compliance gap analysis against seven regulatory frameworks (HIPAA, GDPR, SOC 2, PCI-DSS, EU AI Act, NIST AI RMF, and internal AI Governance). It is not a SIEM add-on or a passive log aggregator; it is an active enforcement proxy that intercepts traffic and applies policies before damage occurs — combined with a governance layer for agentic AI workflows and machine identities at a 100:1 NHI-to-human ratio.

👥 Who Has It

Enterprise security teams (>500 employees) deploying AI at scale with no enforcement controls
Compliance orgs facing Aug 2026 EU AI Act deadline with 78% not ready
Platform engineers running agentic pipelines with no runtime governance

❌ Why Others Fail

DLP tools are perimeter-only — don't understand AI semantics
SIEMs log after the fact — data is already in the LLM
No existing tool intercepts agentic tool-calls inline
NHI management tools don't model AI-specific credential risks

🔧 What Makes It Hard

<50ms inline enforcement at 10k RPS with semantic analysis is at the physical limits of the current LLM API stack
Agent memory integrity requires continuous vector-space drift monitoring
Multi-tenant EU data residency with shared infra is architecturally complex

⚠ Risky Assumptions

⚠ Assumption A1 — LATENCY CONTRADICTION

The PRD states L3 AI Judge runs in <100ms AND inline enforcement completes in <50ms. These are contradictory. Calling an LLM API (even Haiku/Gemini Flash) for every interaction takes 200–800ms on a warm connection. Mitigation: L3 must be triggered asynchronously — inline path uses only L1+L2 (<15ms), with L3 escalating to a HOLD state for high-confidence heuristic hits. This is a fundamental architecture change from what the PRD implies.

⚠ Assumption A2 — MULTI-TENANT ARCHITECTURE UNRESOLVED

OQ-04 asks shared vs dedicated. This is not a product question — it's an architectural fork. Shared infra with RLS and tenant namespacing takes 6 weeks to get right. Dedicated per-customer means 10x infra costs at low customer counts. Mitigation: MVP ships shared with strict Postgres Row-Level Security. Dedicated is an enterprise tier add-on at Phase 3+. Decision must be locked before week 2.

⚠ Assumption A3 — ARMOR FRAMEWORK DEPENDENCY

A5 assumes "ARMOR framework integration for unauthorized fine-tuning detection is available as a dependency." ARMOR is not a widely known open framework as of April 2026. This is likely an internal or partner dependency. If unavailable, the fine-tuning detection feature (Luxion L4) is blocked. Mitigation: Document this as a P2 feature with a clear go/no-go gate at week 20.

⚠ Assumption A4 — EU AI ACT CLASSIFICATION VIA ML IS OVERCONFIDENT

Auto-classifying systems as Prohibited/High-Risk/Limited/Minimal per Annex III requires legal interpretation, not just ML. A model misclassifying a High-Risk system as Minimal creates legal liability. Mitigation: Stellix auto-classification is an advisory score + human confirmation workflow, never automated gate-keeping without human sign-off.

⚠ Assumption A5 — L4 DEEPFAKE AT <500ms IS OPTIMISTIC

Spectral audio analysis of a 30-second clip + behavioral baseline comparison is a 1–4 second operation with GPU, longer without. <500ms requires dedicated GPU inference infrastructure or aggressive audio chunking. Mitigation: L4 is never in the inline path. It's always async post-hoc analysis triggered by L2/L3 signals. Wire fraud authorization should go through a separate multi-factor CHALLENGE flow regardless.

Section 02

MVP Plan

✅ Ships in v1 (Weeks 1–10)

Browser extension with DOM interception for Chrome + Edge (MDM deployable)
Apexion: PII/PHI blocking with regex + pattern matching (<15ms, L1+L2 only)
Stellix: AI tool discovery via DNS telemetry + browser extension inventory
API Gateway proxy on AWS (us-east-1) for REST AI API calls
Policy CRUD — create, test, activate policies with 30-day retroactive simulation
Basic Nexion: alert feed with severity + acknowledgment
Chronix: HIPAA + GDPR gap analysis only (the two highest-demand frameworks)
Postgres-backed audit log with 13-month retention
Sales Demo Mode (client-side, no backend needed)

✕ Cut from MVP

Vigil (agent governance) — cut because it requires a separate proxy infrastructure, agent registration model, and kill-switch mechanism. Adding this to the MVP doubles scope and delays the sellable core by 6 weeks.

Sentinel NHI — cut because NHI behavioral anomaly detection requires 30+ days of baseline data before it produces signal. Shipping it empty is confusing.

L3 AI Judge + L4 Deepfake — cut because the async inference service is a separate platform. Ship L1+L2 with 90%+ coverage of common threats.

Blast Radius Simulator — cut because it requires agent topology data that doesn't exist until Vigil ships.

EU AI Act countdown + all 7 frameworks — reduce to HIPAA + GDPR in MVP; the clock-based EU feature is powerful in demos but requires Stellix inventory to be populated first.

🏗 MVP Architecture

🔧 Tech Stack with Justification

Layer	Choice	Why	Rejected Alternatives
Enforcement Service	Go 1.22+	Goroutine-per-connection model handles 10k RPS with sub-10ms routing overhead. Net/http2 + context cancellation fits the <50ms pipeline perfectly. Compiled binary with no JVM warmup or GC pauses.	Rust: better performance ceiling but 2x development time; team ramp-up is a real cost. Node.js: event loop blocking during DLP regex at high concurrency; GC pressure at 10k RPS is observed.
Policy Engine	OPA (Rego)	Sub-millisecond policy evaluation; battle-tested in Kubernetes admission controllers; native bundle compilation; supports "policy as code" git workflows. The partial evaluation feature enables policy simulation cheaply.	Custom rule engine: faster initially, but becomes a maintenance nightmare at 48+ active policies. Cedar (AWS): newer, less community tooling.
Primary DB	PostgreSQL 16 + TimescaleDB	TimescaleDB's hypertable chunking gives O(log n) time-series queries on audit events without changing the SQL interface. Row-Level Security handles multi-tenant isolation at the DB layer. JSONB columns for flexible event metadata.	DynamoDB: no ad-hoc queries for compliance reports; expensive at audit log volumes. ClickHouse: move analytics here at Phase 3 when query patterns stabilize (see Oraxis).
Session State	Redis Cluster 7.x	Sub-millisecond counter increments for rate limiting; SETEX for session TTLs; pub/sub for real-time alert streaming. Redis Cluster avoids single-node SPOF. Persistence with AOF for crash recovery.	DragonflyDB: API compatible but less production-proven at enterprise scale. Memcached: no pub/sub, no persistence.
Event Pipeline	Apache Kafka (MSK)	At 10k RPS sustained, Kafka's 100k+ msg/sec throughput with 7-day retention gives replay for forensics. Topic partitioning by tenant_id enables isolation. SQS hits its 3,000 msg/sec standard queue limit too easily.	SQS: PRD specifies it but throughput ceiling is insufficient. Kinesis: viable alternative, less community tooling for consumer groups.
Frontend	Next.js 14 (App Router)	Server Components for compliance report rendering without client hydration. API routes co-located for BFF pattern. Built-in ISR for slowly-changing compliance scores. Vercel or self-hosted on ECS.	Remix: better for highly interactive UIs but compliance dashboards are read-heavy. SPA React: no SSR for SEO/initial load.
Auth	Cognito + custom JWT	Cognito handles SAML/SSO for enterprise (required for >500 employee orgs). Custom JWT service for API keys with scoped claims and per-key rotation tracking. Cognito User Pools give MFA without building it.	Auth0: higher cost at scale, vendor lock-in. Keycloak: requires ops overhead. DIY: never for a security product.
ML Inference	Python FastAPI + Triton	Separate Python service for L3 AI Judge (async only) and L4 deepfake models. Triton Inference Server for GPU model serving with batching. Decoupled from Go enforcement service — ML failures don't affect inline path.	Inline in Go: CGo bindings for Python ML libraries are fragile and hard to deploy. Lambda: cold start latency incompatible with any latency target.
Orchestration	EKS (Kubernetes)	ECS Fargate is fine at low scale but lacks the pod autoscaling granularity needed for burst traffic at the enforcement gateway. EKS Karpenter for node auto-provisioning; HPA on CPU + request latency custom metrics.	ECS Fargate: PRD specifies it, but at 10k RPS target you hit Fargate task scaling lag (>60s). Kubernetes HPA reacts in <15s.

Section 03

Full Product Architecture

Service Boundaries

Verdict: Modular Monolith → Selective Microservices. Start with a modular monolith (single deployable, domain-separated packages) for Phases 0–2. Extract to separate services only at proven pain points, not ahead of time.

🔒 Enforcement Gateway

Extracted as a separate service from Day 1. Reason: must be independently scalable to 10k RPS, independently deployable (no big-bang redeploy for policy updates), and isolated so a bug in Chronix can't affect inline enforcement uptime.

🤖 ML Inference Service

Extracted from Day 1. Reason: Python/GPU runtime environment is incompatible with the Go enforcement binary. GPU node pools are expensive — co-locate all ML workloads in one service.

📊 Analytics Service (Phase 3)

Extract Oraxis onto ClickHouse when audit log exceeds 50M rows/month. OLAP queries for blast radius and cost attribution will kill the PostgreSQL primary if not separated.

🏛 Everything Else

Chronix, Nexion, Sentinel, Stellix, Vigil, and the Dashboard live in a single Go monolith until team size or throughput demands extraction. Premature microservices create distributed systems overhead without the team to manage it.

Scalability Model

Metric	Target	Strategy
Throughput	10k RPS	Horizontal EKS pod scaling; Redis rate limiting per tenant
P95 Latency	<50ms	L1+L2 only inline; OPA bundle cached in-process; Redis local replica
Audit Events	~864M/day at 10k RPS	Kafka buffering; TimescaleDB chunk compression; S3 tiering after 90 days
NHI Inventory	10k identities	4-hour scan interval; background worker pool; Redis cache for posture scores
Agent Sessions	1k concurrent	Vigil Proxy stateful connections via Redis sorted sets; kill-switch via pub/sub

SLA & Reliability

⚠ 99.9% SLA Implication

99.9% = 8.77 hours downtime/year. For an inline enforcement path, this means every outage lets unscanned AI traffic through. The fail-open/fail-closed config is therefore not a convenience feature — it's a core SLA definition. Production should default fail-open for low-risk, fail-closed for high-risk (PHI tenants). Document this contractually.

Failure Mode: Gateway Outage

Blast radius: All inline enforcement stops. Browser extensions fall through to allow-mode. API gateway returns 200 without DLP scan.
Detection: Heartbeat alert within 15 seconds; CloudWatch alarm on 5xx rate.
Recovery: EKS pod restart SLA: <45 seconds. Multi-AZ deployment ensures AZ failure doesn't cause full outage.

🗄 Key Data Model

-- Core event table (TimescaleDB hypertable)
Entity: enforcement_events
  - id: UUID [PK]
  - tenant_id: UUID [FK, RLS partition key]
  - timestamp: TIMESTAMPTZ [hypertable partition]
  - surface: ENUM(browser|api|desktop|agent|mcp)
  - actor_id: TEXT [user or NHI identifier]
  - ai_provider: TEXT [openai|anthropic|google|local]
  - decision: ENUM(allow|block|redact|challenge)
  - policy_ids: UUID[] [matched policies]
  - detection_layers: JSONB [{layer:L1, sig:DAN_001, conf:0.99}]
  - pii_detected: JSONB [{type:SSN, count:2, redacted:true}]
  - latency_ms: SMALLINT
  - INDEX: (tenant_id, timestamp DESC) -- compliance queries
  - INDEX: (tenant_id, decision, timestamp) -- alert feed
  - INDEX: (tenant_id, actor_id, timestamp) -- user risk scoring

Entity: policies
  - id: UUID [PK]
  - tenant_id: UUID [FK]
  - name: TEXT
  - surface: TEXT[] [applicable surfaces]
  - conditions: JSONB [OPA rego bundle reference]
  - action: ENUM(block|redact|warn|log|challenge)
  - enabled: BOOLEAN
  - version: INTEGER [optimistic locking]
  - simulated_impact: JSONB [last 30d retroactive analysis]

Entity: agent_sessions (Vigil)
  - id: UUID [PK]
  - tenant_id: UUID [FK]
  - agent_id: UUID [FK → registered_agents]
  - state: ENUM(open|active|inspecting|quarantined|terminated|closed)
  - execution_graph: JSONB [full hop trace]
  - memory_hash: TEXT [SHA-256 of memory store snapshot]
  - started_at, ended_at: TIMESTAMPTZ
  - kill_reason: TEXT

Entity: non_human_identities (Sentinel)
  - id: UUID [PK]
  - tenant_id: UUID [FK]
  - identity_type: ENUM(api_key|jwt|oauth|service_account|pat)
  - owner_team: TEXT
  - privilege_score: SMALLINT [0-100]
  - posture: ENUM(pass|warn|fail)
  - last_rotated_at: TIMESTAMPTZ
  - last_seen_ip: INET
  - anomaly_flags: JSONB
    

-- Critical API Endpoints (inline path only)

POST /api/v1/enforce
  Input:  {surface, actor, prompt_hash, ai_provider, metadata}
  Output: {decision, action, redacted_content?, challenge_token?}
  Auth:   Agent API key (scoped to tenant)
  SLA:    P95 <50ms — circuit breaker at 100ms
  Rate:   10k RPS per tenant bucket (Redis token bucket)
  Note:   This is the ONLY hot path. Optimize obsessively.

POST /api/v1/events/batch
  Input:  [{event}] max 100 events
  Output: {accepted: N, failed: M}
  Note:   Async ingestion path for cloud connectors.
              Publishes to Kafka, responds immediately.

GET  /api/v1/policies/bundle
  Output: OPA bundle (tar.gz with .rego files)
  Cache:  E-Tag + 304 Not Modified; agents poll every 30s
  Note:   Policy bundles are compiled, not interpreted.
              Zero-downtime update via bundle versioning.

POST /api/v1/agents/{id}/sessions/{sid}/kill
  Input:  {reason, escalate_to: "SOC"}
  Output: {terminated: true, correlated_events: [...]}
  Note:   Publishes kill signal to Redis pub/sub.
              Vigil proxy subscribes; terminates connection immediately.
      

🔐 Security Model

All inter-service comms: mTLS with cert rotation via AWS ACM PCA
Enforcement events: encrypted at rest AES-256 in Postgres + S3 SSE
Audit logs: WORM-locked in S3 Object Lock (compliance evidence)
Top threat #1: Prompt injection to exfiltrate policy config → Policy data never sent to LLM; L2 detects instruction syntax in reverse direction
Top threat #2: Tenant data leakage via mis-scoped API key → Per-key tenant binding in JWT claims; Postgres RLS enforces at DB layer
Top threat #3: Enforcement bypass via direct API call to AI provider → Browser extension monitors XHR; desktop agent monitors process network; bypass logged as critical alert

Section 04

Tech Decisions & Tradeoffs

⚡ DECISION: L3 AI Judge is ASYNC only — never in the inline path

Rationale

LLM API calls (even flash/haiku) average 200–600ms on a warm connection. The PRD's stated "<50ms inline enforcement" and "<100ms L3" are physically contradictory. The inline path uses L1 (signature, <5ms) + L2 (heuristic, <15ms) + OPA policy evaluation (<1ms). L3 is triggered only when L2 confidence is 0.65–0.94 (ambiguous zone), and the interaction enters HOLD state pending async L3 resolution — like a credit card soft decline requiring additional verification.

Alternatives Rejected

L3 in inline path: Adds 200–600ms to every interaction. Unacceptable. Users would bypass the system. L3 via edge LLM (local model): Viable at Phase 4 — a quantized Mistral-7B on GPU nodes could do <80ms. Deferred due to GPU infra cost and complexity in MVP.

Risk

The HOLD state creates a user-visible delay for ~8% of interactions (estimated L2 ambiguous zone). This may frustrate users if hold time exceeds 2–3 seconds. Mitigation: CHALLENGE workflow gives users a reason ("Policy review: 2s") rather than an opaque pause.

12–18 Month Risk

If LLM inference latency drops to <50ms (plausible with edge models), revisit inlining L3. ONNX-quantized security-tuned models may enable this at Phase 4.

🗄 DECISION: PostgreSQL + TimescaleDB over DynamoDB or ClickHouse

Rationale

Compliance reports require JOIN-heavy queries across events, policies, NHI, and agents. DynamoDB's single-table design makes these queries painful and expensive. TimescaleDB's hypertable chunking gives O(log n) time-series performance without leaving the SQL ecosystem. PostgreSQL RLS provides multi-tenant data isolation at zero application-layer complexity.

Alternatives Rejected

DynamoDB: PRD specifies RDS but let's be explicit. DynamoDB cannot support the ad-hoc query patterns needed for Chronix gap analysis without a massive GSI footprint. ClickHouse: Ideal for Oraxis analytics at 50M+ events/month. Add as a read-replica analytics store at Phase 3, not Day 1 — schema is not stable enough yet.

Risk

At 864M events/day sustained (10k RPS), TimescaleDB chunk compression and S3 tiering after 90 days is critical. Without it, storage costs and query times degrade materially by month 4. Chunk compression must be tested during Phase 0 load testing.

Migration Path

At Phase 3, mirror enforcement_events to ClickHouse via Kafka consumer. Oraxis queries hit ClickHouse; all other modules stay on PostgreSQL. No disruptive migration required.

🚦 DECISION: Kafka over SQS for the event pipeline

Rationale

The PRD specifies SQS/EventBridge. However, at 10k RPS x 1 event/request, that's 864M messages/day. Standard SQS is limited to ~3,000 msg/sec without batching tricks. Kafka on MSK handles 1M+ msg/sec natively. More critically, Kafka's 7-day log retention enables forensic replay — if a new Luxion signature is deployed, you can reprocess last week's events to catch previously-missed attacks.

Alternatives Rejected

SQS Standard: Throughput ceiling too low; no replay. Kinesis: Valid alternative. Less Go ecosystem support than Kafka; shard management more manual. Switch to Kinesis if AWS-native tooling is a hard requirement. EventBridge: Keep for routing Nexion alerts to external systems (PagerDuty, Slack, SIEM) but not for high-throughput event ingestion.

Risk

Kafka on MSK adds operational complexity vs SQS. Partition count decisions made at cluster creation are hard to change. Set 64 partitions per topic (over-provision; cheaper than under-provisioning).

Cost Note

MSK Serverless removes partition management complexity at a cost premium (~2x per GB). Use MSK Serverless for MVP, evaluate dedicated MSK at Phase 2 when throughput patterns are known.

🏗 DECISION: EKS over ECS Fargate for orchestration

Rationale

ECS Fargate task scale-out takes 45–90 seconds (new task launch + health check). At 10k RPS burst traffic, that's too slow. EKS HPA with KEDA (Kubernetes Event-Driven Autoscaling) can scale pods in <15 seconds based on Kafka consumer lag metrics — which is exactly the right signal for the event pipeline. Karpenter handles node provisioning in <60 seconds.

Alternatives Rejected

ECS Fargate (PRD spec): Simpler ops, but the autoscaling SLA doesn't match the product's performance requirements. Reconsider for non-critical services (Chronix report generation). ECS on EC2: Self-managed instances add overhead that Kubernetes already abstracts better.

Mitigation for Complexity

Use ArgoCD for GitOps deployments. Helm charts for all services. Cluster creation via Terraform. Budget 1 dedicated SRE/platform engineer starting Phase 1.

EU Data Residency

Deploy a dedicated EKS cluster in eu-west-1 for EU tenants. Enforce tenant routing at API Gateway (CloudFront → regional ALB). Postgres replicated to eu-west-1 Multi-AZ. No EU data touches us-east-1 workers.

Section 05

Technical Challenges

Challenge

🔥 Sub-50ms DLP at Scale

Why It Occurs

PII regex on raw prompt text is O(n) per pattern per token. At 10k RPS with 2k-token prompts and 50+ PII patterns, you're running 50M regex operations/second.

Failure Mode

CPU saturation at ~3k RPS. P95 latency spikes to 200ms. Rate limiter kicks in; legitimate traffic is shed.

Mitigation

Use Hyperscan (Intel's regex engine via Go CGo binding) — designed for network DPI at multi-Gbps. Compile all PII patterns to a single Hyperscan database loaded at startup. 10-100x faster than RE2 for multi-pattern matching. Benchmark during Phase 0 load testing.

Challenge

⚠ Agent Memory Drift

Why It Occurs

RAG memory stores are updated continuously by agent tool calls. An adversarial document injected into the vector store shifts the embedding space gradually — no single event is suspicious, but the cumulative drift enables a semantic jailbreak over hours.

Failure Mode

Agent begins returning responses aligned with injected instructions days after the original poisoning. By the time L2 detects the output anomaly, the damage is done.

Mitigation

Hash-based snapshot on every memory write (SHA-256 of ordered embedding centroids). Semantic drift score: cosine similarity between current centroid distribution and T-24h baseline. Threshold alert at >0.08 drift. Purge + restore from last clean snapshot when threshold exceeded.

Challenge

🚨 3AM Alert: Policy Bundle Race

Why It Occurs

Enforcement Gateway pods poll policy bundles every 30 seconds. During a rolling deploy, pods have different bundle versions simultaneously. A policy change during a rolling restart creates a 30–60 second window where 50% of pods are on old policy.

Failure Mode

A critical PHI blocking policy is deployed; for 60 seconds, half of requests bypass it. This is a HIPAA compliance incident, not just a bug.

Mitigation

Bundle version is included in every enforcement event log. Policy activation has two modes: "instant" (all pods hot-reload within 5s via Redis pub/sub) vs "rolling" (default). Critical PHI/PCI policies must use instant mode. Audit log captures policy version per decision for compliance evidence.

Challenge (PRD Blind Spot)

🔭 Browser Extension Trust

Why It Occurs

The PRD assumes the browser extension is the enforcement point for browser AI interactions. But a user can: (a) use a browser without the extension installed, (b) disable the extension, or (c) use a browser AI model that runs locally without network calls.

Failure Mode

Shadow AI usage via unmanaged devices or non-MDM browsers is completely undetected. The Stellix discovery metric shows 95% coverage, but it's really 95% of managed devices — a significant blind spot.

Mitigation

API Gateway enforcement is the backstop — all external AI API calls from the corporate network must route through the proxy (network policy enforcement via CASB rules or DNS filtering). Browser extension is defense-in-depth, not the primary control. This must be communicated to customers clearly.

Section 06

Advanced Approaches

🌊 Event Sourcing for Audit

What it solves: Compliance evidence requires immutable, ordered history of every policy evaluation with full context. Traditional CRUD updates destroy this audit chain.

When worth it: From Day 1. The enforcement_events table IS an event store — append-only, never updated. This is event sourcing without the overhead of rebuilding projections for every read.

Adoption cost: Low. Design the data model correctly from the start (no UPDATE on events). The "projection" is Chronix reading the event stream to compute compliance scores.

✓ USE NOW

🏎 Edge-Deployed Policy Eval

What it solves: Browser extension currently sends to API Gateway for policy evaluation — adds ~50ms round-trip. Edge deployment of OPA WASM bundles in the extension itself could reduce this to <5ms local evaluation.

When worth it: Phase 4, when customer latency complaints emerge or when enterprises deploy Astral for latency-sensitive real-time coding assistants.

Overkill for: MVP. Policy bundle sync in the extension adds deployment complexity; WASM bundles are 2-5MB which slows extension updates through Chrome Web Store review.

⟳ EVALUATE AT PHASE 4

🧠 Fine-tuned Security LLM

What it solves: The L3 AI Judge uses a general-purpose LLM (GPT-4o/Claude Sonnet) for semantic threat analysis. A security-domain fine-tuned model (Mistral-7B + security corpus) could run on dedicated GPU nodes at <80ms — enabling inline L3.

When worth it: When monthly L3 API costs exceed $50k/month, or when customers require data-resident AI judgment (no prompts leaving tenant network).

Adoption cost: High. Requires curated training dataset, RLHF pipeline, model evaluation framework, and GPU serving infra. 3–4 months of ML team time.

⟳ EVALUATE AT $50K/MO THRESHOLD

📊 Feature Flags + Progressive Policy Delivery

What it solves: The policy simulator shows retroactive impact but can't predict future edge cases. Progressive policy rollout (10% of users → 25% → 100%) catches false positives before full deployment.

When worth it: Phase 2. When enterprise customers start deploying blocking policies and false positives become a customer success issue.

Implementation: LaunchDarkly SDK or self-hosted Unleash. Policy activation has a "canary" mode that routes N% of tenant traffic through the new policy.

⟳ EVALUATE PHASE 2

🔍 Vector DB for Semantic Search

What it solves: Compliance report generation requires "find all events related to PHI exposure" — a semantic query, not a keyword match. pgvector on PostgreSQL enables this without a separate vector database.

When worth it: Phase 3. Add pgvector to the existing PostgreSQL instance; embed enforcement events with a lightweight model (all-MiniLM-L6-v2). Enables natural language compliance queries.

⟳ PHASE 3 WITH pgvector

🔄 CQRS for Read/Write Separation

What it solves: Compliance report generation is a read-heavy, long-running operation that competes with high-frequency audit event writes on the same PostgreSQL primary.

When worth it: When report generation p95 exceeds 10 seconds, or when write throughput on the primary exceeds 50k events/second. Route all Chronix reads to a PostgreSQL read replica.

Overkill for: MVP. PostgreSQL Multi-AZ read replica is sufficient for Phase 0–2.

⟳ WHEN REPORTS SLOW DOWN

Section 07

Implementation Roadmap

Weeks 1–4

Phase 0 — Foundation

EKS cluster + VPC, us-east-1 + eu-west-1 (infra eng)
PostgreSQL 16 + TimescaleDB, schema v1 (backend)
Redis Cluster + MSK Kafka baseline config
Cognito + JWT auth service, API key lifecycle
OPA policy engine — bundle compile + serve pipeline
Event state machine: OBSERVED → LOGGED (minimal)
CI/CD: ArgoCD + GitHub Actions + Terraform state
Load testing harness: k6 @ 1k RPS baseline

Weeks 5–10

Phase 1 — Inline MVP

Chrome extension: DOM intercept + XHR hook (frontend)
Apexion enforcement engine: L1+L2, Hyperscan DLP (Go)
API Gateway proxy: HTTPS intercept, JWT validation
Policy CRUD API + 30-day simulation mode
Stellix: DNS telemetry + extension inventory
Basic Nexion: alert feed + ACK workflow
Chronix: HIPAA + GDPR gap analysis + PDF reports
Next.js dashboard: enforcement log + policy UI
Sales Demo Mode: 17 scenarios, client-side only
Load test: P95 <50ms @ 5k RPS. Gate to Phase 2.

Weeks 11–16

Phase 2 — Agent Runtime

Vigil proxy: sidecar deployment, agent registration API
Tool-call interception + MCP whitelist enforcement
Session state machine (OPEN→QUARANTINED→TERMINATED)
Kill-switch via Redis pub/sub
Execution graph rendering in Next.js dashboard
Indirect injection stripping (PDF/doc ingestion)
Desktop agent: macOS/Windows process scanner
Sentinel NHI: inventory + privilege scoring (passive)

Weeks 17–22

Phase 3 — Governance

Chronix: all 7 frameworks + article-level gaps
EU AI Act countdown + Art.6 classification workflow
Sentinel NHI: behavioral anomaly + auto-rotation
Oraxis: token cost attribution + burn rate
Executive Risk Dashboard + board reports
ClickHouse analytics store migration
SOC 2 Type II prep: audit evidence automation
eu-west-1 full regional isolation + data residency validation

Weeks 23–28

Phase 4 — Advanced

L3 AI Judge async service (FastAPI + LLM)
L4 deepfake detection (audio + image) on GPU nodes
Vigil memory integrity scanner + purge/restore
Blast radius simulator (Oraxis)
Policy simulator with canary rollout
Stellix supply chain scanner (CVE + dependency)
ARMOR integration (if available) for fine-tune detection
Edge OPA WASM bundles (evaluate)

⚡ Parallelization Opportunities

Track A: Enforcement

Go gateway + browser extension + API proxy. Can run fully independently from Track B after shared event schema is locked (end of Week 2). Team: 2 Go engineers + 1 frontend (extension).

Track B: Dashboard

Next.js dashboard + policy UI + Chronix gap analysis. Depends only on API contract (OpenAPI spec locked Week 2). Mock API for development. Team: 1 fullstack + 1 frontend.

Track C: Infra/Platform

EKS, Terraform, CI/CD, MSK, RDS, Redis. Runs in parallel Week 1 onwards. Must deliver baseline environment by end of Week 2. Team: 1 platform/SRE engineer.

Track D: ML (Phase 2+)

L3 AI Judge async service + L4 deepfake models. Starts Week 12, after core enforcement is stable. Depends on API contract for async challenge protocol. Team: 1 ML engineer.

🎯 Critical Path

The single thread that delays everything: The enforcement event schema and the OPA policy bundle format. Every other component depends on these two contracts — the browser extension, the API gateway, the Nexion alert feed, the Chronix compliance mapping, and the TimescaleDB schema all wire to this. If the event schema changes after Week 3, it cascades across 5 teams.

Lock the event schema and OpenAPI spec by end of Week 2. No exceptions. Use schema versioning (v1, v2) to allow additive changes without breaking existing consumers.

📦 Open Source & Acceleration

Tool	Purpose	Why This	Status
OPA + Rego	Policy engine	Sub-ms eval; git-native policy-as-code; WASM compilation for edge	Stable ✓
Hyperscan	Multi-pattern DLP regex	Network-speed regex; single DB compile for all PII patterns	Stable ✓
TimescaleDB	Time-series audit events	PostgreSQL-compatible; chunk compression; continuous aggregates	Stable ✓
Triton Inference Server	GPU model serving (L3/L4)	Dynamic batching; multi-model serving; gRPC interface for Go	Active — evaluate
Presidio (Microsoft)	PII/PHI entity recognition	Pretrained NER for 50+ PII types; augments regex for ML-based detection	Active — evaluate
Karpenter	EKS node auto-provisioning	Spot instance provisioning in <60s; GPU node pool management	Stable ✓
KEDA	Kafka-driven pod autoscaling	Scale pods based on Kafka consumer lag — ideal trigger for enforcement burst	Stable ✓
pgvector	Semantic search on events (Phase 3)	No separate vector DB; works in existing PostgreSQL instance	Active — evaluate

Section 08

Shortcuts & Non-Negotiables

✅ Acceptable MVP Shortcuts

Regex-only DLP in v1

Ship L1 signature DLP without the ML-backed L2 heuristic layer. Covers 90%+ of common PII patterns. Cleanup trigger: first false-negative customer complaint or when ML infra is available.

Polling for Policy Updates

Agents poll for policy bundles every 30s rather than push. Adds up to 30s lag on policy activation. Cleanup trigger: customer needs sub-5s policy propagation (implement Redis pub/sub push).

Single-Region MVP

Launch us-east-1 only. EU region required only for first EU enterprise customer. Cleanup trigger: first EU customer signed. Allow 6 weeks for eu-west-1 deployment.

🚫 Unacceptable Shortcuts

Skipping Multi-Tenant RLS

Tempting to add tenant_id to WHERE clauses in application code. This is a trap. One missing WHERE clause leaks all tenants' data. PostgreSQL RLS enforces isolation at the DB layer — no application bug can bypass it. Must be right from Day 1.

Mutable Audit Events

No UPDATE or DELETE on enforcement_events. Ever. Compliance evidence requires immutable, append-only records. Even "corrections" must be new events with a reference to the corrected event.

Synchronous L3 AI Judge

If the team is tempted to put the LLM call in the hot path "just for now" — don't. The latency regression will be immediate, the removal will be deprioritized forever, and customers will build workflows around the slow behavior.

🔒 Non-Negotiables from Day 1

Event Schema Lock by Week 2

The enforcement_events schema is the contract between 5 services. Changing it after Week 3 is a coordination disaster. Lock it. Version it. Add fields additively only.

TLS 1.3 Everywhere

All inter-service communication: mTLS. All client connections: TLS 1.3 minimum. Retrofitting this in a security product is a marketing disaster, not just a technical debt.

Load Testing Gate between Phases

No Phase 2 start until Phase 1 passes: P95 <50ms @ 5k RPS under k6 load test. This isn't optional. Ship a product that can't hit its stated SLA and the sales team is lying to customers.

Observability Stack on Day 1

OpenTelemetry tracing on every enforcement hop. Prometheus metrics exported. Grafana dashboards. Without this, debugging latency regressions or finding the "3am incident" is impossible.

Section 09 — Final Critique

Honest Critique

⚠ Top 3 Failure Modes

1. The 50ms SLA is Aspirational

The <50ms target for inline enforcement requires Hyperscan DLP + OPA eval + Redis rate check + async Kafka publish — all under 50ms including network overhead to the enforcement gateway. At cold-cache conditions on a new EKS pod, this won't be met. The first time a customer sees a >50ms enforcement decision and screenshots their Network tab, it's a sales problem. Mitigation: Define SLA as P95 <50ms (not mean), pre-warm pods with 5-minute JIT warm-up period, and instrument every component with latency breakdown in the Oraxis dashboard.

2. Multi-Tenant Decision Debt

OQ-04 (shared vs dedicated) is marked "open question" but the architecture decision must be made in Week 1 — it affects the Postgres schema, the Kafka topic structure, the Cognito pool design, and the AWS account strategy. If it's answered wrong (shared when a customer needs dedicated), migration is a 6-month effort. Recommend: ship shared multi-tenant, sell dedicated as Enterprise tier at 2x price, budget the migration cost into the enterprise deal.

3. EU AI Act Classification Liability

Stellix auto-classifying AI systems as "Minimal Risk" when they should be "High Risk" creates legal liability for the customer — and potentially for Zorianto as the tool enabling that misjudgment. The PRD frames this as a discovery feature, but it will be used by compliance officers as a compliance decision. Every auto-classification must have a "pending legal review" state and a prominent disclaimer. Build the human-confirmation workflow into the Stellix UI before any customer uses the EU AI Act features in production.

🚀 With 10x Resources

What Would Change

Build a fine-tuned security LLM from Day 1, enabling inline L3 at <40ms on dedicated GPU nodes
Ship all 9 modules simultaneously with dedicated teams per module
Deploy in 5 AWS regions on Day 1 (US, EU, APAC, CA, UK)
Build a separate red-team team to continuously attack the platform before GA
Hire a dedicated legal team to do EU AI Act classification alongside the ML model — eliminating the liability risk

What Would Stay the Same

Go for the enforcement service — no amount of resources changes the correctness of this choice
OPA for policy engine — battle-tested, correct choice regardless of scale
Append-only audit event model — non-negotiable for compliance evidence
Kafka over SQS for the event pipeline at this throughput
PostgreSQL + TimescaleDB as the operational database — OLAP can migrate later

📊 Key Success Metrics

<50ms

P95 inline enforcement

<3%

False positive rate (DLP)

95%+

Shadow AI detection

<30min

MTTR on critical alerts

✅

SELF-REVIEW COMPLETE — 2 cycles, 7 amendments.
[REVISED: L3 async-only architecture clearly separated from inline path] · [REVISED: Kafka substituted for SQS with explicit rationale] · [REVISED: EKS over ECS Fargate with autoscaling justification] · [REVISED: Multi-tenant OQ-04 resolution added to assumptions] · [REVISED: EU AI Act classification liability warning added] · [REVISED: Policy bundle race condition failure mode added] · [REVISED: Browser extension trust model blind spot documented]

Zorianto Astral v6.0

System Understanding

⚠ Risky Assumptions

MVP Plan

✅ Ships in v1 (Weeks 1–10)

✕ Cut from MVP

🏗 MVP Architecture

🔧 Tech Stack with Justification

Full Product Architecture

Service Boundaries

Scalability Model

SLA & Reliability

🗄 Key Data Model

Tech Decisions & Tradeoffs

Technical Challenges

Advanced Approaches

Implementation Roadmap

⚡ Parallelization Opportunities

🎯 Critical Path

📦 Open Source & Acceleration

Shortcuts & Non-Negotiables

✅ Acceptable MVP Shortcuts

🚫 Unacceptable Shortcuts

🔒 Non-Negotiables from Day 1

Honest Critique

⚠ Top 3 Failure Modes

🚀 With 10x Resources

📊 Key Success Metrics

Zorianto
Astral
v6.0