Architecture#

WatchLLM is built for high-performance semantic caching and observability at the global edge.

System Overview#

The system is architected to minimize latency while providing robust governance over AI usage.

The core "brain" of WatchLLM. It handles all incoming API requests under /v1/*.

Validation: Performs millisecond-latency API key validation and plan lookup via Supabase.
Normalization: Standardizes prompt message arrays and parameters into a canonical hash.
Semantic Engine: Integrates with Pinecone or D1-backed vector storage to perform semantic similarity checks on prompt embeddings.
Provider Gateway: A multi-client router that handles authentication and response mapping for OpenAI, Anthropic, and Groq.
Streaming: Implements a passthrough transformer to inject headers and log usage without blocking the initial byte delivery.

The unified dashboard for controlling your AI infrastructure.

Project Governance: Set per-project rate limits, semantic thresholds, and provider configurations.
Observability: Real-time visualization of usage logs powered by a highly-indexed PostgreSQL backend.
A/B Engine: Configure traffic splits and variant rules for live model testing.
Billing & Subscriptions: Integrated with Stripe for automated seat and usage-based billing.

Persistent Source of Truth (Supabase): Stores project configurations, API keys, and detailed usage_logs.
Global In-Memory Cache (Upstash Redis): Provides sub-10ms response times for deterministic prompt hits and rate-limiting counters.
Vector Storage (Cloudflare D1): Stores prompt embeddings and their associated response data for semantic retrieval.

Ingress: Client sends an OpenAI-compatible request to the Proxy domain.
Auth & Limits: Proxy validates the lgw_ key and checks rate limits in Redis.
Cache Lookup:
- Step A: Exact match check in Redis.
- Step B: (If Step A misses) Semantic similarity check in Vector store.
Upstream Call: (If Cache misses) Proxy forwards the request to the configured provider.
Egress: Proxy returns the response with X-WatchLLM-* headers and asynchronously logs the event for analytics.

Cold Start Optimization: The worker package is minimized to ensure sub-100ms cold starts across the Cloudflare global network.
Asynchronous Logging: Usage logging happens "out-of-band" so it never blocks the prompt response to the client.
Plan-Based Routing: Traffic is prioritized based on the project's subscription tier.