Frequently Asked Questions#

Common questions about WatchLLM, answered.

General#

What is WatchLLM?#

WatchLLM is an edge proxy that sits between your application and AI providers (OpenAI, Anthropic, Groq). It automatically caches repetitive queries using semantic similarity, reducing API costs by 30-50% without requiring code changes.

How does semantic caching work?#

Instead of only matching exact prompts, WatchLLM understands the meaning behind your prompts. For example, "What is Python?" and "Explain what Python is" would match semantically and return the cached response, saving you an API call.

Which AI providers are supported?#

WatchLLM supports:

OpenAI — GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo
Anthropic — Claude 3 Opus, Sonnet, Haiku
Groq — Llama, Mixtral, Gemma
100+ models via OpenRouter integration

Is WatchLLM open source?#

Yes! WatchLLM is open source and available on GitHub. You can self-host it or use our managed service.

Integration#

How long does integration take?#

Most developers integrate WatchLLM in under 5 minutes. It's a 3-line code change — just update your baseURL and apiKey in your existing OpenAI SDK setup.

Do I need to change my code?#

Minimal changes. WatchLLM is a drop-in replacement for the OpenAI API. You only need to:

Change baseURL to https://proxy.watchllm.dev/v1
Use your WatchLLM API key instead of your provider key

Does it work with streaming responses?#

Yes! WatchLLM fully supports streaming (stream: true). Cached responses are streamed back with the same chunked format as the original provider response.

Does it work with function calling / tools?#

Yes. Function calling, tool use, and structured outputs are fully supported and cached correctly.

Can I use it with LangChain / LlamaIndex?#

Yes. Since WatchLLM is OpenAI-compatible, it works with any framework that uses the OpenAI SDK, including LangChain, LlamaIndex, Vercel AI SDK, and others.

Pricing & Billing#

Is there a free tier?#

Yes! The Free plan includes 50,000 requests/month with no credit card required. It's enough for development and small projects.

What counts as a request?#

Every API call to the proxy counts as one request, regardless of whether it's a cache hit or miss. This includes chat completions, text completions, and embedding requests.

What happens when I exceed my monthly quota?#

Free plan: Switches to cache-only mode (only serves cached responses, no new upstream calls)
Paid plans: Overage billing at $0.40-$0.50 per 1,000 additional requests, up to the plan's overage cap

Can I cancel anytime?#

Yes. Cancel your subscription at any time from the dashboard. Your plan stays active until the end of the current billing period.

Caching#

How do I know if a response was cached?#

Check the response headers:

X-WatchLLM-Cache: HIT — exact cache match
X-WatchLLM-Cache: HIT-SEMANTIC — semantic similarity match
X-WatchLLM-Cache: MISS — new request, forwarded to provider

Can I control the cache sensitivity?#

Yes. Adjust the similarity threshold in your project settings (default: 95%). Lower values = more aggressive caching, higher values = stricter matching.

Can I clear the cache?#

Yes. You can clear your project's cache from the dashboard under Settings → Cache Management.

How long are responses cached?#

Cache TTL (time-to-live) is configurable per project. Default is 24 hours. Adjust it in Settings → Cache Configuration.

Does caching work across different users?#

Yes, by default. If two different users send semantically similar prompts, the second user gets the cached response. You can scope caching per user by including a user identifier in your requests.

Security#

Is my data secure?#

Yes. All data is encrypted in transit (TLS 1.3) and at rest (AES-256). See our Security docs for full details.

Are my API keys safe?#

WatchLLM API keys are stored securely. BYOK provider keys are encrypted with AES-256. Keys are never logged or exposed in response headers.

Where is my data stored?#

Cache: Cloudflare's global edge network and Upstash Redis
Database: Supabase (PostgreSQL) with Row Level Security
Logs: Retained per your plan's data retention policy

Yes. We support data export, account deletion, and data portability. See our Security & Privacy documentation for details.

Technical#

What's the latency overhead?#

Cache hit: Typically 5-15ms overhead (faster than provider calls)
Cache miss: Approximately 20-50ms overhead for cache lookup + logging
WatchLLM runs on Cloudflare's edge network, close to your users

Does it support multiple projects?#

Yes. Create separate projects in the dashboard, each with their own API keys, cache settings, and analytics.

Can I self-host WatchLLM?#

Yes! See our Self-Hosting Guide for detailed deployment instructions. You can run the entire stack on your own infrastructure.

What happens if WatchLLM is down?#

If the proxy experiences issues, requests will fail with a 503 error. We recommend implementing fallback logic in your application to call providers directly in case of proxy unavailability.

Troubleshooting#

My cache hit rate is low#

Check your similarity threshold — it might be too strict (try 90-92%)
Ensure prompts are consistent in format
Review the Troubleshooting guide for optimization tips

I'm getting 429 errors#

You're hitting your rate limit. See Rate Limits for details on limits per plan and how to handle them.

My API key isn't working#

Verify the key starts with lgw_proj_ or lgw_test_
Check that the key is active in your dashboard
Ensure you're using the correct baseURL
See Authentication for troubleshooting

Still Have Questions?#

Email: kiwi092020@gmail.com
GitHub: github.com/kaadipranav/WATCHLLM
Discord: Join our community for real-time help

Frequently Asked Questions