Frequently Asked Questions#
Common questions about WatchLLM, answered.
General#
What is WatchLLM?#
WatchLLM is an edge proxy that sits between your application and AI providers (OpenAI, Anthropic, Groq). It automatically caches repetitive queries using semantic similarity, reducing API costs by 30-50% without requiring code changes.
How does semantic caching work?#
Instead of only matching exact prompts, WatchLLM understands the meaning behind your prompts. For example, "What is Python?" and "Explain what Python is" would match semantically and return the cached response, saving you an API call.
Which AI providers are supported?#
WatchLLM supports:
- OpenAI — GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo
- Anthropic — Claude 3 Opus, Sonnet, Haiku
- Groq — Llama, Mixtral, Gemma
- 100+ models via OpenRouter integration
Is WatchLLM open source?#
Yes! WatchLLM is open source and available on GitHub. You can self-host it or use our managed service.
Integration#
How long does integration take?#
Most developers integrate WatchLLM in under 5 minutes. It's a 3-line code change — just update your baseURL and apiKey in your existing OpenAI SDK setup.
Do I need to change my code?#
Minimal changes. WatchLLM is a drop-in replacement for the OpenAI API. You only need to:
- Change
baseURLtohttps://proxy.watchllm.dev/v1 - Use your WatchLLM API key instead of your provider key
Does it work with streaming responses?#
Yes! WatchLLM fully supports streaming (stream: true). Cached responses are streamed back with the same chunked format as the original provider response.
Does it work with function calling / tools?#
Yes. Function calling, tool use, and structured outputs are fully supported and cached correctly.
Can I use it with LangChain / LlamaIndex?#
Yes. Since WatchLLM is OpenAI-compatible, it works with any framework that uses the OpenAI SDK, including LangChain, LlamaIndex, Vercel AI SDK, and others.
Pricing & Billing#
Is there a free tier?#
Yes! The Free plan includes 50,000 requests/month with no credit card required. It's enough for development and small projects.
What counts as a request?#
Every API call to the proxy counts as one request, regardless of whether it's a cache hit or miss. This includes chat completions, text completions, and embedding requests.
What happens when I exceed my monthly quota?#
- Free plan: Switches to cache-only mode (only serves cached responses, no new upstream calls)
- Paid plans: Overage billing at $0.40-$0.50 per 1,000 additional requests, up to the plan's overage cap
Can I cancel anytime?#
Yes. Cancel your subscription at any time from the dashboard. Your plan stays active until the end of the current billing period.
Caching#
How do I know if a response was cached?#
Check the response headers:
X-WatchLLM-Cache: HIT— exact cache matchX-WatchLLM-Cache: HIT-SEMANTIC— semantic similarity matchX-WatchLLM-Cache: MISS— new request, forwarded to provider
Can I control the cache sensitivity?#
Yes. Adjust the similarity threshold in your project settings (default: 95%). Lower values = more aggressive caching, higher values = stricter matching.
Can I clear the cache?#
Yes. You can clear your project's cache from the dashboard under Settings → Cache Management.
How long are responses cached?#
Cache TTL (time-to-live) is configurable per project. Default is 24 hours. Adjust it in Settings → Cache Configuration.
Does caching work across different users?#
Yes, by default. If two different users send semantically similar prompts, the second user gets the cached response. You can scope caching per user by including a user identifier in your requests.
Security#
Is my data secure?#
Yes. All data is encrypted in transit (TLS 1.3) and at rest (AES-256). See our Security docs for full details.
Are my API keys safe?#
WatchLLM API keys are stored securely. BYOK provider keys are encrypted with AES-256. Keys are never logged or exposed in response headers.
Where is my data stored?#
- Cache: Cloudflare's global edge network and Upstash Redis
- Database: Supabase (PostgreSQL) with Row Level Security
- Logs: Retained per your plan's data retention policy
Is WatchLLM GDPR compliant?#
Yes. We support data export, account deletion, and data portability. See our Security & Privacy documentation for details.
Technical#
What's the latency overhead?#
- Cache hit: Typically 5-15ms overhead (faster than provider calls)
- Cache miss: Approximately 20-50ms overhead for cache lookup + logging
- WatchLLM runs on Cloudflare's edge network, close to your users
Does it support multiple projects?#
Yes. Create separate projects in the dashboard, each with their own API keys, cache settings, and analytics.
Can I self-host WatchLLM?#
Yes! See our Self-Hosting Guide for detailed deployment instructions. You can run the entire stack on your own infrastructure.
What happens if WatchLLM is down?#
If the proxy experiences issues, requests will fail with a 503 error. We recommend implementing fallback logic in your application to call providers directly in case of proxy unavailability.
Troubleshooting#
My cache hit rate is low#
- Check your similarity threshold — it might be too strict (try 90-92%)
- Ensure prompts are consistent in format
- Review the Troubleshooting guide for optimization tips
I'm getting 429 errors#
You're hitting your rate limit. See Rate Limits for details on limits per plan and how to handle them.
My API key isn't working#
- Verify the key starts with
lgw_proj_orlgw_test_ - Check that the key is active in your dashboard
- Ensure you're using the correct
baseURL - See Authentication for troubleshooting
Still Have Questions?#
- Email: kiwi092020@gmail.com
- GitHub: github.com/kaadipranav/WATCHLLM
- Discord: Join our community for real-time help