Troubleshooting#
Solutions to common issues when integrating or using WatchLLM.
Authentication#
401 Unauthorized#
This usually means your API key is invalid or missing.
- Check Key: Ensure the key exists in your project settings and starts with
lgw_. - Check Header: Verify you're using the
Authorization: Bearer <key>format. - Check Project: If a project is archived or deleted, its keys will stop working immediately.
BYOK Authentication Failed#
If you're using Bring Your Own Key (BYOK) and getting auth errors:
- Provider Key Missing: Ensure you've added the API key for the specific provider (OpenAI/Anthropic/Groq) in your project settings.
- Key Format: Verify the key format matches the provider (OpenAI:
sk-, Anthropic:sk-ant-, Groq:gsk_). - Provider Credits: Check that your provider account has sufficient credits and the key is active.
- Key Rotation: Try generating a new API key from the provider if the current one seems invalid.
Caching Issues#
Unexpected Cache Misses#
If you expect a HIT but get a MISS:
- Temperature: High temperatures (e.g.,
1.0) might result in variations that miss the cache if the threshold is tight. - Normalization: WatchLLM normalizes whitespace and casing, but major prompt changes will always trigger a miss.
- Threshold: You can adjust the
SEMANTIC_CACHE_THRESHOLDin your project settings to be more or less permissive.
Semantic Cache Not Working as Expected#
For BYOK users experiencing semantic caching issues:
- Model Compatibility: Semantic caching works best with embedding models. Ensure your BYOK setup includes access to text-embedding-ada-002 or similar.
- Threshold Tuning: Start with a threshold of 0.8-0.9 for most use cases. Lower values (0.7) catch more similar prompts but may increase false positives.
- Prompt Length: Very short prompts (< 10 words) may not cache well semantically. Consider using exact matching for short prompts.
- Language Mixing: If mixing languages in prompts, the semantic similarity may be lower than expected.
Performance#
High Latency#
WatchLLM adds ~20ms of overhead for cache lookups. If you experience higher latency:
- Provider Status: Check if the upstream provider (OpenAI/Anthropic) is having issues.
- Region: The proxy runs at the edge, but database lookups might add latency if the project is far from a Supabase region.
- BYOK vs OpenRouter: BYOK typically has lower latency than OpenRouter for direct provider access.
BYOK Performance Issues#
If BYOK requests are slower than expected:
- Provider Regions: Your provider API key might be configured for a different region than your WatchLLM project.
- Rate Limits: Check if you've hit provider rate limits. WatchLLM doesn't add rate limiting beyond what providers enforce.
- Model Selection: Some models (like GPT-4) have higher latency than faster models (GPT-3.5, Claude Instant).
Model Selection#
"Model not found" Error#
- BYOK Setup: If using BYOK, ensure you've configured the correct provider key and are using native model names (e.g.,
gpt-4onotopenai/gpt-4o). - OpenRouter Format: For OpenRouter models, use the full format like
openai/gpt-4ooranthropic/claude-3-sonnet. - Model Availability: Some models may be in beta or limited availability. Check provider documentation for current model status.
Inconsistent Model Behavior#
- BYOK vs OpenRouter: Models accessed via BYOK use the provider's native parameters, while OpenRouter may apply different defaults.
- Temperature Settings: Ensure temperature and other parameters are set consistently across requests for better caching.
Billing & Usage#
Unexpected Charges#
- BYOK Costs: With BYOK, you pay providers directly. Monitor your provider dashboard for usage.
- Cache Bypass: Requests that bypass cache (due to high temperature or unique prompts) will always incur provider costs.
- Rate Limits: WatchLLM doesn't enforce rate limits beyond provider defaults. Monitor your usage to avoid overages.
Usage Analytics Not Showing#
- Project Settings: Ensure usage analytics is enabled in your project configuration.
- Time Zones: Usage data is aggregated by UTC. Check your timezone settings if data seems off.
- BYOK Tracking: BYOK usage is tracked separately from OpenRouter usage in analytics.
Semantic Cache Optimization#
Improving Cache Hit Rates#
- Prompt Consistency: Use consistent prompt formatting, casing, and structure across similar requests.
- Threshold Adjustment: Experiment with different semantic thresholds (0.7-0.95) based on your use case.
- Prompt Templates: Pre-normalize prompts in your application before sending to reduce normalization overhead.
- Batch Similar Requests: Group similar prompts together to maximize cache benefits.
Cache Invalidation Issues#
- Manual Clearing: Use the dashboard to clear cache for specific prompts or entire projects when needed.
- Version Updates: Clear cache when updating prompt templates or model versions to ensure fresh responses.
- Stale Responses: If cached responses become outdated, temporarily disable caching or use exact matching.
Integration Issues#
"Connection refused" or Network Errors#
If you're getting connection errors when making requests:
- CORS Issues: Ensure your requests include the proper headers. WatchLLM supports CORS for web applications.
- Firewall/Proxy: Check if your network or corporate firewall is blocking requests to
proxy.watchllm.dev. - HTTPS Required: All requests must use HTTPS. HTTP requests will be rejected.
- Rate Limiting: If you're making too many requests too quickly, you may hit rate limits. Implement exponential backoff.
SDK Integration Problems#
- OpenAI SDK Version: Ensure you're using OpenAI SDK v4+ for best compatibility.
- Base URL Configuration: Double-check that
baseURLis set tohttps://proxy.watchllm.dev/v1(note the/v1). - Environment Variables: Make sure your API key is loaded correctly in your environment.
Data & Analytics#
Analytics Data Not Appearing#
- Project Selection: Ensure you're viewing analytics for the correct project.
- Time Range: Analytics data may take a few minutes to appear after the first requests.
- BYOK Tracking: BYOK requests are tracked separately from OpenRouter requests.
- Cache Metrics: Cache hit/miss data is calculated in real-time but may have slight delays.
Usage Limits and Quotas#
- Free Tier Limits: 50k requests/month, 10 requests/minute.
- Starter Plan: 250k requests/month, 50 requests/minute.
- Pro Plan: 1M requests/month, 200 requests/minute.
- Rate Limit Behavior: Exceeding rate limits returns HTTP 429. Implement proper retry logic.
Advanced Configuration#
Custom Cache TTL Settings#
- Default TTL: Completions cache for 1 hour, embeddings for 24 hours.
- Custom TTL: Can be configured per project in advanced settings.
- Cache Invalidation: Use the dashboard to manually clear cache when needed.
Environment-Specific Setup#
- Development: Use test keys and monitor usage carefully.
- Staging: Test with production-like data volumes.
- Production: Enable all monitoring and set up proper error handling.
Common BYOK Setup Mistakes#
Provider Key Configuration#
- Wrong Key Format: Each provider has specific key formats (OpenAI:
sk-, Anthropic:sk-ant-, Groq:gsk_). - Key Permissions: Ensure your API keys have the necessary permissions for the models you want to use.
- Account Verification: Some providers require account verification before certain models are accessible.
Model Name Confusion#
// ✅ Correct BYOK usage
const client = new OpenAI({
apiKey: "lgw_proj_your_key",
baseURL: "https://proxy.watchllm.dev/v1"
});
// Use native model names
await client.chat.completions.create({
model: "gpt-4o", // ✅ Direct OpenAI
messages: [...]
});
// ❌ Wrong - mixing OpenRouter format with BYOK
await client.chat.completions.create({
model: "openai/gpt-4o", // ❌ Don't use this with BYOK
messages: [...]
});Performance Optimization#
Maximizing Cache Hit Rates#
- Prompt Normalization: Use consistent formatting, casing, and structure.
- Temperature Settings: Lower temperatures (0.0-0.3) cache better than high temperatures.
- Prompt Length: Longer, more descriptive prompts cache better semantically.
- Batch Processing: Group similar requests to improve cache efficiency.
Monitoring & Alerting#
- Response Times: WatchLLM adds ~20-50ms overhead for cache lookups.
- Error Rates: Monitor for increased error rates that might indicate provider issues.
- Usage Patterns: Set up alerts for unusual usage spikes or drops.