Troubleshooting

Official WatchLLM documentation for troubleshooting.

Troubleshooting#

Solutions to common issues when integrating or using WatchLLM.

Authentication#

401 Unauthorized#

This usually means your API key is invalid or missing.

  • Check Key: Ensure the key exists in your project settings and starts with lgw_.
  • Check Header: Verify you're using the Authorization: Bearer <key> format.
  • Check Project: If a project is archived or deleted, its keys will stop working immediately.

BYOK Authentication Failed#

If you're using Bring Your Own Key (BYOK) and getting auth errors:

  • Provider Key Missing: Ensure you've added the API key for the specific provider (OpenAI/Anthropic/Groq) in your project settings.
  • Key Format: Verify the key format matches the provider (OpenAI: sk-, Anthropic: sk-ant-, Groq: gsk_).
  • Provider Credits: Check that your provider account has sufficient credits and the key is active.
  • Key Rotation: Try generating a new API key from the provider if the current one seems invalid.

Caching Issues#

Unexpected Cache Misses#

If you expect a HIT but get a MISS:

  • Temperature: High temperatures (e.g., 1.0) might result in variations that miss the cache if the threshold is tight.
  • Normalization: WatchLLM normalizes whitespace and casing, but major prompt changes will always trigger a miss.
  • Threshold: You can adjust the SEMANTIC_CACHE_THRESHOLD in your project settings to be more or less permissive.

Semantic Cache Not Working as Expected#

For BYOK users experiencing semantic caching issues:

  • Model Compatibility: Semantic caching works best with embedding models. Ensure your BYOK setup includes access to text-embedding-ada-002 or similar.
  • Threshold Tuning: Start with a threshold of 0.8-0.9 for most use cases. Lower values (0.7) catch more similar prompts but may increase false positives.
  • Prompt Length: Very short prompts (< 10 words) may not cache well semantically. Consider using exact matching for short prompts.
  • Language Mixing: If mixing languages in prompts, the semantic similarity may be lower than expected.

Performance#

High Latency#

WatchLLM adds ~20ms of overhead for cache lookups. If you experience higher latency:

  • Provider Status: Check if the upstream provider (OpenAI/Anthropic) is having issues.
  • Region: The proxy runs at the edge, but database lookups might add latency if the project is far from a Supabase region.
  • BYOK vs OpenRouter: BYOK typically has lower latency than OpenRouter for direct provider access.

BYOK Performance Issues#

If BYOK requests are slower than expected:

  • Provider Regions: Your provider API key might be configured for a different region than your WatchLLM project.
  • Rate Limits: Check if you've hit provider rate limits. WatchLLM doesn't add rate limiting beyond what providers enforce.
  • Model Selection: Some models (like GPT-4) have higher latency than faster models (GPT-3.5, Claude Instant).

Model Selection#

"Model not found" Error#

  • BYOK Setup: If using BYOK, ensure you've configured the correct provider key and are using native model names (e.g., gpt-4o not openai/gpt-4o).
  • OpenRouter Format: For OpenRouter models, use the full format like openai/gpt-4o or anthropic/claude-3-sonnet.
  • Model Availability: Some models may be in beta or limited availability. Check provider documentation for current model status.

Inconsistent Model Behavior#

  • BYOK vs OpenRouter: Models accessed via BYOK use the provider's native parameters, while OpenRouter may apply different defaults.
  • Temperature Settings: Ensure temperature and other parameters are set consistently across requests for better caching.

Billing & Usage#

Unexpected Charges#

  • BYOK Costs: With BYOK, you pay providers directly. Monitor your provider dashboard for usage.
  • Cache Bypass: Requests that bypass cache (due to high temperature or unique prompts) will always incur provider costs.
  • Rate Limits: WatchLLM doesn't enforce rate limits beyond provider defaults. Monitor your usage to avoid overages.

Usage Analytics Not Showing#

  • Project Settings: Ensure usage analytics is enabled in your project configuration.
  • Time Zones: Usage data is aggregated by UTC. Check your timezone settings if data seems off.
  • BYOK Tracking: BYOK usage is tracked separately from OpenRouter usage in analytics.

Semantic Cache Optimization#

Improving Cache Hit Rates#

  • Prompt Consistency: Use consistent prompt formatting, casing, and structure across similar requests.
  • Threshold Adjustment: Experiment with different semantic thresholds (0.7-0.95) based on your use case.
  • Prompt Templates: Pre-normalize prompts in your application before sending to reduce normalization overhead.
  • Batch Similar Requests: Group similar prompts together to maximize cache benefits.

Cache Invalidation Issues#

  • Manual Clearing: Use the dashboard to clear cache for specific prompts or entire projects when needed.
  • Version Updates: Clear cache when updating prompt templates or model versions to ensure fresh responses.
  • Stale Responses: If cached responses become outdated, temporarily disable caching or use exact matching.

Integration Issues#

"Connection refused" or Network Errors#

If you're getting connection errors when making requests:

  • CORS Issues: Ensure your requests include the proper headers. WatchLLM supports CORS for web applications.
  • Firewall/Proxy: Check if your network or corporate firewall is blocking requests to proxy.watchllm.dev.
  • HTTPS Required: All requests must use HTTPS. HTTP requests will be rejected.
  • Rate Limiting: If you're making too many requests too quickly, you may hit rate limits. Implement exponential backoff.

SDK Integration Problems#

  • OpenAI SDK Version: Ensure you're using OpenAI SDK v4+ for best compatibility.
  • Base URL Configuration: Double-check that baseURL is set to https://proxy.watchllm.dev/v1 (note the /v1).
  • Environment Variables: Make sure your API key is loaded correctly in your environment.

Data & Analytics#

Analytics Data Not Appearing#

  • Project Selection: Ensure you're viewing analytics for the correct project.
  • Time Range: Analytics data may take a few minutes to appear after the first requests.
  • BYOK Tracking: BYOK requests are tracked separately from OpenRouter requests.
  • Cache Metrics: Cache hit/miss data is calculated in real-time but may have slight delays.

Usage Limits and Quotas#

  • Free Tier Limits: 50k requests/month, 10 requests/minute.
  • Starter Plan: 250k requests/month, 50 requests/minute.
  • Pro Plan: 1M requests/month, 200 requests/minute.
  • Rate Limit Behavior: Exceeding rate limits returns HTTP 429. Implement proper retry logic.

Advanced Configuration#

Custom Cache TTL Settings#

  • Default TTL: Completions cache for 1 hour, embeddings for 24 hours.
  • Custom TTL: Can be configured per project in advanced settings.
  • Cache Invalidation: Use the dashboard to manually clear cache when needed.

Environment-Specific Setup#

  • Development: Use test keys and monitor usage carefully.
  • Staging: Test with production-like data volumes.
  • Production: Enable all monitoring and set up proper error handling.

Common BYOK Setup Mistakes#

Provider Key Configuration#

  • Wrong Key Format: Each provider has specific key formats (OpenAI: sk-, Anthropic: sk-ant-, Groq: gsk_).
  • Key Permissions: Ensure your API keys have the necessary permissions for the models you want to use.
  • Account Verification: Some providers require account verification before certain models are accessible.

Model Name Confusion#

// ✅ Correct BYOK usage
const client = new OpenAI({
  apiKey: "lgw_proj_your_key",
  baseURL: "https://proxy.watchllm.dev/v1"
});
 
// Use native model names
await client.chat.completions.create({
  model: "gpt-4o", // ✅ Direct OpenAI
  messages: [...]
});
 
// ❌ Wrong - mixing OpenRouter format with BYOK
await client.chat.completions.create({
  model: "openai/gpt-4o", // ❌ Don't use this with BYOK
  messages: [...]
});

Performance Optimization#

Maximizing Cache Hit Rates#

  • Prompt Normalization: Use consistent formatting, casing, and structure.
  • Temperature Settings: Lower temperatures (0.0-0.3) cache better than high temperatures.
  • Prompt Length: Longer, more descriptive prompts cache better semantically.
  • Batch Processing: Group similar requests to improve cache efficiency.

Monitoring & Alerting#

  • Response Times: WatchLLM adds ~20-50ms overhead for cache lookups.
  • Error Rates: Monitor for increased error rates that might indicate provider issues.
  • Usage Patterns: Set up alerts for unusual usage spikes or drops.

© 2025 WatchLLM. All rights reserved.