WatchLLM Integration Guide

Official WatchLLM documentation for watchllm integration guide.

WatchLLM Integration Guide#

A beginner's guide to integrating WatchLLM into your application

Get automatic caching, observability, and agent debugging with minimal code changes.


Table of Contents#

  1. Quick Start
  2. Feature 1: Automatic Caching
  3. Feature 2: Observability & Analytics
  4. Feature 3: Agent Debugger
  5. Advanced Features
  6. Troubleshooting

Quick Start#

Step 1: Sign Up & Get API Key#

  1. Go to watchllm.com and create an account
  2. Create a new project
  3. Copy your API key from the dashboard (starts with lgw_proj_...)

Step 2: Install SDK#

Node.js / TypeScript:

npm install @watchllm/sdk-node

Python:

pip install watchllm

Step 3: Update Your Code#

Replace direct OpenAI/Anthropic API calls with WatchLLM proxy. That's it!


Feature 1: Automatic Caching#

What you get: Automatic response caching (exact match + semantic similarity), streaming cache replay, 50-80% cost reduction.

Node.js / TypeScript#

BEFORE (Direct OpenAI)#

import OpenAI from 'openai';
 
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});
 
async function chatWithAI(userMessage: string) {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: userMessage }
    ],
    temperature: 0.7
  });
  
  return response.choices[0].message.content;
}
 
// Usage
const answer = await chatWithAI("What is the capital of France?");
console.log(answer);

AFTER (With WatchLLM - Caching Enabled)#

import OpenAI from 'openai';
 
const openai = new OpenAI({
  apiKey: process.env.WATCHLLM_API_KEY,  // Changed: Use WatchLLM key
  baseURL: 'https://proxy.watchllm.dev/v1'  // Changed: Point to WatchLLM
});
 
async function chatWithAI(userMessage: string) {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: userMessage }
    ],
    temperature: 0.7
  });
  
  // Check if response was cached
  console.log('Cache status:', response.headers?.['x-cache']); // "HIT" or "MISS"
  
  return response.choices[0].message.content;
}
 
// Usage
const answer = await chatWithAI("What is the capital of France?");
console.log(answer);
 
// Second call with same question = instant cached response!
const cachedAnswer = await chatWithAI("What is the capital of France?");

That's it! Caching is automatic. Identical requests return cached responses instantly.


Python#

BEFORE (Direct OpenAI)#

from openai import OpenAI
 
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
 
def chat_with_ai(user_message: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": user_message}
        ],
        temperature=0.7
    )
    
    return response.choices[0].message.content
 
# Usage
answer = chat_with_ai("What is the capital of France?")
print(answer)

AFTER (With WatchLLM - Caching Enabled)#

from openai import OpenAI
 
client = OpenAI(
    api_key=os.environ["WATCHLLM_API_KEY"],  # Changed
    base_url="https://proxy.watchllm.dev/v1"  # Changed
)
 
def chat_with_ai(user_message: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": user_message}
        ],
        temperature=0.7
    )
    
    # Check cache status
    print(f"Cache: {response.headers.get('x-cache')}")  # "HIT" or "MISS"
    
    return response.choices[0].message.content
 
# Usage
answer = chat_with_ai("What is the capital of France?")
print(answer)
 
# Cached on second call!
cached_answer = chat_with_ai("What is the capital of France?")

Streaming Responses with Cache#

BEFORE (Streaming without cache)#

const stream = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Write a poem" }],
  stream: true
});
 
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

AFTER (Streaming with WatchLLM - Replay Cache)#

// Just change baseURL - streaming cache works automatically!
const openai = new OpenAI({
  apiKey: process.env.WATCHLLM_API_KEY,
  baseURL: 'https://proxy.watchllm.dev/v1'
});
 
const stream = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Write a poem" }],
  stream: true
});
 
// First call: streams from OpenAI, buffers for cache
// Second call: replays from cache with realistic timing!
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

Cache headers in streaming:

  • First response: X-Cache: MISS (real API call)
  • Second response: X-Cache: HIT (replayed from cache)

Feature 2: Observability & Analytics#

What you get: Request logs, cost tracking, latency metrics, error monitoring, model usage breakdown.

Node.js / TypeScript#

BEFORE (No observability)#

import OpenAI from 'openai';
 
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});
 
async function processUserRequest(userId: string, question: string) {
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: question }]
  });
  
  return response.choices[0].message.content;
}
 
// No visibility into costs, errors, or performance!

AFTER (With WatchLLM - Full Observability)#

import OpenAI from 'openai';
import { WatchLLMClient } from '@watchllm/sdk-node';
 
// Initialize WatchLLM SDK for observability
const watchllm = new WatchLLMClient({
  apiKey: process.env.WATCHLLM_API_KEY,
  projectId: process.env.WATCHLLM_PROJECT_ID
});
 
const openai = new OpenAI({
  apiKey: process.env.WATCHLLM_API_KEY,
  baseURL: 'https://proxy.watchllm.dev/v1'
});
 
async function processUserRequest(userId: string, question: string) {
  // Optional: Add custom metadata for better tracking
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: question }],
    user: userId  // Track per-user usage
  });
  
  // Optional: Log custom events
  await watchllm.logEvent({
    event_type: 'llm_request',
    metadata: {
      user_id: userId,
      question_length: question.length,
      success: true
    }
  });
  
  return response.choices[0].message.content;
}
 
// Now view analytics in dashboard:
// - Total requests & costs
// - Average latency
// - Error rates
// - Model usage breakdown
// - Cost per user

Dashboard Access: Go to watchllm.com/dashboard/analytics to see:

  • Request volume charts
  • Cost breakdown by model
  • Performance metrics (P50, P95, P99 latency)
  • Cache hit rates
  • Error tracking

Python#

BEFORE (No observability)#

from openai import OpenAI
 
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
 
def process_user_request(user_id: str, question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}]
    )
    
    return response.choices[0].message.content
 
# No cost tracking, no error monitoring

AFTER (With WatchLLM - Full Observability)#

from openai import OpenAI
from watchllm import WatchLLMClient
 
# Initialize WatchLLM for observability
watchllm = WatchLLMClient(
    api_key=os.environ["WATCHLLM_API_KEY"],
    project_id=os.environ["WATCHLLM_PROJECT_ID"]
)
 
client = OpenAI(
    api_key=os.environ["WATCHLLM_API_KEY"],
    base_url="https://proxy.watchllm.dev/v1"
)
 
def process_user_request(user_id: str, question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
        user=user_id  # Track per-user usage
    )
    
    # Optional: Log custom events
    watchllm.log_event({
        "event_type": "llm_request",
        "metadata": {
            "user_id": user_id,
            "question_length": len(question),
            "success": True
        }
    })
    
    return response.choices[0].message.content
 
# All requests automatically tracked in WatchLLM dashboard!

Advanced: LangChain Integration#

BEFORE (LangChain without observability)#

import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";
 
const model = new ChatOpenAI({
  openAIApiKey: process.env.OPENAI_API_KEY,
  modelName: "gpt-4o"
});
 
const response = await model.invoke([
  new HumanMessage("Explain quantum computing")
]);

AFTER (LangChain with WatchLLM)#

import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";
import { WatchLLMClient } from "@watchllm/sdk-node";
 
// Initialize WatchLLM
const watchllm = new WatchLLMClient({
  apiKey: process.env.WATCHLLM_API_KEY,
  projectId: process.env.WATCHLLM_PROJECT_ID
});
 
// Use WatchLLM's LangChain callback handler
const model = new ChatOpenAI({
  openAIApiKey: process.env.WATCHLLM_API_KEY,
  configuration: {
    baseURL: 'https://proxy.watchllm.dev/v1'
  },
  modelName: "gpt-4o",
  callbacks: [watchllm.getLangChainCallbackHandler()]  // Auto-track all LangChain calls
});
 
const response = await model.invoke([
  new HumanMessage("Explain quantum computing")
]);
 
// All LangChain calls now tracked with full context!

Feature 3: Agent Debugger#

What you get: Step-by-step agent execution tracking, cost attribution per step, cache hit visualization, automatic fixture generation.

Node.js / TypeScript#

BEFORE (Agent without debugging)#

import OpenAI from 'openai';
 
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});
 
async function researchAgent(topic: string) {
  const steps = [];
  
  // Step 1: Generate search queries
  const queries = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ 
      role: "user", 
      content: `Generate 3 search queries for: ${topic}` 
    }]
  });
  steps.push({ step: 'query_gen', result: queries.choices[0].message.content });
  
  // Step 2: Summarize results
  const summary = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ 
      role: "user", 
      content: `Summarize research on: ${topic}` 
    }]
  });
  steps.push({ step: 'summary', result: summary.choices[0].message.content });
  
  // No visibility into:
  // - Which step cost how much?
  // - Which steps were cached?
  // - How long each step took?
  // - Where the agent got stuck?
  
  return summary.choices[0].message.content;
}

AFTER (With WatchLLM Agent Debugger)#

import OpenAI from 'openai';
import { WatchLLMClient } from '@watchllm/sdk-node';
 
const watchllm = new WatchLLMClient({
  apiKey: process.env.WATCHLLM_API_KEY,
  projectId: process.env.WATCHLLM_PROJECT_ID
});
 
const openai = new OpenAI({
  apiKey: process.env.WATCHLLM_API_KEY,
  baseURL: 'https://proxy.watchllm.dev/v1'
});
 
async function researchAgent(topic: string) {
  // Start tracking agent run
  const run = await watchllm.startAgentRun('research_agent', { topic });
  
  try {
    // Step 1: Generate search queries
    const stepStart1 = Date.now();
    const queries = await openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ 
        role: "user", 
        content: `Generate 3 search queries for: ${topic}` 
      }]
    });
    
    // Log step with metadata
    await run.logStep({
      name: 'query_generation',
      input: { topic },
      output: { queries: queries.choices[0].message.content },
      metadata: {
        model: 'gpt-4o',
        tokens_used: queries.usage?.total_tokens,
        duration_ms: Date.now() - stepStart1,
        cache_hit: queries.headers?.['x-cache'] === 'HIT'
      }
    });
    
    // Step 2: Summarize results
    const stepStart2 = Date.now();
    const summary = await openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ 
        role: "user", 
        content: `Summarize research on: ${topic}` 
      }]
    });
    
    await run.logStep({
      name: 'summarization',
      input: { topic },
      output: { summary: summary.choices[0].message.content },
      metadata: {
        model: 'gpt-4o',
        tokens_used: summary.usage?.total_tokens,
        duration_ms: Date.now() - stepStart2,
        cache_hit: summary.headers?.['x-cache'] === 'HIT'
      }
    });
    
    // Mark run as complete
    await run.complete({ 
      status: 'success',
      final_output: summary.choices[0].message.content 
    });
    
    return summary.choices[0].message.content;
    
  } catch (error) {
    // Track errors
    await run.complete({ 
      status: 'error',
      error: error.message 
    });
    throw error;
  }
}
 
// Now debug in dashboard:
// - See cost breakdown per step
// - Identify cache hits/misses
// - Detect loops and wasted calls
// - Replay agent execution timeline

Dashboard View: Go to watchllm.com/dashboard/observability/agent-runs and click on a run to see:

Research Agent Run #abc123
├─ Step 1: query_generation (2.3s, $0.0045, cached)
│  ├─ Input: { topic: "quantum computing" }
│  └─ Output: 3 search queries generated
│
└─ Step 2: summarization (1.8s, $0.0089, cache miss)
   ├─ Input: { topic: "quantum computing" }
   └─ Output: Research summary

Total Cost: $0.0134
Total Time: 4.1s
Cache Savings: $0.0045 (33%)

Python#

BEFORE (Agent without debugging)#

from openai import OpenAI
 
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
 
def research_agent(topic: str) -> str:
    # Step 1: Generate queries
    queries = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"Generate 3 search queries for: {topic}"
        }]
    )
    
    # Step 2: Summarize
    summary = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"Summarize research on: {topic}"
        }]
    )
    
    # No debugging, no cost breakdown
    return summary.choices[0].message.content

AFTER (With WatchLLM Agent Debugger)#

import time
from openai import OpenAI
from watchllm import WatchLLMClient
 
watchllm = WatchLLMClient(
    api_key=os.environ["WATCHLLM_API_KEY"],
    project_id=os.environ["WATCHLLM_PROJECT_ID"]
)
 
client = OpenAI(
    api_key=os.environ["WATCHLLM_API_KEY"],
    base_url="https://proxy.watchllm.dev/v1"
)
 
def research_agent(topic: str) -> str:
    # Start agent run tracking
    run = watchllm.start_agent_run('research_agent', {'topic': topic})
    
    try:
        # Step 1: Generate queries
        step_start = time.time()
        queries = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": f"Generate 3 search queries for: {topic}"
            }]
        )
        
        run.log_step({
            'name': 'query_generation',
            'input': {'topic': topic},
            'output': {'queries': queries.choices[0].message.content},
            'metadata': {
                'model': 'gpt-4o',
                'tokens_used': queries.usage.total_tokens,
                'duration_ms': (time.time() - step_start) * 1000,
                'cache_hit': queries.headers.get('x-cache') == 'HIT'
            }
        })
        
        # Step 2: Summarize
        step_start = time.time()
        summary = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": f"Summarize research on: {topic}"
            }]
        )
        
        run.log_step({
            'name': 'summarization',
            'input': {'topic': topic},
            'output': {'summary': summary.choices[0].message.content},
            'metadata': {
                'model': 'gpt-4o',
                'tokens_used': summary.usage.total_tokens,
                'duration_ms': (time.time() - step_start) * 1000,
                'cache_hit': summary.headers.get('x-cache') == 'HIT'
            }
        })
        
        # Complete the run
        run.complete(
            status='success',
            final_output=summary.choices[0].message.content
        )
        
        return summary.choices[0].message.content
        
    except Exception as e:
        run.complete(status='error', error=str(e))
        raise

AutoGPT / LangGraph Integration#

For complex agents using frameworks like LangGraph or AutoGPT:

import { WatchLLMClient } from '@watchllm/sdk-node';
import { ChatOpenAI } from "@langchain/openai";
 
const watchllm = new WatchLLMClient({
  apiKey: process.env.WATCHLLM_API_KEY,
  projectId: process.env.WATCHLLM_PROJECT_ID
});
 
// LangGraph example
const model = new ChatOpenAI({
  openAIApiKey: process.env.WATCHLLM_API_KEY,
  configuration: {
    baseURL: 'https://proxy.watchllm.dev/v1'
  },
  callbacks: [watchllm.getLangChainCallbackHandler()]  // Auto-track everything
});
 
// Now ALL agent steps are automatically tracked:
// - Tool calls
// - Chain executions
// - Retrieval steps
// - Final outputs

Advanced Features#

1. Cost Kill Switch#

Prevent runaway agent costs by setting a budget:

const run = await watchllm.startAgentRun('expensive_agent', input, {
  maxCost: 5.00  // Stop if cost exceeds $5
});
 
try {
  // Agent runs...
  // If total cost > $5, throws error automatically
} catch (error) {
  if (error.code === 'cost_limit_exceeded') {
    console.log('Agent stopped due to budget limit');
    console.log('Total cost:', error.details.accumulated_cost);
  }
}

2. Semantic Caching#

Enable semantic similarity caching (not just exact matches):

// In dashboard: Settings → Caching
// Set "Semantic Cache Threshold" to 0.95 (95% similarity)
 
// Now similar questions return cached responses:
await chat("What is the capital of France?");  // Cache MISS
await chat("What's France's capital city?");   // Cache HIT (semantic match!)

3. Custom Cache TTL#

Set different cache durations per endpoint:

// In dashboard: Settings → Caching → TTL Settings
// Configure:
// - /v1/chat/completions: 24 hours
// - /v1/embeddings: 7 days
// - /v1/completions: 1 hour

4. Customer Billing (Pass-Through)#

Track costs per end-user for billing:

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [...],
  user: "customer_abc123"  // Customer ID
});
 
// View per-customer costs in dashboard:
// watchllm.com/dashboard/billing/customers

5. BYOK (Bring Your Own Key)#

Use your own OpenAI/Anthropic keys instead of shared pool:

  1. Go to Settings → API Keys
  2. Add your OpenAI API key
  3. Select "Use my own key" for the project
  4. All requests now use YOUR key (still get caching + observability!)

Environment Variables#

Create a .env file:

# WatchLLM Configuration
WATCHLLM_API_KEY=lgw_proj_your_key_here
WATCHLLM_PROJECT_ID=your_project_id
 
# Optional: For BYOK
OPENAI_API_KEY=sk-your_openai_key  # Only if using BYOK
ANTHROPIC_API_KEY=sk-ant-your_key  # Only if using BYOK

Troubleshooting#

Issue: Caching not working#

Check:

  1. Are you using identical request parameters? (model, temperature, messages must match exactly)
  2. Is temperature > 0? Lower temperature = better cache hit rate
  3. Check cache header: console.log(response.headers?.['x-cache'])

Fix:

// Use consistent temperature for better caching
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  temperature: 0.0,  // Deterministic = better cache hits
  messages: [...]
});

Issue: Observability events not showing up#

Check:

  1. Is WATCHLLM_PROJECT_ID set correctly?
  2. Did you initialize the WatchLLM SDK client?
  3. Allow 30-60 seconds for events to appear in dashboard (async ingestion)

Fix:

// Ensure SDK is initialized
const watchllm = new WatchLLMClient({
  apiKey: process.env.WATCHLLM_API_KEY,
  projectId: process.env.WATCHLLM_PROJECT_ID  // Must be set!
});
 
// Optional: Manually flush events
await watchllm.flush();  // Force send pending events

Issue: Agent debugger showing incomplete data#

Check:

  1. Are you calling run.logStep() for each step?
  2. Did you call run.complete() at the end?
  3. Is the agent run ID being passed consistently?

Fix:

const run = await watchllm.startAgentRun('my_agent', input);
 
// Log EVERY step
await run.logStep({...});  // Step 1
await run.logStep({...});  // Step 2
 
// MUST call complete
await run.complete({ status: 'success' });  // Required!

Issue: High latency on first request#

Explanation: First request must hit the real API and populate cache. This is expected.

Expected behavior:

  • Request 1: 2000ms (cache MISS, real API call)
  • Request 2: 50ms (cache HIT, instant response)

Cache hit rate improves over time as cache warms up.


Issue: Streaming responses feel slower#

Check: Are you using the cached stream? It replays with 30ms chunk delay for realistic UX.

Fix (if needed):

// Disable realistic replay timing (instant stream)
const stream = await openai.chat.completions.create({
  model: "gpt-4o",
  stream: true,
  extra_headers: {
    'X-WatchLLM-Stream-Instant': 'true'  // Instant replay
  }
});

Next Steps#

  1. View Analytics Dashboard: watchllm.com/dashboard/analytics
  2. Monitor Agent Runs: watchllm.com/dashboard/observability/agent-runs
  3. Configure Caching: watchllm.com/dashboard/settings/caching
  4. Join Discord: Get help from the community
  5. Read API Docs: Full API reference at docs.watchllm.com

Summary#

What changed in your code?

  1. Caching: Change baseURL to https://proxy.watchllm.dev/v1 (2 lines)
  2. Observability: Initialize WatchLLM SDK client (3 lines), optional event logging
  3. Agent Debugger: Wrap agent in startAgentRun() and call logStep() per step

What you get:

✓ Automatic response caching (exact + semantic) ✓ Real-time cost & performance analytics ✓ Request/error tracking ✓ Agent execution debugging ✓ Cost kill switch ✓ Streaming cache replay ✓ BYOK support ✓ Customer billing attribution

Impact:

  • 50-80% cost reduction via caching
  • Full observability into LLM usage
  • Debug complex agents step-by-step
  • Prevent runaway costs

All with minimal code changes!

© 2026 WatchLLM. All rights reserved.