WatchLLM Integration Guide#
A beginner's guide to integrating WatchLLM into your application
Get automatic caching, observability, and agent debugging with minimal code changes.
Table of Contents#
- Quick Start
- Feature 1: Automatic Caching
- Feature 2: Observability & Analytics
- Feature 3: Agent Debugger
- Advanced Features
- Troubleshooting
Quick Start#
Step 1: Sign Up & Get API Key#
- Go to watchllm.com and create an account
- Create a new project
- Copy your API key from the dashboard (starts with
lgw_proj_...)
Step 2: Install SDK#
Node.js / TypeScript:
npm install @watchllm/sdk-nodePython:
pip install watchllmStep 3: Update Your Code#
Replace direct OpenAI/Anthropic API calls with WatchLLM proxy. That's it!
Feature 1: Automatic Caching#
What you get: Automatic response caching (exact match + semantic similarity), streaming cache replay, 50-80% cost reduction.
Node.js / TypeScript#
BEFORE (Direct OpenAI)#
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
async function chatWithAI(userMessage: string) {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: userMessage }
],
temperature: 0.7
});
return response.choices[0].message.content;
}
// Usage
const answer = await chatWithAI("What is the capital of France?");
console.log(answer);AFTER (With WatchLLM - Caching Enabled)#
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.WATCHLLM_API_KEY, // Changed: Use WatchLLM key
baseURL: 'https://proxy.watchllm.dev/v1' // Changed: Point to WatchLLM
});
async function chatWithAI(userMessage: string) {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: userMessage }
],
temperature: 0.7
});
// Check if response was cached
console.log('Cache status:', response.headers?.['x-cache']); // "HIT" or "MISS"
return response.choices[0].message.content;
}
// Usage
const answer = await chatWithAI("What is the capital of France?");
console.log(answer);
// Second call with same question = instant cached response!
const cachedAnswer = await chatWithAI("What is the capital of France?");That's it! Caching is automatic. Identical requests return cached responses instantly.
Python#
BEFORE (Direct OpenAI)#
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def chat_with_ai(user_message: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": user_message}
],
temperature=0.7
)
return response.choices[0].message.content
# Usage
answer = chat_with_ai("What is the capital of France?")
print(answer)AFTER (With WatchLLM - Caching Enabled)#
from openai import OpenAI
client = OpenAI(
api_key=os.environ["WATCHLLM_API_KEY"], # Changed
base_url="https://proxy.watchllm.dev/v1" # Changed
)
def chat_with_ai(user_message: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": user_message}
],
temperature=0.7
)
# Check cache status
print(f"Cache: {response.headers.get('x-cache')}") # "HIT" or "MISS"
return response.choices[0].message.content
# Usage
answer = chat_with_ai("What is the capital of France?")
print(answer)
# Cached on second call!
cached_answer = chat_with_ai("What is the capital of France?")Streaming Responses with Cache#
BEFORE (Streaming without cache)#
const stream = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Write a poem" }],
stream: true
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || '');
}AFTER (Streaming with WatchLLM - Replay Cache)#
// Just change baseURL - streaming cache works automatically!
const openai = new OpenAI({
apiKey: process.env.WATCHLLM_API_KEY,
baseURL: 'https://proxy.watchllm.dev/v1'
});
const stream = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Write a poem" }],
stream: true
});
// First call: streams from OpenAI, buffers for cache
// Second call: replays from cache with realistic timing!
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || '');
}Cache headers in streaming:
- First response:
X-Cache: MISS(real API call) - Second response:
X-Cache: HIT(replayed from cache)
Feature 2: Observability & Analytics#
What you get: Request logs, cost tracking, latency metrics, error monitoring, model usage breakdown.
Node.js / TypeScript#
BEFORE (No observability)#
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
async function processUserRequest(userId: string, question: string) {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: question }]
});
return response.choices[0].message.content;
}
// No visibility into costs, errors, or performance!AFTER (With WatchLLM - Full Observability)#
import OpenAI from 'openai';
import { WatchLLMClient } from '@watchllm/sdk-node';
// Initialize WatchLLM SDK for observability
const watchllm = new WatchLLMClient({
apiKey: process.env.WATCHLLM_API_KEY,
projectId: process.env.WATCHLLM_PROJECT_ID
});
const openai = new OpenAI({
apiKey: process.env.WATCHLLM_API_KEY,
baseURL: 'https://proxy.watchllm.dev/v1'
});
async function processUserRequest(userId: string, question: string) {
// Optional: Add custom metadata for better tracking
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: question }],
user: userId // Track per-user usage
});
// Optional: Log custom events
await watchllm.logEvent({
event_type: 'llm_request',
metadata: {
user_id: userId,
question_length: question.length,
success: true
}
});
return response.choices[0].message.content;
}
// Now view analytics in dashboard:
// - Total requests & costs
// - Average latency
// - Error rates
// - Model usage breakdown
// - Cost per userDashboard Access: Go to watchllm.com/dashboard/analytics to see:
- Request volume charts
- Cost breakdown by model
- Performance metrics (P50, P95, P99 latency)
- Cache hit rates
- Error tracking
Python#
BEFORE (No observability)#
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def process_user_request(user_id: str, question: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": question}]
)
return response.choices[0].message.content
# No cost tracking, no error monitoringAFTER (With WatchLLM - Full Observability)#
from openai import OpenAI
from watchllm import WatchLLMClient
# Initialize WatchLLM for observability
watchllm = WatchLLMClient(
api_key=os.environ["WATCHLLM_API_KEY"],
project_id=os.environ["WATCHLLM_PROJECT_ID"]
)
client = OpenAI(
api_key=os.environ["WATCHLLM_API_KEY"],
base_url="https://proxy.watchllm.dev/v1"
)
def process_user_request(user_id: str, question: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": question}],
user=user_id # Track per-user usage
)
# Optional: Log custom events
watchllm.log_event({
"event_type": "llm_request",
"metadata": {
"user_id": user_id,
"question_length": len(question),
"success": True
}
})
return response.choices[0].message.content
# All requests automatically tracked in WatchLLM dashboard!Advanced: LangChain Integration#
BEFORE (LangChain without observability)#
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";
const model = new ChatOpenAI({
openAIApiKey: process.env.OPENAI_API_KEY,
modelName: "gpt-4o"
});
const response = await model.invoke([
new HumanMessage("Explain quantum computing")
]);AFTER (LangChain with WatchLLM)#
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";
import { WatchLLMClient } from "@watchllm/sdk-node";
// Initialize WatchLLM
const watchllm = new WatchLLMClient({
apiKey: process.env.WATCHLLM_API_KEY,
projectId: process.env.WATCHLLM_PROJECT_ID
});
// Use WatchLLM's LangChain callback handler
const model = new ChatOpenAI({
openAIApiKey: process.env.WATCHLLM_API_KEY,
configuration: {
baseURL: 'https://proxy.watchllm.dev/v1'
},
modelName: "gpt-4o",
callbacks: [watchllm.getLangChainCallbackHandler()] // Auto-track all LangChain calls
});
const response = await model.invoke([
new HumanMessage("Explain quantum computing")
]);
// All LangChain calls now tracked with full context!Feature 3: Agent Debugger#
What you get: Step-by-step agent execution tracking, cost attribution per step, cache hit visualization, automatic fixture generation.
Node.js / TypeScript#
BEFORE (Agent without debugging)#
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
async function researchAgent(topic: string) {
const steps = [];
// Step 1: Generate search queries
const queries = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{
role: "user",
content: `Generate 3 search queries for: ${topic}`
}]
});
steps.push({ step: 'query_gen', result: queries.choices[0].message.content });
// Step 2: Summarize results
const summary = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{
role: "user",
content: `Summarize research on: ${topic}`
}]
});
steps.push({ step: 'summary', result: summary.choices[0].message.content });
// No visibility into:
// - Which step cost how much?
// - Which steps were cached?
// - How long each step took?
// - Where the agent got stuck?
return summary.choices[0].message.content;
}AFTER (With WatchLLM Agent Debugger)#
import OpenAI from 'openai';
import { WatchLLMClient } from '@watchllm/sdk-node';
const watchllm = new WatchLLMClient({
apiKey: process.env.WATCHLLM_API_KEY,
projectId: process.env.WATCHLLM_PROJECT_ID
});
const openai = new OpenAI({
apiKey: process.env.WATCHLLM_API_KEY,
baseURL: 'https://proxy.watchllm.dev/v1'
});
async function researchAgent(topic: string) {
// Start tracking agent run
const run = await watchllm.startAgentRun('research_agent', { topic });
try {
// Step 1: Generate search queries
const stepStart1 = Date.now();
const queries = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{
role: "user",
content: `Generate 3 search queries for: ${topic}`
}]
});
// Log step with metadata
await run.logStep({
name: 'query_generation',
input: { topic },
output: { queries: queries.choices[0].message.content },
metadata: {
model: 'gpt-4o',
tokens_used: queries.usage?.total_tokens,
duration_ms: Date.now() - stepStart1,
cache_hit: queries.headers?.['x-cache'] === 'HIT'
}
});
// Step 2: Summarize results
const stepStart2 = Date.now();
const summary = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{
role: "user",
content: `Summarize research on: ${topic}`
}]
});
await run.logStep({
name: 'summarization',
input: { topic },
output: { summary: summary.choices[0].message.content },
metadata: {
model: 'gpt-4o',
tokens_used: summary.usage?.total_tokens,
duration_ms: Date.now() - stepStart2,
cache_hit: summary.headers?.['x-cache'] === 'HIT'
}
});
// Mark run as complete
await run.complete({
status: 'success',
final_output: summary.choices[0].message.content
});
return summary.choices[0].message.content;
} catch (error) {
// Track errors
await run.complete({
status: 'error',
error: error.message
});
throw error;
}
}
// Now debug in dashboard:
// - See cost breakdown per step
// - Identify cache hits/misses
// - Detect loops and wasted calls
// - Replay agent execution timelineDashboard View: Go to watchllm.com/dashboard/observability/agent-runs and click on a run to see:
Research Agent Run #abc123
├─ Step 1: query_generation (2.3s, $0.0045, cached)
│ ├─ Input: { topic: "quantum computing" }
│ └─ Output: 3 search queries generated
│
└─ Step 2: summarization (1.8s, $0.0089, cache miss)
├─ Input: { topic: "quantum computing" }
└─ Output: Research summary
Total Cost: $0.0134
Total Time: 4.1s
Cache Savings: $0.0045 (33%)
Python#
BEFORE (Agent without debugging)#
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def research_agent(topic: str) -> str:
# Step 1: Generate queries
queries = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Generate 3 search queries for: {topic}"
}]
)
# Step 2: Summarize
summary = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Summarize research on: {topic}"
}]
)
# No debugging, no cost breakdown
return summary.choices[0].message.contentAFTER (With WatchLLM Agent Debugger)#
import time
from openai import OpenAI
from watchllm import WatchLLMClient
watchllm = WatchLLMClient(
api_key=os.environ["WATCHLLM_API_KEY"],
project_id=os.environ["WATCHLLM_PROJECT_ID"]
)
client = OpenAI(
api_key=os.environ["WATCHLLM_API_KEY"],
base_url="https://proxy.watchllm.dev/v1"
)
def research_agent(topic: str) -> str:
# Start agent run tracking
run = watchllm.start_agent_run('research_agent', {'topic': topic})
try:
# Step 1: Generate queries
step_start = time.time()
queries = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Generate 3 search queries for: {topic}"
}]
)
run.log_step({
'name': 'query_generation',
'input': {'topic': topic},
'output': {'queries': queries.choices[0].message.content},
'metadata': {
'model': 'gpt-4o',
'tokens_used': queries.usage.total_tokens,
'duration_ms': (time.time() - step_start) * 1000,
'cache_hit': queries.headers.get('x-cache') == 'HIT'
}
})
# Step 2: Summarize
step_start = time.time()
summary = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Summarize research on: {topic}"
}]
)
run.log_step({
'name': 'summarization',
'input': {'topic': topic},
'output': {'summary': summary.choices[0].message.content},
'metadata': {
'model': 'gpt-4o',
'tokens_used': summary.usage.total_tokens,
'duration_ms': (time.time() - step_start) * 1000,
'cache_hit': summary.headers.get('x-cache') == 'HIT'
}
})
# Complete the run
run.complete(
status='success',
final_output=summary.choices[0].message.content
)
return summary.choices[0].message.content
except Exception as e:
run.complete(status='error', error=str(e))
raiseAutoGPT / LangGraph Integration#
For complex agents using frameworks like LangGraph or AutoGPT:
Automatic Tracking (Recommended)#
import { WatchLLMClient } from '@watchllm/sdk-node';
import { ChatOpenAI } from "@langchain/openai";
const watchllm = new WatchLLMClient({
apiKey: process.env.WATCHLLM_API_KEY,
projectId: process.env.WATCHLLM_PROJECT_ID
});
// LangGraph example
const model = new ChatOpenAI({
openAIApiKey: process.env.WATCHLLM_API_KEY,
configuration: {
baseURL: 'https://proxy.watchllm.dev/v1'
},
callbacks: [watchllm.getLangChainCallbackHandler()] // Auto-track everything
});
// Now ALL agent steps are automatically tracked:
// - Tool calls
// - Chain executions
// - Retrieval steps
// - Final outputsAdvanced Features#
1. Cost Kill Switch#
Prevent runaway agent costs by setting a budget:
const run = await watchllm.startAgentRun('expensive_agent', input, {
maxCost: 5.00 // Stop if cost exceeds $5
});
try {
// Agent runs...
// If total cost > $5, throws error automatically
} catch (error) {
if (error.code === 'cost_limit_exceeded') {
console.log('Agent stopped due to budget limit');
console.log('Total cost:', error.details.accumulated_cost);
}
}2. Semantic Caching#
Enable semantic similarity caching (not just exact matches):
// In dashboard: Settings → Caching
// Set "Semantic Cache Threshold" to 0.95 (95% similarity)
// Now similar questions return cached responses:
await chat("What is the capital of France?"); // Cache MISS
await chat("What's France's capital city?"); // Cache HIT (semantic match!)3. Custom Cache TTL#
Set different cache durations per endpoint:
// In dashboard: Settings → Caching → TTL Settings
// Configure:
// - /v1/chat/completions: 24 hours
// - /v1/embeddings: 7 days
// - /v1/completions: 1 hour4. Customer Billing (Pass-Through)#
Track costs per end-user for billing:
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [...],
user: "customer_abc123" // Customer ID
});
// View per-customer costs in dashboard:
// watchllm.com/dashboard/billing/customers5. BYOK (Bring Your Own Key)#
Use your own OpenAI/Anthropic keys instead of shared pool:
- Go to Settings → API Keys
- Add your OpenAI API key
- Select "Use my own key" for the project
- All requests now use YOUR key (still get caching + observability!)
Environment Variables#
Create a .env file:
# WatchLLM Configuration
WATCHLLM_API_KEY=lgw_proj_your_key_here
WATCHLLM_PROJECT_ID=your_project_id
# Optional: For BYOK
OPENAI_API_KEY=sk-your_openai_key # Only if using BYOK
ANTHROPIC_API_KEY=sk-ant-your_key # Only if using BYOKTroubleshooting#
Issue: Caching not working#
Check:
- Are you using identical request parameters? (model, temperature, messages must match exactly)
- Is
temperature > 0? Lower temperature = better cache hit rate - Check cache header:
console.log(response.headers?.['x-cache'])
Fix:
// Use consistent temperature for better caching
const response = await openai.chat.completions.create({
model: "gpt-4o",
temperature: 0.0, // Deterministic = better cache hits
messages: [...]
});Issue: Observability events not showing up#
Check:
- Is
WATCHLLM_PROJECT_IDset correctly? - Did you initialize the WatchLLM SDK client?
- Allow 30-60 seconds for events to appear in dashboard (async ingestion)
Fix:
// Ensure SDK is initialized
const watchllm = new WatchLLMClient({
apiKey: process.env.WATCHLLM_API_KEY,
projectId: process.env.WATCHLLM_PROJECT_ID // Must be set!
});
// Optional: Manually flush events
await watchllm.flush(); // Force send pending eventsIssue: Agent debugger showing incomplete data#
Check:
- Are you calling
run.logStep()for each step? - Did you call
run.complete()at the end? - Is the agent run ID being passed consistently?
Fix:
const run = await watchllm.startAgentRun('my_agent', input);
// Log EVERY step
await run.logStep({...}); // Step 1
await run.logStep({...}); // Step 2
// MUST call complete
await run.complete({ status: 'success' }); // Required!Issue: High latency on first request#
Explanation: First request must hit the real API and populate cache. This is expected.
Expected behavior:
- Request 1: 2000ms (cache MISS, real API call)
- Request 2: 50ms (cache HIT, instant response)
Cache hit rate improves over time as cache warms up.
Issue: Streaming responses feel slower#
Check: Are you using the cached stream? It replays with 30ms chunk delay for realistic UX.
Fix (if needed):
// Disable realistic replay timing (instant stream)
const stream = await openai.chat.completions.create({
model: "gpt-4o",
stream: true,
extra_headers: {
'X-WatchLLM-Stream-Instant': 'true' // Instant replay
}
});Next Steps#
- View Analytics Dashboard: watchllm.com/dashboard/analytics
- Monitor Agent Runs: watchllm.com/dashboard/observability/agent-runs
- Configure Caching: watchllm.com/dashboard/settings/caching
- Join Discord: Get help from the community
- Read API Docs: Full API reference at docs.watchllm.com
Summary#
What changed in your code?
- Caching: Change
baseURLtohttps://proxy.watchllm.dev/v1(2 lines) - Observability: Initialize WatchLLM SDK client (3 lines), optional event logging
- Agent Debugger: Wrap agent in
startAgentRun()and calllogStep()per step
What you get:
✓ Automatic response caching (exact + semantic) ✓ Real-time cost & performance analytics ✓ Request/error tracking ✓ Agent execution debugging ✓ Cost kill switch ✓ Streaming cache replay ✓ BYOK support ✓ Customer billing attribution
Impact:
- 50-80% cost reduction via caching
- Full observability into LLM usage
- Debug complex agents step-by-step
- Prevent runaway costs
All with minimal code changes!