Rate Limits#
WatchLLM enforces rate limits at multiple layers to ensure fair usage and protect infrastructure.
Plan-Based Rate Limits#
Each plan has specific rate limits and monthly quotas:
| Plan | Price | Requests/Minute | Monthly Quota | Data Retention |
|---|---|---|---|---|
| Free | $0/mo | 10 rpm | 50,000 requests | 7 days |
| Starter | $29/mo | 50 rpm | 250,000 requests | 30 days |
| Pro | $49/mo | 200 rpm | 1,000,000 requests | 90 days |
How Rate Limiting Works#
Per-Key Rate Limiting#
Rate limits are applied per API key using a sliding window algorithm:
- Each API key has a Redis counter:
ratelimit:{keyId}:minute - The counter increments with each request
- The window resets every 60 seconds
- When the limit is reached, requests return
429 Too Many Requests
IP-Based Rate Limiting#
An additional defense-in-depth layer limits requests per IP address:
- 120 requests/minute per IP
- 30 requests/10 seconds burst limit
- IPs are hashed (SHA-256) for privacy
- After 5 violations in 5 minutes → 5-minute IP block
Rate Limit Response Headers#
Every response includes rate limit information:
HTTP/1.1 200 OK
X-RateLimit-Limit: 50
X-RateLimit-Remaining: 47
X-RateLimit-Reset: 1706054460When rate limited:
HTTP/1.1 429 Too Many Requests
Retry-After: 60
X-RateLimit-Limit: 50
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1706054460
{
"error": {
"message": "Rate limit exceeded. Please retry after 60 seconds.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}Monthly Quota#
In addition to per-minute rate limits, each plan has a monthly request quota:
- Counted requests: All requests that hit the proxy (including cache hits)
- Quota reset: First day of each calendar month at 00:00 UTC
- Overage behavior:
- Free plan: Requests are blocked after quota is reached
- Starter plan: $0.50 per 1,000 additional requests (up to 200k overage)
- Pro plan: $0.40 per 1,000 additional requests (up to 750k overage)
Handling Rate Limits#
Recommended Client-Side Strategy#
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: 'lgw_proj_your_key',
baseURL: 'https://proxy.watchllm.dev/v1',
maxRetries: 3, // Built-in retry with exponential backoff
});
// The SDK automatically retries on 429 responses
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hello!' }],
});Custom Retry Logic#
async function withRetry(fn, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (error.status === 429 && i < maxRetries - 1) {
const retryAfter = error.headers?.['retry-after'] || 60;
console.log(`Rate limited. Retrying in ${retryAfter}s...`);
await new Promise(r => setTimeout(r, retryAfter * 1000));
} else {
throw error;
}
}
}
}Python Retry Example#
import time
from openai import OpenAI, RateLimitError
client = OpenAI(
api_key="lgw_proj_your_key",
base_url="https://proxy.watchllm.dev/v1",
max_retries=3, # Built-in retry
)
# Or handle manually:
def call_with_retry(messages, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="gpt-4o",
messages=messages,
)
except RateLimitError as e:
if attempt < max_retries - 1:
wait = int(e.response.headers.get("retry-after", 60))
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
else:
raiseBest Practices#
- Use exponential backoff — Don't retry immediately after a 429
- Cache client-side too — Avoid sending duplicate requests
- Monitor your usage — Check the dashboard analytics for usage trends
- Batch requests — Group multiple prompts when possible
- Upgrade your plan — If you consistently hit limits, consider upgrading
Need Higher Limits?#
If your use case requires higher rate limits or quotas beyond the Pro plan, contact us to discuss Enterprise options with custom limits.