Prompt caching API
Implementing a prompt caching API in your LLM stack
PromptCacheAI is a provider-agnostic prompt caching API and LLM cache. Ask PromptCacheAI first, fall back to your model provider on misses, then save the response for exact-match and semantic reuse.
1. Create an API key
Visit Settings → API Keys after signing in. Save the generated key somewhere secure: you will only see it once.
API keys are tenant scoped. Every request must include the header X-API-Key: YOUR_API_KEY.
2. Ask the prompt caching API before your AI model
Send the complete prompt (including system/instruction context) to PromptCacheAI. If an exact or semantic similarity match is found, the cached response is returned instantly and you can skip the expensive LLM call. The /chat response includes a unique prompt_hash representing your prompt. Use this same value when saving your model’s output back to PromptCacheAI.
curl https://api.prompt-cache.ai/v1/chat \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespace": "support-bot",
"provider": "openai",
"model": "gpt-4o-mini",
"prompt": "How do I reset my password?"
}'Replace YOUR_API_KEY with a key generated at Settings → API Keys.
3. Call your model if there was a miss
When PromptCacheAI responds with cached: false, call your provider exactly as you do today (OpenAI, Anthropic, Vertex, etc.). PromptCacheAI is provider-agnostic—keep your existing retries, streaming, and safety filters.
4. Save the response for future requests
Once you receive the LLM output, store it with the same namespace and prompt_hash returned by the /chat endpoint. This ensures future identical or similar prompts hit the cache. This is what turns the flow into an application-layer prompt cache instead of a provider-specific optimization.
curl https://api.prompt-cache.ai/v1/cache/save \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt_hash": "<PROMPT_HASH_FROM_CHAT_RESPONSE>",
"namespace": "support-bot",
"response": "To reset your password, click ..."
}'You don’t need to specify a TTL when saving — entries automatically inherit the TTL configured for their namespace.
Putting it all together
You can integrate PromptCacheAI with just a few lines of code.
import OpenAI from "openai";
import fetch from "node-fetch";
const API_BASE = "https://api.prompt-cache.ai/v1";
const openai = new OpenAI({ apiKey: process.env.OPENAI_KEY! }); // model provider key
export async function answerWithCache(namespace: string, prompt: string) {
const apiKey = process.env.PROMPTCACHEAI_KEY!; // PromptCacheAI API key
// 1) Ask PromptCacheAI
const chatRes = await fetch(`${API_BASE}/chat`, {
method: "POST",
headers: {
"X-API-Key": apiKey,
"Content-Type": "application/json",
},
body: JSON.stringify({
namespace,
provider: "openai",
model: "gpt-4o-mini",
prompt,
}),
}).then((res) => res.json());
if (chatRes.cached) {
return { source: "cache", text: chatRes.response };
}
// 2) Cache miss → call provider
const completion = await openai.responses.create({
model: "gpt-4o-mini",
input: prompt,
});
const text =
completion.output_text ||
completion.choices?.[0]?.message?.content ||
"";
// 3) Save new response
await fetch(`${API_BASE}/cache/save`, {
method: "POST",
headers: {
"X-API-Key": apiKey,
"Content-Type": "application/json",
},
body: JSON.stringify({
prompt_hash: chatRes.prompt_hash,
namespace,
response: text,
}),
});
return { source: "provider", text };
}Namespaces
A namespace is a completely separate cache. Entries in one namespace are never visible to another — including similarity matching and TTL behavior.
Use different namespaces when:
- You want caching behavior specific to a model or provider
- You separate environments (e.g.,
prodvsdev) - You operate multiple apps and don’t want them sharing knowledge
This keeps your cache predictable and prevents unintended sharing.
Example: support-bot-prod and support-bot-dev never share cached responses.
Similarity matching
If PromptCacheAI doesn’t find an exact match in a namespace, it looks for a semantically similar prompt. If the meaning is close enough, the existing cached answer is reused instead of calling the model again.
Similarity matching is based entirely on the prompt itself —model, provider, and temperature do not affect similarity. If the wording or phrasing is similar, you may still get a hit:
Exact:
"What is the capital of France?" → Exact hit
Similar wording:
"capital of france" → Similarity hit
"What city is France’s capital?" → Similarity hit
Different model/provider/temperature:
"What is the capital of France?" → Similarity hitTip: Use separate namespaces if you want model-specific behavior or strict separation between apps (e.g., gpt-4o-prod vs gemini-prod).
TTL strategy
Each namespace has one TTL value applied to every cached entry. Configure it in Settings → Cache TTL.
When a cache entry expires, PromptCacheAI keeps the prompt fingerprint (for fast lookup), but treats it as a miss. Your app simply calls the model again and saves the fresh response (/cache/save).
Observability
The dashboard (/dashboard) surfaces hit rates, savings, and raw entries. Use the filters to verify new namespaces, confirm TTL behavior, and spot high-miss workloads that need prompt normalization.
The dashboard helps you see which cached prompts are reused most often, so you can revise those responses directly when needed.
Future cache hits return the edited version, and saving an edit refreshes the entry's cache lifetime.
Editing an entry with an empty response will seed it with a reusable cached answer, so future hits can return that saved version.
If you are still comparing architectures, read the what is prompt caching guide or the prompt caching vs semantic caching comparison.