Developer guide

Implementing Prompt Cache in your LLM stack

PromptCacheAI sits between your application and your AI model provider. Always ask PromptCacheAI first, fall back to your AI provider when needed, then save the response.

1. Create an API key

Visit Settings → API Keys after signing in. Save the generated key somewhere secure: you will only see it once.

API keys are tenant scoped. Every request must include the header X-API-Key: YOUR_API_KEY.

2. Ask PromptCacheAI before your AI model

Send the complete prompt (including system/instruction context) to PromptCacheAI. If an exact or similarity match is found, the cached response is returned instantly and you can skip the expensive LLM call. The /chat response includes a unique prompt_hash representing your prompt. Use this same value when saving your model’s output back to PromptCacheAI.

curl https://api.prompt-cache.ai/v1/chat \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespace": "support-bot",
    "provider": "openai",
    "model": "gpt-4o-mini",
    "prompt": "How do I reset my password?"
  }'

Replace YOUR_API_KEY with a key generated at Settings → API Keys.

3. Call your model if there was a miss

When PromptCacheAI responds with cached: false, call your provider exactly as you do today (OpenAI, Anthropic, Vertex, etc.). PromptCacheAI is provider-agnostic—keep your existing retries, streaming, and safety filters.

4. Save the response for future requests

Once you receive the LLM output, store it with the same namespace and prompt_hash returned by the /chat endpoint. This ensures future identical or similar prompts hit the cache.

curl https://api.prompt-cache.ai/v1/cache/save \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt_hash": "<PROMPT_HASH_FROM_CHAT_RESPONSE>",
    "namespace": "support-bot",
    "response": "To reset your password, click ..."
  }'

You don’t need to specify a TTL when saving — entries automatically inherit the TTL configured for their namespace.

Putting it all together

You can integrate PromptCacheAI with just a few lines of code.

import OpenAI from "openai";
import fetch from "node-fetch";

const API_BASE = "https://api.prompt-cache.ai/v1";
const openai = new OpenAI({ apiKey: process.env.OPENAI_KEY! }); // model provider key

export async function answerWithCache(namespace: string, prompt: string) {
  const apiKey = process.env.PROMPTCACHEAI_KEY!; // PromptCacheAI API key

  // 1) Ask PromptCacheAI
  const chatRes = await fetch(`${API_BASE}/chat`, {
    method: "POST",
    headers: {
      "X-API-Key": apiKey,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      namespace,
      provider: "openai",
      model: "gpt-4o-mini",
      prompt,
    }),
  }).then((res) => res.json());

  if (chatRes.cached) {
    return { source: "cache", text: chatRes.response };
  }

  // 2) Cache miss → call provider
  const completion = await openai.responses.create({
    model: "gpt-4o-mini",
    input: prompt,
  });
  const text =
    completion.output_text ||
    completion.choices?.[0]?.message?.content ||
    "";

  // 3) Save new response
  await fetch(`${API_BASE}/cache/save`, {
    method: "POST",
    headers: {
      "X-API-Key": apiKey,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      prompt_hash: chatRes.prompt_hash,
      namespace,
      response: text,
    }),
  });

  return { source: "provider", text };
}

Namespaces

A namespace is a completely separate cache. Entries in one namespace are never visible to another — including similarity matching and TTL behavior.

Use different namespaces when:

This keeps your cache predictable and prevents unintended sharing.

Example: support-bot-prod and support-bot-dev never share cached responses.

Similarity matching

If PromptCacheAI doesn’t find an exact match in a namespace, it looks for a semantically similar prompt. If the meaning is close enough, the existing cached answer is reused instead of calling the model again.

Similarity matching is based entirely on the prompt itself —model, provider, and temperature do not affect similarity. If the wording or phrasing is similar, you may still get a hit:

Exact:
"What is the capital of France?"  → Exact hit

Similar wording:
"capital of france"               → Similarity hit
"What city is France’s capital?"  → Similarity hit

Different model/provider/temperature:
"What is the capital of France?"  → Similarity hit

Tip: Use separate namespaces if you want model-specific behavior or strict separation between apps (e.g., gpt-4o-prod vs gemini-prod).

TTL strategy

Each namespace has one TTL value applied to every cached entry. Configure it in Settings → Cache TTL.

When a cache entry expires, PromptCacheAI keeps the prompt fingerprint (for fast lookup), but treats it as a miss. Your app simply calls the model again and saves the fresh response (/cache/save).

Observability

The dashboard (/dashboard) surfaces hit rates, savings, and raw entries. Use the filters to verify new namespaces, confirm TTL behavior, and spot high-miss workloads that need prompt normalization.