All articles
6 min read

30% Lower RAG Costs with Ettin Reranker and Vercel AI Gateway

A practical, code-first guide: integrating Ettin Reranker into RAG in Next.js/TS and connecting it with caching and limits in Vercel AI Gateway. Fewer tokens, better relevance—without changing the main model.

Cover illustration for article: 30% Lower RAG Costs with Ettin Reranker and Vercel AI Gateway

Key takeaways

  • Reranking before context submission typically reduces prompt tokens by 25–40%.
  • Vercel AI Gateway provides instant caching and limits without code changes to the model.
  • A stable prompt (instruction for the model) = higher cache hit-rate and lower costs.
  • A/B: compare top-8 without reranking vs. top-25 → Ettin top-4; measure tokens and quality.
  • AEO/GEO: concise, unambiguous KB + JSON-LD = shorter and cheaper responses.

RAG (Retrieval-Augmented Generation) is a pattern where LLMs respond based on fragments found in your knowledge base rather than solely from memory. It works but can be costly—especially for the tokens fed into the main model. Below, I’ll show you how to add Ettin Reranker (a fresh model from Hugging Face) along with caching and limits from Vercel AI Gateway in Next.js/TS to cut token usage by ~30% without altering the main model. Short, to the point, with code.

Architecture: Where to Integrate Ettin and Where We Save Costs

The main costs of RAG come from long contexts. Retrieval returns candidates at the fragment level. The Reranker acts as a cross-encoder, taking pairs (question, fragment), calculating a relevance score for each fragment, sorting them in descending order, and we truncate to the top-N (e.g., 3–5), optionally deduplicating sources. From these short, most relevant fragments, we assemble the context with separators. The result: 25–40% fewer tokens in the prompt and more unambiguous answers.

Step-by-step architecture (Next.js/TypeScript):

  • 1) User submits a question.
  • 2) Retrieval: vector-based (e.g., pgvector/Pinecone) + optionally BM25 → top-20..50 candidates.
  • 3) Ettin Reranker evaluates pairs (question, document) → returns a sorted list.
  • 4) Assemble a short context: e.g., top-4 fragments, each ≤ 1000 characters.
  • 5) Build a stable prompt (instruction for the model) with a clear style and format.
  • 6) Call the main model through Vercel AI Gateway (cache + limits).
  • 7) Log usage and response, optionally save to analytics.

Integrating Ettin Reranker in Next.js/TypeScript

The Reranker is a cross-encoder model: it takes (query, document) and returns relevance. The simplest way to invoke it is through Hugging Face Inference (or your own endpoint). ‘Keep the token server-side’ = store it as an environment variable (Vercel → Project Settings → Environment Variables or .env.local) and call the Inference API only in server-side code (Route Handler in app/api/.../route.ts or Server Action). Do not place the token in client components or pass it to the browser. Optionally: set export const runtime = 'nodejs', limit the token's scope, and/or enable IP allowlist on the HF side.

Skeleton in app/api/ask/route.ts (pseudo-TS, simplified):

  • // Next.js server-only (Route Handler) – token does not reach the browser export const runtime = 'nodejs';
  • // 1) Candidates from vectors (example interface) const candidates = await retrieveCandidates({ query, topK: 30 });
  • // 2) Reranking with Ettin (HF Inference API) async function rerankWithEttin(query: string, docs: string[], topN = 4) { const res = await fetch('https://api-inference.huggingface.co/models/<ettin-model-id>', { //:
  • method: 'POST', headers: { Authorization: `Bearer ${process.env.HF_API_TOKEN}` }, body: JSON.stringify({ query, documents: docs, top_n: topN }) }); // Expect a list of { index, score }. Check the schema
  • const ranked = await res.json(); return ranked.map((r: any) => ({ text: docs[r.index], score: r.score })); }
  • // 3) Usage const ranked = await rerankWithEttin(query, candidates.map(c => c.text), 4); const context = ranked.map(r => r.text).join('\n---\n');
  • // 4) Return to the LLM layer (see next section)

Vercel AI Gateway: Caching + Limits Without Hassle

The Gateway is integrated between the application and the LLM provider. We gain: response caching, tokens/min limits, and budget metrics. Most importantly: maintain a stable prompt (instruction for the model)—the same system, format, and order of fields—so that the cache hits more frequently.

Configuration (high-level): in the Vercel dashboard → AI Gateway → add provider (e.g., OpenAI), enable caching (TTL 1–6 hours for questions to a static KB), set RPM/TPM limits and budget in USD.

Calling from AI SDK through the Gateway (Next.js/TS):

  • import { generateText } from 'ai'; import { openai } from '@ai-sdk/openai'; const llm = openai({ apiKey: process.env.VERCEL_AI_KEY, baseURL: process.env.VERCEL_AI_URL });
  • const res = await generateText({ model: llm('gpt-4o-mini'), // provider and model mapped through Gateway temperature: 0.2, maxTokens: 400, system: 'Respond briefly and unambiguously. If data is missing — say so',
  • prompt: `Context:\n${context}\n\nQuestion: ${query}\n\nAnswer in 5-7 sentences.` });
  • // res.usage.promptTokens, res.usage.completionTokens → log to metrics

A/B Testing: Confirm Cost Reduction and No Quality Regression

Conduct a small harness test. Dataset: 50–100 real questions + short reference answers (1–3 sentences). Two branches:

A) Baseline: retrieval top-8 without reranking. B) With Ettin: retrieval top-30 → rerank top-4. Collect usage.promptTokens, usage.completionTokens, total cost, and time. Quality: a simple metric of contains/does not contain, or evaluation by a cheaper judge model with scoring.

Simplified runner sketch (TS): minimal:

How to do it in practice (steps, no prior experience in experiments):

  • const variants = { baseline: { topK: 8, rerank: false }, ettin: { topK: 30, rerank: true, topN: 4 } }; for (const v of Object.values(variants)) { const stats = { q: 0, tokens: 0, pass: 0 }; for (const q of dataset) {
  • const ctx = await buildContext(q, v); // with or without rerank const out = await askLLM(ctx, q); stats.tokens += out.usage.totalTokens; stats.pass += judge(out.text, q.reference) ? 1 : 0; stats.q++;
  • } report(v, stats); }
  • Expected: ~25–35% fewer tokens with comparable or better pass-rate. If reduction <20% – increase the number of candidates for reranking or shorten the length of individual fragments.
  • 1) Offline: disable caching in the Gateway (or set different cache keys per variant), fix the system prompt and seed; run both variants on the same dataset in the same order.
  • 2) Production: split 50/50 on a stable hash (e.g., hash(userId + query) % 2). Log: variant, prompt/completion/total tokens, latency P50/P95, cost, code version, documents in context.
  • 3) Quality: pairwise comparison (A vs B) with a judge model (e.g., gpt-4o-mini) and a rubric ‘Is the answer correct, complete, and aligned with the KB (yes/no)?’. Additionally, count binary ‘pass@1’. 4) Sample size: min

AEO/GEO: Shorter Answers Through Better KB and JSON-LD

AEO (Answer Engine Optimization) involves practices for preparing content so that ‘answer engines’—LLMs, Google AI Overviews, Perplexity—can easily extract concise, unambiguous answers. GEO (Graph/Entity Optimization) focuses on clearly marking entities, relationships, and sources in content and metadata (e.g., schema.org) so models understand who/what/where/when. Both approaches reduce ambiguity, allowing the model to generate shorter and cheaper answers. Guidelines for KB:

- Chunks of 400–700 words, each with a unique H3 and ISO date. - FAQ sections with each article (question in user mode, answer in 1–3 sentences). - Glossary of terms (unambiguous definitions). - Canonical links and repetitive header formats.

Minimal JSON-LD FAQ (insert on KB pages):

  • { "@context": "https://schema.org", "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "How to reset the password?", "acceptedAnswer": { "@type": "Answer", "text": "Go to Settings → Change Password→

Combining Ettin Reranker and Vercel AI Gateway is a quick win: fewer tokens, more relevant context, and cost control without changing the main model. If you’d like, I can assist with a 2–3 hour review and POC on your knowledge base—with A/B testing and a rollout plan.

Frequently asked questions

Do I need to change the main model to use Ettin Reranker?

No. The Reranker operates before the main model, selecting context. Simply add a reranking step and maintain the existing model through Vercel AI Gateway.

Will the reranker slow down responses?

The reranker adds a short, additional step for evaluating fragments, so latency may increase slightly. In practice, you usually recover this due to the shorter context (fewer tokens to process) and cache hits in the Gateway. Start with the scheme: fetch 20–30 candidates, select the top-4 short fragments. If latency is still an issue, reduce the number of candidates or host the reranker closer to the application.

What top-k and top-n values should I start with?

Practically: retrieval top-30, Ettin top-4. If the domain is broad, increase top-30 to top-50. Always validate A/B on your own data.

Is caching in Vercel AI Gateway safe for dynamic data?

Use caching for stable content (KB, regulations). For dynamic data, set a short TTL or disable caching. Remember to anonymize prompts with PII.

Can I host Ettin on-prem?

Yes, if the model has a license for self-hosting. Then you call your own endpoint instead of the Inference API. The Gateway still only handles the main model.

Let's talk
about your project

The consultation is free and no-strings-attached. We'll review your needs and I'll suggest concrete solutions.

Send a message

Briefly describe your problem — I'll get back to you with concrete suggestions.