raj
Full Stack Developer

Command Palette

Search for a command to run...

3 Rewrites to a RAG Pipeline My Client Liked — And Better AI Acceptance

3 Rewrites to a RAG Pipeline My Client Liked — And Better AI Acceptance

How 3 rewrites of a RAG pipeline — from LLM summarization to Voyage Context 3 — unified 3 content sources, cut costs 450x, and improved AI draft acceptance.

Artificial IntelligenceRAGLLMsAI Agents

Published on: Wednesday, Mar 25, 2026


Table of contents
The Setup: Three Knowledge Sources
Version 1: Embed Individual Turns
Version 2: Summarize Then Embed
The Insight: Voyage Context 3 Already Does This
Version 3: One Model, One Pipeline, One Table
Hybrid Search: Vector + BM25 + RRF + Rerank
The Numbers
Tradeoffs
What I Took Away
facebooktwitterlinkedinwhatsapp

Spent the last few months reworking the retrieval pipeline for a customer support platform I work on. Three complete rewrites in five days before landing somewhere both I and my client were happy with.

The platform has an AI assistant that drafts responses for every incoming customer message. When I started, agents accepted only 0.2% of those drafts without edits. After three rewrites of the retrieval pipeline, that number climbed to 10.9% — with the manual trigger acceptance rate hitting 35%.

Here’s what each version looked like, what broke, and why the final rewrite worked.

The Setup: Three Knowledge Sources

The platform’s AI has three knowledge sources:

  1. Uploaded documents — product manuals, pricing sheets, policies
  2. Scraped website pages — the client’s site, crawled and indexed
  3. Resolved support interactions — real conversations between agents and customers

Documents and web pages were straightforward. Chunk the text, embed with voyage-context-3, store in pgvector, hybrid search at query time. Standard stuff.

Support interactions were a different story.

Version 1: Embed Individual Turns

The first approach was simple: take each agent response, embed it, store it in a conversation_turn_embeddings table. At query time, search across both the KB articles and these individual turns.

Problems showed up fast:

  • Low coverage — only 7.7% of conversations had enough messages to be useful
  • Fragmented context — a single agent reply like "€120 extra" means nothing without knowing the customer asked about pricing
  • Corrections lost — when agents corrected themselves or added follow-up details without a customer prompt, those never got captured
  • AI pollution risk — automated and AI responses could leak into the knowledge base, creating a feedback loop

Individual turns lacked the conversation-level context that made them useful in the first place.

Version 2: Summarize Then Embed

The second version tried to solve the context problem by adding an LLM preprocessing step:

  1. Take resolved conversations
  2. Slide a 50-message window (10-message overlap) across the messages
  3. Send each window to Claude Haiku for summarization
  4. Embed the summaries with voyage-4-large
  5. Store in a separate conversation_chunks table
  6. Search with a separate function and a separate query embedding

This actually worked. The summaries captured context properly, and the AI started retrieving relevant past interactions. But the architecture had problems:

  • Two embedding models — voyage-context-3 for documents, voyage-4-large for conversations
  • Two tables — ai_content and conversation_chunks
  • Two search functions — each with their own query embedding
  • An LLM call per chunk — Claude Haiku ran on every 50-message window just for preprocessing

The cost to embed 2,818 resolved conversations (59,000+ messages) was roughly $85 — almost entirely from the Haiku summarization step. And every time a conversation was reopened and re-resolved, the entire window had to be re-summarized.

It was working, but it was expensive and fragile. Two parallel pipelines meant twice the surface area for bugs.

The Insight: Voyage Context 3 Already Does This

Turns out the summarization step was solving a problem that Voyage Context 3 already handles natively.

Voyage Context 3 is a contextualized embedding model. When you pass it all chunks from a document in a single API call, it embeds each chunk with awareness of the full document context. This is exactly what the Haiku summarization was trying to achieve — giving each chunk enough context to be meaningful on its own — but Voyage does it at the embedding level, without an LLM call.

From Voyage’s own benchmarks: Context 3 outperforms Anthropic’s contextual retrieval (which uses LLM-prepended context) by 6.76–20.54% on retrieval tasks. It eliminates the LLM preprocessing step entirely.

Once I understood this, the path was clear.

Version 3: One Model, One Pipeline, One Table

The final version unified everything. All three sources go through the same pipeline:

For conversations specifically, the formatting step filters to human-only messages (no AI responses, no automated messages), labels each line as [Customer] or [Agent], and includes voice transcripts and image descriptions:

The recursive character splitter tries \n\n → \n → . → before hard-splitting, which means it naturally breaks between messages since each is on its own line. A 2,000-character chunk fits roughly 10–50 messages.

The single content_chunks table uses a polymorphic foreign key design — each row points to either a document, a connector content item, or a conversation:

Hybrid Search: Vector + BM25 + RRF + Rerank

Search uses Reciprocal Rank Fusion to combine vector similarity and full-text search results:

The top candidates from RRF are then reranked with voyage-rerank-2 to get the final relevance ordering. This two-stage approach — cheap hybrid search for recall, expensive reranking for precision — works well in practice.

The Numbers

"""Before (V2)""After (V3)"
"Embedding models""2 (voyage-context-3 + voyage-4-large)""1 (voyage-context-3)"
"Tables""2 (ai_content + conversation_chunks)""1 (content_chunks)"
"Search functions""2""1"
"Query embeddings per search""2""1"
"LLM preprocessing""Claude Haiku per window""None"
"Cost to embed 2818 conversations""~$85""$0.19"
"Cost per re-resolve""~$0.03""$0.002"
"AI draft acceptance rate""0.2%""10.9%"

The acceptance rate here measures drafts accepted without any edits — the strictest possible metric. When agents deliberately trigger the AI (manual mode), the acceptance rate reaches 35%.

Tradeoffs

One worth mentioning: Voyage Context 3 needs all chunks from a document together in a single API call for contextualization to work properly. So when a conversation reopens and re-resolves, I delete all existing chunks and re-embed from scratch. It’s not incremental.

At $0.002 per re-embed, this doesn’t matter in practice. But it’s a design constraint worth knowing about.

What I Took Away

The RAG landscape has moved fast — agentic RAG, vectorless approaches, GraphRAG for multi-hop reasoning. Each solves different problems. But the principle that guided these three rewrites holds:

Before adding complexity, check if a simpler architecture can do the same job.

In my case, one embedding model replaced an LLM plus a separate pipeline. The retrieval quality improved. The cost dropped 450x. And the client’s team finally started accepting the AI’s suggestions.

Sometimes the right rewrite isn’t adding more — it’s finding the tool that was designed for exactly your problem.

Stay Updated

Get the latest updates and insights directly to your inbox.

EmailLinkedinGithub