
3 Rewrites to a RAG Pipeline My Client Liked — And Better AI Acceptance
How 3 rewrites of a RAG pipeline — from LLM summarization to Voyage Context 3 — unified 3 content sources, cut costs 450x, and improved AI draft acceptance.
Published on: Wednesday, Mar 25, 2026
Spent the last few months reworking the retrieval pipeline for a customer support platform I work on. Three complete rewrites in five days before landing somewhere both I and my client were happy with.
The platform has an AI assistant that drafts responses for every incoming customer message. When I started, agents accepted only 0.2% of those drafts without edits. After three rewrites of the retrieval pipeline, that number climbed to 10.9% — with the manual trigger acceptance rate hitting 35%.
Here’s what each version looked like, what broke, and why the final rewrite worked.
The Setup: Three Knowledge Sources
The platform’s AI has three knowledge sources:
- Uploaded documents — product manuals, pricing sheets, policies
- Scraped website pages — the client’s site, crawled and indexed
- Resolved support interactions — real conversations between agents and customers
Documents and web pages were straightforward. Chunk the text, embed with voyage-context-3, store in pgvector, hybrid search at query time. Standard stuff.
Support interactions were a different story.
Version 1: Embed Individual Turns
The first approach was simple: take each agent response, embed it, store it in a conversation_turn_embeddings table. At query time, search across both the KB articles and these individual turns.
Problems showed up fast:
- Low coverage — only 7.7% of conversations had enough messages to be useful
- Fragmented context — a single agent reply like "€120 extra" means nothing without knowing the customer asked about pricing
- Corrections lost — when agents corrected themselves or added follow-up details without a customer prompt, those never got captured
- AI pollution risk — automated and AI responses could leak into the knowledge base, creating a feedback loop
Individual turns lacked the conversation-level context that made them useful in the first place.
Version 2: Summarize Then Embed

The second version tried to solve the context problem by adding an LLM preprocessing step:
- Take resolved conversations
- Slide a 50-message window (10-message overlap) across the messages
- Send each window to Claude Haiku for summarization
- Embed the summaries with voyage-4-large
- Store in a separate
conversation_chunkstable - Search with a separate function and a separate query embedding
This actually worked. The summaries captured context properly, and the AI started retrieving relevant past interactions. But the architecture had problems:
- Two embedding models — voyage-context-3 for documents, voyage-4-large for conversations
- Two tables —
ai_contentandconversation_chunks - Two search functions — each with their own query embedding
- An LLM call per chunk — Claude Haiku ran on every 50-message window just for preprocessing
The cost to embed 2,818 resolved conversations (59,000+ messages) was roughly $85 — almost entirely from the Haiku summarization step. And every time a conversation was reopened and re-resolved, the entire window had to be re-summarized.
It was working, but it was expensive and fragile. Two parallel pipelines meant twice the surface area for bugs.
The Insight: Voyage Context 3 Already Does This

Turns out the summarization step was solving a problem that Voyage Context 3 already handles natively.
Voyage Context 3 is a contextualized embedding model. When you pass it all chunks from a document in a single API call, it embeds each chunk with awareness of the full document context. This is exactly what the Haiku summarization was trying to achieve — giving each chunk enough context to be meaningful on its own — but Voyage does it at the embedding level, without an LLM call.
From Voyage’s own benchmarks: Context 3 outperforms Anthropic’s contextual retrieval (which uses LLM-prepended context) by 6.76–20.54% on retrieval tasks. It eliminates the LLM preprocessing step entirely.
Once I understood this, the path was clear.
Version 3: One Model, One Pipeline, One Table
The final version unified everything. All three sources go through the same pipeline:
For conversations specifically, the formatting step filters to human-only messages (no AI responses, no automated messages), labels each line as [Customer] or [Agent], and includes voice transcripts and image descriptions:
The recursive character splitter tries \n\n → \n → . → before hard-splitting, which means it naturally breaks between messages since each is on its own line. A 2,000-character chunk fits roughly 10–50 messages.
The single content_chunks table uses a polymorphic foreign key design — each row points to either a document, a connector content item, or a conversation:
Hybrid Search: Vector + BM25 + RRF + Rerank
Search uses Reciprocal Rank Fusion to combine vector similarity and full-text search results:
The top candidates from RRF are then reranked with voyage-rerank-2 to get the final relevance ordering. This two-stage approach — cheap hybrid search for recall, expensive reranking for precision — works well in practice.
The Numbers
| "" | "Before (V2)" | "After (V3)" | |
|---|---|---|---|
| "Embedding models" | "2 (voyage-context-3 + voyage-4-large)" | "1 (voyage-context-3)" | |
| "Tables" | "2 (ai_content + conversation_chunks)" | "1 (content_chunks)" | |
| "Search functions" | "2" | "1" | |
| "Query embeddings per search" | "2" | "1" | |
| "LLM preprocessing" | "Claude Haiku per window" | "None" | |
| "Cost to embed 2 | 818 conversations" | "~$85" | "$0.19" |
| "Cost per re-resolve" | "~$0.03" | "$0.002" | |
| "AI draft acceptance rate" | "0.2%" | "10.9%" |
The acceptance rate here measures drafts accepted without any edits — the strictest possible metric. When agents deliberately trigger the AI (manual mode), the acceptance rate reaches 35%.
Tradeoffs
One worth mentioning: Voyage Context 3 needs all chunks from a document together in a single API call for contextualization to work properly. So when a conversation reopens and re-resolves, I delete all existing chunks and re-embed from scratch. It’s not incremental.
At $0.002 per re-embed, this doesn’t matter in practice. But it’s a design constraint worth knowing about.
What I Took Away
The RAG landscape has moved fast — agentic RAG, vectorless approaches, GraphRAG for multi-hop reasoning. Each solves different problems. But the principle that guided these three rewrites holds:
Before adding complexity, check if a simpler architecture can do the same job.
In my case, one embedding model replaced an LLM plus a separate pipeline. The retrieval quality improved. The cost dropped 450x. And the client’s team finally started accepting the AI’s suggestions.
Sometimes the right rewrite isn’t adding more — it’s finding the tool that was designed for exactly your problem.
Stay Updated
Get the latest updates and insights directly to your inbox.
