AlgoCoder · AI & LLM Engineering · Case AI-02

RAG Pipeline Accuracy Remediation for a Knowledge Assistant That Wasn't Working

The assistant returned wrong answers often enough that users had stopped using it.

RAG accuracy

Abstract

An organization that had built a RAG-based knowledge assistant against their internal documentation — policies, procedures, technical references, customer-facing FAQ content. The assistant had launched to internal users with reasonable expectations and was being abandoned because its answer quality wasn't meeting them.

I.Problem Statement

The assistant was returning wrong answers at a rate users couldn't accept. Some answers cited the right document but extracted the wrong information from it. Some cited documents that weren't relevant to the question. Some confidently fabricated information that wasn't in any document. Users had built up a learned distrust and were going back to manual document search even when the assistant would have been faster.

II.Methodology

A systematic accuracy remediation engagement covering chunking, retrieval, reranking, and grounding.

The chunking strategy was rebuilt. The original implementation had used a fixed-size chunking approach that frequently split related content across chunks and combined unrelated content into the same chunk. The new chunking respected document structure — section boundaries, list groupings, table integrity — so that each chunk represented a coherent unit of meaning rather than an arbitrary slice of text.

The embedding model was reviewed and replaced. The original model had been chosen for cost; the cost difference for the larger, stronger model was material at the assistant's query volume but justified by the accuracy improvement. Embeddings were regenerated against the new chunking and the new model.

A reranking layer was added between retrieval and generation. The retrieval layer pulled a candidate set wider than the generation layer would use; the reranker — a model specifically trained for relevance scoring — selected the best candidates. The two-stage retrieve-and-rerank pattern materially improved the quality of context passed to the generation model.

Grounding was strengthened. The generation prompt was rewritten to require explicit citation to source chunks, with the model instructed to refuse to answer when the retrieved context didn't support an answer. Hallucination — confident wrong answers in the absence of grounding — became substantially less common because the model was explicitly trained by the prompt to admit when it didn't know.

An evaluation suite was built. A representative set of questions with known good answers was assembled from real user queries. The evaluation ran against every change to the pipeline and surfaced regressions before they reached users. Quality became measurable rather than anecdotal.

User feedback was instrumented. Users could mark answers as correct or incorrect; the feedback flowed into a queue for the team's review and into the evaluation suite over time, expanding the test coverage as real usage surfaced new question patterns.

III.Results & Discussion

Answer quality improved substantially against the evaluation suite and against user feedback. Usage recovered as users encountered enough correct answers to rebuild trust. The "I'll just search the documents myself" behavior pattern decreased noticeably. The team gained a measurable accuracy posture rather than depending on user reports to know whether changes helped or hurt.

— —

AI-02 · Case 2 of 12 in AI& LLM Engineering

End of Transmission

Building something with shape similar to this?

Book an AI Strategy Call →