AI Engineering

Semantic Distance Failures in RAG: Why Your AI Retrieves the Wrong Chunks

(Even when the answer is right there)

Baljeet Dogra Baljeet Dogra
13 min read

Your documents contain the answer. You know they do. But your RAG pipeline returns something adjacent, something vaguely related, or—worst of all—something confidently wrong. This is a semantic distance failure, and it is the single most common reason production RAG systems underperform despite having the right data.

What semantic distance actually means in a RAG context

When a user submits a query, your retriever converts it into a vector embedding and searches for nearest neighbours in your vector store. "Nearest" means cosine similarity—the angular distance between vectors in high-dimensional space.

Semantic distance is that gap. Small distance = high conceptual similarity. Large distance = the retriever thinks those two things are not closely related.

The failure happens when the mathematical distance between a query embedding and the correct chunk embedding is larger than the distance to an incorrect but superficially similar chunk. The retriever picks the wrong chunk. The LLM hallucinates on top of it. Your user loses trust.

The four core failure modes

1. Vocabulary mismatch between query and document

The most fundamental failure. The user asks in their language; your documents are written in a different register.

User query: "What happens if I miss a mortgage payment?"

Document: "Consequences of payment default under a regulated credit agreement"

These mean the same thing. But their embeddings may sit far apart because surface vocabulary diverges significantly. Embedding models learn from co-occurrence patterns—terms that rarely appear together in training will be embedded further apart, even if semantically equivalent in your domain.

Fix: Query rewriting and HyDE (Hypothetical Document Embeddings). Instead of embedding the raw query, generate a hypothetical answer in the document's register and embed that instead. You are now searching document-space with a document-shaped query.

2. Chunk boundary failures

Your retriever finds the right section—but the answer is split across a chunk boundary. Each chunk is incomplete. Neither scores highly enough to be retrieved.

A policy document defines a term on page 4 and applies it in a critical clause on page 7. Fixed-size chunking splits them into separate, unconnected chunks with no overlap and no shared context.

Why it happens: Fixed-size chunking is fast and simple but semantically blind. It cuts sentences, separates definitions from their applications, and severs cause from effect.

Fix: Use semantic chunking (split on meaning boundaries, not character counts), add overlapping windows, and implement parent-child chunk retrieval—retrieve small chunks for precision but pass the parent chunk to the LLM for context.

3. Query-document asymmetry

Short queries are embedded very differently from long, rich document passages. A three-word query like "liability exclusion clause" produces a sparse, low-information embedding. A 400-word chunk produces a dense, rich one. The geometric relationship is inherently asymmetric.

Why it happens: Most general-purpose embedders were trained on symmetric similarity tasks—comparing sentences of similar length and structure.

Fix: Use embeddings trained for asymmetric retrieval—models like bge-large, e5-mistral, or Cohere's embed-v3 with explicit query and document instruction prefixes. These are built to handle the asymmetry between a short question and a long answer.

4. Semantic dilution in dense documents

Particularly common in legal, financial, and technical documents. A single chunk covers multiple concepts because the source document is dense. Its embedding becomes an average of several ideas—pulled toward none of them strongly.

A 500-word clause covering premium calculation, exclusions, and dispute resolution simultaneously has an embedding that sits between all three topics. A query about dispute resolution retrieves a less relevant but more focused chunk instead.

Why it happens: Vector embeddings collapse a chunk's entire meaning into a single point. Polysemantic chunks create ambiguous points that sit between clusters rather than inside them.

Fix: Proposition indexing. Instead of indexing chunks as they appear, decompose each chunk into atomic propositions and embed each one independently. Retrieve propositions, then reconstruct parent context before passing to the LLM.

The compounding effect

These failures rarely occur in isolation. A vocabulary mismatch pushes the right chunk down the ranking. A chunk boundary failure means neither adjacent chunk compensates. Query-document asymmetry means your reranker does not catch the error either.

By the time the LLM receives its context window, the correct information is not there—and a plausible but wrong chunk is. The LLM generates a fluent, confident answer from the context it received. The failure was upstream. The hallucination is just the symptom.

A diagnostic framework

Before rebuilding your pipeline, locate the failure:

Test What it reveals
Embed query and expected answer chunk manually—compute cosine similarity directly Whether the retriever can find the right chunk, or whether the distance is genuinely too large
Retrieve top-20 instead of top-5—does the right chunk appear? Isolates ranking problems from vocabulary mismatch problems
Test with a verbose, document-register version of the query Diagnoses query-document asymmetry
Inspect chunk boundaries around the expected answer Reveals boundary failures immediately

What good looks like

A RAG system with low semantic distance failure rates has:

  • Asymmetric embedding models aligned to retrieval tasks, not general similarity
  • Semantic or proposition-level chunking rather than fixed-size splits
  • Query rewriting or HyDE in the retrieval pipeline
  • A reranker (cross-encoder, not bi-encoder) as a second-pass filter
  • Metadata filtering to constrain retrieval before semantic search runs

Retrieval accuracy is the ceiling on your RAG system's quality. The best LLM in the world cannot compensate for a retriever that consistently fetches the wrong context. Closing semantic distance is not a prompt engineering problem—it is an architecture problem, and it needs to be treated as one.

Related reading: LLM cost explosion traps (RAG chunking is trap #8), LLM cost architecture, and LLM model drift (distributional drift compounds retrieval failures).

Building a RAG system that actually retrieves correctly?

I help teams design production RAG pipelines with the right embedding models, chunking strategies, and retrieval architecture—so the LLM gets the right context every time.

Get in Touch