Hybrid retrieval, plainly explained

by David Bunting, Founder

Dense retrieval is for meaning

An embedding model turns text into a vector — a list of numbers that points at a capture, roughly, of what the text means. Vectors that mean similar things end up close together in that space. Dense retrieval finds chunks whose embedding is close to the embedding of the query.

This is excellent when the user’s wording and the document’s wording don’t match but the meaning does. A question about “reimbursement for travel” finds a policy about “expense claims for journeys” because the meaning is the same even though no words are shared. It’s the part of retrieval that handles paraphrase and synonym — the part that makes a retrieval demo feel magical.

Sparse retrieval is for exact words

BM25 — a sparse keyword-matching algorithm older than most of the people building “AI products” this year — ranks documents by how often the query’s exact terms appear, weighted by how rare those terms are. No embeddings, no semantic guesswork. Just an inverted index and some maths.

This sounds like a step backwards until you remember what dense retrieval is bad at. Identifiers (“policy GR-4421”), legal citations, product SKUs, version numbers, surnames — anything where the user means a specific thing and any substitution is wrong. A vector embedding of “GR-4421” sits near every other alphanumeric identifier in the index. BM25 finds the right one immediately.

The re-ranker is where bad answers get filtered

Run both retrievers, take the top N from each, and you have a candidate set of maybe fifty chunks. Most of those chunks are still wrong — near-miss matches, related-topic passages, that one boilerplate paragraph that turns up in every document. A re-ranker is a cheap second-pass model that scores each candidate against the original query and returns the ones that actually answer it.

Skipping the re-rank is the single biggest reason “RAG demos” collapse the moment they leave a curated test set. The top-k vector neighbours look great on three hand-picked questions. On real production traffic, half of them are noise — and without the re-rank, that noise is what the LLM sees.

Hybrid retrieval is, in the end, realtively simple. Dense plus sparse plus a re-ranker on top. Source attribution kept intact through all three stages. That’s the part of Laminae that works quietly, and the part that keeps working the day after the demo ends.

More articles

Why a smaller information surface gets you better answers

Vector databases reward scope. Pour everything into one giant index and unrelated documents start bleeding into every search — the answers come back confidently mixing facts from places you never asked about. One bucket per concept is the fix.

Read more

The most expensive way to use AI is to drag your PDF in every time

The default pattern — drop the document into the chat, ask the question, repeat tomorrow — costs you twice. Once in tokens spent re-embedding the same text. Again in answer quality, when the document outgrows the context window. There is a cheaper way.

Read more

Want to see Laminae on your own documents?

Based and hosted in

  • Frankfurt
    Frankfurt am Main
    Germany