The most expensive way to use AI is to drag your PDF in every time

April 10, 2026

by David Bunting, Founder

The default pattern

Somebody on your team has a question about a policy document. They open their AI client, drag the PDF in, ask the question, get the answer. The interaction takes a minute. It works. Tomorrow, someone else has a different question about the same document — they do the same thing. So does the third person, and the fourth.

This is the path of least resistance and it’s how most teams actually use AI today. It also has two problems that don’t show up until the bill arrives or the answer goes sideways.

Problem one: you’re paying to embed the same text every time

Before an AI model can reason over a document, the document has to be turned into a representation the model can work with — chunks of text, converted into embeddings, held in working memory for the duration of the conversation. That conversion costs tokens. It is metered and billed, not free.

Every time the same PDF gets dragged into a new conversation, that conversion happens again. Same text, same vectors, same dollars. Multiply across a policy library or a claims handbook, and the entire customer-success team using AI a dozen times a day, and you’re paying to convert the same content into the same vectors thousands of times a month.

None of this work has to be redone. Once a document is embedded, the embeddings are reusable forever — they describe the document, not the question being asked of it. A knowledge layer that caches them server-side turns that recurring bill into a one-time cost.

Problem two: large documents quietly fail in chat

Every AI model has a context window: a hard ceiling on how much text it can hold in memory at once. The ceiling has been getting bigger every year, and people have started to assume it doesn’t matter. It still matters.

Drop a 200-page contract into a chat. The client either refuses, truncates silently, or paginates through the document in passes — and in every case, parts of the document fall out of working memory before the model finishes thinking. The answer that comes back looks confident, because the model is always confident, but it’s grounded in whatever fragment happened to be in context at the moment of generation. The bits the model couldn’t see get filled in by a probability engine.

The fix is to stop sending the whole document. If retrieval can find the three paragraphs that actually answer the question, the model only needs those three paragraphs in context — the rest of the budget goes to reasoning. Answer quality improves at the same time the token bill drops.

What it looks like when you stop doing this

A knowledge layer that sits between your documents and your AI client cleans both problems up at once. Documents get embedded once, on ingest, and the embeddings persist. When a question arrives, only the matching chunks travel into the model’s context. The same document can answer ten thousand questions and the embedding cost is paid exactly once.

This is what Laminae does. Buckets manage the storage and the embeddings; MCP exposes them on an open standard so any AI client connects without a custom integration. Your team keeps using the AI client they already like. The repeated upload pattern just quietly stops being the way work gets done.

The cost story isn’t the only argument for building a knowledge layer — provenance and auditability are arguably bigger ones — but it may be the easiest one to explain to whoever signs off on the AI budget.

Location

The most expensive way to use AI is to drag your PDF in every time

The default pattern

Problem one: you’re paying to embed the same text every time

Problem two: large documents quietly fail in chat

What it looks like when you stop doing this

More articles

Why a smaller information surface gets you better answers

Why a chat UI is the wrong wedge for enterprise AI

Want to see Laminae on your own documents?

Based and hosted in