Your model might not be the problem: 13 KB rewrites took us from 60% to 100% extraction on Llama 3.1 8B

Most agent projects I see talk about the model first and the docs second. We ended up doing the opposite, and it’s the only reason our 8Bparameter agent actually works in production.

We’re building a bounded domain data agent (Oracle Forge) that handles multi DB routing, join key correction, and unstructured text across PostgreSQL and MongoDB. Early on, we realized that "retrieving and hoping" with RAG was a recipe for silent failures.

Instead of moving to a 70B model or fine tuning, we decided to make our Knowledge Base (KB) testable.

The Experiment:

We wrote 21 KB documents covering schemas, join rules, and domain terms. To verify them, we ran a "Unit Test" for each doc:

Load only one doc into a fresh Llama 3.1 8B session (via Groq).

Ask a verification question that requires specific extraction from that doc.

If the model misses more than 30% of the keywords, the doc fails.

The Result:

Our first pass was a disaster (~60% pass rate). After 13 iterations of refactoring the documentation itself, we hit 21/21 (100% pass rate) on the 8B model.

What actually moved the needle (The "Context Engineering" Patterns):

Tables > Prose: We converted every paragraph of "data" into Markdown tables. The attention mechanism in 8B models seems to "see" tabular data significantly better than buried prose.

Information Front-Loading: We moved the "Action Path" (how to do it) to the first 30% of the doc and pushed the "Why it matters" to the bottom.

Embedded Q&A: We baked a verification question/answer pair into the end of every doc. It primes the model on what it’s supposed to extract.

Keyword Redundancy: If the model needs to output a specific string, that string appears in the header, the body, and the footer.

My Takeaway:

A Knowledge Base is not "documentation." It is part of the runtime. If an 8B model cannot extract the right answer from a doc when it’s the only thing in context, your system is fragile before it even touches a database.

For bounded agents where you know the schemas and failure modes upfront, "Injection Testing" your docs beats "Retrieving and Hoping" every time.

Let me see how the [r/LocalLlama](r/LocalLlama) community handles this:

Do you unit-test your KB documents before wiring up the agent?

Do you treat KB edits like code changes with regression tests?

Or do you only discover doc quality problems after the model starts hallucinating in production?

submitted by /u/Ambitious-Hornet-841
[link] [comments]

Leave a Comment