Parent-Child Document Architecture in RAG: Why Flat Chunking Isn’t Enough

This is Part 4 of a 5-part series on building a production-grade RAG system.

Part 1 — Hybrid RAG Overview
Part 2 — Semantic Chunking vs Fixed Chunking
Part 3 — Custom Reranking: Combining Dense and Sparse Scores
Part 4 — Parent-Child Architecture (you are here)
Part 5 — Building a Tool-Augmented RAG Agent

There’s a fundamental tension in RAG chunking: small chunks retrieve better, large chunks read better.

Small chunks are precise. A 100-word window that describes exactly when to add garlic to a veggie burger recipe will rank highly for “when do I add garlic.” But if you feed that 100-word chunk to an LLM, it has almost no surrounding context — no recipe title, no prior steps, no serving suggestions.

Large chunks give the LLM rich context. But they embed poorly because a single 768-dim vector has to represent too many different concepts at once. A 2000-word recipe chunk will retrieve at moderate similarity for almost everything food-related, and excellent similarity for nothing specific.

The solution is parent-child architecture: index small children for precise retrieval, but return large parents for generation.

The Two Levels of Chunking

This pipeline uses two distinct chunking stages. Both are covered in depth in Part 2; here’s how they map to the parent-child hierarchy.

Parent chunks are created with fixed character chunking:

parent_chunks = read_documents_in_chunks(
    documents, 
    chunk_size=2000, 
    chunk_overlap=400
)

A 2000-character chunk of a recipe file typically covers one major section: all the ingredients, or the complete instructions, or the storage and reheating guide. These chunks are large enough to give an LLM meaningful context.

Child chunks are created inside each parent via sliding windows + semantic merging:

# Split parent chunk into word-level windows
sw = sliding_windows(chunk["chunk"].split(), window_size=200, overlap=40)

# Embed all windows via Ollama nomic-embed-text
embeddings = await get_embeddings(tuple(sw))

# Merge semantically adjacent windows into children
merged_chunks = semantic_merge(sw, list(embeddings), threshold=0.60)

The merged children are typically 150–400 words each — small enough for 768-dim dense vectors to carry precise meaning, large enough to not be mere sentence fragments.

The ID Scheme as a Hierarchy

The Pinecone vector IDs encode the entire ancestry:

doc{document_id}_p{parent_chunk_number}_c{child_chunk_number}

For example:

doc19_p0_c0  →  Document 19, Parent chunk 0, Child chunk 0
doc19_p1_c0  →  Document 19, Parent chunk 1, Child chunk 0
doc6_p0_c0   →  Document 6,  Parent chunk 0, Child chunk 0

When a query returns doc19_p0_c0, you immediately know it came from document 19 (veggie_burgers.txt), parent chunk 0, and is the first child of that parent. You can reconstruct the full parent by re-reading document_id=19, chunk_number=0 from the original file — or by fetching all Pinecone vectors matching doc19_p0_*.

This is how the system surfaces the right recipe file name in retrieval results:

Source: knowledge\veggie_burgers.txt
Hybrid Score: 0.4528
Preview: # Homemade Black Bean Veggie Burgers ## Ingredients...

The source in metadata is set at parent-chunk creation time and inherited by all children, so every retrieved vector traces back to a named document.

The Full Indexing Pipeline

Here is the complete process_and_index_documents function that implements two-level chunking and indexes everything into Pinecone:

async def process_and_index_documents(documents, num_docs_to_process=1):
    """Process documents and index them in Pinecone with hybrid search capabilities."""
    
    parent_chunks = read_documents_in_chunks(documents, chunk_size=2000, chunk_overlap=400)
    
    for chunk in parent_chunks:
        print(f"Processing document {chunk['document_id']}, chunk {chunk['chunk_number']}")
        
        # Create sliding windows from parent chunk
        sw = sliding_windows(chunk["chunk"].split(), window_size=200, overlap=40)
        print(f"Created {len(sw)} sliding windows")
        
        # Embed windows locally via Ollama for merge decisions
        embeddings = await get_embeddings(tuple(sw))
        
        # Merge semantically adjacent windows into children
        merged_chunks = semantic_merge(sw, list(embeddings), threshold=0.60)
        print(f"Merged into {len(merged_chunks)} semantic chunks")
        
        # Embed final child chunks for indexing
        merged_embeddings = await get_embeddings(tuple(merged_chunks))
        
        # Build Pinecone vectors with full ancestry metadata
        pinecone_vectors = []
        for chunk_number, child_chunk in enumerate(merged_embeddings):
            sparse_values = splade.encode_documents([child_chunk["chunk"]])[0]
            
            vector = {
                "id": f"doc{chunk['document_id']}_p{chunk['chunk_number']}_c{chunk_number}",
                "values": child_chunk["embedding"].vector,    # 768-dim from Ollama
                "sparse_values": sparse_values,               # SPLADE sparse vector
                "metadata": {
                    "chunk": child_chunk["chunk"],
                    "source": chunk["document_name"],         # from parent
                    "child_id": chunk_number,
                    "parent_id": chunk["chunk_number"],       # parent's position
                    "document_id": chunk["document_id"],
                }
            }
            pinecone_vectors.append(vector)
        
        index.upsert(namespace=index_name, vectors=pinecone_vectors)
        print(f"Indexed {len(pinecone_vectors)} vectors to Pinecone")

All embedding calls go through Ollama locally — the double embedding pass (once for merge decisions, once for the final children) costs no API tokens and is bounded only by your machine’s throughput. The Ollama client setup is in Part 1.

What Gets Indexed vs What Gets Returned

Child Chunks Parent Chunks Indexed in Pinecone ✅ Yes ❌ No Used for retrieval ✅ Yes ❌ No Stored as context ✅ In metadata Available via source file Vector size 768 dims (Ollama) N/A Text size ~150–400 words ~2000 characters Purpose Precise matching Rich generation context

When the agent calls rag_search (see Part 5), it receives the child chunk text directly from metadata["chunk"]. The child text is small enough to be precise, and because it was semantically merged, it's coherent enough to be meaningful on its own.

For use cases requiring even richer context — multi-step reasoning over a full recipe — you could extend the pipeline to re-fetch the parent by querying Pinecone with the parent’s ID pattern doc{n}_p{n}_*, or by re-reading the source file at the known parent_id offset.

The 144-Vector Index

After running the full pipeline over 20 recipe documents, the Pinecone index contains 144 vectors:

total_vector_count: 144
namespaces: {'recipe-index-ollama-nomic-embed': {'vector_count': 144}}
dimension: 768
metric: dotproduct

That’s an average of 7.2 child vectors per document. For a 20-recipe corpus, this is small. The architecture scales linearly: add 1000 more documents and you get ~7200 more precisely-indexed children, each traceable back to its named source, all generated by local Ollama calls with no rate limits.

Parent-Child in Action: The Veggie Burger Query

For the query "How to make veggie burgers?", all four returned chunks come from the same document (veggie_burgers.txt) but from different parent sections:

doc19_p0_c0  →  Parent 0: Ingredients + Instructions  (Hybrid: 0.4528)
doc19_p1_c0  →  Parent 1: Ingredient Flexibility + Texture Guide  (Hybrid: 0.3835)
doc19_p2_c0  →  Parent 2: Serving + Storage + Meal Prep  (Hybrid: 0.3710)
doc19_p3_c0  →  Parent 3: Timing + Nutrition + FAQ  (Hybrid: 0.3482)

Each parent chunk covers a different section of the document. The children retrieve precisely — chunk 0 matches best because it contains the recipe core — but all four give the agent distinct, non-overlapping context.

Without parent-child architecture, a flat chunker might produce a single 2000-character blob covering all these sections. That blob would embed at moderate similarity for most food queries and excellent similarity for none.

What’s Next

Part 5 — The Agent Layer: How all of this gets wrapped into a tool-augmented agent that uses rag_search mid-conversation.
Part 3 — Custom Reranking: The scoring mechanism that ranks the child chunks returned by this architecture.
Part 2 — Semantic Chunking: The chunking logic that produces the children indexed here.

Parent-Child Document Architecture in RAG: Why Flat Chunking Isn’t Enough was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.