Am I going about this RAG Perplexity-on-crack Jarvis project the wrong way?

First real LLM project for me, probably same endgame as half the people here: personal Jarvis. But the reason I'm actually building it is bigger than that.

I'm a dad, and the more I mess with commercial LLMs the more worried I get that we're nearing the end of actually source-able information. Misinformation has been rough forever, but I already only really trust a small handful of outlets (AP, Reuters, a couple others), and the idea of some company baking their own agenda into the next model and deciding what counts as true for my kids does not sit right with me.

Started small. Daily digest that only pulls from sources I trust so I stop doom scrolling. Worked better than I expected.

Then I got ambitious. Extended it into a full RAG chatbot, basically Perplexity on crack but only pulling from corpus I personally curated. Every answer cites back to what I put in, shows a confidence score, blind spots, and flags claims the corpus actually contradicts. 2M+ chunks in across 14 collections and 67ish download sources now, so it's real. Which is also why the scope problem is getting painful.

-------- Rigs -------- - Unraid box - AMD RX 7900 XT 20GB - MacBook Pro M3 Max 36GB, retired from the inference role. A 7900 XT was beating it on tok/s for every model I cared about. Unified memory sounds great until you realize the memory bandwidth isn't being used by the thing you want to run. -------- Stack -------- - Qdrant for vectors - llama-swap + llama.cpp Vulkan on Unraid. Moved off Ollama after catching the same model pass 5/5 JSON extractions on llama.cpp while Ollama failed them. Backend mattered more than the model - Interactive chat: qwen3.6 Q3_K_S, ~108 tok/s, 262K ctx - Bulk extraction: qwen3.6 IQ3_XXS, ~112 tok/s. Different quants won different benchmarks so I route by content type. Swap is under a second - Embeddings: Qwen3-Embedding-4B Q8, Matryoshka truncated to 1024d - GTE modernbert reranker on CPU - Claude Sonnet for the synthesis pass, Opus only for deep mode 

Where I'm stuck

Measured production throughput: ~13,500 chunks/hr on the 4B embedder. For the full 7M English Wikipedia pages:

  • Top 2M by pageview rank, dense ingest: ~8 months
  • Tail 5M (~80M chunks): 22 to 36 months elastic duty cycle

So I'm staring down 2.5 to 3.5 years for full local Wikipedia. That's already assuming the tail runs background-only.

Already tried:

  • 0.6B embedder for the 2x bump. Got 1.91x raw. Quality dropped past my retrieval gate. Rejected
  • Parallel batching (-np 2) on the 0.6B. Got 1.03 to 1.23x over the 4B pipeline. Below my pre-committed 1.4x floor. Rejected
  • Vulkan has no multi-GPU tensor-split, so adding a second AMD card wouldn't give me a unified VRAM pool anyway

Staying on the 7900 XT, budget isn't there for hardware moves yet. Maybe eventually I can get on a 256GB Mac Studio if they release and prices aren't too absured. Trying to figure out what's left on the table in software.

Questions:

  1. Anyone actually chewed through a full ZIM Wikipedia ingest on consumer hardware? Wall clock and embedder? I know there's pre-embedded Wikipedia sets on HF, but none of them carry the extraction layers my pipeline builds on top (claims, entities, contextual headers, provenance), so I'm stuck running it myself.
  2. Any reason not to run 0.6B on the tail 5M and 4B on the top 2M and just accept the quality tier?
  3. Anyone squeezing more out of a single 7900 XT for batch embedding than I am? Already on llama.cpp Vulkan, flash attention off, KV cache quant off (segfaults)
submitted by /u/vick2djax
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top