LDR maintainer here. Thanks to the strong support of r/LocalLLaMA community LDR got very far. I haven't reported in a while because I thought I was not ready for another prominent post in one of the leading outlets of Local LLM research.
But I think the LDR community finally there again. I think it is finally time to report again.
Setup
- RTX 3090, 24GB
- Ollama backend (qwen3.6:27b)
- LDR's
langgraph_agentstrategy — LangChaincreate_agent()with tool-calling, parallel subtopic decomposition, up to 50 iterations - LLM grader: qwen3.6:27b self-graded (I have used opus to review examples and it generally only underestimates accuracy)
Benchmarks (fully local LLM with web search)
| Model | SimpleQA | xbench-DeepSearch |
|---|---|---|
| Qwen3.6-27B | 95.7% (287/300) | 77.0% (77/100) |
| Qwen3.5-9B | 91.2% (182/200) | 59.0% (59/100) |
| gpt-oss-20B | 85.4% (295/346) | – |
sample size is small, but the benchmarks were not rerun multiple times and you can see from the other rows that this is unlikely just chance. Full leaderboard: https://huggingface.co/datasets/local-deep-research/ldr-benchmarks
Important framing — these are agent + search scores, not closed-book
However, also note that these are similar benchmarks results to Perplexity Deep Research (93.9%), tavily (93.3%) etc. [Tavily forces the LLM to answer only from retrieved docs (pure retrieval test). Perplexity Deep Research is an end-to-end agent and discloses no grader or sample size. ]
Even if our results where only 90% it would already be a great success.
Also I can confirm from using it daily that these results feel consistent with my performance on random querries I do for daily questions.
Caveats:
- SimpleQA contamination risk on newer base models is real
- LLM-judge noise + Sampling error
- bench-DeepSearch is in chinese so an advantage for the chinese qwen models
- No BrowseComp / GAIA numbers yet - But I also dont believe we are good at this benchmark yet. I will have to run some benchmarks to verify the current state
The thing that surprised me:
Results seem to track tool-calling quality more than raw size for local deep research. The langgraph_agent strategy hammers the model with multi-iteration tool calls, parallel subagent decomposition, and structured output — exactly the axis where the newer Qwen generations have improved most. Hypothesis only; if anyone wants to design an ablation we'd love the data.
Some cool LDR features that I want to additionally highlight:
- Journal Quality System (shipped v1.6.0) - academic source grading using OpenAlex, DOAJ. I haven't seen this anywhere else in the open-source deep-research space.
- Per-user SQLCipher AES-256 DB (PBKDF2-HMAC-SHA512, 256k iterations) — admins can't read your data at rest. No password recovery; we don't hold the keys.
- Zero telemetry. no telemetry, no analytics, no tracking.
- Cosign-signed Docker images with SLSA provenance + SBOMs.
- MIT licensed. Everything open source
Repo: https://github.com/LearningCircuit/local-deep-research
Happy to share strategy configs, help reproduce the Qwen runs
Thanks to all the academic and other open source foundational work that made this repo possible.
[link] [comments]