We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local

LDR maintainer here. Thanks to the strong support of r/LocalLLaMA community LDR got very far. I haven't reported in a while because I thought I was not ready for another prominent post in one of the leading outlets of Local LLM research.

But I think the LDR community finally there again. I think it is finally time to report again.

Setup

  • RTX 3090, 24GB
  • Ollama backend (qwen3.6:27b)
  • LDR's langgraph_agent strategy — LangChain create_agent() with tool-calling, parallel subtopic decomposition, up to 50 iterations
  • LLM grader: qwen3.6:27b self-graded (I have used opus to review examples and it generally only underestimates accuracy)

Benchmarks (fully local LLM with web search)

Model SimpleQA xbench-DeepSearch
Qwen3.6-27B 95.7% (287/300) 77.0% (77/100)
Qwen3.5-9B 91.2% (182/200) 59.0% (59/100)
gpt-oss-20B 85.4% (295/346)

sample size is small, but the benchmarks were not rerun multiple times and you can see from the other rows that this is unlikely just chance. Full leaderboard: https://huggingface.co/datasets/local-deep-research/ldr-benchmarks

Important framing — these are agent + search scores, not closed-book

However, also note that these are similar benchmarks results to Perplexity Deep Research (93.9%), tavily (93.3%) etc. [Tavily forces the LLM to answer only from retrieved docs (pure retrieval test). Perplexity Deep Research is an end-to-end agent and discloses no grader or sample size. ]

Even if our results where only 90% it would already be a great success.

Also I can confirm from using it daily that these results feel consistent with my performance on random querries I do for daily questions.

Caveats:

  • SimpleQA contamination risk on newer base models is real
  • LLM-judge noise + Sampling error
  • bench-DeepSearch is in chinese so an advantage for the chinese qwen models
  • No BrowseComp / GAIA numbers yet - But I also dont believe we are good at this benchmark yet. I will have to run some benchmarks to verify the current state

The thing that surprised me:

Results seem to track tool-calling quality more than raw size for local deep research. The langgraph_agent strategy hammers the model with multi-iteration tool calls, parallel subagent decomposition, and structured output — exactly the axis where the newer Qwen generations have improved most. Hypothesis only; if anyone wants to design an ablation we'd love the data.

Some cool LDR features that I want to additionally highlight:

  • Journal Quality System (shipped v1.6.0) - academic source grading using OpenAlex, DOAJ. I haven't seen this anywhere else in the open-source deep-research space.
  • Per-user SQLCipher AES-256 DB (PBKDF2-HMAC-SHA512, 256k iterations) — admins can't read your data at rest. No password recovery; we don't hold the keys.
  • Zero telemetry. no telemetry, no analytics, no tracking.
  • Cosign-signed Docker images with SLSA provenance + SBOMs.
  • MIT licensed. Everything open source

Repo: https://github.com/LearningCircuit/local-deep-research

Happy to share strategy configs, help reproduce the Qwen runs

Thanks to all the academic and other open source foundational work that made this repo possible.

submitted by /u/ComplexIt
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top