I no longer need a cloud LLM to do quick web research

This might be super old news to some people, but I only just recently started using local models due to them only just now meeting my standards for quality. I just want to share the setup I have for web searching/scraping locally.

I use Qwen3.5:27B-Q3_K_M on an RTX 4090 with a context length of ~200,000. I get ~40 tk/s and use about 22gb VRAM.

I use it through the llama.cpp Web UI, with MCP tools enabled. Here are the tools I have provided it for web search/scrape:

""" webmcp - MCP server for web scraping and content extraction """ import asyncio import json import logging import os import re import time from contextlib import contextmanager from datetime import datetime, timezone from pathlib import Path from typing import Any import httpx from ddgs import DDGS from markdownify import markdownify as md from mcp.server.fastmcp import FastMCP from mcp.server.transport_security import TransportSecuritySettings from playwright.async_api import async_playwright from readability import Document as ReadabilityDocument from starlette.middleware.cors import CORSMiddleware # ============================================================================ # Configuration # ============================================================================ logger = logging.getLogger(__name__) TOOL_CALL_LOG_PATH = os.path.join( os.path.dirname(os.path.abspath(__file__)), "tool_calls.log.json" ) LLM_URL = os.environ.get("LLM_URL", "") LLM_MODEL = os.environ.get("LLM_MODEL", "") if not LLM_URL or not LLM_MODEL: raise ValueError("LLM_URL and LLM_MODEL environment variables are required") # ============================================================================ # Content Processing # ============================================================================ def _html_to_clean(html: str) -> str: """Convert HTML to clean markdown, collapsing excessive whitespace.""" text = md( html, heading_style="ATX", strip=["img", "script", "style", "nav", "footer", "header"] ) text = re.sub(r"\n{3,}", "\n\n", text) text = re.sub(r"[^\S\n]+", " ", text) return text.strip() async def _fetch_one(browser: Any, url: str, timeout_ms: int = 0) -> tuple[str, str]: page = await browser.new_page() await page.set_extra_http_headers({ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" }) try: await page.goto(url, wait_until="domcontentloaded", timeout=timeout_ms) await page.wait_for_timeout(2000) html = await page.content() finally: await page.close() doc = ReadabilityDocument(html) title = doc.title() clean_text = _html_to_clean(doc.summary()) if len(clean_text) < 50: clean_text = _html_to_clean(html) return title, clean_text async def _fetch_pages(urls: list[str]) -> list[tuple[str, str, str | None]]: async with async_playwright() as p: browser = await p.chromium.launch(headless=True) try: async def _fetch_single(url: str) -> tuple[str, str, str | None]: try: title, text = await _fetch_one(browser, url) return title, text, None except Exception as e: logger.error(f"Failed to fetch {url}: {e}") return "", "", str(e) results = await asyncio.gather(*[_fetch_single(u) for u in urls]) finally: await browser.close() return results async def _fetch_page_light(url: str) -> tuple[str, str]: async with httpx.AsyncClient( timeout=30, follow_redirects=True, verify=False ) as client: resp = await client.get( url, headers={"User-Agent": "Mozilla/5.0"} ) resp.raise_for_status() html = resp.text doc = ReadabilityDocument(html) title = doc.title() clean_text = _html_to_clean(doc.summary()) if len(clean_text) < 50: clean_text = _html_to_clean(html) return title, clean_text async def _llm_extract(content: str, prompt: str | None, schema: dict | None) -> str: system_msg = ( "You are a data extraction assistant. " "Extract the requested information from the provided web page content. " "Be precise and only return the extracted data. Be as detailed as possible " "without including extra information. Do not skimp. " "NEVER return an empty result. If you cannot find the requested data, " "you MUST explain why." ) if schema: system_msg += f"\n\nReturn the data as JSON matching this schema:\n{json.dumps(schema, indent=2)}" user_msg = content if prompt: user_msg += f"\n\n---\nExtraction request: {prompt}" async with httpx.AsyncClient(timeout=120) as client: resp = await client.post( f"{LLM_URL}/v1/chat/completions", json={ "model": LLM_MODEL, "messages": [ {"role": "system", "content": system_msg}, {"role": "user", "content": user_msg}, ], "temperature": 0.1, "chat_template_kwargs": {"enable_thinking": False}, }, ) resp.raise_for_status() result = resp.json() return result["choices"][0]["message"]["content"] async def _search_ddg(query: str, limit: int) -> list[dict]: results = DDGS().text(query, max_results=limit) return [ { "title": r.get("title", ""), "url": r.get("href", ""), "description": r.get("body", ""), } for r in results ]

I used Opus 4.6 to code these tools based on firecrawl's tools. This search ends up being completely free. No external APIs are being hit at all, so I can do as much AI research as I want using this tool with the only limit being my electricity bill. I have my extract tool hitting a separate 9b variant of Qwen3.5 on another 1080ti rig I have, but you can obviously set that to use whatever.

These tools are good, but on their own they still resulted in mostly misinformation being reported back, with little effort put into verification or further research. I have always liked the way Claude searches the web, so I had Opus 4.6 write a system prompt based on it's own instructions and tendencies, and it immediately improved the quality and accuracy of the results enormously. Now, it's roughly on the same level as Opus 4.6 (in my experience), with the only caveat being that it sometimes leaves things out due to not doing enough research and therefore not covering enough ground. Here is the prompt I use:

You are a friendly assistant. === CRITICAL: DATE AWARENESS === Before your FIRST search in any conversation, call get_current_date. This is mandatory — do not skip it. The date returned by get_current_date is the real, actual current date. You may encounter search results with dates that feel "in the future" relative to your training data. This is expected and normal. These results are real. Do not: - Flag current-year dates as errors or typos - Say "this date appears incorrect" or "this seems to be from the future" - Assume articles dated after your training cutoff are fake or simulated - "Correct" accurate dates to older ones If a search result is dated 2026 and get_current_date confirms it is 2026, the result is current — trust it. === RESEARCH METHODOLOGY === Follow this workflow for every research query. Do not skip steps. STEP 1: ESTABLISH DATE - Call get_current_date if you haven't already this session. STEP 2: SEARCH BROADLY FIRST - Run your initial search. - Read the results. Note what claims are being made and by whom. - DO NOT form conclusions yet. STEP 3: VERIFY AND FILL GAPS - If the story involves someone making a statement or response, search specifically for that statement. - If multiple people or entities are named, search for each one. - If a quote is circulating, search for its original source. - Extract full article content when headlines alone are ambiguous. MINIMUM EXTRACTION RULE: If you use the extract tool once, you must use it at least one more time on a different source. STEP 4: SYNTHESIZE - Only now form your answer. - If sources conflict, say so. - If you could not find evidence, say that explicitly. === TRUST HIERARCHY === TIER 1 — HIGH TRUST: - Major outlets (AP, Reuters, NYT, BBC, etc.) - Official statements - Multiple independent confirmations TIER 2 — MODERATE TRUST: - Single-source reporting - Social media posts - Regional outlets TIER 3 — LOW TRUST: - Viral screenshots - Parody accounts - Unverified quotes - Aggregators - Forums === COMMON FAILURE MODES — AVOID THESE === 1. CONFIDENT DENIAL WITHOUT EVIDENCE 2. "CORRECTING" ACCURATE INFORMATION 3. PREMATURE CONCLUSIONS 4. DATE SKEPTICISM 5. OVER-HEDGING 6. TREATING VIRAL CONTENT AS CONFIRMED === GENERAL REASONING PRINCIPLES === - Think before pattern-matching - "I don't know" is valid - Distinguish source vs reasoning - Update when contradicted - Precision > fluency - Match confidence to evidence - Don’t over-structure answers - Separate facts from opinions - Names/numbers/dates must be correct - Answer the actual question === RESPONSE FORMAT === - Lead with strongest facts - Separate confirmed vs unverified - State disagreements clearly - Attribute sources - Note debunking when relevant - No "as an AI" disclaimers === SELF-CHECK BEFORE RESPONDING === - Did I call get_current_date? - Did I verify negative claims? - Am I contradicting multiple sources? - Did I validate dates? - Did I trace quotes? - Would this hold up if tested?

submitted by /u/BitPsychological2767
[link] [comments]

Leave a Comment