Posting some empirical measurements that might be useful to others working on RAG / agentic systems.
Setup: 100 URLs across 5 categories (news, ecommerce, docs, social, SaaS marketing), 20 each. Two extractors run in parallel per URL: (a) naive HTML-to-text — represents what most agents currently consume; (b) structural extraction — semantic HTML tags + text density per DOM subtree + link density. Token counts from tiktoken cl100k_base.
Results: 83/100 pages were accessible (the other 17 returned 403 to non-browser User-Agents). Mean token reduction across the 83: 71.5%. Distribution by category:
News 65.5% (n=18, σ similar to mean) E-commerce 62.5% (n=12, 8 sites bot-blocked) Docs 46.3% (n=18) SaaS 45.9% (n=20) Social 30.7% (n=15, dragged by Reddit serving near-empty pages) Validation via LLM-as-judge (qwen2.5:7b, local, free):
- Content Preservation Score: 77.7 / 100 mean
- Answer Quality Delta on category-relevant questions: 26 sentinel-better / 31 ties / 26 baseline-better
The tied AQD distribution is the more honest finding — heuristic extraction doesn't reliably improve answer quality, but it doesn't degrade it either, while consuming 71.5% fewer tokens. Equivalent quality at ~28.5% of the token cost.
One side finding worth flagging: When I ran the same measurement as a session-level A/B inside Claude Code (Anthropic's CLI), token costs were near-identical with and without my tool. The per-model breakdown from /cost showed that Claude Code routes WebFetch through Haiku as an internal compression step before passing to the main model. This is undocumented. Implication: if you're benchmarking RAG/extraction tools using Claude Code as the harness, your numbers reflect Anthropic's compression layer plus your tool, not your tool alone. Worth knowing.
Repo (code, methodology, per-URL CSV): https://github.com/iOptimizeThings/sentinel
The extraction algorithm itself is not novel — it draws on the Mozilla Readability / Trafilatura lineage. The contributions here are (1) reproducible measurement methodology against a curated benchmark set, (2) the structured output format optimized for agent consumption rather than human reading, and (3) the LLM-as-judge validation showing semantic preservation.
Open to feedback on the methodology, especially the AQD setup which is the weakest part — single category-level question per page is coarse.
[link] [comments]