I've spent my entire summer building the ultimate web extraction layer for my AI agent.
I built a custom proxy rotator. I set up headless Playwright instances. I wrote hundreds of lines of fragile Regex to strip out HTML tags and inline CSS just so my vector database wouldn't choke on the garbage data.
I was so proud of it... until I realized how completely unmaintainable it is. Every time a target site updates its UI, my parser breaks. My proxies keep getting banned.
Tell me I'm not the only one who wasted months reinventing the wheel. What off-the-shelf tools are you guys using to just pass a URL and get clean JSON/Markdown back?
[link] [comments]