I thought I was being smart building an AI competitor analysis tool.
I hooked up Puppeteer to scrape pricing pages, but I didn't realize target sites had updated their bot protection. My scraper got caught in an infinite Cloudflare Turnstile captcha loop.
Instead of crashing, my script just kept feeding the bot-challenge HTML back into Claude/OpenAI to "parse the pricing data." It ran all night, burning millions of tokens on literal garbage HTML. Woke up to a catastrophic Stripe receipt.
I am never managing headless browsers again. How are you guys safely extracting clean text from modern sites without risking a token-burn like this? Please tell me there’s an API that just handles this safely.
[link] [comments]