spend 2 months building a scraper. a single site redesign broke everything in 4 minutes. [D]

not exaggerating. 2 months of work. custom proxy rotator, playwright instances, regex hell to strip HTML into something my vector DB wouldn't choke on.

site updated their CSS class names. entire pipeline collapsed.

i sat there staring at broken JSON for an hour before i fully processed that this is just... always going to happen. there's no version of this where i maintain a custom scraper long term and it doesn't become a part time job.

how are people building AI products on top of web data without this being a constant maintenance nightmare? genuinely asking. is there an extraction layer that handles the fragility so i don't have to?

submitted by /u/Quick_Eye_6585
[link] [comments]

Leave a Comment