not exaggerating. 2 months of work. custom proxy rotator, playwright instances, regex hell to strip HTML into something my vector DB wouldn't choke on.
site updated their CSS class names. entire pipeline collapsed.
i sat there staring at broken JSON for an hour before i fully processed that this is just... always going to happen. there's no version of this where i maintain a custom scraper long term and it doesn't become a part time job.
how are people building AI products on top of web data without this being a constant maintenance nightmare? genuinely asking. is there an extraction layer that handles the fragility so i don't have to?
[link] [comments]