I am currently building a RAG pipeline that needs to process a massive volume of messy legacy data—including outdated reports, poorly formatted emails, various PDFs, mobile phone photos, and more. While the retrieval and generation components are functioning smoothly, I’ve hit a major bottleneck during the data preparation phase,specifically regarding data anonymization and schema mapping. We managed to cobble together a small internal tool for anonymization that works quite well; however, I’m completely stuck on the task of extracting and mapping standard data from their "spaghetti-code-like" raw inputs.
My current approach involves using the open-source library Unstructured in conjunction with gpt-4o to convert text content into JSON format. The problem is that these open-source parsers often struggle to correctly handle complex document layouts (especially tables).conversely, relying on gpt-4o at scale solely for data formatting results in costs that are simply exorbitant.
Rather than continuing to vent about my own project, I’d much prefer to learn how the rest of you handle this specific stage of the workflow. For those of you currently running production-grade or mid-scale RAG systems:
What are the biggest data processing challenges you are currently facing? (Is it parsing diverse document layouts, anonymizing PII, or forcing unstructured text to fit into rigid data schemas?)
How is your tech stack designed to achieve optimal results? Do you rely on APIs from data tools like Unstructuredio or LlamaParse, or do you primarily depend on custom, internally developed scripts?
Processing Cycle: If someone handed your team a massive pile of raw, messy text data today. In the real world, how long does it take you to process it into a state ready for use by AI?My manager keeps hounding me for a timeline, so I’d love to get a sense of what the average turnaround time looks like for everyone else.
I’m really looking forward to hearing about your respective workflows or any magic tools you’ve discovered that help save you time
[link] [comments]