/u/True_City6143 - Provide.ai

Best approach for OCR → cleaning → Hindi–English translation pipeline? [R]

/u/True_City6143 / April 14, 2026

I’m working with a large CSV dataset (~200k rows) containing OCR-extracted text from parliamentary documents. The data includes a mix of Hindi (Devanagari) and English, often within the same sentence, along with OCR noise (broken words, symbols, format…

Author name: /u/True_City6143

Best approach for OCR → cleaning → Hindi–English translation pipeline? [R]