Best approach for OCR → cleaning → Hindi–English translation pipeline? [R]

I’m working with a large CSV dataset (~200k rows) containing OCR-extracted text from parliamentary documents. The data includes a mix of Hindi (Devanagari) and English, often within the same sentence, along with OCR noise (broken words, symbols, formatting artifacts).

I’m looking for suggestions on:

  • Effective strategies for cleaning noisy OCR text (especially multilingual)
  • Handling mixed Hindi–English content within the same sentence
  • Reliable approaches for language detection at scale
  • Best practices for translating Hindi → English in such datasets
  • Model or pipeline recommendations (rule-based vs transformer-based vs hybrid)

Any guidance on building a robust pipeline for this kind of data would be really helpful.

submitted by /u/True_City6143
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top