cs.CL

Wiki Dumps to Training Corpora: South Slavic Case

arXiv:2604.25384v2 Announce Type: replace
Abstract: This paper presents a pipeline designed to transform raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves e…