Wiki Dumps to Training Corpora: South Slavic Case
arXiv:2604.25384v1 Announce Type: new
Abstract: This paper presents a methodology for transforming raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extract…