MachineLearning

I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

For the past several years I've been quietly assembling and processing what I believe is one of the larger privately held pretraining corpora around… a complete Usenet archive spanning 1980 to 2013. Here's what it ended up being: 103.1 billi…