/u/OwnerByDane - Provide.ai

I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

/u/OwnerByDane / May 1, 2026

For the past several years I've been quietly assembling and processing what I believe is one of the larger privately held pretraining corpora around… a complete Usenet archive spanning 1980 to 2013. Here's what it ended up being: 103.1 billi…

Author name: /u/OwnerByDane

I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]