Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
arXiv:2506.01732v2 Announce Type: replace
Abstract: Large Language Models (LLMs) are pre-trained on large data from different sources and domains. These datasets often contain trillions of tokens, including large portions of copyrighted or proprietary…