C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.
The dataset can be downloaded in a pre-processed form from allennlp.
Paper | Code | Results | Date | Stars |
---|