Conference Proceedings
CC-News-En: A Large English News Corpus
Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R Trippas, J Shane Culpepper, Alistair Moffat
Proceedings of the 29th ACM International Conference on Information & Knowledge Management | ACM | Published : 2020
Abstract
We describe a static, open-access news corpus using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages. Our derived corpus, CC-News-En, contains 44 million English documents collected between September 2016 and March 2018. The collection is comparable in size with the number of documents typically found in a single shard of a large-scale, distributed search engine, and is four times larger than the news collections previously used in offline information retrieval experiments. To complement the corpus, 173 topics were curated using titles from Reddit threads, form..
View full abstractRelated Projects (2)
Grants
Awarded by Australian Research Council