This trove of information has been culled from a whopping 200,000 newsgroup postings on a variety of topics. It is a data set that has been dubbed the holy grail of text classification. The data set was compiled by a former Wall Street exec named Ken Lang. To be clear, this is not an exhaustive list, but it does represent the best of the best.
You might think
For example, the data set does not include any cross-posts. In order to get the most from the set, you’ll need to reformat your data in a distinctly non-Windows manner. Thankfully, it’s easier than you might think.
You can use a program called “Linux” to do the job for you. That said, this is a large data set and you’ll want to have a dedicated server to keep it all safe. Once you’ve got the files in place, you’re ready to analyze your data. Just be sure to test your results on a separate machine.
Of course, the 20 Newsgroups dataset isn’t just for text classification. This trove of information also includes a fair number of tweets. These can be used for text clustering and ad hoc analysis. There’s an equally impressive collection of photos and videos as well.
Interestingly, the dataset is compiled in two parts: the train set and the test set. Unlike the train set, the test set isn’t tainted with duplicates. As a result, the dataset is a plethora of high-quality documents.