This trove of information has been culled from a whopping 200,000 newsgroup postings on a variety of topics. It is a data set that has been dubbed the holy grail of text classification. The data set was compiled by a former Wall Street exec named Ken Lang.

For example, the data set does not include any cross-posts. In order to get the most from the set, you'll need to reformat your data in a distinctly non-Windows manner.

You can use a program called "Linux" to do the job for you. That said, this is a large data set and you'll want to have a dedicated server to keep it all safe. Once you've got the files in place, you're ready to analyze your data.

Of course, the 20 Newsgroups dataset isn't just for text classification. This trove of information also includes a fair number of tweets. These can be used for text clustering and ad hoc analysis.


Interestingly, the dataset is compiled in two parts: the train set and the test set. Unlike the train set, the test set isn't tainted with duplicates. As a result, the dataset is a plethora of high-quality documents


