Text Clustering With the 20News Dataset

20 newsgroups is a data set of 20,000 newsgroup posts culled by Ken Lang. The 20news dataset resides in a database that is split into train and test sets. There are 18846 documents in total. This is a huge number and requires a lot of computational muscle. It is also worth noting that the data set does not include cross-posts.

Entries on topics

One of the best uses for the 20news data set is text clustering. It contains a nice selection of newsgroup entries on topics spanning from politics to religion to technology. Fortunately for us, the data set is available for free download. However, it is not suitable for Windows users. A few tricks will be necessary to get it to work.

The largest numbers

The 20news data set is a nice trove of data that will give you some interesting and surprising results. For example, some of the smallest entries in the dataset are surprisingly large. Interestingly, the largest numbers are attributed to a few highly trafficked newsgroups like nytimes. On the downside, some of the smaller entries are sparsely filled. Using a filter to reduce the number of entries may improve the performance of some classifiers. Another cool thing is that it contains a few notable non-newsgroup entries.



A brief perusal of the available docs shows that it has been around for a while. Although there is no official list of names, a few notable entries include the aforementioned nytimes and nytimes-mo.