Got a spare 142 Gigs?

Just a very quick note to point you in the direction of a fantastic resource that Paul (one of the other members of the engineering team) pointed me at the other day. Its part of the ICWSM 2009 Data Challenge and is a set of 44 million blog posts generated between August and October 2008. Its in XML and is only 142 gigs after you uncompress it. Could form that basis of a great test set a la TREC maybe?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>