Skip Navigation

The ClueWeb09 Dataset

CMU has compiled a dataset of 1 billion web pages and made it available to researchers via UPSNet (aka four 1.5TB SATA disks + shipping). At $790 + shipping, it’s a bit of a steal.

The real dream would be some organization (Amazon?) making this dataset accessible electronically, but I imagine the processing power to work on a 1-billion-page dataset would be fairly pricey. Maybe the 50-million-page sample (only $240) would provide a legit sample and be a little bit lighter on the CPU load.

Visit this Link

By the way, since you've made it to the bottom:

  • You should subscribe to my RSS feed here.
  • You should follow me on Twitter here.
Avatar of M. Jackson Wilkinson

I'm M. Jackson Wilkinson, a technologist, designer, speaker, educator, and writer in San Francisco. I'm the CEO and Founder of WeSprout, which is coming soon. I'm from Philadelphia, went to Bowdoin College in Maine, root for the Phillies, and love to sing.

Comments

There are no comments.

Add a comment


Real names, svp.


Required, but I won't use it for anything, promise.


It'll get checked to make sure it's legit, but it's optional.


Don't be mean, don't be a tool, and make a contribution. Use markdown.