CMU has compiled a dataset of 1 billion web pages and made it available to researchers via UPSNet (aka four 1.5TB SATA disks + shipping). At $790 + shipping, it’s a bit of a steal.
The real dream would be some organization (Amazon?) making this dataset accessible electronically, but I imagine the processing power to work on a 1-billion-page dataset would be fairly pricey. Maybe the 50-million-page sample (only $240) would provide a legit sample and be a little bit lighter on the CPU load.