Data to back the piechart I need to make that will illustrate how we continuously filtered out the pool of sites and urls returned by commoncrawl for MRI text down to the final web domains and pages we worked with for our samples.