- Timestamp:
- 2020-03-10T20:45:18+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/mongodb-data/piechart_data2.txt
r34007 r34011 106 106 07. 11316 empty status unknown 107 107 108 109 110 ============ 111 112 113 1463 sites prepared for crawling 114 1447 sites crawled (16 were autotranslated or otherwise irrelevant) 115 1446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb 116 619 sites not finished crawling 117 1027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content 118 119 120 121 16 uncrawled irrelevant sites pruned away 122 1 failed crawl of site (text dump missing) 123 1446 crawled sites in MongoDB 124 125 126 Graph title: Breakdown of the 1463 sites prepared for crawling 127 * 16 uncrawled irrelevant sites pruned away 128 * 1 sites failed to properly crawl (text dump missing) 129 * 619 incompletely crawled sites 130 * 827 completely crawled sites 131 132 133 Graph title: Breakdown of the 1463 sites prepared for crawling 134 * 16 uncrawled irrelevant sites pruned away 135 * 1 sites failed to properly crawl (text dump missing) 136 * 419 crawled sites with no text content 137 - 150 crawled sites with 0-size dump.txt files [crawled sites with empty dump.txt files] See below. 138 - 269 crawled sites where dump.txt had no text content 139 * 1027 crawled sites with text content (WebPages collection in MongoDB will have webpage documents for these sites) 140 141 142 143 144 # All the dump.txt files that are 0 bytes (no content): 145 # https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories 146 wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc 147 150 150 2550 148 149
Note:
See TracChangeset
for help on using the changeset viewer.