Changeset 34011 for other-projects/maori-lang-detection/mongodb-data
- Timestamp:
- 2020-03-10T20:45:18+13:00 (4 years ago)
- Location:
- other-projects/maori-lang-detection/mongodb-data
- Files:
-
- 6 added
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/mongodb-data/piechart_data.txt
r34007 r34011 487 487 metadata _csh_ : ^@^@^@^@ 488 488 489 490 ------------ 491 492 493 Would like to do something like: 494 wharariki:[378]/Scratch/ak19/maori-lang-detection/crawled>find . -name UNFINISHED | grep -l text:start */dump.txt | wc 495 496 497 Would like to find how many and which of the unfinished websites had a dump.txt with no text content 498 AND how many of the completely crawled websites had a dump.txt with no text content. 499 500 501 -------------- 502 503 504 505 wharariki:[393]/Scratch/ak19/maori-lang-detection/crawled>grep -l "text:start" */dump.txt 506 507 wharariki:[388]/Scratch/ak19/maori-lang-detection/crawled>less 01461/dump.txt 508 wharariki:[389]/Scratch/ak19/maori-lang-detection/crawled>less 01453/dump.txt 509 wharariki:[390]/Scratch/ak19/maori-lang-detection/crawled>less 01447/dump.txt 510 wharariki:[391]/Scratch/ak19/maori-lang-detection/crawled>less 01446/dump.txt 511 wharariki:[392]/Scratch/ak19/maori-lang-detection/crawled>less 01445/dump.txt 512 513 # All the dump.txt files that are 0 bytes (no content): 514 # https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories 515 wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc 516 150 150 2550 517 518 519 Examples of empty dump.txt files (listed with: find . -name "dump.txt" -size 0 | sort): 520 wharariki:[400]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/00014/seedURLs.txt 521 wharariki:[401]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01461/seedURLs.txt 522 wharariki:[402]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01447/seedURLs.txt 523 wharariki:[403]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01422/seedURLs.txt 524 525 526 527 528 ======= -
other-projects/maori-lang-detection/mongodb-data/piechart_data2.txt
r34007 r34011 106 106 07. 11316 empty status unknown 107 107 108 109 110 ============ 111 112 113 1463 sites prepared for crawling 114 1447 sites crawled (16 were autotranslated or otherwise irrelevant) 115 1446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb 116 619 sites not finished crawling 117 1027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content 118 119 120 121 16 uncrawled irrelevant sites pruned away 122 1 failed crawl of site (text dump missing) 123 1446 crawled sites in MongoDB 124 125 126 Graph title: Breakdown of the 1463 sites prepared for crawling 127 * 16 uncrawled irrelevant sites pruned away 128 * 1 sites failed to properly crawl (text dump missing) 129 * 619 incompletely crawled sites 130 * 827 completely crawled sites 131 132 133 Graph title: Breakdown of the 1463 sites prepared for crawling 134 * 16 uncrawled irrelevant sites pruned away 135 * 1 sites failed to properly crawl (text dump missing) 136 * 419 crawled sites with no text content 137 - 150 crawled sites with 0-size dump.txt files [crawled sites with empty dump.txt files] See below. 138 - 269 crawled sites where dump.txt had no text content 139 * 1027 crawled sites with text content (WebPages collection in MongoDB will have webpage documents for these sites) 140 141 142 143 144 # All the dump.txt files that are 0 bytes (no content): 145 # https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories 146 wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc 147 150 150 2550 148 149
Note:
See TracChangeset
for help on using the changeset viewer.