Changeset 33457

Show
Ignore:
Timestamp:
05.09.2019 19:01:36 (13 days ago)
Author:
ak19
Message:

Got stage 1, the WARC to WET conversion, working, after necessary adjustments to the online instructions were discovered and made. Instructions and code didn't work as is, they probably were out of date.

Location:
gs3-extensions/maori-lang-detection/MoreReading
Files:
2 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33456 r33457  
    33https://gist.github.com/Smerity/afe7430fdb4371015466 
    44https://github.com/commoncrawl/commoncrawl/issues/11 
    5  
     5WARC TO WET: 
    66https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to 
     7 
    78Sebastian Nagel      
    8905/07/2017 
     
    4445 
    4546Best, 
    46 Sebastian  
     47Sebastian 
     48 
    4749======================= 
    4850Latest version of the index's schema: 
     
    167169tikauka:[142]/Scratch/anupama/maori-lang-detection>gunzip CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz  
    168170 
     171vagrant@node1:~/ia-hadoop-tools$ hdfs dfs -get /user/vagrant/cc-mri-subset/wet/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz /home/vagrant/. 
    169172 
    170173-------------------------------------------- 
  • gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

    r33456 r33457  
    1414- Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/ 
    1515- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1. 
     16 
     17=========================================== 
     18        WARC TO WET 
     19=========================================== 
     20https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to 
     21 
     22Sebastian Nagel      
     2305/07/2017 
     24Hi, 
     25 
     26unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives. 
     27 
     28But it's easy to run the WET extractor on the WARC files, see: 
     29  https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion 
     30  https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion 
     31 
     32That's what you have to do: 
     33 
     34# download the WARC files and place them in a directory "warc/" 
     35# create sibling folders wat and wet 
     36# | 
     37# |-- warc/ 
     38# |   |-- CC-NEWS-20161001224340-00008.warc.gz 
     39# |   |-- CC-NEWS-20161017145313-00000.warc.gz 
     40# |   `-- ... 
     41# | 
     42# |-- wat/ 
     43# | 
     44# `-- wet/ 
     45 
     46git clone https://github.com/commoncrawl/ia-web-commons 
     47cd ia-web-commons 
     48mvn install 
     49 
     50cd .. 
     51git clone https://github.com/commoncrawl/ia-hadoop-tools 
     52cd ia-hadoop-tools 
     53mvn package 
     54 
     55java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \ 
     56   -strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz 
     57 
     58The folders wat/ and wet/ will then contain the exports. 
     59 
     60Best, 
     61Sebastian 
     62 
     63--- 
     64 
     651. So following the above instructions, I first made a warc subfolder in hdfs:///user/vagrant/cc-mri-subset/ 
     66Then moved all the downloaded *warc.gz into there. 
     67Then created wat and wet subfolders in there alongside the warc folder. 
     68 
     692. Next, I did the 2 git clone and mvn compile operations above. 
     70The first, ia-web-commons, successfully compiled (despite some test failures) 
     71 
     723. BUT THE 2ND GIT PROJECT, ia-hadoop-tools, DIDN'T COMPILE AT FIRST, with the mvn package command failing: 
     73 
     74git clone https://github.com/commoncrawl/ia-hadoop-tools 
     75cd ia-hadoop-tools 
     76mvn package 
     77 
     78Compile failed with a message about the JSONTokener constructor not taking a String object. It turned out that the JSONTokener used was a version of the class that was too old. Whereas the necessary constructor is present in the most recent version, as seen in the API at https://docs.oracle.com/cd/E51273_03/ToolsAndFrameworks.110/apidoc/assembler/com/endeca/serialization/json/JSONTokener.html 
     79 
     80So instead, I opened up ia-hadoop-tools/pom.xml for editing and added the newest version of the org.json package's json (see http://builds.archive.org/maven2/org/json/json/ for <version>) into the pom.xml's <dependencies> element, based on how it this was done at https://bukkit.org/threads/problem-loading-libraries-with-maven-java-lang-noclassdeffounderror-org-json-jsonobject.428354/: 
     81 
     82   <dependency> 
     83      <groupId>org.json</groupId> 
     84      <artifactId>json</artifactId> 
     85      <version>20131018</version> 
     86    </dependency> 
     87 
     88Then I was able to run "mvn package" successfully. 
     89(Maybe I could also have added in a far more recent version, as seen in the version numbers at https://mvnrepository.com/artifact/org.json/json, 
     90but didn't want to go too far ahead in case there was other incompatibility.) 
     91 
     924. Next, I wanted to finally run the built executable to convert the warc files to wet files. 
     93 
     94I had the warc files on the hadoop filesystem. The original instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to however were apparently for working with warcs stored on the local filesystem, as those instructions did not run the hadoop command but the regular java command. The regular java command did not work with the files being on the hadoop system (attempt #1 below). 
     95 
     96ATTEMPTS THAT DIDN'T WORK: 
     971. vagrant@node1:~/ia-hadoop-tools$ java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz 
     982. vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz 
     99 
     100 
     101The 2nd attempt, which uses a proper hadoop command, I based off https://www.tutorialspoint.com/map_reduce/implementation_in_hadoop.htm 
     102It produced lots of errors and the output wet (and wat) .gz files were all corrupt as gunzip could not successfully run over them: 
     103 
     104vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz 
     10519/09/05 05:57:22 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout 
     10619/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100139-000000.warc.gz 
     10719/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100141-000001.warc.gz 
     10819/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100451-000002.warc.gz 
     10919/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100453-000003.warc.gz 
     11019/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100805-000004.warc.gz 
     11119/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100809-000005.warc.gz 
     11219/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000006.warc.gz 
     11319/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000007.warc.gz 
     11419/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000008.warc.gz 
     11519/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000009.warc.gz 
     11619/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032 
     11719/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032 
     11819/09/05 05:57:23 INFO mapred.FileInputFormat: Total input paths to process : 10 
     11919/09/05 05:57:24 INFO mapreduce.JobSubmitter: number of splits:10 
     12019/09/05 05:57:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1567397114047_0001 
     12119/09/05 05:57:24 INFO impl.YarnClientImpl: Submitted application application_1567397114047_0001 
     12219/09/05 05:57:24 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1567397114047_0001/ 
     12319/09/05 05:57:24 INFO mapreduce.Job: Running job: job_1567397114047_0001 
     12419/09/05 05:57:31 INFO mapreduce.Job: Job job_1567397114047_0001 running in uber mode : false 
     12519/09/05 05:57:31 INFO mapreduce.Job:  map 0% reduce 0% 
     12619/09/05 05:57:44 INFO mapreduce.Job:  map 10% reduce 0% 
     12719/09/05 05:57:44 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000002_0, Status : FAILED 
     128Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 
     129Container killed by the ApplicationMaster. 
     130Container killed on request. Exit code is 143 
     131Container exited with a non-zero exit code 143 
     132 
     13319/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000004_0, Status : FAILED 
     134Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 
     135Container killed by the ApplicationMaster. 
     136Container killed on request. Exit code is 143 
     137Container exited with a non-zero exit code 143 
     138 
     13919/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED 
     140Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 
     14119/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000005_0, Status : FAILED 
     142Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 
     14319/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000000_0, Status : FAILED 
     144Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 
     14519/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000003_0, Status : FAILED 
     146Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 
     14719/09/05 05:57:46 INFO mapreduce.Job:  map 0% reduce 0% 
     14819/09/05 05:57:54 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000007_0, Status : FAILED 
     149Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 
     15019/09/05 05:57:55 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000006_0, Status : FAILED 
     151Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 
     15219/09/05 05:57:57 INFO mapreduce.Job:  map 10% reduce 0% 
     15319/09/05 05:57:57 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000009_0, Status : FAILED 
     154Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 
     155Container killed by the ApplicationMaster. 
     156Container killed on request. Exit code is 143 
     157Container exited with a non-zero exit code 143 
     158 
     15919/09/05 05:57:58 INFO mapreduce.Job:  map 20% reduce 0% 
     16019/09/05 05:57:58 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000008_0, Status : FAILED 
     161Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 
     16219/09/05 05:58:06 INFO mapreduce.Job:  map 30% reduce 0% 
     16319/09/05 05:58:08 INFO mapreduce.Job:  map 60% reduce 0% 
     16419/09/05 05:58:09 INFO mapreduce.Job:  map 70% reduce 0% 
     16519/09/05 05:58:10 INFO mapreduce.Job:  map 80% reduce 0% 
     16619/09/05 05:58:12 INFO mapreduce.Job:  map 90% reduce 0% 
     16719/09/05 05:58:13 INFO mapreduce.Job:  map 100% reduce 0% 
     16819/09/05 05:58:13 INFO mapreduce.Job: Job job_1567397114047_0001 completed successfully 
     16919/09/05 05:58:13 INFO mapreduce.Job: Counters: 32 
     170    File System Counters 
     171        FILE: Number of bytes read=0 
     172        FILE: Number of bytes written=1239360 
     173        FILE: Number of read operations=0 
     174        FILE: Number of large read operations=0 
     175        FILE: Number of write operations=0 
     176        HDFS: Number of bytes read=1430 
     177        HDFS: Number of bytes written=0 
     178        HDFS: Number of read operations=30 
     179        HDFS: Number of large read operations=0 
     180        HDFS: Number of write operations=0 
     181    Job Counters  
     182        Failed map tasks=10 
     183        Launched map tasks=20 
     184        Other local map tasks=10 
     185        Data-local map tasks=10 
     186        Total time spent by all maps in occupied slots (ms)=208160 
     187        Total time spent by all reduces in occupied slots (ms)=0 
     188        Total time spent by all map tasks (ms)=208160 
     189        Total vcore-milliseconds taken by all map tasks=208160 
     190        Total megabyte-milliseconds taken by all map tasks=213155840 
     191    Map-Reduce Framework 
     192        Map input records=10 
     193        Map output records=0 
     194        Input split bytes=1430 
     195        Spilled Records=0 
     196        Failed Shuffles=0 
     197        Merged Map outputs=0 
     198        GC time elapsed (ms)=1461 
     199        CPU time spent (ms)=2490 
     200        Physical memory (bytes) snapshot=1564528640 
     201        Virtual memory (bytes) snapshot=19642507264 
     202        Total committed heap usage (bytes)=1126170624 
     203    File Input Format Counters  
     204        Bytes Read=0 
     205    File Output Format Counters  
     206        Bytes Written=0 
     207vagrant@node1:~/ia-hadoop-tools$  
     208 
     209 
     2105. The error messages are all the same but not very informative 
     211   19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED 
     212   Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; 
     213 
     214All the references I could find on google indicated that the full version of the error message was that this method (com.google.common.io.ByteStreams.limit(...)) could not be located. 
     215The page at http://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3C7AD5ED91-F8BA-43C9-9030-049BAB360BD4@gmail.com%3E 
     216revealed that guava.jar contains the com.google.common.io.ByteStreams class. 
     217 
     218 
     219TO GET THE EXECUTABLE TO WORK: 
     220I located guava.jar, found there were 2 identical ones on the filesystem but that neither was on the hadoop classpath yet, so I copied it into one of the Hadoop Classpath locations. Then I was able to successfully run the executable and produce meaningful WET files at last from the WARC input files: 
     221  
     222 
     223 
     224vagrant@node1:~$ locate guava.jar 
     225/usr/share/java/guava.jar 
     226/usr/share/maven/lib/guava.jar 
     227vagrant@node1:~$ jar -tvf /usr/share/maven/lib/guava.jar | less 
     228vagrant@node1:~$ jar -tvf /usr/share/java/guava.jar | less 
     229# both contained the ByteStreams class 
     230 
     231vagrant@node1:~$ cd - 
     232/home/vagrant/ia-hadoop-tools 
     233vagrant@node1:~/ia-hadoop-tools$ find . -name "guava.jar" 
     234# None in the git project 
     235 
     236vagrant@node1:~/ia-hadoop-tools$ hadoop classpath 
     237/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar 
     238# guava.jar not on hadoop classpath yet 
     239 
     240vagrant@node1:~/ia-hadoop-tools$ diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar 
     241# no differences, identical 
     242 
     243vagrant@node1:~/ia-hadoop-tools$ hdfs dfs -put /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/. 
     244put: `/usr/local/hadoop/share/hadoop/common/.': No such file or directory 
     245# hadoop classpath locations are not hdfs filesystem 
     246 
     247vagrant@node1:~/ia-hadoop-tools$ sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/. 
     248vagrant@node1:~/ia-hadoop-tools$ hadoop classpath 
     249/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar 
     250# Copied guava.jar to somewhere on existing hadoop classpath 
     251 
     252vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz 
     253# Successful run 
     254 
     255vagrant@node1:~$ hdfs dfs -get /user/vagrant/cc-mri-subset/wet/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz /home/vagrant/. 
     256vagrant@node1:~$ cd .. 
     257vagrant@node1:~$ gunzip MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz 
     258vagrant@node1:~$ less MAORI-CC-2019-30-20190902100139-000000.warc.wet 
     259# Copied a WET output file from the hadoop filesystem to local filesystem and inspected its contents. Works! 
    16260 
    17261-----------------------------------