Changeset 33457


Ignore:
Timestamp:
2019-09-05T19:01:36+12:00 (5 years ago)
Author:
ak19
Message:

Got stage 1, the WARC to WET conversion, working, after necessary adjustments to the online instructions were discovered and made. Instructions and code didn't work as is, they probably were out of date.

Location:
gs3-extensions/maori-lang-detection/MoreReading
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33456 r33457  
    33https://gist.github.com/Smerity/afe7430fdb4371015466
    44https://github.com/commoncrawl/commoncrawl/issues/11
    5 
     5WARC TO WET:
    66https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
     7
    78Sebastian Nagel     
    8905/07/2017
     
    4445
    4546Best,
    46 Sebastian
     47Sebastian
     48
    4749=======================
    4850Latest version of the index's schema:
     
    167169tikauka:[142]/Scratch/anupama/maori-lang-detection>gunzip CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz
    168170
     171vagrant@node1:~/ia-hadoop-tools$ hdfs dfs -get /user/vagrant/cc-mri-subset/wet/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz /home/vagrant/.
    169172
    170173--------------------------------------------
  • gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

    r33456 r33457  
    1414- Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/
    1515- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1.
     16
     17===========================================
     18        WARC TO WET
     19===========================================
     20https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
     21
     22Sebastian Nagel     
     2305/07/2017
     24Hi,
     25
     26unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives.
     27
     28But it's easy to run the WET extractor on the WARC files, see:
     29  https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion
     30  https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion
     31
     32That's what you have to do:
     33
     34# download the WARC files and place them in a directory "warc/"
     35# create sibling folders wat and wet
     36# |
     37# |-- warc/
     38# |   |-- CC-NEWS-20161001224340-00008.warc.gz
     39# |   |-- CC-NEWS-20161017145313-00000.warc.gz
     40# |   `-- ...
     41# |
     42# |-- wat/
     43# |
     44# `-- wet/
     45
     46git clone https://github.com/commoncrawl/ia-web-commons
     47cd ia-web-commons
     48mvn install
     49
     50cd ..
     51git clone https://github.com/commoncrawl/ia-hadoop-tools
     52cd ia-hadoop-tools
     53mvn package
     54
     55java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
     56   -strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz
     57
     58The folders wat/ and wet/ will then contain the exports.
     59
     60Best,
     61Sebastian
     62
     63---
     64
     651. So following the above instructions, I first made a warc subfolder in hdfs:///user/vagrant/cc-mri-subset/
     66Then moved all the downloaded *warc.gz into there.
     67Then created wat and wet subfolders in there alongside the warc folder.
     68
     692. Next, I did the 2 git clone and mvn compile operations above.
     70The first, ia-web-commons, successfully compiled (despite some test failures)
     71
     723. BUT THE 2ND GIT PROJECT, ia-hadoop-tools, DIDN'T COMPILE AT FIRST, with the mvn package command failing:
     73
     74git clone https://github.com/commoncrawl/ia-hadoop-tools
     75cd ia-hadoop-tools
     76mvn package
     77
     78Compile failed with a message about the JSONTokener constructor not taking a String object. It turned out that the JSONTokener used was a version of the class that was too old. Whereas the necessary constructor is present in the most recent version, as seen in the API at https://docs.oracle.com/cd/E51273_03/ToolsAndFrameworks.110/apidoc/assembler/com/endeca/serialization/json/JSONTokener.html
     79
     80So instead, I opened up ia-hadoop-tools/pom.xml for editing and added the newest version of the org.json package's json (see http://builds.archive.org/maven2/org/json/json/ for <version>) into the pom.xml's <dependencies> element, based on how it this was done at https://bukkit.org/threads/problem-loading-libraries-with-maven-java-lang-noclassdeffounderror-org-json-jsonobject.428354/:
     81
     82   <dependency>
     83      <groupId>org.json</groupId>
     84      <artifactId>json</artifactId>
     85      <version>20131018</version>
     86    </dependency>
     87
     88Then I was able to run "mvn package" successfully.
     89(Maybe I could also have added in a far more recent version, as seen in the version numbers at https://mvnrepository.com/artifact/org.json/json,
     90but didn't want to go too far ahead in case there was other incompatibility.)
     91
     924. Next, I wanted to finally run the built executable to convert the warc files to wet files.
     93
     94I had the warc files on the hadoop filesystem. The original instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to however were apparently for working with warcs stored on the local filesystem, as those instructions did not run the hadoop command but the regular java command. The regular java command did not work with the files being on the hadoop system (attempt #1 below).
     95
     96ATTEMPTS THAT DIDN'T WORK:
     971. vagrant@node1:~/ia-hadoop-tools$ java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
     982. vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
     99
     100
     101The 2nd attempt, which uses a proper hadoop command, I based off https://www.tutorialspoint.com/map_reduce/implementation_in_hadoop.htm
     102It produced lots of errors and the output wet (and wat) .gz files were all corrupt as gunzip could not successfully run over them:
     103
     104vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
     10519/09/05 05:57:22 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout
     10619/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100139-000000.warc.gz
     10719/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100141-000001.warc.gz
     10819/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100451-000002.warc.gz
     10919/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100453-000003.warc.gz
     11019/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100805-000004.warc.gz
     11119/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100809-000005.warc.gz
     11219/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000006.warc.gz
     11319/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000007.warc.gz
     11419/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000008.warc.gz
     11519/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000009.warc.gz
     11619/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032
     11719/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032
     11819/09/05 05:57:23 INFO mapred.FileInputFormat: Total input paths to process : 10
     11919/09/05 05:57:24 INFO mapreduce.JobSubmitter: number of splits:10
     12019/09/05 05:57:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1567397114047_0001
     12119/09/05 05:57:24 INFO impl.YarnClientImpl: Submitted application application_1567397114047_0001
     12219/09/05 05:57:24 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1567397114047_0001/
     12319/09/05 05:57:24 INFO mapreduce.Job: Running job: job_1567397114047_0001
     12419/09/05 05:57:31 INFO mapreduce.Job: Job job_1567397114047_0001 running in uber mode : false
     12519/09/05 05:57:31 INFO mapreduce.Job:  map 0% reduce 0%
     12619/09/05 05:57:44 INFO mapreduce.Job:  map 10% reduce 0%
     12719/09/05 05:57:44 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000002_0, Status : FAILED
     128Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
     129Container killed by the ApplicationMaster.
     130Container killed on request. Exit code is 143
     131Container exited with a non-zero exit code 143
     132
     13319/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000004_0, Status : FAILED
     134Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
     135Container killed by the ApplicationMaster.
     136Container killed on request. Exit code is 143
     137Container exited with a non-zero exit code 143
     138
     13919/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED
     140Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
     14119/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000005_0, Status : FAILED
     142Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
     14319/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000000_0, Status : FAILED
     144Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
     14519/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000003_0, Status : FAILED
     146Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
     14719/09/05 05:57:46 INFO mapreduce.Job:  map 0% reduce 0%
     14819/09/05 05:57:54 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000007_0, Status : FAILED
     149Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
     15019/09/05 05:57:55 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000006_0, Status : FAILED
     151Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
     15219/09/05 05:57:57 INFO mapreduce.Job:  map 10% reduce 0%
     15319/09/05 05:57:57 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000009_0, Status : FAILED
     154Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
     155Container killed by the ApplicationMaster.
     156Container killed on request. Exit code is 143
     157Container exited with a non-zero exit code 143
     158
     15919/09/05 05:57:58 INFO mapreduce.Job:  map 20% reduce 0%
     16019/09/05 05:57:58 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000008_0, Status : FAILED
     161Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
     16219/09/05 05:58:06 INFO mapreduce.Job:  map 30% reduce 0%
     16319/09/05 05:58:08 INFO mapreduce.Job:  map 60% reduce 0%
     16419/09/05 05:58:09 INFO mapreduce.Job:  map 70% reduce 0%
     16519/09/05 05:58:10 INFO mapreduce.Job:  map 80% reduce 0%
     16619/09/05 05:58:12 INFO mapreduce.Job:  map 90% reduce 0%
     16719/09/05 05:58:13 INFO mapreduce.Job:  map 100% reduce 0%
     16819/09/05 05:58:13 INFO mapreduce.Job: Job job_1567397114047_0001 completed successfully
     16919/09/05 05:58:13 INFO mapreduce.Job: Counters: 32
     170    File System Counters
     171        FILE: Number of bytes read=0
     172        FILE: Number of bytes written=1239360
     173        FILE: Number of read operations=0
     174        FILE: Number of large read operations=0
     175        FILE: Number of write operations=0
     176        HDFS: Number of bytes read=1430
     177        HDFS: Number of bytes written=0
     178        HDFS: Number of read operations=30
     179        HDFS: Number of large read operations=0
     180        HDFS: Number of write operations=0
     181    Job Counters
     182        Failed map tasks=10
     183        Launched map tasks=20
     184        Other local map tasks=10
     185        Data-local map tasks=10
     186        Total time spent by all maps in occupied slots (ms)=208160
     187        Total time spent by all reduces in occupied slots (ms)=0
     188        Total time spent by all map tasks (ms)=208160
     189        Total vcore-milliseconds taken by all map tasks=208160
     190        Total megabyte-milliseconds taken by all map tasks=213155840
     191    Map-Reduce Framework
     192        Map input records=10
     193        Map output records=0
     194        Input split bytes=1430
     195        Spilled Records=0
     196        Failed Shuffles=0
     197        Merged Map outputs=0
     198        GC time elapsed (ms)=1461
     199        CPU time spent (ms)=2490
     200        Physical memory (bytes) snapshot=1564528640
     201        Virtual memory (bytes) snapshot=19642507264
     202        Total committed heap usage (bytes)=1126170624
     203    File Input Format Counters
     204        Bytes Read=0
     205    File Output Format Counters
     206        Bytes Written=0
     207vagrant@node1:~/ia-hadoop-tools$
     208
     209
     2105. The error messages are all the same but not very informative
     211   19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED
     212   Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
     213
     214All the references I could find on google indicated that the full version of the error message was that this method (com.google.common.io.ByteStreams.limit(...)) could not be located.
     215The page at http://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%[email protected]%3E
     216revealed that guava.jar contains the com.google.common.io.ByteStreams class.
     217
     218
     219TO GET THE EXECUTABLE TO WORK:
     220I located guava.jar, found there were 2 identical ones on the filesystem but that neither was on the hadoop classpath yet, so I copied it into one of the Hadoop Classpath locations. Then I was able to successfully run the executable and produce meaningful WET files at last from the WARC input files:
     221 
     222
     223
     224vagrant@node1:~$ locate guava.jar
     225/usr/share/java/guava.jar
     226/usr/share/maven/lib/guava.jar
     227vagrant@node1:~$ jar -tvf /usr/share/maven/lib/guava.jar | less
     228vagrant@node1:~$ jar -tvf /usr/share/java/guava.jar | less
     229# both contained the ByteStreams class
     230
     231vagrant@node1:~$ cd -
     232/home/vagrant/ia-hadoop-tools
     233vagrant@node1:~/ia-hadoop-tools$ find . -name "guava.jar"
     234# None in the git project
     235
     236vagrant@node1:~/ia-hadoop-tools$ hadoop classpath
     237/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar
     238# guava.jar not on hadoop classpath yet
     239
     240vagrant@node1:~/ia-hadoop-tools$ diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar
     241# no differences, identical
     242
     243vagrant@node1:~/ia-hadoop-tools$ hdfs dfs -put /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
     244put: `/usr/local/hadoop/share/hadoop/common/.': No such file or directory
     245# hadoop classpath locations are not hdfs filesystem
     246
     247vagrant@node1:~/ia-hadoop-tools$ sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
     248vagrant@node1:~/ia-hadoop-tools$ hadoop classpath
     249/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar
     250# Copied guava.jar to somewhere on existing hadoop classpath
     251
     252vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
     253# Successful run
     254
     255vagrant@node1:~$ hdfs dfs -get /user/vagrant/cc-mri-subset/wet/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz /home/vagrant/.
     256vagrant@node1:~$ cd ..
     257vagrant@node1:~$ gunzip MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz
     258vagrant@node1:~$ less MAORI-CC-2019-30-20190902100139-000000.warc.wet
     259# Copied a WET output file from the hadoop filesystem to local filesystem and inspected its contents. Works!
    16260
    17261-----------------------------------
Note: See TracChangeset for help on using the changeset viewer.