---------------------------------------- INDEX: follow in sequence ---------------------------------------- A. VAGRANT VM WITH HADOOP AND SPARK B. Create IAM role on Amazon AWS to use S3a C. Configure Spark on your vagrant VM with the AWS authentication details --- Script scripts/setup.sh now is automated to do the steps in D-F below and prints out the main instruction for G. --- D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS E. Setup cc-index-table git project F. Setup warc-to-wet tools (git projects) G. Getting and running our scripts --- H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps I. Setting up Nutch v2 on its own Vagrant VM machine J. Automated crawling with Nutch v2.3.1 and post-processing K. Sending the crawled data into mongodb with NutchTextDumpProcessor.java --- APPENDIX: Legend of mongodb-data folder's contents APPENDIX: Reading data from hbase tables and backing up hbase ---------------------------------------- ---------------------------------------- A. VAGRANT VM WITH HADOOP AND SPARK ---------------------------------------- Set up vagrant with hadoop and spark as follows 1. by following the instructions at https://github.com/martinprobson/vagrant-hadoop-hive-spark This will eventually create the following folder, which will contain Vagrantfile /home//vagrant-hadoop-hive-spark 2. If there are other vagrant VMs set up according to the same instructions on the same machine, then need to change the forwarded ports (the 2nd column of ports) in the file "Vagrantfile". In the example below, excerpted from my Vagrantfile, I've incremented the forwarded ports by 1: config.vm.network "forwarded_port", guest: 8080, host: 8081 config.vm.network "forwarded_port", guest: 8088, host: 8089 config.vm.network "forwarded_port", guest: 9083, host: 9084 config.vm.network "forwarded_port", guest: 4040, host: 4041 config.vm.network "forwarded_port", guest: 18888, host: 18889 config.vm.network "forwarded_port", guest: 16010, host: 16011 Remember to visit the adjusted ports on the running VM. 3. The most useful vagrant commands: vagrant up # start up the vagrant VM if not already running. # May need to provide VM's ID if there's more than one vagrant VM ssh vagrant # ssh into the sole vagrant VM, else may need to provide vagrant VM's ID vagrant halt # to shutdown the vagrant VM. Provide VM's ID if there's more than one vagrant VM. (vagrant destroy) # to get rid of your vagrant VM. Useful if you've edited your Vagrantfile 4. Inside the VM, /home//vagrant-hadoop-hive-spark will be shared and mounted as /vagrant Remember, this is the folder containing Vagrantfile. It's easy to use the shared folder to transfer files between the VM and the actual machine that hosts it. 5. Install EMACS, FIREFOX AND MAVEN on the vagrant VM: Start up vagrant machine ("vagrant up") and ssh into it ("ssh vagrant") if you haven't already. a. sudo apt-get -y install firefox b. sudo apt-get install emacs c. sudo apt-get install maven (or sudo apt update sudo apt install maven) Maven is needed for the commoncrawl github projects we'll be working with. 6. Although you can edit the Vagrantfile to have emacs and maven automatically installed when the vagrant VM is created, for firefox, you're advised to install it as above. To be able to view firefox from the machine hosting the VM, use a separate terminal and run: vagrant ssh -- -Y [or "vagrant ssh -- -Y node1", if VM ID is node1] READING ON Vagrant: * Guide: https://www.vagrantup.com/intro/getting-started/index.html * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know * vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html * https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box * https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box sudo apt-get -y install firefox * vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a * hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml * https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/ ------------------------------------------------- B. Create IAM role on Amazon AWS to use S3 (S3a) ------------------------------------------------- CommonCrawl (CC) crawl data is stored on Amazon S3, specifically the newest version Amazon s3a which has superceded both s3 and its earlier successor s3n. In order to have access to cc crawl data, need to create an IAM role on Dr Bainbridge's Amazon AWS account and configure its profile for commoncrawl. 1. Log into Dr Bainbridge's Amazon AWS account - In the aws management console: davidb@waikato.ac.nz lab pwd, capital R and ! (maybe g) 2. Create a new "iam" role or user for "commoncrawl(er)" profile 3. You can create the commoncrawl profile while creating the user/role, by following the instructions at https://answers.dataiku.com/1734/common-crawl-s3 which states "Even though the bucket is public, if your AWS key does not have your full permissions (ie if it's a restricted IAM user), you need to grant explicit access to the commoncrawl bucket: attach the following policy to your IAM user" #### START POLICY IN JSON FORMAT ### { "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1503647467000", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::commoncrawl/*", "arn:aws:s3:::commoncrawl" ] } ] } #### END POLICY ### -------------------------------------------------------------------------- C. Configure Spark on your vagrant VM with the AWS authentication details -------------------------------------------------------------------------- Any Spark jobs run against the CommonCrawl data stored on Amazon s3a need to be able to authenticate with the AWS IAM role you created above. In order to do this, you'll want to put the Amazon AWS access key and secret key in the SPARK configuration properties file. (Instead of configuring these values in hadoop's core-site.xml, as in the latter case, the authentication details don't get copied across when distributed jobs are run to other computers in the distributed cluster that also need to know how to authenticate): 1. Inside the vagrant vm: sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf (sudo emacs $SPARK_HOME/conf/spark-defaults.conf) 2. Edit the spark properties conf file to contain these 3 new properties: spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY_HERE spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY_HERE Instructions on which properties to set were taken from: - https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark - https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760 [NOTE, inactive alternative: Instead of editing spark's config file to set these properties, these properties can also be set in the bash script that executes the commoncrawl Spark jobs: $SPARK_HOME/bin/spark-submit \ ... --conf spark.hadoop.fs.s3a.access.key=ACCESSKEY \ --conf spark.hadoop.fs.s3a.secret.key=SECRETKEY \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ ... But better not to hardcode authentication details into code, so I did it the first way. ] ---------------------------------------------------------------------- NOTE: Script scripts/setup.sh now is automated to do the steps in D-F below and prints out the main instruction for G. ---------------------------------------------------------------------- D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS ---------------------------------------------------------------------- The following 2 pages state that additional steps are necessary to get hadoop and spark to work with AWS S3a: - https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark - https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760 I'm not sure whether these steps were really necessary in my case, and if so, whether it was A or B below that got things working for me. However, I have both A and B below set up. A. Check your maven installation for necessary jars: 1. Installing maven may already have got the specifically recommended version of AWS-Java-SDK (aws-java-sdk-1.7.4.jar) and v2.7.6 hadoop-aws matching the vagrant VM's hadoop version (hadoop-aws-2.7.6.jar). Check these locations, as that's where I have them: - /home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar - /home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar The specifically recommended v.1.7.4 from the instructions can be found off https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4 at https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar 2. The script that runs the 2 Spark jobs uses the above paths for one of the spark jobs: $SPARK_HOME/bin/spark-submit \ --jars file:/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar,file:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \ --driver-class-path=/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \ However the other Spark job in the script does not set --jars or --driver-class-path, despite also referring to the s3a://commoncrawl table. So I'm not sure whether the jars are necessary or whether theywere just being ignored when provided. B. Download jar files and put them on the hadoop classpath: 1. download the jar files: - I obtained aws-java-sdk-1.11.616.jar (v1.11) from https://aws.amazon.com/sdk-for-java/ - I downloaded hadoop-aws 2.7.6 jar, as it goes with my version of hadoop, from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6 2. The easiest solution is to copy the 2 downloaded jars onto a location in the hadoop classpath. a. The command that shows the paths present on the Hadoop CLASSPATH: hadoop classpath One of the paths this will list is /usr/local/hadoop-2.7.6/share/hadoop/common/ b. SUDO COPY the 2 downloaded jar files, hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar, to this location: sudo cp hadoop-aws-2.7.6.jar /usr/local/hadoop-2.7.6/share/hadoop/common/. sudo cp aws-java-sdk-1.11.616.jar /usr/local/hadoop-2.7.6/share/hadoop/common/. Any hadoop jobs run will now find these 2 jar files on the classpath. [NOTE, unused alternative: Instead of copying the 2 jar files into a system location, assuming they were downloaded into /home/vagrant/lib, you can also export a custom folder's jar files into the hadoop classpath from the bash script that runs the spark jobs. This had no effect for me, and was commented out, and is another reason why I'm not sure if the 2 jar files were even necessary. #export LIBJARS=/home/vagrant/lib/* #export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g` ] ------------------------------------ E. Setup cc-index-table git project ------------------------------------ Need to be inside the vagrant VM. 1. Since you should have already installed maven, you can checkout and compile the cc-index-table git project. git clone https://github.com/commoncrawl/cc-index-table.git 2. Modify the toplevel pom.xml file used by maven by changing the spark version used to 2.3.0 and adding a dependency for hadoop-aws 2.7.6, as indicated below: 17c17,18 < 2.4.1 --- > > 2.3.0 135a137,143 > > org.apache.hadoop > hadoop-aws > 2.7.6 > > 3. Although cc-index-table will compile successfully after the above modifications, it will nevertheless throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows: a. Set option(header) to false, since the csv file contains no header row, only data rows. Change: sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true) .load(csvQueryResult); To sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true) .load(csvQueryResult); b. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc. Comment out: //JavaRDD rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd() .toJavaRDD(); Replace with the default inferred column names: JavaRDD rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd() .toJavaRDD(); // TODO: link svn committed versions of orig and modified CCIndexWarcExport.java here. 4. Now (re)compile cc-index-table with the above modifications: cd cc-index-table mvn package ------------------------------- F. Setup warc-to-wet tools ------------------------------- To convert WARC files to WET (.warc.wet) files, need to checkout, set up and compile a couple more tools. These instructions are derived from those at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to 1. Grab and compile the 2 git projects for converting warc to wet: git clone https://github.com/commoncrawl/ia-web-commons cd ia-web-commons mvn install git clone https://github.com/commoncrawl/ia-hadoop-tools cd ia-hadoop-tools # can't compile this yet 2. Add the following into ia-hadoop-tools/pom.xml, in the top of the element, so that maven finds an appropriate version of the org.json package and its JSONTokener (version number found off https://mvnrepository.com/artifact/org.json/json): org.json json 20131018 [ UNFAMILAR CHANGES that I don't recollect making and that may have been a consequence of the change in step 1 above: a. These further differences show up between the original version of the file in pom.xml.orig and the modified new pom.xml: ia-hadoop-tools>diff pom.xml.orig pom.xml < org.netpreserve.commons < webarchive-commons < 1.1.1-SNAPSHOT --- > org.commoncrawl > ia-web-commons > 1.1.9-SNAPSHOT b. I don't recollect changing or manually introducing any java files. I just followed the instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to However, a diff -rq between the latest "ia-hadoop-tools" gitproject checked out a month after the "ia-hadoop-tools.orig" checkout I ran, shows the following differences in files which are not shown as recently modified in github itself in that same period. ia-hadoop-tools> diff -rq ia-hadoop-tools ia-hadoop-tools.orig/ Files ia-hadoop-tools/src/main/java/org/archive/hadoop/io/HDFSTouch.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/io/HDFSTouch.java differ Only in ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs: ArchiveFileExtractor.java Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java differ Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/JobDriver.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/JobDriver.java differ Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java differ Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WATGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WATGenerator.java differ Only in ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs: WEATGenerator.java Files ia-hadoop-tools/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java differ Files ia-hadoop-tools/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java differ Files ia-hadoop-tools/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java differ Files ia-hadoop-tools/src/main/java/org/archive/server/FileBackedInputStream.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/FileBackedInputStream.java differ Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeClient.java differ Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeServer.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeServer.java differ ] 3. Now can compile ia-hadoop-tools: cd ia-hadoop-tools mvn package 4. Can't run it until guava.jar is on hadoop classpath. Locate a guava.jar and put it into an existing location checked for by hadoop classpath: locate guava.jar # found in /usr/share/java/guava.jar and /usr/share/maven/lib/guava.jar diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar # identical/no difference, so can use either sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/. # now guava.jar has been copied into a location on hadoop classpath Having done the above, our bash script will now be able to convert WARC to WET files when it runs: $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/PATH/TO/warc/*.warc.gz Our script expects a specific folder structure: there should be a "warc" folder (containing the warc files), which is supplied as above, but also an empty "wet" and "wat" folder at the same level as the "warc" folder. When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?) ----------------------------------- G. Getting and running our scripts ----------------------------------- 1. Grab our 1st bash script and put it into the /home/vagrant/cc-index-table/src/script: cd cc-index-table/src/script wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_maori_WET_records_for_crawl.sh chmod u+x get_maori_WET_records_for_crawl.sh RUN AS: cd cc-index-table ./src/script/get_maori_WET_records_for_crawl.sh where crawl-timestamp of form "CC-MAIN-YYYY-##" >= September 2019 OUTPUT: After hours of processing (leave it to run overnight), you should end up with: hdfs dfs -ls /user/vagrant/ In particular, the zipped wet records at hdfs:///user/vagrant//wet/ that we want would have been copied into /vagrant/-wet-files/ The script get_maori_WET_records_for_crawl.sh - takes a crawl timestamp of the form "CC-MAIN-YYYY-##" from Sep 2018 onwards (before which content_languages were not indexed). The legitimate crawl timestampts are listed in the first column at http://index.commoncrawl.org/ - runs a spark job against CC's AWS bucket over s3a to create a csv table of MRI language records - runs a spark job to download all the WARC records from CC's AWS that are denoted by the csv file's records into zipped warc files - converts WARC to WET: locally converts the downloaded warc.gz files into warc.wet.gz (and warc.wat.gz) files 2. Grab our 2nd bash script and put it into the top level of cc-index-table (/home/vagrant/cc-index/table): cd cc-index-table wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_Maori_WET_records_from_CCSep2018_on.sh chmod u+x get_Maori_WET_records_from_CCSep2018_on.sh RUN FROM cc-index-table DIRECTORY AS: (cd cc-index-table) ./get_Maori_WET_records_from_CCSep2018_on.sh This script just runs the 1st script cc-index-table/src/script/get_maori_WET_records_for_crawl.sh (above) to process all listed common-crawls since September 2018. If any fails, then the script will terminate. Else it runs against each common-crawl in sequence. NOTE: If needed, update the script with more recent crawl timestamps from http://index.commoncrawl.org/ OUTPUT: After days of running, will end up with: hdfs:///user/vagrant//wet/ for each crawl-timestamp listed in the script, which at present would have got copied into /vagrant/-wet-files/ Each of these output wet folders can then be processed in turn by CCWETProcessor.java from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java ----------------------------------- H. Austici crawl ----------------------------------- Austici's crawl: CLI to download web sites as WARCs, features basics to avoid crawler taps. Out of several software to do site mirroring, Autistici's "crawl" seemed promising: https://anarc.at/services/archive/web/ - CLI. - Can download a website quite simply, though flags for additional settings are available. - Coded to prevent common traps. - Downloads website as WARC file - Now I have the WARC to WET process working for the WARC file it produced for the usual test site (Dr Bainbridge's home page) Need to have Go installed in order to install and run Autistici's crawl. Not a problem, because I can do it on the remote machine (which also hosts the hdfs) where I have sudo powers. INSTRUCTIONS 1. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f 2. Create go environment: #!/bin/bash # environment vars for golang export GOROOT=/usr/local/go export GOPATH=$HOME/go export PATH=$GOPATH/bin:$GOROOT/bin:$PATH 3. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage. These steps work: cd $GOPATH mkdir bin mkdir src cd src 4. Since trying to go install the crawl url didn't work https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main [https://stackoverflow.com/questions/26694271/go-install-doesnt-create-any-bin-file] vagrant@node2:~/go/src$ mkdir -p git.autistici.org/ale cd git.autistici.org/ale git clone https://git.autistici.org/ale/crawl.git [Now can run the install command in README.md:] cd $GOPATH/src go install git.autistici.org/ale/crawl/cmd/crawl Now we should have a $GOPATH/bin folder containing the "crawl" binary 5. Run a crawl: cd $GOPATH/bin ./crawl https://www.cs.waikato.ac.nz/~davidb/ which downloads the site and puts the warc file into the $GOPATH/bin folder. More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md 6. To view the RAW contents of a WARC file: https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives zless zless already installed on vagrant file ----------------------------------------------------------------------------------------------- How to run warc-to-wet conversion on sites downloaded as WARCs by Austici's "crawl" ----------------------------------------------------------------------------------------------- ISSUES CONVERTING WARC to WET: --- WARC files produced by Autistici crawl are of a somewhat different format to CommonCrawl WARCs. - missing elements in header - different header elements - ordering different (if that matters) But WET is an official format, not CommonCrawl specific, as indicated by https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis "WET (parsed text) WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format." So must be possible to get WARC to WET conversion used for CommonCrawl data to work on Autistici crawl's WARC files. RESOLUTION: --- I made changes to 2 java source files in the 2 github projects ia-web-commons and ia-hadoop-tools, which we use for the WARC to WET processing of CommonCrawl data. These gitprojects (with modifications for commoncrawl) are already on http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/gitprojects. The changed files are as follows: 1. patches/WATExtractorOutput.java put into ia-web-commons/src/main/java/org/archive/extract after renaming existing to .orig THEN RECOMPILE ia-web-commons with: mvn install 2. patches/GZRangeClient.java put into ia-hadoop-tools/src/main/java/org/archive/server after renaming existing to .orig THEN RECOMPILE ia-hadoop-tools with: mvn package Make sure to first compile ia-web-commons, then ia-hadoop-tools. The modifications made to the above 2 files are as follows: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. ia-web-commons/src/main/java/org/archive/extract/WATExtractorOutput.java [diff src/main/java/org/archive/extract/WATExtractorOutput.orig src/main/java/org/archive/extract/WATExtractorOutput.java] 162,163c162,163 < targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Filename"); < } else { --- > targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Warcinfo-ID"); > } else { 2. ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java [diff src/main/java/org/archive/server/GZRangeClient.orig src/main/java/org/archive/server/GZRangeClient.java] 76,83c76,82 < "WARC/1.0\r\n" + < "WARC-Type: warcinfo\r\n" + < "WARC-Date: %s\r\n" + < "WARC-Filename: %s\r\n" + < "WARC-Record-ID: \r\n" + < "Content-Type: application/warc-fields\r\n" + < "Content-Length: %d\r\n\r\n"; < --- > "WARC/1.0\r\n" + > "Content-Type: application/warc-fields\r\n" + > "WARC-Type: warcinfo\r\n" + > "WARC-Warcinfo-ID: \r\n" + > "Content-Length: %d\r\n\r\n" + > "WARC-Record-ID: \r\n" + > "WARC-Date: %s\r\n"; 115,119c114,119 < private static String DEFAULT_WARC_PATTERN = "software: %s Extractor\r\n" + < "format: WARC File Format 1.0\r\n" + < "conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n" + < "publisher: Internet Archive\r\n" + < "created: %s\r\n\r\n"; --- > private static String DEFAULT_WARC_PATTERN = "Software: crawl/1.0\r\n" + > "Format: WARC File Format 1.0\r\n" + > "Conformsto: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n\r\n"; > // + > //"publisher: Internet Archive\r\n" + > //"created: %s\r\n\r\n"; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 3. To run WARC to WET, the warc needs to live on hdfs in a warc folder and there should be wet and wat folders at the same level. For example, assume that running Autistici's crawl generated $GOPATH/bin/crawl.warc.gz (default location and filename unless you pass flags to crawl CLI to control these) a. Ensure you get crawl.warc.gz onto the vagrant VM that has the git projects for WARC-to-WET installed, recompiled with the above modifications. b. Now, create the folder structure needed for warc-to-wet conversion: hdfs dfs -mkdir /user/vagrant/warctest hdfs dfs -mkdir /user/vagrant/warctest/warc hdfs dfs -mkdir /user/vagrant/warctest/wet hdfs dfs -mkdir /user/vagrant/warctest/wat c. Put crawl.warc.gz into the warc folder on hfds: hdfs dfs -put crawl.warc.gz /user/vagrant/warctest/warc/. d. Finally, time to run the actual warc-to-wet conversion from ia-hadoop-tools: cd ia-hadoop-tools WARC_FOLDER=/user/vagrant/warctest/warc $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz $WARC_FOLDER/crawl*.warc.gz More meaningful when the WARC_FOLDER contains multiple *.warc.gz files, as the above will use map-reduce to generate a *.warc.wet.gz file in the output wet folder for each input *.warc.gz file. e. Copy the generated wet files across from /user/vagrant/warctest/wet/: (cd /vagrant or else cd /home/vagrant ) hdfs dfs -get /user/vagrant/warctest/wet/crawl.warc.wet.gz . or, when dealing with multiple input warc files, we'll have multiple wet files: hdfs dfs -get /user/vagrant/warctest/wet/*.warc.wet.gz f. Now can view the contents of the WET files to confirm they are what we want: gunzip crawl.warc.wet.gz zless crawl.warc.wet The wet file contents should look good now: the web pages as WET records without html tags. ---------------------------------------------------- I. Setting up Nutch v2 on its own Vagrant VM machine ---------------------------------------------------- 1. Untar vagrant-for-nutch2.tar.gz 2. Follow the instructions below. A copy is also in vagrant-for-nutch2/GS_README.txt --- REASONING FOR THE NUTCH v2 SPECIFIC VAGRANT VM: --- We were able to get nutch v1 working on a regular machine. From a few pages online starting with https://stackoverflow.com/questions/33354460/nutch-clone-website, it appeared that "./bin/nutch fetch -all" was the nutch command to mirror a web site. Nutch v2 introduced the -all flag to the ./bin/nutch fetch command. And nutch v2 required HBase which presupposes hadoop. Our vagrant VM for commoncrawl had an incompatible version of HBase but this version was needed for that VM's version of hadoop and spark. So Dr Bainbridge came up with the idea of having a separate Vagrant VM for Nutch v2 which would have the version of HBase it needed and a Hadoop version matching that. Compatible versions with nutch 2.3.1 are mentioned at https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/ (Another option was MongoDB instead of HBase, https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/, but that was not even covered in the apache nutch 2 installation guide.) --- Vagrant VM for Nutch2 --- This vagrant virtual machine is based on https://github.com/martinprobson/vagrant-hadoop-hive-spark However: - It comes with the older versions of hadoop 2.5.2 and hbase 0.98.21, and no spark or hive or other packages. - the VM is called node2 with IP 10.211.55.102 (instead of node1 with IP 10.211.55.101) - Since not all packages are installed, fewer ports needed forwarding. And they're forwarded to portnumber+2 to not conflict with any vagrant VM that used the original vagrant image's forwarded port numbers. - scripts/common.sh uses HBASE_ARCHIVE of the form "hbase-${HBASE_VERSION}-hadoop2-bin.tar.gz" (the -hadoop2 suffix is specific to v0.98.21) - and hbase gets installed as /usr/local/hbase-$HBASE_VERSION-hadoop2 (again with the additional -hadoop2 suffix specific to v0.98.21) in scripts/setup-hbase.sh, so the symbolic link creation there needed to refer to a path of this form. INSTRUCTIONS: a. mostly follow the "Getting Started" instructions at https://github.com/martinprobson/vagrant-hadoop-hive-spark b. but after step 3, replace the github cloned Vagrantfile, scripts and resources folders with their modified counterparts included in the zip file that can be downloaded by visiting http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz. c. wherever the rest of that git page refers to "node1", IP "10.211.55.101" and specific port numbers, use instead "node2", IP "10.211.55.102" and the forwarded port numbers in the customised Vagrantfile. If there's already a node2/if IP "10.211.55.102" vagrant VM set up, then adjust all files in the git cloned vagrant vm already modified by the contents of this vagrant-for-nutch2 folder as follows: - increment all occurrences of node2 and "10.211.55.102" to node3 and IP "10.211.55.103", if not already taken, and - in the Vagrantfile increment forwarded ports by another 2 or so from the highest port number values already in use by other vagrant VMs. d. After doing "vagrant up --provider=virtualbox" to create the VM, do "vagrant ssh" or "vagrant ssh node<#>" (e.g. "vagrant ssh node2") for step 8 in the "Getting Started" section. e. Inside the VM, install emacs, maven, firefox: sudo apt-get install emacs sudo apt update sudo apt install maven sudo apt-get -y install firefox f. We set up nutch 2.3.1, which can be downloaded from https://archive.apache.org/dist/nutch/2.3.1/, as that version worked as per the nutch2 tutorial instructions with the configuration of specific versions of hadoop, hbase and gora for the vagrant VM described here. After untarring the nutch 2.3.1 source tarball, 1. move apache-nutch-2.3.1/conf/regex-urlfilter.txt to apache-nutch-2.3.1/conf/regex-urlfilter.txt.orig 2. download the two files nutch-site.xml and regex-urlfilter.GS_TEMPLATE from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/conf and put them into the apache-nutch-2.3.1/conf folder. 3. Then continue following the nutch tutorial 2 instructions at https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial to set up nutch2 (and apache-solr, if needed, but I didn't install apache-solr for nutch v2). - nutch-site.xml has already been configured to do as much optimisation and speeding up of the crawling as we know about concerning nutch - for each site that will be crawled, regex-urlfilter.GS_TEMPLATE will get copied as the live "regex-urlfilter.txt" file, and lines of regex filters will be appended to its end. ------------------------------------------------------------------------ J. Automated crawling with Nutch v2.3.1 and post-processing ------------------------------------------------------------------------ 1. When you're ready to start crawling with Nutch 2.3.1, - copy the batchcrawl.sh file (from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts) into the vagrant machine at top level. Make the script executable. - copy the to_crawl.tar.gz file containing the "to_crawl" folder (generated by CCWETProcessor.java running of the common-crawl downloaded data where MRI was the primary language) and put it into the vagrant machine at toplevel. - run batchcrawl.sh on a site or range of sites not yet crawled, e.g. ./batchcrawl.sh 00485-00500 2. When crawling is done, the above will have generated the "crawled" folder containing a subfolder for each of the crawled sites, e.g. subfolders 00485 to 00500. Each crawled site folder will contain a dump.txt with the text output of the site's web pages. The "crawled" folder with site subfolders each containing a dump.txt file can be processed with NutchTextDumpProcessor.java. ------------------------------------------------------------------------ K. Sending the crawled data into mongodb with NutchTextDumpProcessor.java ------------------------------------------------------------------------ 1. The crawled folder should contain all the batch crawls done with nutch (section J above). 2. Set up mongodb connection properties in conf/config.properties By default, the mongodb database name is configured to be ateacrawldata. 3. Create a mongodb database by the specified name. A database named "ateacrawldata" to be created, unless the default db name is changed. 4. Set up the environment and compile NutchTextDumpProcessor: cd maori-lang-detection/apache-opennlp-1.9.1 export OPENNLP_HOME=`pwd` cd maori-lang-detection/src javac -cp ".:../conf:../lib/*:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" org/greenstone/atea/NutchTextDumpToMongoDB.java 4. Pass the crawled folder to NutchTextDumpProcessor: java -cp ".:../conf:../lib/*:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" org/greenstone/atea/NutchTextDumpToMongoDB /PATH/TO/crawled 5. It may take 1.5 hours or so to ingest the approximately 1450 crawled sites' data into mongodb. 6. Launch the Robo 3T (version 1.3 is one we tested) MongoDB client. Use it to connect to MongoDB's "ateacrawldata" database. Now you can run queries. Here are most of the important MongoDB queries I ran, and the shorter answers. # Num websites db.getCollection('Websites').find({}).count() 1445 # Num webpages db.getCollection('Webpages').find({}).count() 117496 # Find number of websites that have 1 or more pages detected as being in Maori (a positive numPagesInMRI) db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count() 361 # Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count() 868 # Obviously, the union of the above two will be identical to numPagesContainingMRI: db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count() 868 # Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true) db.getCollection('Webpages').find({isMRI:true}).count() 7818 # Number of pages that contain any number of MRI sentences db.getCollection('Webpages').find({containsMRI: true}).count() 20371 # Number of sites with crawled web pages that have URLs containing /mi(/) OR http(s)://mi.* db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count() 670 # Number of websites that are outside NZ that contain /mi(/) OR http(s)://mi.* # in any of its crawled webpage urls db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count() 656 # 14 sites with URLs containing /mi(/) OR http(s)://mi.* that are in NZ 14 PROJECTION QUERIES: # For all the sites that do not originate in NZ, list their country codes (geoLocationCountryCode # field) and the urlContainsLangCodeInPath field db.getCollection('Websites').find({geoLocationCountryCode: {$ne:"nz"}}, {geoLocationCountryCode:1, urlContainsLangCodeInPath: 1}) AGGREGATION QUERIES - the results of important aggregate queries here can be found in the associated mongodb-data/counts*.json files. # count of country codes for all sites db.Websites.aggregate([ { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 } } }, { $sort : { count : -1} } ]); # count of country codes for sites that have at least one page detected as MRI db.Websites.aggregate([ { $match: { numPagesInMRI: {$gt: 0} } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); # count of country codes for sites that have at least one page containing at least one sentence detected as MRI db.Websites.aggregate([ { $match: { numPagesContainingMRI: {$gt: 0} } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); # ATTEMPT TO FILTER OUT LIKELY AUTO-TRANSLATED SITES # Get a count of all non-NZ (or .nz TLD) sites that don't have /mi(/) or http(s)://mi.* # in the URL path of any crawled web pages of the site db.getCollection('Websites').find( {$and: [ {numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}} ]}).count() 220 # Aggregate: count by country codes of non-NZ related sites that # don't have the language code in the URL path on any crawled pages of the site db.Websites.aggregate([ { $match: { $and: [ {numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}} ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); The above query contains "domain: { $addToSet: '$domain' }" which adds the list of matching domains for each country code to the output of the aggregate result list. This is useful as I'll be inspecting these manually to ensure they're not auto-translated to further reduce the list if necessary. For each resulting domain, I can then inspect that website's pages in the Webpages mongodb collection for whether those pages are relevant or auto-translated with a query of the following form. This example works with the sample site URL https://www.lexilogos.com db.getCollection('Webpages').find({URL:/lexilogos\.com/, mriSentenceCount: {$gt: 0}}) In inspecting Australian sites in the result list, I noticed that one that should not be excluded from the output was https://www.kiwiproperty.com. The TLD is not .nz, and the site originates in Australia, not NZ, but it's still a site of NZ content. This will be an important consideration when constructing some aggregate queries further below. # Count of websites that have at least 1 page containing at least one sentence detected as MRI # AND which websites have mi in the URL path: db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count() 491 # The websites that have some MRI detected AND which are either in NZ or with NZ TLD # or (so if they're from overseas) don't contain /mi or mi.* in URL path: db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}]}).count() 396 Include Australia, to get the valid "kiwiproperty.com" website included in the result list: db.getCollection('Websites').find({$and: [ {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]} ]}).count() 397 # aggregate results by a count of country codes db.Websites.aggregate([ { $match: { $and: [ {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]} ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); # Just considering those sites outside NZ or not with .nz TLD: db.Websites.aggregate([ { $match: { $and: [ {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /\.nz/}}, {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]} ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); # counts by country code excluding NZ related sites db.getCollection('Websites').find({$and: [ {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /\.nz/}}, {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]} ]}).count() 221 websites # But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld): db.getCollection('Websites').find({$and: [ {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} ]}).count() 176 (Total is 221+176 = 397, which adds up). # Get the count (and domain listing) output put under a hardcoded _id of "nz": db.Websites.aggregate([ { $match: { $and: [ {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "nz", count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); # Manually inspected shortlist of the 221 non-NZ websites to weed out those that aren't MRI (weeding out those misdetected as MRI, autotranslated or just contain placenames etc), and adding the 176 NZ on top: MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY: NZ: 176 US: 25 AU: 3 DE: 2 DK: 2 BG: 1 CZ: 1 ES: 1 FR: 1 IE: 1 TOTAL: 213 Manually created counts.json file for above with name "6counts_nonProductSites1_manualShortlist.json" -------------------------------------------------------- APPENDIX: Legend of mongodb-data folder's contents -------------------------------------------------------- 1. allCrawledSites: all sites from CommonCrawl where the content-language=MRI, which we then crawled with Nutch with depth=10. Some obvious auto-translated websites were skipped. 2. sitesWithPagesInMRI: those sites of point 1 above which contained one or more pages that openNLP detected as MRI as primary language 3. sitesWithPagesContainingMRI.json: those sites of point 1 where one or more pages containing at least one "sentence" for which the primary language detected by OpenNLP was MRI 4. tentativeNonProductSites: sites of point 3 excluding those non-NZ sites that had "mi.*" or "*/mi" in the URL path 5. tentativeNonProductSites1: similar to point 4, but "NZ sites" in this set were not just those that were detected as originating in NZ (hosted on NZ servers?) but also any with a TLD of .nz regardless of site's country of origin. 6. nonProductSites1_manualShortlist: based on point 5, but manually inspected all the non-NZ sites for any that were not actually sources of MRI content. For example, sites where the content was in a different language misdetected by openNLP (and commoncrawl's language detection) as MRI, or any further sites that were autotranslated, sites where the "MRI" detected content were photos captioned with NZ placenames constituting the "sentence(s)" detected as being MRI. a. All .json files that contain the "counts_" prefix are the counts by country code for each of the above variants. The comments section at the top of each such *counts_*.json file usually contains the mongodb query used to generate the json content of the file. b. All .json files that contain "geojson-features_" and "multipoint_" prefix for each of the above variants are generated by running org/greenstone/atea/CountryCodeCountsMapData.java on the *counts_*.json file. Run as: cd maori-lang-detection/src java -cp ".:../conf:../lib/*" org/greenstone/atea/CountryCodeCountsMapData ../mongodb-data/[1-6]counts*.json This will then generate the *multipoint_*.json and *geojson-features_*.json files for any of the above 1-6 variants of the input counts json file. c. All .png files that contain the "map_" prefix for each of the above variants were screenshots of the map generated by http://geojson.tools/ for each *geojson-features_*.json file. GIMP was used to crop each screenshot to the area of interest. -------------------------------------------------------- APPENDIX: Reading data from hbase tables and backing up hbase -------------------------------------------------------- * Backing up HBase database: https://blogs.msdn.microsoft.com/data_otaku/2016/12/21/working-with-the-hbase-import-and-export-utility/ * From an image at http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/ to see the contents of a table, inside hbase shell, type: scan 'tablename' e.g. scan '01066_webpage' and hit enter. To list tables and see their "column families" (I don't yet understand what this is): hbase shell hbase(main):001:0> list hbase(main):002:0> describe '01066_webpage' Table 01066_webpage is ENABLED 01066_webpage COLUMN FAMILIES DESCRIPTION {NAME => 'f', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} {NAME => 'h', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} {NAME => 'il', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOC KCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} {NAME => 'mk', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOC KCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} {NAME => 'mtdt', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BL OCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} {NAME => 'ol', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOC KCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} {NAME => 'p', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} {NAME => 's', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} 8 row(s) in 0.1180 seconds -----------------------EOF------------------------