Changeset 33809

Show
Ignore:
Timestamp:
17.12.2019 19:53:17 (5 weeks ago)
Author:
ak19
Message:

Some more GS_README.txt instructions. Not put the mongodb queries in here yet. They're still in MoreReading?/mongodb.txt, but the final quweries that are useful will end up in this file later on.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/hdfs-cc-work/GS_README.TXT

    r33618 r33809  
    1717I. Setting up Nutch v2 on its own Vagrant VM machine 
    1818J. Automated crawling with Nutch v2.3.1 and post-processing 
     19K. Sending the crawled data into mongodb with NutchTextDumpProcessor.java 
     20--- 
     21 
     22APPENDIX: Reading data from hbase tables and backing up hbase 
    1923 
    2024---------------------------------------- 
     
    665669 
    666670 
     671------------------------------------------------------------------------ 
     672K. Sending the crawled data into mongodb with NutchTextDumpProcessor.java 
     673------------------------------------------------------------------------ 
     6741. The crawled folder should contain all the batch crawls done with nutch (section J above). 
     675 
     6762. Set up mongodb connection properties in conf/config.properties 
     677By default, the mongodb database name is configured to be ateacrawldata. 
     678 
     6793. Create a mongodb database by the specified name. A database named "ateacrawldata" to be created, unless the default db name is changed. 
     680 
     6814. Set up the environment and compile NutchTextDumpProcessor: 
     682   cd maori-lang-detection/apache-opennlp-1.9.1 
     683   export OPENNLP_HOME=`pwd` 
     684   cd maori-lang-detection/src 
     685 
     686   javac -cp ".:../conf:../lib/*:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" org/greenstone/atea/NutchTextDumpToMongoDB.java 
     687 
     6884. Pass the crawled folder to NutchTextDumpProcessor: 
     689   java -cp ".:../conf:../lib/*:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" org/greenstone/atea/NutchTextDumpToMongoDB /PATH/TO/crawled 
     690      
     6915. It may take 1.5 hours or so to ingest the approximately 1450 crawled sites' data into mongodb. 
     692 
     6936. Launch the Robo 3T (version 1.3 is one we tested) MongoDB client. Use it to connect to MongoDB's "ateacrawldata" database. 
     694Now you can run queries. 
     695 
    667696-------------------------------------------------------- 
    668 K. Reading data from hbase tables and backing up hbase 
     697APPENDIX: Reading data from hbase tables and backing up hbase 
    669698-------------------------------------------------------- 
    670699