Changeset 33608

Show
Ignore:
Timestamp:
30.10.2019 23:02:26 (2 weeks ago)
Author:
ak19
Message:

1. New script to export from HBase so that we could in theory reimport into HBase. I've not tried the reimport out, but I followed instructions to export and I got a non-zero output file, so I am assuming it worked. 2. Committing today's new crawls in crawledNode4.tar. Each crawled site's folder inside it now includes a file called part-m-* that is the exported Hbase on that node VM. 3. Updated hdfs related GS_README.txt with instructions on viewing the contents of a table in HBase and a link on exporting/importing from HBase. 4. Minor changes like the tar files shouldn't be called tar.gz.

Location:
gs3-extensions/maori-lang-detection
Files:
1 added
3 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

    r33598 r33608  
    665665 
    666666 
     667-------------------------------------------------------- 
     668K. Reading data from hbase tables and backing up hbase 
     669-------------------------------------------------------- 
     670 
     671* Backing up HBase database: 
     672https://blogs.msdn.microsoft.com/data_otaku/2016/12/21/working-with-the-hbase-import-and-export-utility/ 
     673 
     674* From an image at http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/ 
     675to see the contents of a table, inside hbase shell, type: 
     676 
     677   scan 'tablename' 
     678 
     679e.g. scan '01066_webpage' and hit enter. 
     680 
     681 
     682To list tables and see their "column families" (I don't yet understand what this is): 
     683 
     684hbase shell 
     685hbase(main):001:0> list 
     686 
     687hbase(main):002:0> describe '01066_webpage' 
     688Table 01066_webpage is ENABLED                                                                                                                                                                               
     68901066_webpage                                                                                                                                                                                                
     690COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                  
     691{NAME => 'f', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK 
     692CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                                                                                                                             
     693{NAME => 'h', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK 
     694CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                                                                                                                             
     695{NAME => 'il', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOC 
     696KCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                                                                                                                            
     697{NAME => 'mk', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOC 
     698KCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                                                                                                                            
     699{NAME => 'mtdt', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BL 
     700OCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                                                                                                                          
     701{NAME => 'ol', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOC 
     702KCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                                                                                                                            
     703{NAME => 'p', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK 
     704CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                                                                                                                             
     705{NAME => 's', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK 
     706CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                                                                                                                             
     7078 row(s) in 0.1180 seconds 
     708 
    667709 
    668710-----------------------EOF------------------------ 
  • gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh

    r33574 r33608  
    6464        echo "2. copy the regex-urlfilter file:" 2>&1 | tee -a ${siteDir}UNFINISHED 
    6565        echo "   cp $NUTCH_URLFILTER_TEMPLATE $NUTCH_URLFILTER_FILE" 2>&1 | tee -a ${siteDir}UNFINISHED 
    66         echo "3. Adjust # crawl iterations in old crawl command:\n$crawl_cmd" 2>&1 | tee -a ${siteDir}UNFINISHED 
     66        echo "3. Adjust # crawl iterations in old crawl command:" 2>&1 | tee -a ${siteDir}UNFINISHED 
     67        echo "   $crawl_cmd" 2>&1 | tee -a ${siteDir}UNFINISHED 
    6768    fi 
    6869     
  • gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java

    r33587 r33608  
    8181 
    8282 
    83     // we'll be storing just those sentences in text that are in Māori.  
     83    // we'll be storing just those sentences in the text that are in Māori.  
    8484     
    8585    // OpenNLP language detection works best with a minimum of 2 sentences