source: gs2-extensions/ngramj/src/README.txt@ 25141

Last change on this file since 25141 was 25141, checked in by papitha, 12 years ago

NGRAMJ PERL MODULE ADDED /MAORI LANGUAGE GUESSING WORKING WELL!!

File size: 4.8 KB
Line 
1HOW TO ADD A NEW LANGUAGE (Example used Maori Language Guessing)
2
3Generating a Plain Text Corpus from Wikipedia
4
5Step 1: Download the Wikipedia Extractors Toolkit. First thing to do is download below toolkit and extract it somewhere:
6
7 wget http://www.polishmywriting.com/download/wikipedia2text_rsm_mods.tgz
8 tar zxvf wikipedia2text_rsm_mods.tgz
9 cd wikipedia2text
10
11Step 2: Download and Extract the Wikipedia Data Dump
12 a) Go To - http://download.wikimedia.org/.
13 b) Click on Database backup dumps - http://dumps.wikimedia.org/backup-index.html
14 c) Click on lang wiki- ex - for Maori click on - miwiki
15 d) Download *-pages-articles.xml.bz2 - ex- miwiki-20120218-pages-articles.xml.bz2
16
17Step 3: Extract Article Data from the Wikipedia Data
18
19a) Now you have a big XML file full of all the Wikipedia articles. The next step is to extract the articles and strip all the other stuff.
20 b) Create a directory for your output and run xmldump2files.py against the .XML file you obtained in the last step:
21 mkdir out
22 ./xmldump2files.py miwiki-20120218-pages-articles.xml out
23
24Note - This step will take a few hours depending on your hardware.
25
26Step 4: Parse the Article Wiki Markup into XML
27 The next step is to take the extracted articles and parse the Wikimedia markup into an XML form that we can later recover the plain text from.There is a shell script to generate XML files for all the files in our out directory. Use a shell script for each core that executes the Wikimedia to XML command on part of the file set.
28
29 a) To generate these shell scripts:
30 find out -type f | grep '\.txt$' >mi.files
31 b) To split this mi.files into several .sh files.
32 java -jar sleep.jar into8.sl mi.files
33 c) You may find it helpful to create a launch.sh file to launch the shell scripts created by into8.sl.
34 cat >launch.sh
35 ./files0.sh &
36 ./files1.sh &
37 ./files2.sh &
38 ...
39 ./files15.sh &
40 ^D
41 d) Next, launch these shell scripts.
42 ./launch.sh
43
44 Note- 1) The command run by these scripts for each file has the following comment: "Converts Wikipedia articles in wiki format into an XML format." It might segfault or go into an “infinite” loop sometimes. This statement is true. The PHP processes will freeze or crash. My first time through this process I had to manually watching top and kill errant processes. This makes the process take longer than it should and it’s time-consuming. To help use a script that kills any php process that has run for more than two minutes. To launch it: java -jar sleep.jar watchthem.sl
45 2)Text files gets converted to XML file at this stage. Quick check at this stage would be - Open your Out folder and check for txt file and xml file. If you don't find them watch out for error msg. It's good to see remove the files that triggers the error
46 3) Otherwise Just let this program run.
47 4) Expect this step to take more hours depending on your hardware.
48
49Step 5: Extract Plain Text from the Articles
50 To extract the article plaintext from the XML file do the below:
51 ./wikiextract.py out maori_plaintext.txt
52
53 Note - This command will create a file called maori_plaintext.txt with the entire plain text content of the Maori Wikipedia. Expect this command to take a few hours depending on your hardware.
54
55READY TO ADD A NEW LANGUAGE
56
57Step 6: The first step is to create a raw language profile. You can do this with the cngram.jar file:
58 $ java -jar cngram.jar -create mi_big id_corpus.txt
59 new profile 'mi_big.ngp' was created.
60
61 This will create an mi.ngp file.
62
63Step 7: Save below script as sortit.sl
64
65 %grams = ohash();
66 setMissPolicy(%grams, { return @(); });
67
68 $handle = openf(@ARGV[0]);
69 $banner = readln($handle);
70 readln($handle); # consume the ngram_count value
71
72 while $text (readln($handle)) {
73 ($gram, $count) = split(' ', $text);
74
75 if (strlen($gram) <= 2 || $count > 20000) {
76 push(%grams[strlen($gram)], @($gram, $count));
77 }
78 }
79 closef($handle);
80
81 sub sortTuple {
82 return $2[1] <=> $1[1];
83 }
84
85 println($banner);
86
87 printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[1])));
88 printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[2])));
89 printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[3])));
90 printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[4])));
91
92
93Step 8: Run the script:
94
95 $ java -jar lib/sleep.jar sortit.sl mi_big.ngp >mi.ngp
96
97 1) The last step is to copy mi.ngp into src/de/spieleck/app/cngram/
98 2) Edit src/de/spieleck/app/cngram/profiles.lst to contain the mi resource.
99 3) Type ant in the top-level directory of the source code to rebuild cngram.jar and then you’re ready to test:
100 4) $ java -jar cngram.jar -lang2 a.txt
101 You would see this msg -
102 speed: mi:0.863 ro:0.005 it:0.009 bg:0.000 |9.9E-2 |0.0E0 dt=1933
103
104
105
106
107
108
Note: See TracBrowser for help on using the repository browser.