source: gs2-extensions/ngramj/src/wiki/wikipedia2text/CHANGES

Last change on this file was 25141, checked in by papitha, 12 years ago

NGRAMJ PERL MODULE ADDED /MAORI LANGUAGE GUESSING WORKING WELL!!

File size: 1.0 KB
Line 
1This package is originally from:
2
3http://evanjones.ca/software/wikipedia2text.html
4
5I've modified it to make it suitable for extracting plaintext from an entire Wikipedia corpus.
6
7My mods are:
8
9- Included sleep.jar to run nifty Sleep scripts. You'll need Java 1.4.2+ for this interpreter to work.
10-- Added into8.sl to create 16 shell scripts (and a launch.sh) to convert article markup into XML in a way that takes advantage of multiple cores
11-- Added watchthem.sl to kill PHP processes that have run for more than two minutes
12-- Added makecorpus.sl to split plaintext file into 768 separate text files
13
14- Modified wikiextract.py to process each file in a try/catch block. This way if one file causes the process to barf, it doesn't stop
15
16See my blog for instructions on how to use this:
17
18http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/
19
20Contact:
21
22Raphael Mudge ([email protected])
23
24This code is released under the BSD license.
25http://www.opensource.org/licenses/bsd-license.php
Note: See TracBrowser for help on using the repository browser.