Last change
on this file was 25141, checked in by papitha, 12 years ago |
NGRAMJ PERL MODULE ADDED /MAORI LANGUAGE GUESSING WORKING WELL!!
|
File size:
1.0 KB
|
Line | |
---|
1 | This package is originally from:
|
---|
2 |
|
---|
3 | http://evanjones.ca/software/wikipedia2text.html
|
---|
4 |
|
---|
5 | I've modified it to make it suitable for extracting plaintext from an entire Wikipedia corpus.
|
---|
6 |
|
---|
7 | My mods are:
|
---|
8 |
|
---|
9 | - Included sleep.jar to run nifty Sleep scripts. You'll need Java 1.4.2+ for this interpreter to work.
|
---|
10 | -- Added into8.sl to create 16 shell scripts (and a launch.sh) to convert article markup into XML in a way that takes advantage of multiple cores
|
---|
11 | -- Added watchthem.sl to kill PHP processes that have run for more than two minutes
|
---|
12 | -- Added makecorpus.sl to split plaintext file into 768 separate text files
|
---|
13 |
|
---|
14 | - Modified wikiextract.py to process each file in a try/catch block. This way if one file causes the process to barf, it doesn't stop
|
---|
15 |
|
---|
16 | See my blog for instructions on how to use this:
|
---|
17 |
|
---|
18 | http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/
|
---|
19 |
|
---|
20 | Contact:
|
---|
21 |
|
---|
22 | Raphael Mudge ([email protected])
|
---|
23 |
|
---|
24 | This code is released under the BSD license.
|
---|
25 | http://www.opensource.org/licenses/bsd-license.php
|
---|
Note:
See
TracBrowser
for help on using the repository browser.