Context Navigation

CHANGES

Last change on this file was 25141, checked in by papitha, 12 years ago
NGRAMJ PERL MODULE ADDED /MAORI LANGUAGE GUESSING WORKING WELL!!
File size: 1.0 KB

Line
1	This package is originally from:
2
3	http://evanjones.ca/software/wikipedia2text.html
4
5	I've modified it to make it suitable for extracting plaintext from an entire Wikipedia corpus.
6
7	My mods are:
8
9	- Included sleep.jar to run nifty Sleep scripts. You'll need Java 1.4.2+ for this interpreter to work.
10	-- Added into8.sl to create 16 shell scripts (and a launch.sh) to convert article markup into XML in a way that takes advantage of multiple cores
11	-- Added watchthem.sl to kill PHP processes that have run for more than two minutes
12	-- Added makecorpus.sl to split plaintext file into 768 separate text files
13
14	- Modified wikiextract.py to process each file in a try/catch block. This way if one file causes the process to barf, it doesn't stop
15
16	See my blog for instructions on how to use this:
17
18	http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/
19
20	Contact:
21
22	Raphael Mudge ([email protected])
23
24	This code is released under the BSD license.
25	http://www.opensource.org/licenses/bsd-license.php

Note: See TracBrowser for help on using the repository browser.