source: tags/gsdl-2_30d-distribution/gsdl/perllib/Kea-1.1.4/README.txt@ 2308

Last change on this file since 2308 was 1972, checked in by jmt14, 23 years ago

* empty log message *

  • Property svn:keywords set to Author Date Id Revision
File size: 13.0 KB
Line 
1Kea -- Automatic Keyphrase Extraction
2
3Copyright 1998-1999 by Gordon Paynter and Eibe Frank
4Contact [email protected] or [email protected]
5
6 * This program is free software; you can redistribute it and/or modify
7 * it under the terms of the GNU General Public License as published by
8 * the Free Software Foundation; either version 2 of the License, or
9 * (at your option) any later version.
10 *
11 * This program is distributed in the hope that it will be useful,
12 * but WITHOUT ANY WARRANTY; without even the implied warranty of
13 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details.
15 *
16 * You should have received a copy of the GNU General Public License
17 * along with this program; if not, write to the Free Software
18 * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
19
20
21***************
220. Introduction
23***************
24
25Kea is a program for extracting keyphrases from text and html files.
26The Kea algorithm is described in these papers:
27 * Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin,
28 and Craig G. Nevill-Manning (1999) "KEA: Practical Automatic
29 Keyphrase Extraction."
30 * Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and
31 Craig G. Nevill-Manning (1999) "Domain-Specific Keyphrase Extraction."
32These papers, and others, and our Kea implementation, are available from
33the technology section of the New Zealand Digital Library web site at
34 http://www.nzdl.org/
35
36Kea was mostly implemented by Gordon Paynter ([email protected])
37and Eibe Frank ([email protected]). Craig Nevill-Manning
38and Carl Gutwin have worked on earlier versions; there's even
39a chance that some of their semi-colons are still be in service.
40Please contact Gordon about the general implementation or Eibe about
41the java side of things.
42
43This document describes the current Kea implementation. It is divided
44into these sections:
45 0. This introduction
46 1. Version History
47 2. System requirements
48 3. Extracting keyphrases
49 4. Using models
50 5. Making models
51 6. The Kea files
52 7. Advanced Kea options
53
54
55******************
561. Version History
57******************
58
59There were many pre-1.0 versions of Kea; they are mostly forgotten.
60
61Version 1.0 of kea was the version used in the paper by Witten et.al.
62described above. It was distributed to very few people.
63
64Version 1.1 of Kea is the first "public" version, and is available at
65http://www.nzdl.org/Kea from March 1999.
66
67
68**********************
692. System requirements
70**********************
71
72Kea runs under Unix. We have been running it in both Linux and Solaris.
73Kea is implemented in Perl and Java (with exception of the stemmer).
74
75You must have Perl (Version 5 or greater) and Java (Version 1.1.6 or
76greater) installed to run Kea. The main Kea program, called Kea,
77has a variable called "$java_command" that contains the command
78Kea will use to run java. You'll have to make sure this is set
79correctly for your system (I can't be bothered doing it for you).
80
81To be honest, you'll probably need some ability with Perl and Java to
82make Kea work.
83
84Kea uses a GPL version of the Lovins stemmer that was written in C.
85This distribution includes a compiled version for LINUX. If you're
86using Solaris or some other Unix, you will have to recompile it for
87that platform. The source code is in the Iterated-Lovins-stemmer
88directory. The README file in that directory will tell you how to
89compile the stemmer. The program "stemmer" must be in the main directory.
90
91(If you know of a GPL Java or Perl version of the Iterated Lovins
92stemmer, do let me know.)
93
94
95************************
963. Extracting keyphrases
97************************
98
99The Kea program is used to extract keyphrases from files.
100It is a perl script, and is used like this:
101 Kea [options] <text-or-html-or-cstr-files>
102
103For example, if you have a text file called myfile.text, you could
104extract keyphrases from it with this command:
105 Kea myfile.text
106
107Kea's output will be stored in a new file called myfile.kea
108that looks something like this:
109 protein protein 0.8135395543417774
110 amino acid amin ac 0.543230038502526
111 Nutrition nutrit 0.15095707184225382
112 assay as 0.15095707184225382
113
114The first column contains keyphrases Kea has extracted from the file.
115The second column contains stemmed versions of the keyphrases.
116The third column is an estimate of the probability that the phrase
117would be chosen by the author as a keyword for this paper. (See
118Witten et.al. for an explanation).
119
120Kea has several options. The most important is -N, which is
121used to output a specific number of keyphrases. For example, suppose
122you have a directory called public_html that contains a bunch of html
123files, and you want to extract 15 phrases from each. Use the command:
124 Kea -N 15 public_html/*.html
125
126Kea works with three types of input file based on extensions.
127 Text files have the extension .txt or .text
128 HTML files have the extension .html or .htm
129 CSTR files have the extension .cstr
130CSTR files are those from the CSTR collection of the NZDL, and you
131will probably never see them. If you want Kea to work with HTML or
132CSTR files, you will need to have the lynx web browser installed
133(we use version 2.5).
134
135
136***************
1374. Using models
138***************
139
140Kea extracts phrases from text files based on a "model" of
141the way authors choose keyphrases. The model is based on a set of
142"training documents" that have author-assigned keyphrases.
143
144The default model for Kea is the "aliweb" model, which is based on
14590 web pages from the aliweb web site. If you use a different model
146to extract phrases from a document, it might choose different pages.
147See Witten et al. for details.
148
149You can download other models from the Kea download page, or you can
150make our own. For example, you can download the CSTR model. This
151model performs very well on Computer Science Technical Reports, but
152less well on other collections. It consists of four files:
153 cstr.stopwords A list of stopwords used in text processing.
154 cstr.df The document-frequencies of some phrases in the CSTR.
155 cstr.model The Naive-Bayes model used in classification.
156 cstr.kf The keyphrase-frequencies of some phrases in the CSTR.
157(Note: the CSTR model consists of all these files, not just cstr.model)
158
159If you want to use the CSTR model to extract 10 keyphrases from a file
160called myCSdocument.text, use the command:
161 Kea -N 10 -C cstr myCSdocument.text
162
163
164****************
1655. Making models
166****************
167
168This section explains how to create a model that you can later use
169to extract keyphrases. You might want to do this for a specialised
170collection, like we did with the CSTR.
171
172To build a model, you will need some training data. Read Witten et al.
173(1999) to get an idea of the amout of training data you will need.
174(We recommend about 50 documents, but fewer will work if you don't
175have that many.)
176
177Your training data should be placed in a single directory.
178The training data consists of a set of text files (called *.txt)
179and author keyword files (called *.key). For every .txt there
180should be a .key file. For example if one of your text files is
181Witten99.txt, there should be a corresponding keyword file called
182Witten99.key. The .txt file should contain the document in plain
183text form. The .key file should be a text file containing each
184of the author-assigned keywords for that file, one per line.
185
186We have put a couple of training datasets that we have used
187on the Kea downloads web page, if you want an example.
188
189Let's assume your training data is in a directory called Green.
190We're going to use your traing data to build a model called green;
191this model will consist of four files:
192 green.stopwords, green.df, green.model, green.kf.
193
194First, create a "stopwords" file for your collection. The
195stopwords are a list of words that never occur at the start
196or end of a keyphrase. Read Witten et al. for more detail.
197They are placed in a text file, one per line, in lowercase.
198Kea comes with a stopwords file called aliweb.stopwords.
199We will it in our model:
200 cp aliweb.stopwords green.stopwords
201You can add new stopwords for specialised collections if you
202need to (see cstr.stopwords for an example).
203
204We will now create a model file (green.model) and a document
205frequency file (green.df).
206
207You will need to convert all the text files to "clauses" files
208with the command:
209 prepare-clauses-all-txt-files.pl Green
210This will create a clauses gile for every text file: for example,
211if you have a Witten99.txt file, Witten99.clauses will be created.
212
213Next, you need to create an "arff" file (green.arff) and, as a
214side effect, the document frequency file (green.df).
215The arff file isn't part of the model; it is the input file
216needed by the machine learning scheme to create the Naive-Bayes
217model. Use the command:
218 k4.pl -f green.df -S green.stopwords Green green.arff
219This command (called k4.pl for historical reasons) uses the
220training files in the directory Green (specifically, *.clauses
221and *.key) to create green.arff.
222It uses green.stopwords for its stopword file, and green.df as its
223document-frequency file. Since green.df doesn't exist when you
224start, it will create green.df for you as it works. (If you ever
225repeat this command, you should delete green.df first.)
226
227Now you need to create a Naive-Bayes model (green.model) from
228the arff file you just built (green.arff).
229You'll need a bit of java knowledge here. Make sure "./jaws.jar"
230is on your java classpath, and type:
231 java KEP -t green.arff -m green.model
232This will use green.arff as training data to create the
233Naive-Bayes model, which is saved in green.model.
234
235The final part of the model is *optional* - the keyphrase
236frequency file, called green.kf. It lists all the author
237keyphrases in the training data, with the number of
238times each occurs as a keyphrase. It is optional,
239but it does improve performance on *specialised* collections,
240so if you're extracting keyphrases for a specialised
241collection for a "real" purpose, then you should use one if
242you can. See Frank et.al. for more details.
243Each line of the file should have a stemmed phrase, followed
244by a tab, folowed by the number of times the phrase is a
245keyphrase - see cstr.kf or aliweb.kf for an example.
246You can make a file like this with a command like
247 cat Green/*.key | stemmer | count-lines.pl > green.kf
248To do this you will need the stemmer and count-lines.pl
249script provided with Kea.
250
251The model is now complete.
252
253To use it, put the green.df, green.model, green.stopwords,and
254(if you have one) green.kf in the Kea directory. You can extract
255keyphrases like this:
256 Kea -N 10 -C green myfile.txt
257
258
259****************
2606. The Kea files
261****************
262
263Here's a description of what the various Kea program files do.
264
265README: This file.
266
267Kea: Extracts keyphrase from text based on a model
268
269*.model: Naive-Bayes model object stored as a file
270*.kf: Keyphrase-frequency file
271*.df: Document-frequency file (aka a global-frequency file)
272*.stopwords: Stopwords file
273
274stemmer: Program for stemming words with the Iterated Lovins stemmer
275Iterated-Lovins-stemmer:
276 Directory conating code for stemmer. Some of the files are
277 copyright 1994 Linh Huynh, Gnu Public License. The others
278 are simply wrappers I have written myself.
279
280KEP.java: Java code for creating & using a Naive-Bayes model
281KEP.class: Compiled version of KEP.java
282jaws.jar: Java archive of the WEKA java machine learnig code.
283 Copyright Eibe Frank & Len Trigg, Gnu Public License.
284
285kea-tidy-key-file.pl:
286 Convert a .key or .kea file into a "clean" format.
287kea-choose-best-phrase.pl:
288 Find the "best" unstemmed version of a keyphrase
289 that appears in a file in many forms.
290prepare-clauses.pl:
291 Perl script that converts a text file to a clauses file.
292prepare-clauses-all-txt-files.pl:
293 Applies prepare-clauses.pl to an entire directory.
294cstr-to-text.pl:
295 Converts cstr files to text; requires lynx.
296count-lines.pl:
297 Counts the lines in a file.
298
299
300***********************
3017. Advanced Kea options
302***********************
303
304Here is a complete list of the options to Kea. The last
305four (-F, -K, -M, and -S) have been superceded by the -C option,
306but still work; its possible they are good for something.
307
308 -d Debug mode. Working files are left in /tmp
309 -t Ouput TF.IDF for each phrase. Used by Kniles.
310 -N n Output n keyphrases (if possible).
311 -E ext Output files have extension ".ext" (default is ".kea")
312 -C x Use model based on corpus x.
313 Defaults to "aliweb" web page corpus.
314
315 -F df Use document-frequency file "df".
316 Defaults to aliweb.df where x is set by the -C argument.
317 -K kf Use keyphrase-frequency file "mf".
318 Defaults to x.kf where x is set by the -C argument.
319 -M mf Use model file "mf".
320 Defaults to x.model where x is set by the -C argument.
321 -S sf Use stopword file "mf".
322 Defaults to x.stopwords where x is set by the -C argument.
323
324
Note: See TracBrowser for help on using the repository browser.