1 | Kea -- Automatic Keyphrase Extraction
|
---|
2 |
|
---|
3 | Copyright 1998-1999 by Gordon Paynter and Eibe Frank
|
---|
4 | Contact [email protected] or [email protected]
|
---|
5 |
|
---|
6 | * This program is free software; you can redistribute it and/or modify
|
---|
7 | * it under the terms of the GNU General Public License as published by
|
---|
8 | * the Free Software Foundation; either version 2 of the License, or
|
---|
9 | * (at your option) any later version.
|
---|
10 | *
|
---|
11 | * This program is distributed in the hope that it will be useful,
|
---|
12 | * but WITHOUT ANY WARRANTY; without even the implied warranty of
|
---|
13 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
---|
14 | * GNU General Public License for more details.
|
---|
15 | *
|
---|
16 | * You should have received a copy of the GNU General Public License
|
---|
17 | * along with this program; if not, write to the Free Software
|
---|
18 | * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
|
---|
19 |
|
---|
20 |
|
---|
21 | ***************
|
---|
22 | 0. Introduction
|
---|
23 | ***************
|
---|
24 |
|
---|
25 | Kea is a program for extracting keyphrases from text and html files.
|
---|
26 | The Kea algorithm is described in these papers:
|
---|
27 | * Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin,
|
---|
28 | and Craig G. Nevill-Manning (1999) "KEA: Practical Automatic
|
---|
29 | Keyphrase Extraction."
|
---|
30 | * Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and
|
---|
31 | Craig G. Nevill-Manning (1999) "Domain-Specific Keyphrase Extraction."
|
---|
32 | These papers, and others, and our Kea implementation, are available from
|
---|
33 | the technology section of the New Zealand Digital Library web site at
|
---|
34 | http://www.nzdl.org/
|
---|
35 |
|
---|
36 | Kea was mostly implemented by Gordon Paynter ([email protected])
|
---|
37 | and Eibe Frank ([email protected]). Craig Nevill-Manning
|
---|
38 | and Carl Gutwin have worked on earlier versions; there's even
|
---|
39 | a chance that some of their semi-colons are still be in service.
|
---|
40 | Please contact Gordon about the general implementation or Eibe about
|
---|
41 | the java side of things.
|
---|
42 |
|
---|
43 | This document describes the current Kea implementation. It is divided
|
---|
44 | into these sections:
|
---|
45 | 0. This introduction
|
---|
46 | 1. Version History
|
---|
47 | 2. System requirements
|
---|
48 | 3. Extracting keyphrases
|
---|
49 | 4. Using models
|
---|
50 | 5. Making models
|
---|
51 | 6. The Kea files
|
---|
52 | 7. Advanced Kea options
|
---|
53 |
|
---|
54 |
|
---|
55 | ******************
|
---|
56 | 1. Version History
|
---|
57 | ******************
|
---|
58 |
|
---|
59 | There were many pre-1.0 versions of Kea; they are mostly forgotten.
|
---|
60 |
|
---|
61 | Version 1.0 of kea was the version used in the paper by Witten et.al.
|
---|
62 | described above. It was distributed to very few people.
|
---|
63 |
|
---|
64 | Version 1.1 of Kea is the first "public" version, and is available at
|
---|
65 | http://www.nzdl.org/Kea from March 1999.
|
---|
66 |
|
---|
67 |
|
---|
68 | **********************
|
---|
69 | 2. System requirements
|
---|
70 | **********************
|
---|
71 |
|
---|
72 | Kea runs under Unix. We have been running it in both Linux and Solaris.
|
---|
73 | Kea is implemented in Perl and Java (with exception of the stemmer).
|
---|
74 |
|
---|
75 | You must have Perl (Version 5 or greater) and Java (Version 1.1.6 or
|
---|
76 | greater) installed to run Kea. The main Kea program, called Kea,
|
---|
77 | has a variable called "$java_command" that contains the command
|
---|
78 | Kea will use to run java. You'll have to make sure this is set
|
---|
79 | correctly for your system (I can't be bothered doing it for you).
|
---|
80 |
|
---|
81 | To be honest, you'll probably need some ability with Perl and Java to
|
---|
82 | make Kea work.
|
---|
83 |
|
---|
84 | Kea uses a GPL version of the Lovins stemmer that was written in C.
|
---|
85 | This distribution includes a compiled version for LINUX. If you're
|
---|
86 | using Solaris or some other Unix, you will have to recompile it for
|
---|
87 | that platform. The source code is in the Iterated-Lovins-stemmer
|
---|
88 | directory. The README file in that directory will tell you how to
|
---|
89 | compile the stemmer. The program "stemmer" must be in the main directory.
|
---|
90 |
|
---|
91 | (If you know of a GPL Java or Perl version of the Iterated Lovins
|
---|
92 | stemmer, do let me know.)
|
---|
93 |
|
---|
94 |
|
---|
95 | ************************
|
---|
96 | 3. Extracting keyphrases
|
---|
97 | ************************
|
---|
98 |
|
---|
99 | The Kea program is used to extract keyphrases from files.
|
---|
100 | It is a perl script, and is used like this:
|
---|
101 | Kea [options] <text-or-html-or-cstr-files>
|
---|
102 |
|
---|
103 | For example, if you have a text file called myfile.text, you could
|
---|
104 | extract keyphrases from it with this command:
|
---|
105 | Kea myfile.text
|
---|
106 |
|
---|
107 | Kea's output will be stored in a new file called myfile.kea
|
---|
108 | that looks something like this:
|
---|
109 | protein protein 0.8135395543417774
|
---|
110 | amino acid amin ac 0.543230038502526
|
---|
111 | Nutrition nutrit 0.15095707184225382
|
---|
112 | assay as 0.15095707184225382
|
---|
113 |
|
---|
114 | The first column contains keyphrases Kea has extracted from the file.
|
---|
115 | The second column contains stemmed versions of the keyphrases.
|
---|
116 | The third column is an estimate of the probability that the phrase
|
---|
117 | would be chosen by the author as a keyword for this paper. (See
|
---|
118 | Witten et.al. for an explanation).
|
---|
119 |
|
---|
120 | Kea has several options. The most important is -N, which is
|
---|
121 | used to output a specific number of keyphrases. For example, suppose
|
---|
122 | you have a directory called public_html that contains a bunch of html
|
---|
123 | files, and you want to extract 15 phrases from each. Use the command:
|
---|
124 | Kea -N 15 public_html/*.html
|
---|
125 |
|
---|
126 | Kea works with three types of input file based on extensions.
|
---|
127 | Text files have the extension .txt or .text
|
---|
128 | HTML files have the extension .html or .htm
|
---|
129 | CSTR files have the extension .cstr
|
---|
130 | CSTR files are those from the CSTR collection of the NZDL, and you
|
---|
131 | will probably never see them. If you want Kea to work with HTML or
|
---|
132 | CSTR files, you will need to have the lynx web browser installed
|
---|
133 | (we use version 2.5).
|
---|
134 |
|
---|
135 |
|
---|
136 | ***************
|
---|
137 | 4. Using models
|
---|
138 | ***************
|
---|
139 |
|
---|
140 | Kea extracts phrases from text files based on a "model" of
|
---|
141 | the way authors choose keyphrases. The model is based on a set of
|
---|
142 | "training documents" that have author-assigned keyphrases.
|
---|
143 |
|
---|
144 | The default model for Kea is the "aliweb" model, which is based on
|
---|
145 | 90 web pages from the aliweb web site. If you use a different model
|
---|
146 | to extract phrases from a document, it might choose different pages.
|
---|
147 | See Witten et al. for details.
|
---|
148 |
|
---|
149 | You can download other models from the Kea download page, or you can
|
---|
150 | make our own. For example, you can download the CSTR model. This
|
---|
151 | model performs very well on Computer Science Technical Reports, but
|
---|
152 | less well on other collections. It consists of four files:
|
---|
153 | cstr.stopwords A list of stopwords used in text processing.
|
---|
154 | cstr.df The document-frequencies of some phrases in the CSTR.
|
---|
155 | cstr.model The Naive-Bayes model used in classification.
|
---|
156 | cstr.kf The keyphrase-frequencies of some phrases in the CSTR.
|
---|
157 | (Note: the CSTR model consists of all these files, not just cstr.model)
|
---|
158 |
|
---|
159 | If you want to use the CSTR model to extract 10 keyphrases from a file
|
---|
160 | called myCSdocument.text, use the command:
|
---|
161 | Kea -N 10 -C cstr myCSdocument.text
|
---|
162 |
|
---|
163 |
|
---|
164 | ****************
|
---|
165 | 5. Making models
|
---|
166 | ****************
|
---|
167 |
|
---|
168 | This section explains how to create a model that you can later use
|
---|
169 | to extract keyphrases. You might want to do this for a specialised
|
---|
170 | collection, like we did with the CSTR.
|
---|
171 |
|
---|
172 | To build a model, you will need some training data. Read Witten et al.
|
---|
173 | (1999) to get an idea of the amout of training data you will need.
|
---|
174 | (We recommend about 50 documents, but fewer will work if you don't
|
---|
175 | have that many.)
|
---|
176 |
|
---|
177 | Your training data should be placed in a single directory.
|
---|
178 | The training data consists of a set of text files (called *.txt)
|
---|
179 | and author keyword files (called *.key). For every .txt there
|
---|
180 | should be a .key file. For example if one of your text files is
|
---|
181 | Witten99.txt, there should be a corresponding keyword file called
|
---|
182 | Witten99.key. The .txt file should contain the document in plain
|
---|
183 | text form. The .key file should be a text file containing each
|
---|
184 | of the author-assigned keywords for that file, one per line.
|
---|
185 |
|
---|
186 | We have put a couple of training datasets that we have used
|
---|
187 | on the Kea downloads web page, if you want an example.
|
---|
188 |
|
---|
189 | Let's assume your training data is in a directory called Green.
|
---|
190 | We're going to use your traing data to build a model called green;
|
---|
191 | this model will consist of four files:
|
---|
192 | green.stopwords, green.df, green.model, green.kf.
|
---|
193 |
|
---|
194 | First, create a "stopwords" file for your collection. The
|
---|
195 | stopwords are a list of words that never occur at the start
|
---|
196 | or end of a keyphrase. Read Witten et al. for more detail.
|
---|
197 | They are placed in a text file, one per line, in lowercase.
|
---|
198 | Kea comes with a stopwords file called aliweb.stopwords.
|
---|
199 | We will it in our model:
|
---|
200 | cp aliweb.stopwords green.stopwords
|
---|
201 | You can add new stopwords for specialised collections if you
|
---|
202 | need to (see cstr.stopwords for an example).
|
---|
203 |
|
---|
204 | We will now create a model file (green.model) and a document
|
---|
205 | frequency file (green.df).
|
---|
206 |
|
---|
207 | You will need to convert all the text files to "clauses" files
|
---|
208 | with the command:
|
---|
209 | prepare-clauses-all-txt-files.pl Green
|
---|
210 | This will create a clauses gile for every text file: for example,
|
---|
211 | if you have a Witten99.txt file, Witten99.clauses will be created.
|
---|
212 |
|
---|
213 | Next, you need to create an "arff" file (green.arff) and, as a
|
---|
214 | side effect, the document frequency file (green.df).
|
---|
215 | The arff file isn't part of the model; it is the input file
|
---|
216 | needed by the machine learning scheme to create the Naive-Bayes
|
---|
217 | model. Use the command:
|
---|
218 | k4.pl -f green.df -S green.stopwords Green green.arff
|
---|
219 | This command (called k4.pl for historical reasons) uses the
|
---|
220 | training files in the directory Green (specifically, *.clauses
|
---|
221 | and *.key) to create green.arff.
|
---|
222 | It uses green.stopwords for its stopword file, and green.df as its
|
---|
223 | document-frequency file. Since green.df doesn't exist when you
|
---|
224 | start, it will create green.df for you as it works. (If you ever
|
---|
225 | repeat this command, you should delete green.df first.)
|
---|
226 |
|
---|
227 | Now you need to create a Naive-Bayes model (green.model) from
|
---|
228 | the arff file you just built (green.arff).
|
---|
229 | You'll need a bit of java knowledge here. Make sure "./jaws.jar"
|
---|
230 | is on your java classpath, and type:
|
---|
231 | java KEP -t green.arff -m green.model
|
---|
232 | This will use green.arff as training data to create the
|
---|
233 | Naive-Bayes model, which is saved in green.model.
|
---|
234 |
|
---|
235 | The final part of the model is *optional* - the keyphrase
|
---|
236 | frequency file, called green.kf. It lists all the author
|
---|
237 | keyphrases in the training data, with the number of
|
---|
238 | times each occurs as a keyphrase. It is optional,
|
---|
239 | but it does improve performance on *specialised* collections,
|
---|
240 | so if you're extracting keyphrases for a specialised
|
---|
241 | collection for a "real" purpose, then you should use one if
|
---|
242 | you can. See Frank et.al. for more details.
|
---|
243 | Each line of the file should have a stemmed phrase, followed
|
---|
244 | by a tab, folowed by the number of times the phrase is a
|
---|
245 | keyphrase - see cstr.kf or aliweb.kf for an example.
|
---|
246 | You can make a file like this with a command like
|
---|
247 | cat Green/*.key | stemmer | count-lines.pl > green.kf
|
---|
248 | To do this you will need the stemmer and count-lines.pl
|
---|
249 | script provided with Kea.
|
---|
250 |
|
---|
251 | The model is now complete.
|
---|
252 |
|
---|
253 | To use it, put the green.df, green.model, green.stopwords,and
|
---|
254 | (if you have one) green.kf in the Kea directory. You can extract
|
---|
255 | keyphrases like this:
|
---|
256 | Kea -N 10 -C green myfile.txt
|
---|
257 |
|
---|
258 |
|
---|
259 | ****************
|
---|
260 | 6. The Kea files
|
---|
261 | ****************
|
---|
262 |
|
---|
263 | Here's a description of what the various Kea program files do.
|
---|
264 |
|
---|
265 | README: This file.
|
---|
266 |
|
---|
267 | Kea: Extracts keyphrase from text based on a model
|
---|
268 |
|
---|
269 | *.model: Naive-Bayes model object stored as a file
|
---|
270 | *.kf: Keyphrase-frequency file
|
---|
271 | *.df: Document-frequency file (aka a global-frequency file)
|
---|
272 | *.stopwords: Stopwords file
|
---|
273 |
|
---|
274 | stemmer: Program for stemming words with the Iterated Lovins stemmer
|
---|
275 | Iterated-Lovins-stemmer:
|
---|
276 | Directory conating code for stemmer. Some of the files are
|
---|
277 | copyright 1994 Linh Huynh, Gnu Public License. The others
|
---|
278 | are simply wrappers I have written myself.
|
---|
279 |
|
---|
280 | KEP.java: Java code for creating & using a Naive-Bayes model
|
---|
281 | KEP.class: Compiled version of KEP.java
|
---|
282 | jaws.jar: Java archive of the WEKA java machine learnig code.
|
---|
283 | Copyright Eibe Frank & Len Trigg, Gnu Public License.
|
---|
284 |
|
---|
285 | kea-tidy-key-file.pl:
|
---|
286 | Convert a .key or .kea file into a "clean" format.
|
---|
287 | kea-choose-best-phrase.pl:
|
---|
288 | Find the "best" unstemmed version of a keyphrase
|
---|
289 | that appears in a file in many forms.
|
---|
290 | prepare-clauses.pl:
|
---|
291 | Perl script that converts a text file to a clauses file.
|
---|
292 | prepare-clauses-all-txt-files.pl:
|
---|
293 | Applies prepare-clauses.pl to an entire directory.
|
---|
294 | cstr-to-text.pl:
|
---|
295 | Converts cstr files to text; requires lynx.
|
---|
296 | count-lines.pl:
|
---|
297 | Counts the lines in a file.
|
---|
298 |
|
---|
299 |
|
---|
300 | ***********************
|
---|
301 | 7. Advanced Kea options
|
---|
302 | ***********************
|
---|
303 |
|
---|
304 | Here is a complete list of the options to Kea. The last
|
---|
305 | four (-F, -K, -M, and -S) have been superceded by the -C option,
|
---|
306 | but still work; its possible they are good for something.
|
---|
307 |
|
---|
308 | -d Debug mode. Working files are left in /tmp
|
---|
309 | -t Ouput TF.IDF for each phrase. Used by Kniles.
|
---|
310 | -N n Output n keyphrases (if possible).
|
---|
311 | -E ext Output files have extension ".ext" (default is ".kea")
|
---|
312 | -C x Use model based on corpus x.
|
---|
313 | Defaults to "aliweb" web page corpus.
|
---|
314 |
|
---|
315 | -F df Use document-frequency file "df".
|
---|
316 | Defaults to aliweb.df where x is set by the -C argument.
|
---|
317 | -K kf Use keyphrase-frequency file "mf".
|
---|
318 | Defaults to x.kf where x is set by the -C argument.
|
---|
319 | -M mf Use model file "mf".
|
---|
320 | Defaults to x.model where x is set by the -C argument.
|
---|
321 | -S sf Use stopword file "mf".
|
---|
322 | Defaults to x.stopwords where x is set by the -C argument.
|
---|
323 |
|
---|
324 |
|
---|