[1972] | 1 | Kea -- Automatic Keyphrase Extraction
|
---|
| 2 |
|
---|
| 3 | Copyright 1998-1999 by Gordon Paynter and Eibe Frank
|
---|
| 4 | Contact [email protected] or [email protected]
|
---|
| 5 |
|
---|
| 6 | * This program is free software; you can redistribute it and/or modify
|
---|
| 7 | * it under the terms of the GNU General Public License as published by
|
---|
| 8 | * the Free Software Foundation; either version 2 of the License, or
|
---|
| 9 | * (at your option) any later version.
|
---|
| 10 | *
|
---|
| 11 | * This program is distributed in the hope that it will be useful,
|
---|
| 12 | * but WITHOUT ANY WARRANTY; without even the implied warranty of
|
---|
| 13 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
---|
| 14 | * GNU General Public License for more details.
|
---|
| 15 | *
|
---|
| 16 | * You should have received a copy of the GNU General Public License
|
---|
| 17 | * along with this program; if not, write to the Free Software
|
---|
| 18 | * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
|
---|
| 19 |
|
---|
| 20 |
|
---|
| 21 | ***************
|
---|
| 22 | 0. Introduction
|
---|
| 23 | ***************
|
---|
| 24 |
|
---|
| 25 | Kea is a program for extracting keyphrases from text and html files.
|
---|
| 26 | The Kea algorithm is described in these papers:
|
---|
| 27 | * Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin,
|
---|
| 28 | and Craig G. Nevill-Manning (1999) "KEA: Practical Automatic
|
---|
| 29 | Keyphrase Extraction."
|
---|
| 30 | * Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and
|
---|
| 31 | Craig G. Nevill-Manning (1999) "Domain-Specific Keyphrase Extraction."
|
---|
| 32 | These papers, and others, and our Kea implementation, are available from
|
---|
| 33 | the technology section of the New Zealand Digital Library web site at
|
---|
| 34 | http://www.nzdl.org/
|
---|
| 35 |
|
---|
| 36 | Kea was mostly implemented by Gordon Paynter ([email protected])
|
---|
| 37 | and Eibe Frank ([email protected]). Craig Nevill-Manning
|
---|
| 38 | and Carl Gutwin have worked on earlier versions; there's even
|
---|
| 39 | a chance that some of their semi-colons are still be in service.
|
---|
| 40 | Please contact Gordon about the general implementation or Eibe about
|
---|
| 41 | the java side of things.
|
---|
| 42 |
|
---|
| 43 | This document describes the current Kea implementation. It is divided
|
---|
| 44 | into these sections:
|
---|
| 45 | 0. This introduction
|
---|
| 46 | 1. Version History
|
---|
| 47 | 2. System requirements
|
---|
| 48 | 3. Extracting keyphrases
|
---|
| 49 | 4. Using models
|
---|
| 50 | 5. Making models
|
---|
| 51 | 6. The Kea files
|
---|
| 52 | 7. Advanced Kea options
|
---|
| 53 |
|
---|
| 54 |
|
---|
| 55 | ******************
|
---|
| 56 | 1. Version History
|
---|
| 57 | ******************
|
---|
| 58 |
|
---|
| 59 | There were many pre-1.0 versions of Kea; they are mostly forgotten.
|
---|
| 60 |
|
---|
| 61 | Version 1.0 of kea was the version used in the paper by Witten et.al.
|
---|
| 62 | described above. It was distributed to very few people.
|
---|
| 63 |
|
---|
| 64 | Version 1.1 of Kea is the first "public" version, and is available at
|
---|
| 65 | http://www.nzdl.org/Kea from March 1999.
|
---|
| 66 |
|
---|
| 67 |
|
---|
| 68 | **********************
|
---|
| 69 | 2. System requirements
|
---|
| 70 | **********************
|
---|
| 71 |
|
---|
| 72 | Kea runs under Unix. We have been running it in both Linux and Solaris.
|
---|
| 73 | Kea is implemented in Perl and Java (with exception of the stemmer).
|
---|
| 74 |
|
---|
| 75 | You must have Perl (Version 5 or greater) and Java (Version 1.1.6 or
|
---|
| 76 | greater) installed to run Kea. The main Kea program, called Kea,
|
---|
| 77 | has a variable called "$java_command" that contains the command
|
---|
| 78 | Kea will use to run java. You'll have to make sure this is set
|
---|
| 79 | correctly for your system (I can't be bothered doing it for you).
|
---|
| 80 |
|
---|
| 81 | To be honest, you'll probably need some ability with Perl and Java to
|
---|
| 82 | make Kea work.
|
---|
| 83 |
|
---|
| 84 | Kea uses a GPL version of the Lovins stemmer that was written in C.
|
---|
| 85 | This distribution includes a compiled version for LINUX. If you're
|
---|
| 86 | using Solaris or some other Unix, you will have to recompile it for
|
---|
| 87 | that platform. The source code is in the Iterated-Lovins-stemmer
|
---|
| 88 | directory. The README file in that directory will tell you how to
|
---|
| 89 | compile the stemmer. The program "stemmer" must be in the main directory.
|
---|
| 90 |
|
---|
| 91 | (If you know of a GPL Java or Perl version of the Iterated Lovins
|
---|
| 92 | stemmer, do let me know.)
|
---|
| 93 |
|
---|
| 94 |
|
---|
| 95 | ************************
|
---|
| 96 | 3. Extracting keyphrases
|
---|
| 97 | ************************
|
---|
| 98 |
|
---|
| 99 | The Kea program is used to extract keyphrases from files.
|
---|
| 100 | It is a perl script, and is used like this:
|
---|
| 101 | Kea [options] <text-or-html-or-cstr-files>
|
---|
| 102 |
|
---|
| 103 | For example, if you have a text file called myfile.text, you could
|
---|
| 104 | extract keyphrases from it with this command:
|
---|
| 105 | Kea myfile.text
|
---|
| 106 |
|
---|
| 107 | Kea's output will be stored in a new file called myfile.kea
|
---|
| 108 | that looks something like this:
|
---|
| 109 | protein protein 0.8135395543417774
|
---|
| 110 | amino acid amin ac 0.543230038502526
|
---|
| 111 | Nutrition nutrit 0.15095707184225382
|
---|
| 112 | assay as 0.15095707184225382
|
---|
| 113 |
|
---|
| 114 | The first column contains keyphrases Kea has extracted from the file.
|
---|
| 115 | The second column contains stemmed versions of the keyphrases.
|
---|
| 116 | The third column is an estimate of the probability that the phrase
|
---|
| 117 | would be chosen by the author as a keyword for this paper. (See
|
---|
| 118 | Witten et.al. for an explanation).
|
---|
| 119 |
|
---|
| 120 | Kea has several options. The most important is -N, which is
|
---|
| 121 | used to output a specific number of keyphrases. For example, suppose
|
---|
| 122 | you have a directory called public_html that contains a bunch of html
|
---|
| 123 | files, and you want to extract 15 phrases from each. Use the command:
|
---|
| 124 | Kea -N 15 public_html/*.html
|
---|
| 125 |
|
---|
| 126 | Kea works with three types of input file based on extensions.
|
---|
| 127 | Text files have the extension .txt or .text
|
---|
| 128 | HTML files have the extension .html or .htm
|
---|
| 129 | CSTR files have the extension .cstr
|
---|
| 130 | CSTR files are those from the CSTR collection of the NZDL, and you
|
---|
| 131 | will probably never see them. If you want Kea to work with HTML or
|
---|
| 132 | CSTR files, you will need to have the lynx web browser installed
|
---|
| 133 | (we use version 2.5).
|
---|
| 134 |
|
---|
| 135 |
|
---|
| 136 | ***************
|
---|
| 137 | 4. Using models
|
---|
| 138 | ***************
|
---|
| 139 |
|
---|
| 140 | Kea extracts phrases from text files based on a "model" of
|
---|
| 141 | the way authors choose keyphrases. The model is based on a set of
|
---|
| 142 | "training documents" that have author-assigned keyphrases.
|
---|
| 143 |
|
---|
| 144 | The default model for Kea is the "aliweb" model, which is based on
|
---|
| 145 | 90 web pages from the aliweb web site. If you use a different model
|
---|
| 146 | to extract phrases from a document, it might choose different pages.
|
---|
| 147 | See Witten et al. for details.
|
---|
| 148 |
|
---|
| 149 | You can download other models from the Kea download page, or you can
|
---|
| 150 | make our own. For example, you can download the CSTR model. This
|
---|
| 151 | model performs very well on Computer Science Technical Reports, but
|
---|
| 152 | less well on other collections. It consists of four files:
|
---|
| 153 | cstr.stopwords A list of stopwords used in text processing.
|
---|
| 154 | cstr.df The document-frequencies of some phrases in the CSTR.
|
---|
| 155 | cstr.model The Naive-Bayes model used in classification.
|
---|
| 156 | cstr.kf The keyphrase-frequencies of some phrases in the CSTR.
|
---|
| 157 | (Note: the CSTR model consists of all these files, not just cstr.model)
|
---|
| 158 |
|
---|
| 159 | If you want to use the CSTR model to extract 10 keyphrases from a file
|
---|
| 160 | called myCSdocument.text, use the command:
|
---|
| 161 | Kea -N 10 -C cstr myCSdocument.text
|
---|
| 162 |
|
---|
| 163 |
|
---|
| 164 | ****************
|
---|
| 165 | 5. Making models
|
---|
| 166 | ****************
|
---|
| 167 |
|
---|
| 168 | This section explains how to create a model that you can later use
|
---|
| 169 | to extract keyphrases. You might want to do this for a specialised
|
---|
| 170 | collection, like we did with the CSTR.
|
---|
| 171 |
|
---|
| 172 | To build a model, you will need some training data. Read Witten et al.
|
---|
| 173 | (1999) to get an idea of the amout of training data you will need.
|
---|
| 174 | (We recommend about 50 documents, but fewer will work if you don't
|
---|
| 175 | have that many.)
|
---|
| 176 |
|
---|
| 177 | Your training data should be placed in a single directory.
|
---|
| 178 | The training data consists of a set of text files (called *.txt)
|
---|
| 179 | and author keyword files (called *.key). For every .txt there
|
---|
| 180 | should be a .key file. For example if one of your text files is
|
---|
| 181 | Witten99.txt, there should be a corresponding keyword file called
|
---|
| 182 | Witten99.key. The .txt file should contain the document in plain
|
---|
| 183 | text form. The .key file should be a text file containing each
|
---|
| 184 | of the author-assigned keywords for that file, one per line.
|
---|
| 185 |
|
---|
| 186 | We have put a couple of training datasets that we have used
|
---|
| 187 | on the Kea downloads web page, if you want an example.
|
---|
| 188 |
|
---|
| 189 | Let's assume your training data is in a directory called Green.
|
---|
| 190 | We're going to use your traing data to build a model called green;
|
---|
| 191 | this model will consist of four files:
|
---|
| 192 | green.stopwords, green.df, green.model, green.kf.
|
---|
| 193 |
|
---|
| 194 | First, create a "stopwords" file for your collection. The
|
---|
| 195 | stopwords are a list of words that never occur at the start
|
---|
| 196 | or end of a keyphrase. Read Witten et al. for more detail.
|
---|
| 197 | They are placed in a text file, one per line, in lowercase.
|
---|
| 198 | Kea comes with a stopwords file called aliweb.stopwords.
|
---|
| 199 | We will it in our model:
|
---|
| 200 | cp aliweb.stopwords green.stopwords
|
---|
| 201 | You can add new stopwords for specialised collections if you
|
---|
| 202 | need to (see cstr.stopwords for an example).
|
---|
| 203 |
|
---|
| 204 | We will now create a model file (green.model) and a document
|
---|
| 205 | frequency file (green.df).
|
---|
| 206 |
|
---|
| 207 | You will need to convert all the text files to "clauses" files
|
---|
| 208 | with the command:
|
---|
| 209 | prepare-clauses-all-txt-files.pl Green
|
---|
| 210 | This will create a clauses gile for every text file: for example,
|
---|
| 211 | if you have a Witten99.txt file, Witten99.clauses will be created.
|
---|
| 212 |
|
---|
| 213 | Next, you need to create an "arff" file (green.arff) and, as a
|
---|
| 214 | side effect, the document frequency file (green.df).
|
---|
| 215 | The arff file isn't part of the model; it is the input file
|
---|
| 216 | needed by the machine learning scheme to create the Naive-Bayes
|
---|
| 217 | model. Use the command:
|
---|
| 218 | k4.pl -f green.df -S green.stopwords Green green.arff
|
---|
| 219 | This command (called k4.pl for historical reasons) uses the
|
---|
| 220 | training files in the directory Green (specifically, *.clauses
|
---|
| 221 | and *.key) to create green.arff.
|
---|
| 222 | It uses green.stopwords for its stopword file, and green.df as its
|
---|
| 223 | document-frequency file. Since green.df doesn't exist when you
|
---|
| 224 | start, it will create green.df for you as it works. (If you ever
|
---|
| 225 | repeat this command, you should delete green.df first.)
|
---|
| 226 |
|
---|
| 227 | Now you need to create a Naive-Bayes model (green.model) from
|
---|
| 228 | the arff file you just built (green.arff).
|
---|
| 229 | You'll need a bit of java knowledge here. Make sure "./jaws.jar"
|
---|
| 230 | is on your java classpath, and type:
|
---|
| 231 | java KEP -t green.arff -m green.model
|
---|
| 232 | This will use green.arff as training data to create the
|
---|
| 233 | Naive-Bayes model, which is saved in green.model.
|
---|
| 234 |
|
---|
| 235 | The final part of the model is *optional* - the keyphrase
|
---|
| 236 | frequency file, called green.kf. It lists all the author
|
---|
| 237 | keyphrases in the training data, with the number of
|
---|
| 238 | times each occurs as a keyphrase. It is optional,
|
---|
| 239 | but it does improve performance on *specialised* collections,
|
---|
| 240 | so if you're extracting keyphrases for a specialised
|
---|
| 241 | collection for a "real" purpose, then you should use one if
|
---|
| 242 | you can. See Frank et.al. for more details.
|
---|
| 243 | Each line of the file should have a stemmed phrase, followed
|
---|
| 244 | by a tab, folowed by the number of times the phrase is a
|
---|
| 245 | keyphrase - see cstr.kf or aliweb.kf for an example.
|
---|
| 246 | You can make a file like this with a command like
|
---|
| 247 | cat Green/*.key | stemmer | count-lines.pl > green.kf
|
---|
| 248 | To do this you will need the stemmer and count-lines.pl
|
---|
| 249 | script provided with Kea.
|
---|
| 250 |
|
---|
| 251 | The model is now complete.
|
---|
| 252 |
|
---|
| 253 | To use it, put the green.df, green.model, green.stopwords,and
|
---|
| 254 | (if you have one) green.kf in the Kea directory. You can extract
|
---|
| 255 | keyphrases like this:
|
---|
| 256 | Kea -N 10 -C green myfile.txt
|
---|
| 257 |
|
---|
| 258 |
|
---|
| 259 | ****************
|
---|
| 260 | 6. The Kea files
|
---|
| 261 | ****************
|
---|
| 262 |
|
---|
| 263 | Here's a description of what the various Kea program files do.
|
---|
| 264 |
|
---|
| 265 | README: This file.
|
---|
| 266 |
|
---|
| 267 | Kea: Extracts keyphrase from text based on a model
|
---|
| 268 |
|
---|
| 269 | *.model: Naive-Bayes model object stored as a file
|
---|
| 270 | *.kf: Keyphrase-frequency file
|
---|
| 271 | *.df: Document-frequency file (aka a global-frequency file)
|
---|
| 272 | *.stopwords: Stopwords file
|
---|
| 273 |
|
---|
| 274 | stemmer: Program for stemming words with the Iterated Lovins stemmer
|
---|
| 275 | Iterated-Lovins-stemmer:
|
---|
| 276 | Directory conating code for stemmer. Some of the files are
|
---|
| 277 | copyright 1994 Linh Huynh, Gnu Public License. The others
|
---|
| 278 | are simply wrappers I have written myself.
|
---|
| 279 |
|
---|
| 280 | KEP.java: Java code for creating & using a Naive-Bayes model
|
---|
| 281 | KEP.class: Compiled version of KEP.java
|
---|
| 282 | jaws.jar: Java archive of the WEKA java machine learnig code.
|
---|
| 283 | Copyright Eibe Frank & Len Trigg, Gnu Public License.
|
---|
| 284 |
|
---|
| 285 | kea-tidy-key-file.pl:
|
---|
| 286 | Convert a .key or .kea file into a "clean" format.
|
---|
| 287 | kea-choose-best-phrase.pl:
|
---|
| 288 | Find the "best" unstemmed version of a keyphrase
|
---|
| 289 | that appears in a file in many forms.
|
---|
| 290 | prepare-clauses.pl:
|
---|
| 291 | Perl script that converts a text file to a clauses file.
|
---|
| 292 | prepare-clauses-all-txt-files.pl:
|
---|
| 293 | Applies prepare-clauses.pl to an entire directory.
|
---|
| 294 | cstr-to-text.pl:
|
---|
| 295 | Converts cstr files to text; requires lynx.
|
---|
| 296 | count-lines.pl:
|
---|
| 297 | Counts the lines in a file.
|
---|
| 298 |
|
---|
| 299 |
|
---|
| 300 | ***********************
|
---|
| 301 | 7. Advanced Kea options
|
---|
| 302 | ***********************
|
---|
| 303 |
|
---|
| 304 | Here is a complete list of the options to Kea. The last
|
---|
| 305 | four (-F, -K, -M, and -S) have been superceded by the -C option,
|
---|
| 306 | but still work; its possible they are good for something.
|
---|
| 307 |
|
---|
| 308 | -d Debug mode. Working files are left in /tmp
|
---|
| 309 | -t Ouput TF.IDF for each phrase. Used by Kniles.
|
---|
| 310 | -N n Output n keyphrases (if possible).
|
---|
| 311 | -E ext Output files have extension ".ext" (default is ".kea")
|
---|
| 312 | -C x Use model based on corpus x.
|
---|
| 313 | Defaults to "aliweb" web page corpus.
|
---|
| 314 |
|
---|
| 315 | -F df Use document-frequency file "df".
|
---|
| 316 | Defaults to aliweb.df where x is set by the -C argument.
|
---|
| 317 | -K kf Use keyphrase-frequency file "mf".
|
---|
| 318 | Defaults to x.kf where x is set by the -C argument.
|
---|
| 319 | -M mf Use model file "mf".
|
---|
| 320 | Defaults to x.model where x is set by the -C argument.
|
---|
| 321 | -S sf Use stopword file "mf".
|
---|
| 322 | Defaults to x.stopwords where x is set by the -C argument.
|
---|
| 323 |
|
---|
| 324 |
|
---|