source: trunk/gsdl/packages/kea/kea-3.0/README@ 8815

Last change on this file since 8815 was 8815, checked in by mdewsnip, 19 years ago

Kea 3.0, as downloaded from http://www.nzdl.org/kea but with CSTR_abstracts_test, CSTR_abstracts_train, Chinese_test, and Chinese_train directories removed.

  • Property svn:keywords set to Author Date Id Revision
File size: 9.4 KB
Line 
1=====================================================================
2
3 ======
4 README
5 ======
6
7 KEA 3.0
8 18 March 2004
9
10 Java Programs for Automatic Keyphrase Extraction
11
12 Copyright (C) 2000, 2001, 2004 Eibe Frank
13
14 email: [email protected]
15
16=====================================================================
17
18Contents:
19---------
20
211. Installation
22
232. Getting started
24
25 - Building a keyphrase extraction model
26 - Extracting keyphrases
27 - Important comment
28
293. Examples
30
314. Other documentation
32
335. Copyright
34
35----------------------------------------------------------------------
36
37NOTE:
38-----
39
40This distribution includes a cut-down version of WEKA, the GPL'ed
41machine learning workbench available from
42
43 http://www.cs.waikato.ac.nz/ml/weka.
44
45----------------------------------------------------------------------
46
471. Installation:
48----------------
49
50KEA is implemented as a set of Java classes (located in the same
51directory as this README file). To run KEA you need to tell the Java
52Virtual Machine where to look for KEA classes. One possible way of
53doing this is to add the directory that contains this README file to
54the CLASSPATH environment variable that is used by the Java Virtual
55Machine.
56
57Under Linux you would do the following:
58
59a) Set KEAHOME to be the directory which contains this README.
60
61b) Add $KEAHOME to your CLASSPATH environment variable.
62
63The on-line documentation (generated from the source code) is located
64in the doc directory. You might want to do the following to have the
65documentation handy in you web browser:
66
67c) Bookmark $KEAHOME/doc/packages.html in your web browser.
68
69----------------------------------------------------------------------
70
712. Getting started:
72-------------------
73
74Building a keyphrase extraction model
75=====================================
76
77To extract keyphrases for new documents, you first need to build a KEA
78keyphrase extraction model from a set of documents (preferably from
79the same domain) for which you have author- assigned keyphrases. To
80this end you have to go through the following steps:
81
82a) Create a directory, called, for example, "training_documents",
83 containing the documents that you want to use for training the
84 keyphrase extractor.
85
86b) Rename the document files in that directory so that they end with
87 the suffix ".txt".
88
89c) Delete the author-assigned keyphrases from those documents
90 and put them into separate ".key" files. For example, if
91 your document file is called doc1.txt, move the keyphrases
92 into a new file called "doc1.key". It is important that
93 you put each keyphrase on a separate line in the .key file!
94
95d) Build the keyphrase extraction model by running the
96 KEAModelBuilder:
97
98 java KEAModelBuilder -l <name_of_directory> -m <name_of_model>
99
100 This will use the documents in <name_of_directory> to build a
101 keyphrase extraction model and save it in <name_of_model>.
102
103KEAModelBuilder has a few other options that you can view if you run
104KEAModelBuilder without any arguments. Here is a list of all the
105options:
106
107-l <directory name>
108 Specifies name of directory.
109-m <model name>
110 Specifies name of model.
111-e <encoding>
112 Specifies encoding.
113-d
114 Turns debugging mode on.
115-k
116 Use keyphrase frequency statistic.
117-p
118 Disallow internal periods.
119-x <length>
120 Sets the maximum phrase length (default: 3).
121-y <length>
122 Sets the minimum phrase length (default: 1).
123-o <number>
124 The minimum number of times a phrase needs to occur
125 (default: 2).
126-s <name of stopwords class>
127 Sets the list of stopwords to use (default: StopwordsEnglish).
128-t <name of stemmer class>
129 Set the stemmer to use (default: IteratedLovinsStemmer).
130-n
131 Do not check for proper nouns.
132
133The -e option allows you to specify a different character encoding
134supported by Java. For example, to extract keyphrases from Chinese
135documents encoded using GBK, you would use specify "-e GBK" as an
136argument.
137
138The -d option generates some output that shows the progress of the
139model builder.
140
141If -k is set, the keyphrase frequency attribute is used in the
142model. For more info on this, have a look at the paper on
143"Domain-specific keyphrase extraction" listed below. Using this option
144improves accuracy if the domain of the documents for which you want to
145extract keyphrases is the same as the domain of the training
146documents. In other words, if you want to extract keyphrases from
147papers on radiology, and your training documents are about radiology,
148you should use this option.
149
150If -p is set, KEA does not consider phrases with internal periods as
151candidate keyphrases. It is important to use this if a full stop is
152not always followed by white space in the documents.
153
154Using -s and -t you can set different classes for stopword detection
155and stemming respectively (for languages other than English).
156
157Using -d you turn KEA's heuristic for detecting proper nouns off. This
158is important for languages like German, where all nouns start with an
159uppercase letter, not just proper nouns.
160
161Extracting keyphrases
162=====================
163
164To extract keyphrases for some documents, put them into an empty
165directory. Then rename them so that they end with the suffix ".txt".
166
167If you've previously built a keyphrase extraction model you can now
168apply keyphrases for these documents using:
169
170java KEAKeyphraseExtractor -l <name_of_directory> -m <name_of_model>
171
172This will create a ".key" file for each document in the
173directory. Each file will contain five extracted keyphrases for the
174corresponding document.
175
176If a ".key" file is already present it won't be overwritten. Instead,
177the keyphrases present in that file will be used to evaluate the
178extraction model. The stemmed extracted phrases are compared to the
179stemmed versions of the phrases in the ".key"
180file. KEAKeyphraseExtractor reports the number of hits among the total
181number of extracted phrases for those documents that have associated
182".key" files in the directory.
183
184KEAKeyphraseExtractor has a few options. Here they are:
185
186-l <directory name>
187 Specifies name of directory.
188-m <model name>
189 Specifies name of model.
190-e <encoding>
191 Specifies encoding.
192-n
193 Specifies number of phrases to be output (default: 5).
194-d
195 Turns debugging mode on.
196-a
197 Also write stemmed phrase and score into ".key" file.
198
199Important comment
200-----------------
201
202To get good results, it is important that the input text for KEA is as
203"clean" as possible. That means html tags etc. in the input documents
204need to be deleted before the model is built and before keyphrases are
205extracted from new documents.
206
207----------------------------------------------------------------------
208
2093. Examples:
210------------
211
212The directory contains two example collections, each split up into a
213train and test directory. Note that these collections are only
214included to show how the system can be applied to actual documents.
215Due to the lack of data, the accuracy isn't very good on either
216example collection.
217
218Collection A
219------------
220
221A collection of abstracts taken from computer science technical
222reports:
223
224 CSTR_abstracts_train
225 CSTR_abstracts_test
226
227To build a model from the training data, try:
228
229 java KEAModelBuilder -l CSTR_abstracts_train -m CSTR_abstracts_model
230
231To evaluate that model on the test data, try:
232
233 java KEAKeyphraseExtractor -l CSTR_abstracts_test -m CSTR_abstracts_model
234
235Collection B
236------------
237
238A small collection of Chinese documents (in GBK encoding):
239
240 Journals_train
241 Journals_test
242
243To build a model from the training data, try:
244
245 java KEAModelBuilder -l Chinese_train -m Chinese_model -e GBK
246
247To evaluate that model on the test data, try:
248
249 java KEAKeyphraseExtractor -l Chinese_test -m Chinese_model -e GBK
250
251----------------------------------------------------------------------
252
2534. Other documentation:
254-----------------------
255
256There are several papers on the KEA algorithm, listed below. Note that
257this implementation differs slightly from the version described in the
258papers, mainly in the pre-processing step (i.e. in the way candidate
259keyphrases are generated). For more info on the new method please
260consult the online documentation.
261
262Witten I.H., Paynter G.W., Frank E., Gutwin C. and Nevill-Manning
263C.G. (2000) "KEA: Practical automatic keyphrase extraction." Working
264Paper 00/5, Department of Computer Science, The University of Waikato.
265
266Witten I.H., Paynter G.W., Frank E., Gutwin C. and Nevill-Manning
267C.G. (1999) "KEA: Practical automatic keyphrase extraction." Proc. DL
268'99, pp. 254-256. (Poster presentation.)
269
270Frank E., Paynter G.W., Witten I.H., Gutwin C. and Nevill-Manning
271C.G. (1999) "Domain-specific keyphrase extraction" Proc. Sixteenth
272International Joint Conference on Artificial Intelligence, Morgan
273Kaufmann Publishers, San Francisco, CA, pp. 668-673.
274
275-----------------------------------------------------------------------
276
2775. Copyright:
278-------------
279
280KEA is distributed under the GNU public license. Please read the file
281COPYING.
282
283-----------------------------------------------------------------------
284
Note: See TracBrowser for help on using the repository browser.