1 | =====================================================================
|
---|
2 |
|
---|
3 | ======
|
---|
4 | README
|
---|
5 | ======
|
---|
6 |
|
---|
7 | KEA 3.0
|
---|
8 | 18 March 2004
|
---|
9 |
|
---|
10 | Java Programs for Automatic Keyphrase Extraction
|
---|
11 |
|
---|
12 | Copyright (C) 2000, 2001, 2004 Eibe Frank
|
---|
13 |
|
---|
14 | email: [email protected]
|
---|
15 |
|
---|
16 | =====================================================================
|
---|
17 |
|
---|
18 | Contents:
|
---|
19 | ---------
|
---|
20 |
|
---|
21 | 1. Installation
|
---|
22 |
|
---|
23 | 2. Getting started
|
---|
24 |
|
---|
25 | - Building a keyphrase extraction model
|
---|
26 | - Extracting keyphrases
|
---|
27 | - Important comment
|
---|
28 |
|
---|
29 | 3. Examples
|
---|
30 |
|
---|
31 | 4. Other documentation
|
---|
32 |
|
---|
33 | 5. Copyright
|
---|
34 |
|
---|
35 | ----------------------------------------------------------------------
|
---|
36 |
|
---|
37 | NOTE:
|
---|
38 | -----
|
---|
39 |
|
---|
40 | This distribution includes a cut-down version of WEKA, the GPL'ed
|
---|
41 | machine learning workbench available from
|
---|
42 |
|
---|
43 | http://www.cs.waikato.ac.nz/ml/weka.
|
---|
44 |
|
---|
45 | ----------------------------------------------------------------------
|
---|
46 |
|
---|
47 | 1. Installation:
|
---|
48 | ----------------
|
---|
49 |
|
---|
50 | KEA is implemented as a set of Java classes (located in the same
|
---|
51 | directory as this README file). To run KEA you need to tell the Java
|
---|
52 | Virtual Machine where to look for KEA classes. One possible way of
|
---|
53 | doing this is to add the directory that contains this README file to
|
---|
54 | the CLASSPATH environment variable that is used by the Java Virtual
|
---|
55 | Machine.
|
---|
56 |
|
---|
57 | Under Linux you would do the following:
|
---|
58 |
|
---|
59 | a) Set KEAHOME to be the directory which contains this README.
|
---|
60 |
|
---|
61 | b) Add $KEAHOME to your CLASSPATH environment variable.
|
---|
62 |
|
---|
63 | The on-line documentation (generated from the source code) is located
|
---|
64 | in the doc directory. You might want to do the following to have the
|
---|
65 | documentation handy in you web browser:
|
---|
66 |
|
---|
67 | c) Bookmark $KEAHOME/doc/packages.html in your web browser.
|
---|
68 |
|
---|
69 | ----------------------------------------------------------------------
|
---|
70 |
|
---|
71 | 2. Getting started:
|
---|
72 | -------------------
|
---|
73 |
|
---|
74 | Building a keyphrase extraction model
|
---|
75 | =====================================
|
---|
76 |
|
---|
77 | To extract keyphrases for new documents, you first need to build a KEA
|
---|
78 | keyphrase extraction model from a set of documents (preferably from
|
---|
79 | the same domain) for which you have author- assigned keyphrases. To
|
---|
80 | this end you have to go through the following steps:
|
---|
81 |
|
---|
82 | a) Create a directory, called, for example, "training_documents",
|
---|
83 | containing the documents that you want to use for training the
|
---|
84 | keyphrase extractor.
|
---|
85 |
|
---|
86 | b) Rename the document files in that directory so that they end with
|
---|
87 | the suffix ".txt".
|
---|
88 |
|
---|
89 | c) Delete the author-assigned keyphrases from those documents
|
---|
90 | and put them into separate ".key" files. For example, if
|
---|
91 | your document file is called doc1.txt, move the keyphrases
|
---|
92 | into a new file called "doc1.key". It is important that
|
---|
93 | you put each keyphrase on a separate line in the .key file!
|
---|
94 |
|
---|
95 | d) Build the keyphrase extraction model by running the
|
---|
96 | KEAModelBuilder:
|
---|
97 |
|
---|
98 | java KEAModelBuilder -l <name_of_directory> -m <name_of_model>
|
---|
99 |
|
---|
100 | This will use the documents in <name_of_directory> to build a
|
---|
101 | keyphrase extraction model and save it in <name_of_model>.
|
---|
102 |
|
---|
103 | KEAModelBuilder has a few other options that you can view if you run
|
---|
104 | KEAModelBuilder without any arguments. Here is a list of all the
|
---|
105 | options:
|
---|
106 |
|
---|
107 | -l <directory name>
|
---|
108 | Specifies name of directory.
|
---|
109 | -m <model name>
|
---|
110 | Specifies name of model.
|
---|
111 | -e <encoding>
|
---|
112 | Specifies encoding.
|
---|
113 | -d
|
---|
114 | Turns debugging mode on.
|
---|
115 | -k
|
---|
116 | Use keyphrase frequency statistic.
|
---|
117 | -p
|
---|
118 | Disallow internal periods.
|
---|
119 | -x <length>
|
---|
120 | Sets the maximum phrase length (default: 3).
|
---|
121 | -y <length>
|
---|
122 | Sets the minimum phrase length (default: 1).
|
---|
123 | -o <number>
|
---|
124 | The minimum number of times a phrase needs to occur
|
---|
125 | (default: 2).
|
---|
126 | -s <name of stopwords class>
|
---|
127 | Sets the list of stopwords to use (default: StopwordsEnglish).
|
---|
128 | -t <name of stemmer class>
|
---|
129 | Set the stemmer to use (default: IteratedLovinsStemmer).
|
---|
130 | -n
|
---|
131 | Do not check for proper nouns.
|
---|
132 |
|
---|
133 | The -e option allows you to specify a different character encoding
|
---|
134 | supported by Java. For example, to extract keyphrases from Chinese
|
---|
135 | documents encoded using GBK, you would use specify "-e GBK" as an
|
---|
136 | argument.
|
---|
137 |
|
---|
138 | The -d option generates some output that shows the progress of the
|
---|
139 | model builder.
|
---|
140 |
|
---|
141 | If -k is set, the keyphrase frequency attribute is used in the
|
---|
142 | model. For more info on this, have a look at the paper on
|
---|
143 | "Domain-specific keyphrase extraction" listed below. Using this option
|
---|
144 | improves accuracy if the domain of the documents for which you want to
|
---|
145 | extract keyphrases is the same as the domain of the training
|
---|
146 | documents. In other words, if you want to extract keyphrases from
|
---|
147 | papers on radiology, and your training documents are about radiology,
|
---|
148 | you should use this option.
|
---|
149 |
|
---|
150 | If -p is set, KEA does not consider phrases with internal periods as
|
---|
151 | candidate keyphrases. It is important to use this if a full stop is
|
---|
152 | not always followed by white space in the documents.
|
---|
153 |
|
---|
154 | Using -s and -t you can set different classes for stopword detection
|
---|
155 | and stemming respectively (for languages other than English).
|
---|
156 |
|
---|
157 | Using -d you turn KEA's heuristic for detecting proper nouns off. This
|
---|
158 | is important for languages like German, where all nouns start with an
|
---|
159 | uppercase letter, not just proper nouns.
|
---|
160 |
|
---|
161 | Extracting keyphrases
|
---|
162 | =====================
|
---|
163 |
|
---|
164 | To extract keyphrases for some documents, put them into an empty
|
---|
165 | directory. Then rename them so that they end with the suffix ".txt".
|
---|
166 |
|
---|
167 | If you've previously built a keyphrase extraction model you can now
|
---|
168 | apply keyphrases for these documents using:
|
---|
169 |
|
---|
170 | java KEAKeyphraseExtractor -l <name_of_directory> -m <name_of_model>
|
---|
171 |
|
---|
172 | This will create a ".key" file for each document in the
|
---|
173 | directory. Each file will contain five extracted keyphrases for the
|
---|
174 | corresponding document.
|
---|
175 |
|
---|
176 | If a ".key" file is already present it won't be overwritten. Instead,
|
---|
177 | the keyphrases present in that file will be used to evaluate the
|
---|
178 | extraction model. The stemmed extracted phrases are compared to the
|
---|
179 | stemmed versions of the phrases in the ".key"
|
---|
180 | file. KEAKeyphraseExtractor reports the number of hits among the total
|
---|
181 | number of extracted phrases for those documents that have associated
|
---|
182 | ".key" files in the directory.
|
---|
183 |
|
---|
184 | KEAKeyphraseExtractor has a few options. Here they are:
|
---|
185 |
|
---|
186 | -l <directory name>
|
---|
187 | Specifies name of directory.
|
---|
188 | -m <model name>
|
---|
189 | Specifies name of model.
|
---|
190 | -e <encoding>
|
---|
191 | Specifies encoding.
|
---|
192 | -n
|
---|
193 | Specifies number of phrases to be output (default: 5).
|
---|
194 | -d
|
---|
195 | Turns debugging mode on.
|
---|
196 | -a
|
---|
197 | Also write stemmed phrase and score into ".key" file.
|
---|
198 |
|
---|
199 | Important comment
|
---|
200 | -----------------
|
---|
201 |
|
---|
202 | To get good results, it is important that the input text for KEA is as
|
---|
203 | "clean" as possible. That means html tags etc. in the input documents
|
---|
204 | need to be deleted before the model is built and before keyphrases are
|
---|
205 | extracted from new documents.
|
---|
206 |
|
---|
207 | ----------------------------------------------------------------------
|
---|
208 |
|
---|
209 | 3. Examples:
|
---|
210 | ------------
|
---|
211 |
|
---|
212 | The directory contains two example collections, each split up into a
|
---|
213 | train and test directory. Note that these collections are only
|
---|
214 | included to show how the system can be applied to actual documents.
|
---|
215 | Due to the lack of data, the accuracy isn't very good on either
|
---|
216 | example collection.
|
---|
217 |
|
---|
218 | Collection A
|
---|
219 | ------------
|
---|
220 |
|
---|
221 | A collection of abstracts taken from computer science technical
|
---|
222 | reports:
|
---|
223 |
|
---|
224 | CSTR_abstracts_train
|
---|
225 | CSTR_abstracts_test
|
---|
226 |
|
---|
227 | To build a model from the training data, try:
|
---|
228 |
|
---|
229 | java KEAModelBuilder -l CSTR_abstracts_train -m CSTR_abstracts_model
|
---|
230 |
|
---|
231 | To evaluate that model on the test data, try:
|
---|
232 |
|
---|
233 | java KEAKeyphraseExtractor -l CSTR_abstracts_test -m CSTR_abstracts_model
|
---|
234 |
|
---|
235 | Collection B
|
---|
236 | ------------
|
---|
237 |
|
---|
238 | A small collection of Chinese documents (in GBK encoding):
|
---|
239 |
|
---|
240 | Journals_train
|
---|
241 | Journals_test
|
---|
242 |
|
---|
243 | To build a model from the training data, try:
|
---|
244 |
|
---|
245 | java KEAModelBuilder -l Chinese_train -m Chinese_model -e GBK
|
---|
246 |
|
---|
247 | To evaluate that model on the test data, try:
|
---|
248 |
|
---|
249 | java KEAKeyphraseExtractor -l Chinese_test -m Chinese_model -e GBK
|
---|
250 |
|
---|
251 | ----------------------------------------------------------------------
|
---|
252 |
|
---|
253 | 4. Other documentation:
|
---|
254 | -----------------------
|
---|
255 |
|
---|
256 | There are several papers on the KEA algorithm, listed below. Note that
|
---|
257 | this implementation differs slightly from the version described in the
|
---|
258 | papers, mainly in the pre-processing step (i.e. in the way candidate
|
---|
259 | keyphrases are generated). For more info on the new method please
|
---|
260 | consult the online documentation.
|
---|
261 |
|
---|
262 | Witten I.H., Paynter G.W., Frank E., Gutwin C. and Nevill-Manning
|
---|
263 | C.G. (2000) "KEA: Practical automatic keyphrase extraction." Working
|
---|
264 | Paper 00/5, Department of Computer Science, The University of Waikato.
|
---|
265 |
|
---|
266 | Witten I.H., Paynter G.W., Frank E., Gutwin C. and Nevill-Manning
|
---|
267 | C.G. (1999) "KEA: Practical automatic keyphrase extraction." Proc. DL
|
---|
268 | '99, pp. 254-256. (Poster presentation.)
|
---|
269 |
|
---|
270 | Frank E., Paynter G.W., Witten I.H., Gutwin C. and Nevill-Manning
|
---|
271 | C.G. (1999) "Domain-specific keyphrase extraction" Proc. Sixteenth
|
---|
272 | International Joint Conference on Artificial Intelligence, Morgan
|
---|
273 | Kaufmann Publishers, San Francisco, CA, pp. 668-673.
|
---|
274 |
|
---|
275 | -----------------------------------------------------------------------
|
---|
276 |
|
---|
277 | 5. Copyright:
|
---|
278 | -------------
|
---|
279 |
|
---|
280 | KEA is distributed under the GNU public license. Please read the file
|
---|
281 | COPYING.
|
---|
282 |
|
---|
283 | -----------------------------------------------------------------------
|
---|
284 |
|
---|