Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

source: trunk/gsdl/packages/kea/kea-3.0/README@ 8815

Last change on this file since 8815 was 8815, checked in by mdewsnip, 19 years ago
Kea 3.0, as downloaded from http://www.nzdl.org/kea but with CSTR_abstracts_test, CSTR_abstracts_train, Chinese_test, and Chinese_train directories removed.
Property svn:keywords set to `Author Date Id Revision`
File size: 9.4 KB

Line
1	=====================================================================
2
3	======
4	README
5	======
6
7	KEA 3.0
8	18 March 2004
9
10	Java Programs for Automatic Keyphrase Extraction
11
12	Copyright (C) 2000, 2001, 2004 Eibe Frank
13
14	email: [email protected]
15
16	=====================================================================
17
18	Contents:
19	---------
20
21	1. Installation
22
23	2. Getting started
24
25	- Building a keyphrase extraction model
26	- Extracting keyphrases
27	- Important comment
28
29	3. Examples
30
31	4. Other documentation
32
33	5. Copyright
34
35	----------------------------------------------------------------------
36
37	NOTE:
38	-----
39
40	This distribution includes a cut-down version of WEKA, the GPL'ed
41	machine learning workbench available from
42
43	http://www.cs.waikato.ac.nz/ml/weka.
44
45	----------------------------------------------------------------------
46
47	1. Installation:
48	----------------
49
50	KEA is implemented as a set of Java classes (located in the same
51	directory as this README file). To run KEA you need to tell the Java
52	Virtual Machine where to look for KEA classes. One possible way of
53	doing this is to add the directory that contains this README file to
54	the CLASSPATH environment variable that is used by the Java Virtual
55	Machine.
56
57	Under Linux you would do the following:
58
59	a) Set KEAHOME to be the directory which contains this README.
60
61	b) Add $KEAHOME to your CLASSPATH environment variable.
62
63	The on-line documentation (generated from the source code) is located
64	in the doc directory. You might want to do the following to have the
65	documentation handy in you web browser:
66
67	c) Bookmark $KEAHOME/doc/packages.html in your web browser.
68
69	----------------------------------------------------------------------
70
71	2. Getting started:
72	-------------------
73
74	Building a keyphrase extraction model
75	=====================================
76
77	To extract keyphrases for new documents, you first need to build a KEA
78	keyphrase extraction model from a set of documents (preferably from
79	the same domain) for which you have author- assigned keyphrases. To
80	this end you have to go through the following steps:
81
82	a) Create a directory, called, for example, "training_documents",
83	containing the documents that you want to use for training the
84	keyphrase extractor.
85
86	b) Rename the document files in that directory so that they end with
87	the suffix ".txt".
88
89	c) Delete the author-assigned keyphrases from those documents
90	and put them into separate ".key" files. For example, if
91	your document file is called doc1.txt, move the keyphrases
92	into a new file called "doc1.key". It is important that
93	you put each keyphrase on a separate line in the .key file!
94
95	d) Build the keyphrase extraction model by running the
96	KEAModelBuilder:
97
98	java KEAModelBuilder -l <name_of_directory> -m <name_of_model>
99
100	This will use the documents in <name_of_directory> to build a
101	keyphrase extraction model and save it in <name_of_model>.
102
103	KEAModelBuilder has a few other options that you can view if you run
104	KEAModelBuilder without any arguments. Here is a list of all the
105	options:
106
107	-l <directory name>
108	Specifies name of directory.
109	-m <model name>
110	Specifies name of model.
111	-e <encoding>
112	Specifies encoding.
113	-d
114	Turns debugging mode on.
115	-k
116	Use keyphrase frequency statistic.
117	-p
118	Disallow internal periods.
119	-x <length>
120	Sets the maximum phrase length (default: 3).
121	-y <length>
122	Sets the minimum phrase length (default: 1).
123	-o <number>
124	The minimum number of times a phrase needs to occur
125	(default: 2).
126	-s <name of stopwords class>
127	Sets the list of stopwords to use (default: StopwordsEnglish).
128	-t <name of stemmer class>
129	Set the stemmer to use (default: IteratedLovinsStemmer).
130	-n
131	Do not check for proper nouns.
132
133	The -e option allows you to specify a different character encoding
134	supported by Java. For example, to extract keyphrases from Chinese
135	documents encoded using GBK, you would use specify "-e GBK" as an
136	argument.
137
138	The -d option generates some output that shows the progress of the
139	model builder.
140
141	If -k is set, the keyphrase frequency attribute is used in the
142	model. For more info on this, have a look at the paper on
143	"Domain-specific keyphrase extraction" listed below. Using this option
144	improves accuracy if the domain of the documents for which you want to
145	extract keyphrases is the same as the domain of the training
146	documents. In other words, if you want to extract keyphrases from
147	papers on radiology, and your training documents are about radiology,
148	you should use this option.
149
150	If -p is set, KEA does not consider phrases with internal periods as
151	candidate keyphrases. It is important to use this if a full stop is
152	not always followed by white space in the documents.
153
154	Using -s and -t you can set different classes for stopword detection
155	and stemming respectively (for languages other than English).
156
157	Using -d you turn KEA's heuristic for detecting proper nouns off. This
158	is important for languages like German, where all nouns start with an
159	uppercase letter, not just proper nouns.
160
161	Extracting keyphrases
162	=====================
163
164	To extract keyphrases for some documents, put them into an empty
165	directory. Then rename them so that they end with the suffix ".txt".
166
167	If you've previously built a keyphrase extraction model you can now
168	apply keyphrases for these documents using:
169
170	java KEAKeyphraseExtractor -l <name_of_directory> -m <name_of_model>
171
172	This will create a ".key" file for each document in the
173	directory. Each file will contain five extracted keyphrases for the
174	corresponding document.
175
176	If a ".key" file is already present it won't be overwritten. Instead,
177	the keyphrases present in that file will be used to evaluate the
178	extraction model. The stemmed extracted phrases are compared to the
179	stemmed versions of the phrases in the ".key"
180	file. KEAKeyphraseExtractor reports the number of hits among the total
181	number of extracted phrases for those documents that have associated
182	".key" files in the directory.
183
184	KEAKeyphraseExtractor has a few options. Here they are:
185
186	-l <directory name>
187	Specifies name of directory.
188	-m <model name>
189	Specifies name of model.
190	-e <encoding>
191	Specifies encoding.
192	-n
193	Specifies number of phrases to be output (default: 5).
194	-d
195	Turns debugging mode on.
196	-a
197	Also write stemmed phrase and score into ".key" file.
198
199	Important comment
200	-----------------
201
202	To get good results, it is important that the input text for KEA is as
203	"clean" as possible. That means html tags etc. in the input documents
204	need to be deleted before the model is built and before keyphrases are
205	extracted from new documents.
206
207	----------------------------------------------------------------------
208
209	3. Examples:
210	------------
211
212	The directory contains two example collections, each split up into a
213	train and test directory. Note that these collections are only
214	included to show how the system can be applied to actual documents.
215	Due to the lack of data, the accuracy isn't very good on either
216	example collection.
217
218	Collection A
219	------------
220
221	A collection of abstracts taken from computer science technical
222	reports:
223
224	CSTR_abstracts_train
225	CSTR_abstracts_test
226
227	To build a model from the training data, try:
228
229	java KEAModelBuilder -l CSTR_abstracts_train -m CSTR_abstracts_model
230
231	To evaluate that model on the test data, try:
232
233	java KEAKeyphraseExtractor -l CSTR_abstracts_test -m CSTR_abstracts_model
234
235	Collection B
236	------------
237
238	A small collection of Chinese documents (in GBK encoding):
239
240	Journals_train
241	Journals_test
242
243	To build a model from the training data, try:
244
245	java KEAModelBuilder -l Chinese_train -m Chinese_model -e GBK
246
247	To evaluate that model on the test data, try:
248
249	java KEAKeyphraseExtractor -l Chinese_test -m Chinese_model -e GBK
250
251	----------------------------------------------------------------------
252
253	4. Other documentation:
254	-----------------------
255
256	There are several papers on the KEA algorithm, listed below. Note that
257	this implementation differs slightly from the version described in the
258	papers, mainly in the pre-processing step (i.e. in the way candidate
259	keyphrases are generated). For more info on the new method please
260	consult the online documentation.
261
262	Witten I.H., Paynter G.W., Frank E., Gutwin C. and Nevill-Manning
263	C.G. (2000) "KEA: Practical automatic keyphrase extraction." Working
264	Paper 00/5, Department of Computer Science, The University of Waikato.
265
266	Witten I.H., Paynter G.W., Frank E., Gutwin C. and Nevill-Manning
267	C.G. (1999) "KEA: Practical automatic keyphrase extraction." Proc. DL
268	'99, pp. 254-256. (Poster presentation.)
269
270	Frank E., Paynter G.W., Witten I.H., Gutwin C. and Nevill-Manning
271	C.G. (1999) "Domain-specific keyphrase extraction" Proc. Sixteenth
272	International Joint Conference on Artificial Intelligence, Morgan
273	Kaufmann Publishers, San Francisco, CA, pp. 668-673.
274
275	-----------------------------------------------------------------------
276
277	5. Copyright:
278	-------------
279
280	KEA is distributed under the GNU public license. Please read the file
281	COPYING.
282
283	-----------------------------------------------------------------------
284

Note: See TracBrowser for help on using the repository browser.

Download in other formats: