Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

source: tags/gsdl-2_30d-distribution/gsdl/perllib/Kea-1.1.4/README.txt@ 2308

Last change on this file since 2308 was 1972, checked in by jmt14, 23 years ago
* empty log message *
Property svn:keywords set to `Author Date Id Revision`
File size: 13.0 KB

Line
1	Kea -- Automatic Keyphrase Extraction
2
3	Copyright 1998-1999 by Gordon Paynter and Eibe Frank
4	Contact [email protected] or [email protected]
5
6	* This program is free software; you can redistribute it and/or modify
7	* it under the terms of the GNU General Public License as published by
8	* the Free Software Foundation; either version 2 of the License, or
9	* (at your option) any later version.
10	*
11	* This program is distributed in the hope that it will be useful,
12	* but WITHOUT ANY WARRANTY; without even the implied warranty of
13	* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14	* GNU General Public License for more details.
15	*
16	* You should have received a copy of the GNU General Public License
17	* along with this program; if not, write to the Free Software
18	* Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
19
20
21	***************
22	0. Introduction
23	***************
24
25	Kea is a program for extracting keyphrases from text and html files.
26	The Kea algorithm is described in these papers:
27	* Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin,
28	and Craig G. Nevill-Manning (1999) "KEA: Practical Automatic
29	Keyphrase Extraction."
30	* Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and
31	Craig G. Nevill-Manning (1999) "Domain-Specific Keyphrase Extraction."
32	These papers, and others, and our Kea implementation, are available from
33	the technology section of the New Zealand Digital Library web site at
34	http://www.nzdl.org/
35
36	Kea was mostly implemented by Gordon Paynter ([email protected])
37	and Eibe Frank ([email protected]). Craig Nevill-Manning
38	and Carl Gutwin have worked on earlier versions; there's even
39	a chance that some of their semi-colons are still be in service.
40	Please contact Gordon about the general implementation or Eibe about
41	the java side of things.
42
43	This document describes the current Kea implementation. It is divided
44	into these sections:
45	0. This introduction
46	1. Version History
47	2. System requirements
48	3. Extracting keyphrases
49	4. Using models
50	5. Making models
51	6. The Kea files
52	7. Advanced Kea options
53
54
55	******************
56	1. Version History
57	******************
58
59	There were many pre-1.0 versions of Kea; they are mostly forgotten.
60
61	Version 1.0 of kea was the version used in the paper by Witten et.al.
62	described above. It was distributed to very few people.
63
64	Version 1.1 of Kea is the first "public" version, and is available at
65	http://www.nzdl.org/Kea from March 1999.
66
67
68	**********************
69	2. System requirements
70	**********************
71
72	Kea runs under Unix. We have been running it in both Linux and Solaris.
73	Kea is implemented in Perl and Java (with exception of the stemmer).
74
75	You must have Perl (Version 5 or greater) and Java (Version 1.1.6 or
76	greater) installed to run Kea. The main Kea program, called Kea,
77	has a variable called "$java_command" that contains the command
78	Kea will use to run java. You'll have to make sure this is set
79	correctly for your system (I can't be bothered doing it for you).
80
81	To be honest, you'll probably need some ability with Perl and Java to
82	make Kea work.
83
84	Kea uses a GPL version of the Lovins stemmer that was written in C.
85	This distribution includes a compiled version for LINUX. If you're
86	using Solaris or some other Unix, you will have to recompile it for
87	that platform. The source code is in the Iterated-Lovins-stemmer
88	directory. The README file in that directory will tell you how to
89	compile the stemmer. The program "stemmer" must be in the main directory.
90
91	(If you know of a GPL Java or Perl version of the Iterated Lovins
92	stemmer, do let me know.)
93
94
95	************************
96	3. Extracting keyphrases
97	************************
98
99	The Kea program is used to extract keyphrases from files.
100	It is a perl script, and is used like this:
101	Kea [options] <text-or-html-or-cstr-files>
102
103	For example, if you have a text file called myfile.text, you could
104	extract keyphrases from it with this command:
105	Kea myfile.text
106
107	Kea's output will be stored in a new file called myfile.kea
108	that looks something like this:
109	protein protein 0.8135395543417774
110	amino acid amin ac 0.543230038502526
111	Nutrition nutrit 0.15095707184225382
112	assay as 0.15095707184225382
113
114	The first column contains keyphrases Kea has extracted from the file.
115	The second column contains stemmed versions of the keyphrases.
116	The third column is an estimate of the probability that the phrase
117	would be chosen by the author as a keyword for this paper. (See
118	Witten et.al. for an explanation).
119
120	Kea has several options. The most important is -N, which is
121	used to output a specific number of keyphrases. For example, suppose
122	you have a directory called public_html that contains a bunch of html
123	files, and you want to extract 15 phrases from each. Use the command:
124	Kea -N 15 public_html/*.html
125
126	Kea works with three types of input file based on extensions.
127	Text files have the extension .txt or .text
128	HTML files have the extension .html or .htm
129	CSTR files have the extension .cstr
130	CSTR files are those from the CSTR collection of the NZDL, and you
131	will probably never see them. If you want Kea to work with HTML or
132	CSTR files, you will need to have the lynx web browser installed
133	(we use version 2.5).
134
135
136	***************
137	4. Using models
138	***************
139
140	Kea extracts phrases from text files based on a "model" of
141	the way authors choose keyphrases. The model is based on a set of
142	"training documents" that have author-assigned keyphrases.
143
144	The default model for Kea is the "aliweb" model, which is based on
145	90 web pages from the aliweb web site. If you use a different model
146	to extract phrases from a document, it might choose different pages.
147	See Witten et al. for details.
148
149	You can download other models from the Kea download page, or you can
150	make our own. For example, you can download the CSTR model. This
151	model performs very well on Computer Science Technical Reports, but
152	less well on other collections. It consists of four files:
153	cstr.stopwords A list of stopwords used in text processing.
154	cstr.df The document-frequencies of some phrases in the CSTR.
155	cstr.model The Naive-Bayes model used in classification.
156	cstr.kf The keyphrase-frequencies of some phrases in the CSTR.
157	(Note: the CSTR model consists of all these files, not just cstr.model)
158
159	If you want to use the CSTR model to extract 10 keyphrases from a file
160	called myCSdocument.text, use the command:
161	Kea -N 10 -C cstr myCSdocument.text
162
163
164	****************
165	5. Making models
166	****************
167
168	This section explains how to create a model that you can later use
169	to extract keyphrases. You might want to do this for a specialised
170	collection, like we did with the CSTR.
171
172	To build a model, you will need some training data. Read Witten et al.
173	(1999) to get an idea of the amout of training data you will need.
174	(We recommend about 50 documents, but fewer will work if you don't
175	have that many.)
176
177	Your training data should be placed in a single directory.
178	The training data consists of a set of text files (called *.txt)
179	and author keyword files (called *.key). For every .txt there
180	should be a .key file. For example if one of your text files is
181	Witten99.txt, there should be a corresponding keyword file called
182	Witten99.key. The .txt file should contain the document in plain
183	text form. The .key file should be a text file containing each
184	of the author-assigned keywords for that file, one per line.
185
186	We have put a couple of training datasets that we have used
187	on the Kea downloads web page, if you want an example.
188
189	Let's assume your training data is in a directory called Green.
190	We're going to use your traing data to build a model called green;
191	this model will consist of four files:
192	green.stopwords, green.df, green.model, green.kf.
193
194	First, create a "stopwords" file for your collection. The
195	stopwords are a list of words that never occur at the start
196	or end of a keyphrase. Read Witten et al. for more detail.
197	They are placed in a text file, one per line, in lowercase.
198	Kea comes with a stopwords file called aliweb.stopwords.
199	We will it in our model:
200	cp aliweb.stopwords green.stopwords
201	You can add new stopwords for specialised collections if you
202	need to (see cstr.stopwords for an example).
203
204	We will now create a model file (green.model) and a document
205	frequency file (green.df).
206
207	You will need to convert all the text files to "clauses" files
208	with the command:
209	prepare-clauses-all-txt-files.pl Green
210	This will create a clauses gile for every text file: for example,
211	if you have a Witten99.txt file, Witten99.clauses will be created.
212
213	Next, you need to create an "arff" file (green.arff) and, as a
214	side effect, the document frequency file (green.df).
215	The arff file isn't part of the model; it is the input file
216	needed by the machine learning scheme to create the Naive-Bayes
217	model. Use the command:
218	k4.pl -f green.df -S green.stopwords Green green.arff
219	This command (called k4.pl for historical reasons) uses the
220	training files in the directory Green (specifically, *.clauses
221	and *.key) to create green.arff.
222	It uses green.stopwords for its stopword file, and green.df as its
223	document-frequency file. Since green.df doesn't exist when you
224	start, it will create green.df for you as it works. (If you ever
225	repeat this command, you should delete green.df first.)
226
227	Now you need to create a Naive-Bayes model (green.model) from
228	the arff file you just built (green.arff).
229	You'll need a bit of java knowledge here. Make sure "./jaws.jar"
230	is on your java classpath, and type:
231	java KEP -t green.arff -m green.model
232	This will use green.arff as training data to create the
233	Naive-Bayes model, which is saved in green.model.
234
235	The final part of the model is optional - the keyphrase
236	frequency file, called green.kf. It lists all the author
237	keyphrases in the training data, with the number of
238	times each occurs as a keyphrase. It is optional,
239	but it does improve performance on specialised collections,
240	so if you're extracting keyphrases for a specialised
241	collection for a "real" purpose, then you should use one if
242	you can. See Frank et.al. for more details.
243	Each line of the file should have a stemmed phrase, followed
244	by a tab, folowed by the number of times the phrase is a
245	keyphrase - see cstr.kf or aliweb.kf for an example.
246	You can make a file like this with a command like
247	cat Green/*.key \| stemmer \| count-lines.pl > green.kf
248	To do this you will need the stemmer and count-lines.pl
249	script provided with Kea.
250
251	The model is now complete.
252
253	To use it, put the green.df, green.model, green.stopwords,and
254	(if you have one) green.kf in the Kea directory. You can extract
255	keyphrases like this:
256	Kea -N 10 -C green myfile.txt
257
258
259	****************
260	6. The Kea files
261	****************
262
263	Here's a description of what the various Kea program files do.
264
265	README: This file.
266
267	Kea: Extracts keyphrase from text based on a model
268
269	*.model: Naive-Bayes model object stored as a file
270	*.kf: Keyphrase-frequency file
271	*.df: Document-frequency file (aka a global-frequency file)
272	*.stopwords: Stopwords file
273
274	stemmer: Program for stemming words with the Iterated Lovins stemmer
275	Iterated-Lovins-stemmer:
276	Directory conating code for stemmer. Some of the files are
277	copyright 1994 Linh Huynh, Gnu Public License. The others
278	are simply wrappers I have written myself.
279
280	KEP.java: Java code for creating & using a Naive-Bayes model
281	KEP.class: Compiled version of KEP.java
282	jaws.jar: Java archive of the WEKA java machine learnig code.
283	Copyright Eibe Frank & Len Trigg, Gnu Public License.
284
285	kea-tidy-key-file.pl:
286	Convert a .key or .kea file into a "clean" format.
287	kea-choose-best-phrase.pl:
288	Find the "best" unstemmed version of a keyphrase
289	that appears in a file in many forms.
290	prepare-clauses.pl:
291	Perl script that converts a text file to a clauses file.
292	prepare-clauses-all-txt-files.pl:
293	Applies prepare-clauses.pl to an entire directory.
294	cstr-to-text.pl:
295	Converts cstr files to text; requires lynx.
296	count-lines.pl:
297	Counts the lines in a file.
298
299
300	***********************
301	7. Advanced Kea options
302	***********************
303
304	Here is a complete list of the options to Kea. The last
305	four (-F, -K, -M, and -S) have been superceded by the -C option,
306	but still work; its possible they are good for something.
307
308	-d Debug mode. Working files are left in /tmp
309	-t Ouput TF.IDF for each phrase. Used by Kniles.
310	-N n Output n keyphrases (if possible).
311	-E ext Output files have extension ".ext" (default is ".kea")
312	-C x Use model based on corpus x.
313	Defaults to "aliweb" web page corpus.
314
315	-F df Use document-frequency file "df".
316	Defaults to aliweb.df where x is set by the -C argument.
317	-K kf Use keyphrase-frequency file "mf".
318	Defaults to x.kf where x is set by the -C argument.
319	-M mf Use model file "mf".
320	Defaults to x.model where x is set by the -C argument.
321	-S sf Use stopword file "mf".
322	Defaults to x.stopwords where x is set by the -C argument.
323
324

Note: See TracBrowser for help on using the repository browser.

Download in other formats: