Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

source: trunk/gsdl/perllib/Kea-1.1.4/README.txt@ 2308

Last change on this file since 2308 was 1972, checked in by jmt14, 23 years ago
* empty log message *
Property svn:keywords set to `Author Date Id Revision`
File size: 13.0 KB

Rev	Line
[1972]	1	Kea -- Automatic Keyphrase Extraction
	2
	3	Copyright 1998-1999 by Gordon Paynter and Eibe Frank
	4	Contact [email protected] or [email protected]
	5
	6	* This program is free software; you can redistribute it and/or modify
	7	* it under the terms of the GNU General Public License as published by
	8	* the Free Software Foundation; either version 2 of the License, or
	9	* (at your option) any later version.
	10	*
	11	* This program is distributed in the hope that it will be useful,
	12	* but WITHOUT ANY WARRANTY; without even the implied warranty of
	13	* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	14	* GNU General Public License for more details.
	15	*
	16	* You should have received a copy of the GNU General Public License
	17	* along with this program; if not, write to the Free Software
	18	* Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
	19
	20
	21	***************
	22	0. Introduction
	23	***************
	24
	25	Kea is a program for extracting keyphrases from text and html files.
	26	The Kea algorithm is described in these papers:
	27	* Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin,
	28	and Craig G. Nevill-Manning (1999) "KEA: Practical Automatic
	29	Keyphrase Extraction."
	30	* Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and
	31	Craig G. Nevill-Manning (1999) "Domain-Specific Keyphrase Extraction."
	32	These papers, and others, and our Kea implementation, are available from
	33	the technology section of the New Zealand Digital Library web site at
	34	http://www.nzdl.org/
	35
	36	Kea was mostly implemented by Gordon Paynter ([email protected])
	37	and Eibe Frank ([email protected]). Craig Nevill-Manning
	38	and Carl Gutwin have worked on earlier versions; there's even
	39	a chance that some of their semi-colons are still be in service.
	40	Please contact Gordon about the general implementation or Eibe about
	41	the java side of things.
	42
	43	This document describes the current Kea implementation. It is divided
	44	into these sections:
	45	0. This introduction
	46	1. Version History
	47	2. System requirements
	48	3. Extracting keyphrases
	49	4. Using models
	50	5. Making models
	51	6. The Kea files
	52	7. Advanced Kea options
	53
	54
	55	******************
	56	1. Version History
	57	******************
	58
	59	There were many pre-1.0 versions of Kea; they are mostly forgotten.
	60
	61	Version 1.0 of kea was the version used in the paper by Witten et.al.
	62	described above. It was distributed to very few people.
	63
	64	Version 1.1 of Kea is the first "public" version, and is available at
	65	http://www.nzdl.org/Kea from March 1999.
	66
	67
	68	**********************
	69	2. System requirements
	70	**********************
	71
	72	Kea runs under Unix. We have been running it in both Linux and Solaris.
	73	Kea is implemented in Perl and Java (with exception of the stemmer).
	74
	75	You must have Perl (Version 5 or greater) and Java (Version 1.1.6 or
	76	greater) installed to run Kea. The main Kea program, called Kea,
	77	has a variable called "$java_command" that contains the command
	78	Kea will use to run java. You'll have to make sure this is set
	79	correctly for your system (I can't be bothered doing it for you).
	80
	81	To be honest, you'll probably need some ability with Perl and Java to
	82	make Kea work.
	83
	84	Kea uses a GPL version of the Lovins stemmer that was written in C.
	85	This distribution includes a compiled version for LINUX. If you're
	86	using Solaris or some other Unix, you will have to recompile it for
	87	that platform. The source code is in the Iterated-Lovins-stemmer
	88	directory. The README file in that directory will tell you how to
	89	compile the stemmer. The program "stemmer" must be in the main directory.
	90
	91	(If you know of a GPL Java or Perl version of the Iterated Lovins
	92	stemmer, do let me know.)
	93
	94
	95	************************
	96	3. Extracting keyphrases
	97	************************
	98
	99	The Kea program is used to extract keyphrases from files.
	100	It is a perl script, and is used like this:
	101	Kea [options] <text-or-html-or-cstr-files>
	102
	103	For example, if you have a text file called myfile.text, you could
	104	extract keyphrases from it with this command:
	105	Kea myfile.text
	106
	107	Kea's output will be stored in a new file called myfile.kea
	108	that looks something like this:
	109	protein protein 0.8135395543417774
	110	amino acid amin ac 0.543230038502526
	111	Nutrition nutrit 0.15095707184225382
	112	assay as 0.15095707184225382
	113
	114	The first column contains keyphrases Kea has extracted from the file.
	115	The second column contains stemmed versions of the keyphrases.
	116	The third column is an estimate of the probability that the phrase
	117	would be chosen by the author as a keyword for this paper. (See
	118	Witten et.al. for an explanation).
	119
	120	Kea has several options. The most important is -N, which is
	121	used to output a specific number of keyphrases. For example, suppose
	122	you have a directory called public_html that contains a bunch of html
	123	files, and you want to extract 15 phrases from each. Use the command:
	124	Kea -N 15 public_html/*.html
	125
	126	Kea works with three types of input file based on extensions.
	127	Text files have the extension .txt or .text
	128	HTML files have the extension .html or .htm
	129	CSTR files have the extension .cstr
	130	CSTR files are those from the CSTR collection of the NZDL, and you
	131	will probably never see them. If you want Kea to work with HTML or
	132	CSTR files, you will need to have the lynx web browser installed
	133	(we use version 2.5).
	134
	135
	136	***************
	137	4. Using models
	138	***************
	139
	140	Kea extracts phrases from text files based on a "model" of
	141	the way authors choose keyphrases. The model is based on a set of
	142	"training documents" that have author-assigned keyphrases.
	143
	144	The default model for Kea is the "aliweb" model, which is based on
	145	90 web pages from the aliweb web site. If you use a different model
	146	to extract phrases from a document, it might choose different pages.
	147	See Witten et al. for details.
	148
	149	You can download other models from the Kea download page, or you can
	150	make our own. For example, you can download the CSTR model. This
	151	model performs very well on Computer Science Technical Reports, but
	152	less well on other collections. It consists of four files:
	153	cstr.stopwords A list of stopwords used in text processing.
	154	cstr.df The document-frequencies of some phrases in the CSTR.
	155	cstr.model The Naive-Bayes model used in classification.
	156	cstr.kf The keyphrase-frequencies of some phrases in the CSTR.
	157	(Note: the CSTR model consists of all these files, not just cstr.model)
	158
	159	If you want to use the CSTR model to extract 10 keyphrases from a file
	160	called myCSdocument.text, use the command:
	161	Kea -N 10 -C cstr myCSdocument.text
	162
	163
	164	****************
	165	5. Making models
	166	****************
	167
	168	This section explains how to create a model that you can later use
	169	to extract keyphrases. You might want to do this for a specialised
	170	collection, like we did with the CSTR.
	171
	172	To build a model, you will need some training data. Read Witten et al.
	173	(1999) to get an idea of the amout of training data you will need.
	174	(We recommend about 50 documents, but fewer will work if you don't
	175	have that many.)
	176
	177	Your training data should be placed in a single directory.
	178	The training data consists of a set of text files (called *.txt)
	179	and author keyword files (called *.key). For every .txt there
	180	should be a .key file. For example if one of your text files is
	181	Witten99.txt, there should be a corresponding keyword file called
	182	Witten99.key. The .txt file should contain the document in plain
	183	text form. The .key file should be a text file containing each
	184	of the author-assigned keywords for that file, one per line.
	185
	186	We have put a couple of training datasets that we have used
	187	on the Kea downloads web page, if you want an example.
	188
	189	Let's assume your training data is in a directory called Green.
	190	We're going to use your traing data to build a model called green;
	191	this model will consist of four files:
	192	green.stopwords, green.df, green.model, green.kf.
	193
	194	First, create a "stopwords" file for your collection. The
	195	stopwords are a list of words that never occur at the start
	196	or end of a keyphrase. Read Witten et al. for more detail.
	197	They are placed in a text file, one per line, in lowercase.
	198	Kea comes with a stopwords file called aliweb.stopwords.
	199	We will it in our model:
	200	cp aliweb.stopwords green.stopwords
	201	You can add new stopwords for specialised collections if you
	202	need to (see cstr.stopwords for an example).
	203
	204	We will now create a model file (green.model) and a document
	205	frequency file (green.df).
	206
	207	You will need to convert all the text files to "clauses" files
	208	with the command:
	209	prepare-clauses-all-txt-files.pl Green
	210	This will create a clauses gile for every text file: for example,
	211	if you have a Witten99.txt file, Witten99.clauses will be created.
	212
	213	Next, you need to create an "arff" file (green.arff) and, as a
	214	side effect, the document frequency file (green.df).
	215	The arff file isn't part of the model; it is the input file
	216	needed by the machine learning scheme to create the Naive-Bayes
	217	model. Use the command:
	218	k4.pl -f green.df -S green.stopwords Green green.arff
	219	This command (called k4.pl for historical reasons) uses the
	220	training files in the directory Green (specifically, *.clauses
	221	and *.key) to create green.arff.
	222	It uses green.stopwords for its stopword file, and green.df as its
	223	document-frequency file. Since green.df doesn't exist when you
	224	start, it will create green.df for you as it works. (If you ever
	225	repeat this command, you should delete green.df first.)
	226
	227	Now you need to create a Naive-Bayes model (green.model) from
	228	the arff file you just built (green.arff).
	229	You'll need a bit of java knowledge here. Make sure "./jaws.jar"
	230	is on your java classpath, and type:
	231	java KEP -t green.arff -m green.model
	232	This will use green.arff as training data to create the
	233	Naive-Bayes model, which is saved in green.model.
	234
	235	The final part of the model is optional - the keyphrase
	236	frequency file, called green.kf. It lists all the author
	237	keyphrases in the training data, with the number of
	238	times each occurs as a keyphrase. It is optional,
	239	but it does improve performance on specialised collections,
	240	so if you're extracting keyphrases for a specialised
	241	collection for a "real" purpose, then you should use one if
	242	you can. See Frank et.al. for more details.
	243	Each line of the file should have a stemmed phrase, followed
	244	by a tab, folowed by the number of times the phrase is a
	245	keyphrase - see cstr.kf or aliweb.kf for an example.
	246	You can make a file like this with a command like
	247	cat Green/*.key \| stemmer \| count-lines.pl > green.kf
	248	To do this you will need the stemmer and count-lines.pl
	249	script provided with Kea.
	250
	251	The model is now complete.
	252
	253	To use it, put the green.df, green.model, green.stopwords,and
	254	(if you have one) green.kf in the Kea directory. You can extract
	255	keyphrases like this:
	256	Kea -N 10 -C green myfile.txt
	257
	258
	259	****************
	260	6. The Kea files
	261	****************
	262
	263	Here's a description of what the various Kea program files do.
	264
	265	README: This file.
	266
	267	Kea: Extracts keyphrase from text based on a model
	268
	269	*.model: Naive-Bayes model object stored as a file
	270	*.kf: Keyphrase-frequency file
	271	*.df: Document-frequency file (aka a global-frequency file)
	272	*.stopwords: Stopwords file
	273
	274	stemmer: Program for stemming words with the Iterated Lovins stemmer
	275	Iterated-Lovins-stemmer:
	276	Directory conating code for stemmer. Some of the files are
	277	copyright 1994 Linh Huynh, Gnu Public License. The others
	278	are simply wrappers I have written myself.
	279
	280	KEP.java: Java code for creating & using a Naive-Bayes model
	281	KEP.class: Compiled version of KEP.java
	282	jaws.jar: Java archive of the WEKA java machine learnig code.
	283	Copyright Eibe Frank & Len Trigg, Gnu Public License.
	284
	285	kea-tidy-key-file.pl:
	286	Convert a .key or .kea file into a "clean" format.
	287	kea-choose-best-phrase.pl:
	288	Find the "best" unstemmed version of a keyphrase
	289	that appears in a file in many forms.
	290	prepare-clauses.pl:
	291	Perl script that converts a text file to a clauses file.
	292	prepare-clauses-all-txt-files.pl:
	293	Applies prepare-clauses.pl to an entire directory.
	294	cstr-to-text.pl:
	295	Converts cstr files to text; requires lynx.
	296	count-lines.pl:
	297	Counts the lines in a file.
	298
	299
	300	***********************
	301	7. Advanced Kea options
	302	***********************
	303
	304	Here is a complete list of the options to Kea. The last
	305	four (-F, -K, -M, and -S) have been superceded by the -C option,
	306	but still work; its possible they are good for something.
	307
	308	-d Debug mode. Working files are left in /tmp
	309	-t Ouput TF.IDF for each phrase. Used by Kniles.
	310	-N n Output n keyphrases (if possible).
	311	-E ext Output files have extension ".ext" (default is ".kea")
	312	-C x Use model based on corpus x.
	313	Defaults to "aliweb" web page corpus.
	314
	315	-F df Use document-frequency file "df".
	316	Defaults to aliweb.df where x is set by the -C argument.
	317	-K kf Use keyphrase-frequency file "mf".
	318	Defaults to x.kf where x is set by the -C argument.
	319	-M mf Use model file "mf".
	320	Defaults to x.model where x is set by the -C argument.
	321	-S sf Use stopword file "mf".
	322	Defaults to x.stopwords where x is set by the -C argument.
	323
	324

Note: See TracBrowser for help on using the repository browser.

Download in other formats: