Context Navigation

source: other-projects/nightly-tasks/diffcol/trunk/gs3-model-collect/Word-PDF-Formatting/archives/HASH0159.dir/doc.xml@ 28241

Last change on this file since 28241 was 28241, checked in by ak19, 11 years ago
Rebuilt the GS3 model collection after the change over to using placeholders for standard GS path prefixes in the two archiveinf gdb files
File size: 28.4 KB

Line
1	<?xml version="1.0" encoding="utf-8" standalone="no"?>
2	<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3	<Archive>
4	<Section>
5	<Description>
6	<Metadata name="gsdldoctype">indexed_doc</Metadata>
7	<Metadata name="Language">en</Metadata>
8	<Metadata name="Encoding">utf8</Metadata>
9	<Metadata name="Title">Clustering with finite data from semi-parametric mixture distributions</Metadata>
10	<Metadata name="gsdlsourcefilename">import/cluster.ps</Metadata>
11	<Metadata name="gsdlconvertedfilename">tmp/1378708752/cluster.text</Metadata>
12	<Metadata name="OrigSource">cluster.text</Metadata>
13	<Metadata name="Source">cluster.ps</Metadata>
14	<Metadata name="SourceFile">cluster.ps</Metadata>
15	<Metadata name="Plugin">PostScriptPlugin</Metadata>
16	<Metadata name="FileSize">94721</Metadata>
17	<Metadata name="FilenameRoot">cluster</Metadata>
18	<Metadata name="FileFormat">PS</Metadata>
19	<Metadata name="srcicon">_iconps_</Metadata>
20	<Metadata name="srclink_file">doc.ps</Metadata>
21	<Metadata name="srclinkFile">doc.ps</Metadata>
22	<Metadata name="dc.Creator">Yong Wang</Metadata>
23	<Metadata name="dc.Creator">Ian H. Witten</Metadata>
24	<Metadata name="dc.Title">Clustering with finite data from semi-parametric mixture distributions</Metadata>
25	<Metadata name="Identifier">HASH015936f516ed4b1d7b050af9</Metadata>
26	<Metadata name="lastmodified">1378708193</Metadata>
27	<Metadata name="lastmodifieddate">20130909</Metadata>
28	<Metadata name="oailastmodified">1378708753</Metadata>
29	<Metadata name="oailastmodifieddate">20130909</Metadata>
30	<Metadata name="assocfilepath">HASH0159.dir</Metadata>
31	<Metadata name="gsdlassocfile">doc.ps:application/postscript:</Metadata>
32	</Description>
33	<Content><pre>
34
35
36	Clustering with finite data from semi-parametric mixture distributions
37
38	Yong Wang Ian H. Witten Computer Science Department Computer Science Department
39	University of Waikato, New Zealand University of Waikato, New Zealand
40
41	Email: [email protected] Email: [email protected]
42
43	Abstract Existing clustering methods for the semi-parametric mixture distribution
44	perform well as the volume of data increases. However, they all suffer from
45	a serious drawback in finite-data situations: small outlying groups of data
46	points can be completely ignored in the clusters that are produced, no matter
47	how far away they lie from the major clusters. This can result in unbounded
48	loss if the loss function is sensitive to the distance between clusters.
49
50	This paper proposes a new distance-based clustering method that overcomes
51	the problem by avoiding global constraints. Experimental results illustrate
52	its superiority to existing methods when small clusters are present in finite
53	data sets; they also suggest that it is more accurate and stable than other
54	methods even when there are no small clusters.
55
56	1 Introduction A common practical problem is to fit an underlying statistical
57	distribution to a sample. In some applications, this involves estimating
58	the parameters of a single distribution function--e.g. the mean and variance
59	of a normal distribution. In others, an appropriate mixture of elementary
60	distributions must be found--e.g. a set of normal distributions, each with
61	its own mean and variance. Among many kinds of mixture distribution, one
62	in particular is attracting increasing research attention because it has
63	many practical applications: the semiparametric mixture distribution.
64
65	A semi-parametric mixture distribution is one whose cumulative distribution
66	function (CDF) has the form
67
68	FG(x) = Z
69
70	\\Theta F (x; `) dG(`); (1)
71
72	where ` 2 \\Theta , the parameter space, and x 2 X , the sample space. This
73	gives the CDF of the mixture distribution FG(x) in terms of two more elementary
74	distributions: the component distribution F (x; `), which is given, and the
75	mixing distribution G(`), which is unknown. The former has a single unknown
76	parameter `, while the latter gives a CDF for `. For example, F (x; `) might
77	be the normal distribution with mean ` and unit variance, where ` is a random
78	variable distributed according to G(`).
79
80	The problem that we will address is the estimation of G(`) from sampled data
81	that are independent and identically distributed according to the unknown
82	distribution FG(x). Once G(`) has been obtained, it is a straightforward
83	matter to obtain the mixture distribution.
84
85	The CDF G(`) can be either continuous or discrete. In the latter case, G(`)
86	is composed of a number of mass points, say, `1; : : : ; `k with masses w1;
87	: : : ; wk respectively, satisfying Pki=1 wi = 1. Then (1) can be re-written
88	as
89
90	FG(x) =
91
92	kX
93
94	i=1
95
96	wiF (x; `i); (2)
97
98	each mass point providing a component, or cluster, in the mixture with the
99	corresponding weight. If the number of components k is finite and known a
100	priori, the mixture distribution is called finite; otherwise it is treated
101	as countably infinite. The qualifier "countably" is necessary to distinguish
102	this case from the situation with continuous G(`), which is also infinite.
103
104	We will focus on the estimation of arbitrary mixing distributions, i.e.,
105	G(`) is any general probability distribution--finite, countably infinite
106	or continuous. A few methods for tackling this problem can be found in the
107	literature. However, as we shall see, they all suffer from a serious drawback
108	in finite-data situations: small outlying groups of data points can be completely
109	ignored in the clusters that are produced.
110
111	This phenomenon seems to have been overlooked, presumably for three reasons:
112	small amounts of data may be assumed to represent a small loss; a few data
113	points
114
115	1
116
117	can easily be dismissed as outliers; and in the limit the problem evaporates
118	because most estimators possess the property of strong consistency--which
119	means that, almost surely, they converge weakly to any given G(`) as the
120	sample size approaches infinity. However, often these reasons are inappropriate:
121	the loss function may be sensitive to the distance between clusters; the
122	small number of outlying data points may actually represent small clusters;
123	and any practical clustering situation will necessarily involve finite data.
124
125	This paper proposes a new method, based on the idea of local fitting, that
126	successfully solves the problem. The experimental results presented below
127	illustrate its superiority to existing methods when small clusters are present
128	in finite data sets. Moreover, they also suggest that it is more accurate
129	and stable than other methods even when there are no small clusters. Existing
130	clustering methods for semi-parametric mixture distributions are briefly
131	reviewed in the next section. Section 3 identifies a common problem from
132	which these current methods suffer. Then we present the new solution, and
133	in Section 5 we describe experiments that illustrate the problem that has
134	been identified and show how the new method overcomes it.
135
136	2 Clustering methods The general problem of inferring mixture models is treated
137	extensively and in considerable depth in books by Titterington et al. (1985),
138	McLachlan and Basford (1988) and Lindsay (1995). For semi-parametric mixture
139	distributions there are three basic approaches: minimum distance, maximum
140	likelihood, and Bayesian. We briefly introduce the first approach, which
141	is the one adopted in the paper, review the other two to show why they are
142	not suitable for arbitrary mixtures, and then return to the chosen approach
143	and review the minimum distance estimators for arbitrary semi-parametric
144	mixture distributions that have been described in the literature.
145
146	The idea of the minimum distance method is to define some measure of the
147	goodness of the clustering and optimize this by suitable choice of a mixing
148	distribution Gn(`) for a sample of size n. We generally want the estimator
149	to be strongly consistent as n ! 1, in the sense defined above, for arbitrary
150	mixing distributions. We also generally want to take advantage of the special
151	structure of semi-parametric mixtures to come up with an efficient algorithmic
152	solution.
153
154	The maximum likelihood approach maximizes the likelihood (or equivalently
155	the log-likelihood) of the data by suitable choice of Gn(`). It can in fact
156	be viewed as
157
158	a minimum distance method that uses the Kullback- Leibler distance (Titterington
159	et al., 1985). This approach has been widely used for estimating finite mixtures,
160	particularly when the number of clusters is fairly small, and it is generally
161	accepted that it is more accurate than other methods. However, it has not
162	been used to estimate arbitrary semi-parametric mixtures, presumably because
163	of its high computational cost. Its speed drops dramatically as the number
164	of parameters that must be determined increases, which makes it computationally
165	infeasible for arbitrary mixtures, since each data point might represent
166	a component of the final distribution with its own parameters.
167
168	Bayesian methods assume prior knowledge, often given by some kind of heuristic,
169	to determine a suitable a priori probability density function. They are often
170	used to determine the number of components in the final distribution--particularly
171	when outliers are present. Like the maximum likelihood approach they are
172	computationally expensive, for they use the same computational techniques.
173
174	We now review existing minimum distance estimators for arbitrary semi-parametric
175	mixture distributions. We begin with some notation. Let x1; : : : ; xn be
176	a sample chosen according to the mixture distribution, and suppose (without
177	loss of generality) that the sequence is ordered so that x1 ^ x2 ^ : : :
178	^ xn. Let Gn(`) be a discrete estimator of the underlying mixing distribution
179	with a set of support points at f`nj; j = 1; : : :; kng. Each `nj provides
180	a component of the final clustering with
181
182	weight wnj * 0, where Pk
183
184	n
185
186	j=1 wnj = 1. Given the sup-port points, obtaining G
187
188	n(`) is equivalent to computing the weight vector wn = (wn1; wn2; : : :;
189	wnk
190
191	n)0. Denoteby F
192
193	Gn(x) the estimated mixture CDF with respect to Gn(`).
194
195	Two minimum distance estimators were proposed in the late 1960s. Choi and
196	Bulgren (1968) used
197
198	1 n
199
200	nX
201
202	i=1
203
204	[FG
205
206	n(xi) \\Gamma i=n]
207
208	2 (3)
209
210	as the distance measure. Minimizing this quantity with respect to Gn yields
211	a strongly consistent estimator. A slight improvement is obtained by using
212	the Cram'er-von Mises statistic
213
214	1 n
215
216	nX
217
218	i=1
219
220	[FG
221
222	n(xi) \\Gamma (i \\Gamma 1=2)=n]
223
224	2 + 1=(12n2); (4)
225
226	which essentially replaces i=n in (3) with (i \\Gamma 12 )=n without affecting
227	the asymptotic result. As might be expected, this reduces the bias for small-sample
228	cases, as
229
230	was demonstrated empirically by Macdonald (1971) in a note on Choi and Bulgren's
231	paper.
232
233	At about the same time, Deely and Kruse (1968) used the sup-norm associated
234	with the Kolmogorov-Smirnov test. The minimization is over
235
236	sup 1^i^nfjF
237
238	Gn(xi) \\Gamma (i \\Gamma 1)=nj; jFGn(xi) \\Gamma i=njg; (5)
239
240	and this leads to a linear programming problem. Deely and Kruse also established
241	the strong consistency of their estimator Gn. Ten years later, this approach
242	was extended by Blum and Susarla (1977) by using any sequence ffng of functions
243	which satisfies sup jfn\\Gamma fGj ! 0 a.s. as n ! 1. Each fn can, for example,
244	be obtained by a kernel-based density estimator. Blum and Susarla approximated
245	the function fn by the overall mixture pdf fG
246
247	n , and established the strong consistency of the esti-mator G
248
249	n under weak conditions.
250
251	For reason of simplicity and generality, we will denote the approximation
252	between two mathematical entities of the same type by ,=, which implies the
253	minimization with respect to an estimator of a distance measure between the
254	entities on either side. The types of entity involved in this paper include
255	vector, function and measure, and we use the same symbol ,= for each.
256
257	In the work reviewed above, two kinds of estimator are used: CDF-based (Choi
258	and Bulgren, Macdonald, and Deely and Kruse) and pdf-based (Blum and Susarla).
259	CDF-based estimators involve approximating an empirical distribution with
260	an estimated one FG
261
262	n. We writethis as
263
264	FG
265
266	n ,= Fn; (6)
267
268	where Fn is the Kolmogorov empirical CDF--or indeed any empirical CDF that
269	converges to it. Pdf-based estimators involve the approximation between probability
270	density functions:
271
272	fG
273
274	n ,= fn; (7)
275
276	where fG
277
278	n is the estimated mixture pdf and fn is theempirical pdf described above.
279
280	The entities involved in (6) and (7) are functions. When the approximation
281	is computed, however, it is computed between vectors that represent the functions.
282	These vectors contain the function values at a particular set of points,
283	which we call "fitting points." In the work reviewed above, the fitting points
284	are chosen to be the data points themselves.
285
286	3 The problem of minority clusters
287
288	Although they perform well asymptotically, all the minimum distance methods
289	described above suffer from the finite-sample problem discussed earlier:
290	they can neglect small groups of outlying data points no matter how far they
291	lie from the dominant data points. The underlying reason is that the objective
292	function to be minimized is defined globally rather than locally. A global
293	approach means that the value of the estimated probability density function
294	at a particular place will be influenced by all data points, no matter how
295	far away they are. This can cause small groups of data points to be ignored
296	even if they are a long way from the dominant part of the data sample. From
297	a probabilistic point of view, however, there is no reason to subsume distant
298	groups within the major clusters just because they are relatively small.
299
300	The ultimate effect of suppressing distant minority clusters depends on how
301	the clustering is applied. If the application's loss function depends on
302	the distance between clusters, the result may prove disastrous because there
303	is no limit to how far away these outlying groups may be. One might argue
304	that small groups of points can easily be explained away as outliers, because
305	the effect will become less important as the number of data points increases--and
306	it will disappear in the limit of infinite data. However, in a finite-data
307	situation--and all practical applications necessarily involve finite data--the
308	"outliers" may equally well represent small minority clusters. Furthermore,
309	outlying data points are not really treated as outliers by these methods--whether
310	or not they are discarded is merely an artifact of the global fitting calculation.
311	When clustering, the final mixture distribution should take all data points
312	into account--including outlying clusters if any exist. If practical applications
313	demand that small outlying clusters are suppressed, this should be done in
314	a separate stage.
315
316	In distance-based clustering, each data point has a farreaching effect because
317	of two global constraints. One is the use of the cumulative distribution
318	function; the other is the normalization constraint Pk
319
320	n
321
322	j=1 wnj = 1. Theseconstraints may sacrifice a small number of data points--
323
324	at any distance--for a better overall fit to the data as a whole. Choi and
325	Bulgren (1968), the Cramer-von Mises statistic (Macdonald, 1971), and Deely
326	and Kruse (1968) all enforce both the CDF and the normalization constraints.
327	Blum and Susarla (1977) drop the CDF, but still enforce the normalization
328	constraint. The result is that these clustering methods are only appropriate
329	for finite mixtures without small clusters, where the risk of suppressing
330	clusters is low.
331
332	This paper addresses the general problem of arbitrary mixtures. Of course,
333	the minority cluster problem exists for all types of mixture--including finite
334	mixtures. Even here, the maximum likelihood and Bayesian approaches do not
335	solve the problem, because they both introduce a global normalization constraint.
336
337	4 Solving the minority cluster
338
339	problem
340
341	Now that the source of the problem has been identified, the solution is clear,
342	at least in principle: drop both the approximation of CDFs, as Blum and Susarla
343	(1977) do, and the normalization constraint--no matter how seductive it may
344	seem.
345
346	Let G0n be a discrete function with masses fwnjg at f`njg; note that we do
347	not require the wnj to sum to one. Since the new method operates in terms
348	of measures rather than distribution functions, the notion of approximation
349	is altered to use intervals rather than points. Using the formulation described
350	in Section 2, we have
351
352	PG0
353
354	n ,= Pn; (8)
355
356	where PG0
357
358	n is the estimated measure and Pn is the em-pirical measure. The intervals
359	over which the approximation takes place are called "fitting intervals."
360	Since (8) is not subject to the normalization constraint, G0n is not a CDF
361	and PG0
362
363	n is not a probability measure. How-ever, G0
364
365	n can be easily converted into a CDF estimatorby normalizing it after equation
366	(8) has been solved.
367
368	To define the estimation procedure fully, we need to determine (a) the set
369	of support points, (b) the set of fitting intervals, (c) the empirical measure,
370	and (d) the distance measure. Here we discuss these in an intuitive manner;
371	Wang and Witten (1999) show how to determine them in a way that guarantees
372	a strongly consistent estimator.
373
374	Support points. The support points are usually suggested by the data points
375	in the sample. For example, if the component distribution F (x; `) is the
376	normal distribution with mean ` and unit variance, each data point can be
377	taken as a support point. In fact, the support points are more accurately
378	described as potential support points, because their associated weights may
379	become zero after solving (8)--and, in practice, many often do.
380
381	Fitting intervals. The fitting intervals are also suggested by the data points.
382	In the normal distribution example, each data point xi can provide one interval,
383	such as [xi \\Gamma 3oe; xi], or two, such as [xi \\Gamma 3oe; xi] and [xi;
384	xi + 3oe], or more. There is no problem if the fitting
385
386	intervals overlap. Their length should not be so large that points can exert
387	an influence on the clustering at an unduly remote place, nor so small that
388	the empirical measure is inaccurate. The experiments reported below use intervals
389	of a few standard deviations around each data point, and, as we will see,
390	this works well.
391
392	Empirical measure. The empirical measure can be the probability measure determined
393	by the Kolmogorov empirical CDF, or any measure that converges to it. The
394	fitting intervals discussed above can be open, closed, or semi-open. This
395	will affect the empirical measure if data points are used as interval boundaries,
396	although it does not change the values of the estimated measure because the
397	corresponding distribution is continuous. In smallsample situations, bias
398	can be reduced by careful attention to this detail--as Macdonald (1971) discusses
399	with respect to Choi and Bulgren's (1968) method.
400
401	Distance measure. The choice of distance measure determines what kind of
402	mathematical programming problem must be solved. For example, a quadratic
403	distance will give rise to a least squares problem under linear constraints,
404	whereas the sup-norm gives rise to a linear programming problem that can
405	be solved using the simplex method. These two measures have efficient solutions
406	that are globally optimal.
407
408	It is worth pointing out that abandoning the global constraints associated
409	with both CDFs and normalization can brings with it a computational advantage.
410	In vector form, we write PG0
411
412	n = AG
413
414	0 nwn, where wn is the(unnormalized) weight vector and each element of the
415
416	matrix AG0
417
418	n is the probability value of a component dis-tribution over an fitting interval.
419	Then, provided the
420
421	support points corresponding to w0n and w00n lie outside each others' sphere
422	of influence as determined by the component distributions F (x; `), the estimation
423	procedure becomes`
424
425	A0G0
426
427	n 00 A00
428
429	G0n ' `
430
431	w0n w00n ' ,= `
432
433	P 0n P 00n ' ; (9)
434
435	subject to w0n * 0 and w00n * 0. This is the same as combining the solutions
436	of two sub-equations, A0nw0n ,= P 0n subject to w0n * 0, and A00nw00n ,=
437	P 00n subject to w00n * 0. If the relevant support points continue to lie
438	outside each others' sphere of influence, the sub-equations can be further
439	partitioned. This implies that when data points are sufficiently far apart,
440	the mixing distribution G can be estimated by grouping data points in different
441	regions. Moreover, the solution in each region can be normalized separately
442	before they are combined, which yields a better estimation of the mixing
443	distribution.
444
445	If the normalization constraint Pk
446
447	n
448
449	j=1 wnj = 1 is re-tained when estimating the mixing distribution, the es
450	timation procedure becomes
451
452	PG
453
454	n ,= Pn: (10)
455
456	where the estimator Gn is a discrete CDF on \\Theta . This constraint is necessary
457	for the left-hand side of (10) to be a probability measure. Although he did
458	not develop an operational estimation scheme, Barbe (1998) suggested exploiting
459	the fact that the empirical probability measure is approximated by the estimated
460	probability measure--but he retained the normalization constraint. As noted
461	above, relaxing the constraint has the effect of loosening the throttling
462	effect of large clusters on small groups of outliers, and our experimental
463	results show that the resulting estimator suffers from the drawback noted
464	earlier.
465
466	Both estimators, Gn obtained from (10) and G0n from (8), have been shown
467	to be strongly consistent under weak conditions similar to those used by
468	others (Wang & Witten, 1999). Of course, the weak convergence of G0n is in
469	the sense of general functions, not CDFs. The strong consistency of G0n immediately
470	implies the strong consistency of the CDF estimator obtained by normalizing
471	G0n.
472
473	5 Experimental validation We have conducted experiments to illustrate the
474	failure of existing methods to detect small outlying clusters, and the improvement
475	achieved by the new scheme. The results also suggest that the new method
476	is more accurate and stable than the others.
477
478	When comparing clustering methods, it is not always easy to evaluate the
479	clusters obtained. To finesse this problem we consider simple artificial
480	situations in which the proper outcome is clear. Some practical applications
481	of clusters do provide objective evaluation functions; however, these are
482	beyond the scope of this paper.
483
484	The methods used are Choi and Bulgren (1968) (denoted choi), Macdonald's
485	application of the Cram'er-von Mises statistic (cram'er), the new method
486	with the normalization constraint (test), and the new method without that
487	constraint (new). In each case, equations involving non-negativity and/or
488	linear equality constraints are solved as quadratic programming problems
489	using the elegant and efficient procedures nnls and lsei provided by Lawson
490	and Hanson (1974). All four methods have the same computational time complexity.
491
492	We set the sample size n to 100 throughout the experiments. The data points
493	are artificially generated from a mixture of two clusters: n1 points from
494	N (0; 1) and n2 points from N (100; 1). The values of n1 and n2 are in the
495	ratios 99 : 1, 97 : 3, 93 : 7, 80 : 20 and 50 : 50.
496
497	Every data point is taken as a potential support point in all four methods:
498	thus the number of potential components in the clustering is 100. For test
499	and new, fitting intervals need to be determined. In the experiments, each
500	data point xi provides the two fitting intervals [xi \\Gamma 3; xi] and [xi;
501	xi + 3]. Any data point located on the boundary of an interval is counted
502	as half a point when determining the empirical measure over that interval.
503
504	These choices are admittedly crude, and further improvements in the accuracy
505	and speed of test and new are possible that take advantage of the flexibility
506	provided by (10) and (8). For example, accuracy will likely increase with
507	more--and more carefully chosen-- support points and fitting intervals. The
508	fact that it performs well even with crudely chosen support points and fitting
509	intervals testifies to the robustness of the method.
510
511	Our primary interest in this experiment is the weights of the clusters that
512	are found. To cast the results in terms of the underlying models, we use
513	the cluster weights to estimate values for n1 and n2. Of course, the results
514	often do not contain exactly two clusters--but because the underlying cluster
515	centres, 0 and 100, are well separated compared to their standard deviation
516	of 1, it is highly unlikely that any data points from one cluster will fall
517	anywhere near the other. Thus we use a threshold of 50 to divide the clusters
518	into two groups: those near 0 and those near 100. The final cluster weights
519	are normalized, and the weights for the first group are summed to obtain
520	an estimate ^n1 of n1, while those for the second group are summed to give
521	an estimate ^n2 of n2.
522
523	Table 1 shows results for each of the four methods. Each cell represents
524	one hundred separate experimental runs. Three figures are recorded. At the
525	top is the number of times the method failed to detect the smaller cluster,
526	that is, the number of times ^n2 = 0. In the middle are the average values
527	for ^n1 and ^n2. At the bottom is the standard deviation of ^n1 and ^n2 (which
528	are equal). These three figures can be thought of as measures of reliability,
529	accuracy and stability respectively.
530
531	The top figures in Table 1 show clearly that only new is always reliable
532	in the sense that it never fails to detect the smaller cluster. The other
533	methods fail mostly when n2 = 1; their failure rate gradually decreases as
534	n2 grows. The center figures show that, under all conditions, new gives a
535	more accurate estimate of the correct values of n1 and n2 than the other
536	methods. As expected, cram'er shows a noticeable improvement over choi, but
537	it is very minor. The test method has lower failure rates and produces estimates
538	that are more accurate and far more stable (indicated by the bottom fign1
539	= 99 n1 = 97 n1 = 93 n1 = 80 n1 = 50
540
541	n2 = 1 n2 = 3 n2 = 7 n2 = 20 n2 = 50 choi Failures 86 42 4 0 0
542
543	^n1=^n2 99.9/0.1 99.2/0.8 95.8/4.2 82.0/18.0 50.6/49.4 SD(^n1) 0.36 0.98
544	1.71 1.77 1.30 cram'er Failures 80 31 1 0 0
545
546	^n1=^n2 99.8/0.2 98.6/1.4 95.1/4.9 81.6/18.4 49.7/50.3 SD(^n1) 0.50 1.13
547	1.89 1.80 1.31 test Failures 52 5 0 0 0
548
549	^n1=^n2 99.8/0.2 98.2/1.8 94.1/5.9 80.8/19.2 50.1/49.9 SD(^n1) 0.32 0.83
550	0.87 0.78 0.55 new Failures 0 0 0 0 0
551
552	^n1=^n2 99.0/1.0 96.9/3.1 92.8/7.2 79.9/20.1 50.1/49.9 SD(^n1) 0.01 0.16
553	0.19 0.34 0.41
554
555	Table 1: Experimental results for detecting small clusters ures) than those
556	for choi and cram'er--presumably because it is less constrained. Of the four
557	methods, new is clearly and consistently the winner in terms of all three
558	measures: reliability, accuracy and stability.
559
560	The results of the new method can be further improved. If the decomposed
561	form (9) is used instead of (8), and the solutions of the sub-equations are
562	normalized before combining them--which is feasible because the two underlying
563	clusters are so distant from each other--the correct values are obtained
564	for ^n1 and ^n2 in virtually every trial.
565
566	6 Conclusions We have identified a shortcoming of existing clustering methods
567	for arbitrary semi-parametric mixture distributions: they fail to detect
568	very small clusters reliably. This is a significant weakness when the minority
569	clusters are far from the dominant ones and the loss function takes account
570	of the distance of misclustered points.
571
572	We have described a new clustering method for arbitrary semi-parametric mixture
573	distributions, and shown experimentally that it overcomes the problem. Furthermore,
574	the experiments suggest that the new estimator is more accurate and more
575	stable than existing ones.
576
577	References Barbe, P. (1998). Statistical analysis of mixtures and
578
579	the empirical probability measure. Acta Applicandae Mathematicae, 50(3),
580	253-340.
581
582	Blum, J. R. & Susarla, V. (1977). Estimation of a mixing
583
584	distribution function. Ann. Probab, 5, 200-209.
585
586	Choi, K. & Bulgren, W. B. (1968). An estimation procedure for mixtures of
587	distributions. J. R. Statist. Soc. B, 30, 444-460.
588
589	Deely, J. J. & Kruse, R. L. (1968). Construction of sequences estimating
590	the mixing distribution. Ann. Math. Statist., 39, 286-288.
591
592	Lawson, C. L. & Hanson, R. J. (1974). Solving Least
593
594	Squares Problems. Prentice-Hall, Inc.
595
596	Lindsay, B. G. (1995). Mixture models: theory, geometry,
597
598	and applications, Volume 5 of NSF-CBMS Regional Conference Series in Probability
599	and Statistics. Institute for Mathematical Statistics: Hayward, CA.
600
601	Macdonald, P. D. M. (1971). Comment on a paper by
602
603	Choi and Bulgren. J. R. Statist. Soc. B, 33, 326- 329.
604
605	McLachlan, G. & Basford, K. (1988). Mixture Models:
606
607	Inference and Applications to Clustering. Marcel Dekker, New York.
608
609	Titterington, D. M., Smith, A. F. M. & Makov, U. E.
610
611	(1985). Statistical Analysis of Finite Mixture Distributions. John Wiley
612	& Sons.
613
614	Wang, Y. & Witten, I. H. (1999). The estimation of mixing distributions by
615	approximating empirical measures. Technical Report (in preparation), Dept.
616	of Computer Science, University of Waikato, New Zealand.
617	</pre></Content>
618	</Section>
619	</Archive>

Note: See TracBrowser for help on using the repository browser.

Download in other formats: