Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

doc.xml@ 27726

Last change on this file since 27726 was 27726, checked in by ak19, 11 years ago
Word-PDF-Basic model collection
File size: 28.3 KB

Line
1	<?xml version="1.0" encoding="utf-8" standalone="no"?>
2	<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3	<Archive>
4	<Section>
5	<Description>
6	<Metadata name="gsdldoctype">indexed_doc</Metadata>
7	<Metadata name="Language">en</Metadata>
8	<Metadata name="Encoding">utf8</Metadata>
9	<Metadata name="Title">Clustering with finite data from semi-parametric mixture distributions</Metadata>
10	<Metadata name="gsdlsourcefilename">import/cluster.ps</Metadata>
11	<Metadata name="gsdlconvertedfilename">tmp/1372402380/cluster.text</Metadata>
12	<Metadata name="OrigSource">cluster.text</Metadata>
13	<Metadata name="Source">cluster.ps</Metadata>
14	<Metadata name="SourceFile">cluster.ps</Metadata>
15	<Metadata name="Plugin">PostScriptPlugin</Metadata>
16	<Metadata name="FileSize">94721</Metadata>
17	<Metadata name="FilenameRoot">cluster</Metadata>
18	<Metadata name="FileFormat">PS</Metadata>
19	<Metadata name="srcicon">_iconps_</Metadata>
20	<Metadata name="srclink_file">doc.ps</Metadata>
21	<Metadata name="srclinkFile">doc.ps</Metadata>
22	<Metadata name="dc.Creator">Yong Wang</Metadata>
23	<Metadata name="dc.Creator">Ian H. Witten</Metadata>
24	<Metadata name="Identifier">HASH015936f516ed4b1d7b050af9</Metadata>
25	<Metadata name="lastmodified">1372400870</Metadata>
26	<Metadata name="lastmodifieddate">20130628</Metadata>
27	<Metadata name="oailastmodified">1372402380</Metadata>
28	<Metadata name="oailastmodifieddate">20130628</Metadata>
29	<Metadata name="assocfilepath">HASH015936f5.dir</Metadata>
30	<Metadata name="gsdlassocfile">doc.ps:application/postscript:</Metadata>
31	</Description>
32	<Content><pre>
33
34
35	Clustering with finite data from semi-parametric mixture distributions
36
37	Yong Wang Ian H. Witten Computer Science Department Computer Science Department
38	University of Waikato, New Zealand University of Waikato, New Zealand
39
40	Email: [email protected] Email: [email protected]
41
42	Abstract Existing clustering methods for the semi-parametric mixture distribution
43	perform well as the volume of data increases. However, they all suffer from
44	a serious drawback in finite-data situations: small outlying groups of data
45	points can be completely ignored in the clusters that are produced, no matter
46	how far away they lie from the major clusters. This can result in unbounded
47	loss if the loss function is sensitive to the distance between clusters.
48
49	This paper proposes a new distance-based clustering method that overcomes
50	the problem by avoiding global constraints. Experimental results illustrate
51	its superiority to existing methods when small clusters are present in finite
52	data sets; they also suggest that it is more accurate and stable than other
53	methods even when there are no small clusters.
54
55	1 Introduction A common practical problem is to fit an underlying statistical
56	distribution to a sample. In some applications, this involves estimating
57	the parameters of a single distribution function--e.g. the mean and variance
58	of a normal distribution. In others, an appropriate mixture of elementary
59	distributions must be found--e.g. a set of normal distributions, each with
60	its own mean and variance. Among many kinds of mixture distribution, one
61	in particular is attracting increasing research attention because it has
62	many practical applications: the semiparametric mixture distribution.
63
64	A semi-parametric mixture distribution is one whose cumulative distribution
65	function (CDF) has the form
66
67	FG(x) = Z
68
69	\\Theta F (x; `) dG(`); (1)
70
71	where ` 2 \\Theta , the parameter space, and x 2 X , the sample space. This
72	gives the CDF of the mixture distribution FG(x) in terms of two more elementary
73	distributions: the component distribution F (x; `), which is given, and the
74	mixing distribution G(`), which is unknown. The former has a single unknown
75	parameter `, while the latter gives a CDF for `. For example, F (x; `) might
76	be the normal distribution with mean ` and unit variance, where ` is a random
77	variable distributed according to G(`).
78
79	The problem that we will address is the estimation of G(`) from sampled data
80	that are independent and identically distributed according to the unknown
81	distribution FG(x). Once G(`) has been obtained, it is a straightforward
82	matter to obtain the mixture distribution.
83
84	The CDF G(`) can be either continuous or discrete. In the latter case, G(`)
85	is composed of a number of mass points, say, `1; : : : ; `k with masses w1;
86	: : : ; wk respectively, satisfying Pki=1 wi = 1. Then (1) can be re-written
87	as
88
89	FG(x) =
90
91	kX
92
93	i=1
94
95	wiF (x; `i); (2)
96
97	each mass point providing a component, or cluster, in the mixture with the
98	corresponding weight. If the number of components k is finite and known a
99	priori, the mixture distribution is called finite; otherwise it is treated
100	as countably infinite. The qualifier "countably" is necessary to distinguish
101	this case from the situation with continuous G(`), which is also infinite.
102
103	We will focus on the estimation of arbitrary mixing distributions, i.e.,
104	G(`) is any general probability distribution--finite, countably infinite
105	or continuous. A few methods for tackling this problem can be found in the
106	literature. However, as we shall see, they all suffer from a serious drawback
107	in finite-data situations: small outlying groups of data points can be completely
108	ignored in the clusters that are produced.
109
110	This phenomenon seems to have been overlooked, presumably for three reasons:
111	small amounts of data may be assumed to represent a small loss; a few data
112	points
113
114	1
115
116	can easily be dismissed as outliers; and in the limit the problem evaporates
117	because most estimators possess the property of strong consistency--which
118	means that, almost surely, they converge weakly to any given G(`) as the
119	sample size approaches infinity. However, often these reasons are inappropriate:
120	the loss function may be sensitive to the distance between clusters; the
121	small number of outlying data points may actually represent small clusters;
122	and any practical clustering situation will necessarily involve finite data.
123
124	This paper proposes a new method, based on the idea of local fitting, that
125	successfully solves the problem. The experimental results presented below
126	illustrate its superiority to existing methods when small clusters are present
127	in finite data sets. Moreover, they also suggest that it is more accurate
128	and stable than other methods even when there are no small clusters. Existing
129	clustering methods for semi-parametric mixture distributions are briefly
130	reviewed in the next section. Section 3 identifies a common problem from
131	which these current methods suffer. Then we present the new solution, and
132	in Section 5 we describe experiments that illustrate the problem that has
133	been identified and show how the new method overcomes it.
134
135	2 Clustering methods The general problem of inferring mixture models is treated
136	extensively and in considerable depth in books by Titterington et al. (1985),
137	McLachlan and Basford (1988) and Lindsay (1995). For semi-parametric mixture
138	distributions there are three basic approaches: minimum distance, maximum
139	likelihood, and Bayesian. We briefly introduce the first approach, which
140	is the one adopted in the paper, review the other two to show why they are
141	not suitable for arbitrary mixtures, and then return to the chosen approach
142	and review the minimum distance estimators for arbitrary semi-parametric
143	mixture distributions that have been described in the literature.
144
145	The idea of the minimum distance method is to define some measure of the
146	goodness of the clustering and optimize this by suitable choice of a mixing
147	distribution Gn(`) for a sample of size n. We generally want the estimator
148	to be strongly consistent as n ! 1, in the sense defined above, for arbitrary
149	mixing distributions. We also generally want to take advantage of the special
150	structure of semi-parametric mixtures to come up with an efficient algorithmic
151	solution.
152
153	The maximum likelihood approach maximizes the likelihood (or equivalently
154	the log-likelihood) of the data by suitable choice of Gn(`). It can in fact
155	be viewed as
156
157	a minimum distance method that uses the Kullback- Leibler distance (Titterington
158	et al., 1985). This approach has been widely used for estimating finite mixtures,
159	particularly when the number of clusters is fairly small, and it is generally
160	accepted that it is more accurate than other methods. However, it has not
161	been used to estimate arbitrary semi-parametric mixtures, presumably because
162	of its high computational cost. Its speed drops dramatically as the number
163	of parameters that must be determined increases, which makes it computationally
164	infeasible for arbitrary mixtures, since each data point might represent
165	a component of the final distribution with its own parameters.
166
167	Bayesian methods assume prior knowledge, often given by some kind of heuristic,
168	to determine a suitable a priori probability density function. They are often
169	used to determine the number of components in the final distribution--particularly
170	when outliers are present. Like the maximum likelihood approach they are
171	computationally expensive, for they use the same computational techniques.
172
173	We now review existing minimum distance estimators for arbitrary semi-parametric
174	mixture distributions. We begin with some notation. Let x1; : : : ; xn be
175	a sample chosen according to the mixture distribution, and suppose (without
176	loss of generality) that the sequence is ordered so that x1 ^ x2 ^ : : :
177	^ xn. Let Gn(`) be a discrete estimator of the underlying mixing distribution
178	with a set of support points at f`nj; j = 1; : : :; kng. Each `nj provides
179	a component of the final clustering with
180
181	weight wnj * 0, where Pk
182
183	n
184
185	j=1 wnj = 1. Given the sup-port points, obtaining G
186
187	n(`) is equivalent to computing the weight vector wn = (wn1; wn2; : : :;
188	wnk
189
190	n)0. Denoteby F
191
192	Gn(x) the estimated mixture CDF with respect to Gn(`).
193
194	Two minimum distance estimators were proposed in the late 1960s. Choi and
195	Bulgren (1968) used
196
197	1 n
198
199	nX
200
201	i=1
202
203	[FG
204
205	n(xi) \\Gamma i=n]
206
207	2 (3)
208
209	as the distance measure. Minimizing this quantity with respect to Gn yields
210	a strongly consistent estimator. A slight improvement is obtained by using
211	the Cram'er-von Mises statistic
212
213	1 n
214
215	nX
216
217	i=1
218
219	[FG
220
221	n(xi) \\Gamma (i \\Gamma 1=2)=n]
222
223	2 + 1=(12n2); (4)
224
225	which essentially replaces i=n in (3) with (i \\Gamma 12 )=n without affecting
226	the asymptotic result. As might be expected, this reduces the bias for small-sample
227	cases, as
228
229	was demonstrated empirically by Macdonald (1971) in a note on Choi and Bulgren's
230	paper.
231
232	At about the same time, Deely and Kruse (1968) used the sup-norm associated
233	with the Kolmogorov-Smirnov test. The minimization is over
234
235	sup 1^i^nfjF
236
237	Gn(xi) \\Gamma (i \\Gamma 1)=nj; jFGn(xi) \\Gamma i=njg; (5)
238
239	and this leads to a linear programming problem. Deely and Kruse also established
240	the strong consistency of their estimator Gn. Ten years later, this approach
241	was extended by Blum and Susarla (1977) by using any sequence ffng of functions
242	which satisfies sup jfn\\Gamma fGj ! 0 a.s. as n ! 1. Each fn can, for example,
243	be obtained by a kernel-based density estimator. Blum and Susarla approximated
244	the function fn by the overall mixture pdf fG
245
246	n , and established the strong consistency of the esti-mator G
247
248	n under weak conditions.
249
250	For reason of simplicity and generality, we will denote the approximation
251	between two mathematical entities of the same type by ,=, which implies the
252	minimization with respect to an estimator of a distance measure between the
253	entities on either side. The types of entity involved in this paper include
254	vector, function and measure, and we use the same symbol ,= for each.
255
256	In the work reviewed above, two kinds of estimator are used: CDF-based (Choi
257	and Bulgren, Macdonald, and Deely and Kruse) and pdf-based (Blum and Susarla).
258	CDF-based estimators involve approximating an empirical distribution with
259	an estimated one FG
260
261	n. We writethis as
262
263	FG
264
265	n ,= Fn; (6)
266
267	where Fn is the Kolmogorov empirical CDF--or indeed any empirical CDF that
268	converges to it. Pdf-based estimators involve the approximation between probability
269	density functions:
270
271	fG
272
273	n ,= fn; (7)
274
275	where fG
276
277	n is the estimated mixture pdf and fn is theempirical pdf described above.
278
279	The entities involved in (6) and (7) are functions. When the approximation
280	is computed, however, it is computed between vectors that represent the functions.
281	These vectors contain the function values at a particular set of points,
282	which we call "fitting points." In the work reviewed above, the fitting points
283	are chosen to be the data points themselves.
284
285	3 The problem of minority clusters
286
287	Although they perform well asymptotically, all the minimum distance methods
288	described above suffer from the finite-sample problem discussed earlier:
289	they can neglect small groups of outlying data points no matter how far they
290	lie from the dominant data points. The underlying reason is that the objective
291	function to be minimized is defined globally rather than locally. A global
292	approach means that the value of the estimated probability density function
293	at a particular place will be influenced by all data points, no matter how
294	far away they are. This can cause small groups of data points to be ignored
295	even if they are a long way from the dominant part of the data sample. From
296	a probabilistic point of view, however, there is no reason to subsume distant
297	groups within the major clusters just because they are relatively small.
298
299	The ultimate effect of suppressing distant minority clusters depends on how
300	the clustering is applied. If the application's loss function depends on
301	the distance between clusters, the result may prove disastrous because there
302	is no limit to how far away these outlying groups may be. One might argue
303	that small groups of points can easily be explained away as outliers, because
304	the effect will become less important as the number of data points increases--and
305	it will disappear in the limit of infinite data. However, in a finite-data
306	situation--and all practical applications necessarily involve finite data--the
307	"outliers" may equally well represent small minority clusters. Furthermore,
308	outlying data points are not really treated as outliers by these methods--whether
309	or not they are discarded is merely an artifact of the global fitting calculation.
310	When clustering, the final mixture distribution should take all data points
311	into account--including outlying clusters if any exist. If practical applications
312	demand that small outlying clusters are suppressed, this should be done in
313	a separate stage.
314
315	In distance-based clustering, each data point has a farreaching effect because
316	of two global constraints. One is the use of the cumulative distribution
317	function; the other is the normalization constraint Pk
318
319	n
320
321	j=1 wnj = 1. Theseconstraints may sacrifice a small number of data points--
322
323	at any distance--for a better overall fit to the data as a whole. Choi and
324	Bulgren (1968), the Cramer-von Mises statistic (Macdonald, 1971), and Deely
325	and Kruse (1968) all enforce both the CDF and the normalization constraints.
326	Blum and Susarla (1977) drop the CDF, but still enforce the normalization
327	constraint. The result is that these clustering methods are only appropriate
328	for finite mixtures without small clusters, where the risk of suppressing
329	clusters is low.
330
331	This paper addresses the general problem of arbitrary mixtures. Of course,
332	the minority cluster problem exists for all types of mixture--including finite
333	mixtures. Even here, the maximum likelihood and Bayesian approaches do not
334	solve the problem, because they both introduce a global normalization constraint.
335
336	4 Solving the minority cluster
337
338	problem
339
340	Now that the source of the problem has been identified, the solution is clear,
341	at least in principle: drop both the approximation of CDFs, as Blum and Susarla
342	(1977) do, and the normalization constraint--no matter how seductive it may
343	seem.
344
345	Let G0n be a discrete function with masses fwnjg at f`njg; note that we do
346	not require the wnj to sum to one. Since the new method operates in terms
347	of measures rather than distribution functions, the notion of approximation
348	is altered to use intervals rather than points. Using the formulation described
349	in Section 2, we have
350
351	PG0
352
353	n ,= Pn; (8)
354
355	where PG0
356
357	n is the estimated measure and Pn is the em-pirical measure. The intervals
358	over which the approximation takes place are called "fitting intervals."
359	Since (8) is not subject to the normalization constraint, G0n is not a CDF
360	and PG0
361
362	n is not a probability measure. How-ever, G0
363
364	n can be easily converted into a CDF estimatorby normalizing it after equation
365	(8) has been solved.
366
367	To define the estimation procedure fully, we need to determine (a) the set
368	of support points, (b) the set of fitting intervals, (c) the empirical measure,
369	and (d) the distance measure. Here we discuss these in an intuitive manner;
370	Wang and Witten (1999) show how to determine them in a way that guarantees
371	a strongly consistent estimator.
372
373	Support points. The support points are usually suggested by the data points
374	in the sample. For example, if the component distribution F (x; `) is the
375	normal distribution with mean ` and unit variance, each data point can be
376	taken as a support point. In fact, the support points are more accurately
377	described as potential support points, because their associated weights may
378	become zero after solving (8)--and, in practice, many often do.
379
380	Fitting intervals. The fitting intervals are also suggested by the data points.
381	In the normal distribution example, each data point xi can provide one interval,
382	such as [xi \\Gamma 3oe; xi], or two, such as [xi \\Gamma 3oe; xi] and [xi;
383	xi + 3oe], or more. There is no problem if the fitting
384
385	intervals overlap. Their length should not be so large that points can exert
386	an influence on the clustering at an unduly remote place, nor so small that
387	the empirical measure is inaccurate. The experiments reported below use intervals
388	of a few standard deviations around each data point, and, as we will see,
389	this works well.
390
391	Empirical measure. The empirical measure can be the probability measure determined
392	by the Kolmogorov empirical CDF, or any measure that converges to it. The
393	fitting intervals discussed above can be open, closed, or semi-open. This
394	will affect the empirical measure if data points are used as interval boundaries,
395	although it does not change the values of the estimated measure because the
396	corresponding distribution is continuous. In smallsample situations, bias
397	can be reduced by careful attention to this detail--as Macdonald (1971) discusses
398	with respect to Choi and Bulgren's (1968) method.
399
400	Distance measure. The choice of distance measure determines what kind of
401	mathematical programming problem must be solved. For example, a quadratic
402	distance will give rise to a least squares problem under linear constraints,
403	whereas the sup-norm gives rise to a linear programming problem that can
404	be solved using the simplex method. These two measures have efficient solutions
405	that are globally optimal.
406
407	It is worth pointing out that abandoning the global constraints associated
408	with both CDFs and normalization can brings with it a computational advantage.
409	In vector form, we write PG0
410
411	n = AG
412
413	0 nwn, where wn is the(unnormalized) weight vector and each element of the
414
415	matrix AG0
416
417	n is the probability value of a component dis-tribution over an fitting interval.
418	Then, provided the
419
420	support points corresponding to w0n and w00n lie outside each others' sphere
421	of influence as determined by the component distributions F (x; `), the estimation
422	procedure becomes`
423
424	A0G0
425
426	n 00 A00
427
428	G0n ' `
429
430	w0n w00n ' ,= `
431
432	P 0n P 00n ' ; (9)
433
434	subject to w0n * 0 and w00n * 0. This is the same as combining the solutions
435	of two sub-equations, A0nw0n ,= P 0n subject to w0n * 0, and A00nw00n ,=
436	P 00n subject to w00n * 0. If the relevant support points continue to lie
437	outside each others' sphere of influence, the sub-equations can be further
438	partitioned. This implies that when data points are sufficiently far apart,
439	the mixing distribution G can be estimated by grouping data points in different
440	regions. Moreover, the solution in each region can be normalized separately
441	before they are combined, which yields a better estimation of the mixing
442	distribution.
443
444	If the normalization constraint Pk
445
446	n
447
448	j=1 wnj = 1 is re-tained when estimating the mixing distribution, the es
449	timation procedure becomes
450
451	PG
452
453	n ,= Pn: (10)
454
455	where the estimator Gn is a discrete CDF on \\Theta . This constraint is necessary
456	for the left-hand side of (10) to be a probability measure. Although he did
457	not develop an operational estimation scheme, Barbe (1998) suggested exploiting
458	the fact that the empirical probability measure is approximated by the estimated
459	probability measure--but he retained the normalization constraint. As noted
460	above, relaxing the constraint has the effect of loosening the throttling
461	effect of large clusters on small groups of outliers, and our experimental
462	results show that the resulting estimator suffers from the drawback noted
463	earlier.
464
465	Both estimators, Gn obtained from (10) and G0n from (8), have been shown
466	to be strongly consistent under weak conditions similar to those used by
467	others (Wang & Witten, 1999). Of course, the weak convergence of G0n is in
468	the sense of general functions, not CDFs. The strong consistency of G0n immediately
469	implies the strong consistency of the CDF estimator obtained by normalizing
470	G0n.
471
472	5 Experimental validation We have conducted experiments to illustrate the
473	failure of existing methods to detect small outlying clusters, and the improvement
474	achieved by the new scheme. The results also suggest that the new method
475	is more accurate and stable than the others.
476
477	When comparing clustering methods, it is not always easy to evaluate the
478	clusters obtained. To finesse this problem we consider simple artificial
479	situations in which the proper outcome is clear. Some practical applications
480	of clusters do provide objective evaluation functions; however, these are
481	beyond the scope of this paper.
482
483	The methods used are Choi and Bulgren (1968) (denoted choi), Macdonald's
484	application of the Cram'er-von Mises statistic (cram'er), the new method
485	with the normalization constraint (test), and the new method without that
486	constraint (new). In each case, equations involving non-negativity and/or
487	linear equality constraints are solved as quadratic programming problems
488	using the elegant and efficient procedures nnls and lsei provided by Lawson
489	and Hanson (1974). All four methods have the same computational time complexity.
490
491	We set the sample size n to 100 throughout the experiments. The data points
492	are artificially generated from a mixture of two clusters: n1 points from
493	N (0; 1) and n2 points from N (100; 1). The values of n1 and n2 are in the
494	ratios 99 : 1, 97 : 3, 93 : 7, 80 : 20 and 50 : 50.
495
496	Every data point is taken as a potential support point in all four methods:
497	thus the number of potential components in the clustering is 100. For test
498	and new, fitting intervals need to be determined. In the experiments, each
499	data point xi provides the two fitting intervals [xi \\Gamma 3; xi] and [xi;
500	xi + 3]. Any data point located on the boundary of an interval is counted
501	as half a point when determining the empirical measure over that interval.
502
503	These choices are admittedly crude, and further improvements in the accuracy
504	and speed of test and new are possible that take advantage of the flexibility
505	provided by (10) and (8). For example, accuracy will likely increase with
506	more--and more carefully chosen-- support points and fitting intervals. The
507	fact that it performs well even with crudely chosen support points and fitting
508	intervals testifies to the robustness of the method.
509
510	Our primary interest in this experiment is the weights of the clusters that
511	are found. To cast the results in terms of the underlying models, we use
512	the cluster weights to estimate values for n1 and n2. Of course, the results
513	often do not contain exactly two clusters--but because the underlying cluster
514	centres, 0 and 100, are well separated compared to their standard deviation
515	of 1, it is highly unlikely that any data points from one cluster will fall
516	anywhere near the other. Thus we use a threshold of 50 to divide the clusters
517	into two groups: those near 0 and those near 100. The final cluster weights
518	are normalized, and the weights for the first group are summed to obtain
519	an estimate ^n1 of n1, while those for the second group are summed to give
520	an estimate ^n2 of n2.
521
522	Table 1 shows results for each of the four methods. Each cell represents
523	one hundred separate experimental runs. Three figures are recorded. At the
524	top is the number of times the method failed to detect the smaller cluster,
525	that is, the number of times ^n2 = 0. In the middle are the average values
526	for ^n1 and ^n2. At the bottom is the standard deviation of ^n1 and ^n2 (which
527	are equal). These three figures can be thought of as measures of reliability,
528	accuracy and stability respectively.
529
530	The top figures in Table 1 show clearly that only new is always reliable
531	in the sense that it never fails to detect the smaller cluster. The other
532	methods fail mostly when n2 = 1; their failure rate gradually decreases as
533	n2 grows. The center figures show that, under all conditions, new gives a
534	more accurate estimate of the correct values of n1 and n2 than the other
535	methods. As expected, cram'er shows a noticeable improvement over choi, but
536	it is very minor. The test method has lower failure rates and produces estimates
537	that are more accurate and far more stable (indicated by the bottom fign1
538	= 99 n1 = 97 n1 = 93 n1 = 80 n1 = 50
539
540	n2 = 1 n2 = 3 n2 = 7 n2 = 20 n2 = 50 choi Failures 86 42 4 0 0
541
542	^n1=^n2 99.9/0.1 99.2/0.8 95.8/4.2 82.0/18.0 50.6/49.4 SD(^n1) 0.36 0.98
543	1.71 1.77 1.30 cram'er Failures 80 31 1 0 0
544
545	^n1=^n2 99.8/0.2 98.6/1.4 95.1/4.9 81.6/18.4 49.7/50.3 SD(^n1) 0.50 1.13
546	1.89 1.80 1.31 test Failures 52 5 0 0 0
547
548	^n1=^n2 99.8/0.2 98.2/1.8 94.1/5.9 80.8/19.2 50.1/49.9 SD(^n1) 0.32 0.83
549	0.87 0.78 0.55 new Failures 0 0 0 0 0
550
551	^n1=^n2 99.0/1.0 96.9/3.1 92.8/7.2 79.9/20.1 50.1/49.9 SD(^n1) 0.01 0.16
552	0.19 0.34 0.41
553
554	Table 1: Experimental results for detecting small clusters ures) than those
555	for choi and cram'er--presumably because it is less constrained. Of the four
556	methods, new is clearly and consistently the winner in terms of all three
557	measures: reliability, accuracy and stability.
558
559	The results of the new method can be further improved. If the decomposed
560	form (9) is used instead of (8), and the solutions of the sub-equations are
561	normalized before combining them--which is feasible because the two underlying
562	clusters are so distant from each other--the correct values are obtained
563	for ^n1 and ^n2 in virtually every trial.
564
565	6 Conclusions We have identified a shortcoming of existing clustering methods
566	for arbitrary semi-parametric mixture distributions: they fail to detect
567	very small clusters reliably. This is a significant weakness when the minority
568	clusters are far from the dominant ones and the loss function takes account
569	of the distance of misclustered points.
570
571	We have described a new clustering method for arbitrary semi-parametric mixture
572	distributions, and shown experimentally that it overcomes the problem. Furthermore,
573	the experiments suggest that the new estimator is more accurate and more
574	stable than existing ones.
575
576	References Barbe, P. (1998). Statistical analysis of mixtures and
577
578	the empirical probability measure. Acta Applicandae Mathematicae, 50(3),
579	253-340.
580
581	Blum, J. R. & Susarla, V. (1977). Estimation of a mixing
582
583	distribution function. Ann. Probab, 5, 200-209.
584
585	Choi, K. & Bulgren, W. B. (1968). An estimation procedure for mixtures of
586	distributions. J. R. Statist. Soc. B, 30, 444-460.
587
588	Deely, J. J. & Kruse, R. L. (1968). Construction of sequences estimating
589	the mixing distribution. Ann. Math. Statist., 39, 286-288.
590
591	Lawson, C. L. & Hanson, R. J. (1974). Solving Least
592
593	Squares Problems. Prentice-Hall, Inc.
594
595	Lindsay, B. G. (1995). Mixture models: theory, geometry,
596
597	and applications, Volume 5 of NSF-CBMS Regional Conference Series in Probability
598	and Statistics. Institute for Mathematical Statistics: Hayward, CA.
599
600	Macdonald, P. D. M. (1971). Comment on a paper by
601
602	Choi and Bulgren. J. R. Statist. Soc. B, 33, 326- 329.
603
604	McLachlan, G. & Basford, K. (1988). Mixture Models:
605
606	Inference and Applications to Clustering. Marcel Dekker, New York.
607
608	Titterington, D. M., Smith, A. F. M. & Makov, U. E.
609
610	(1985). Statistical Analysis of Finite Mixture Distributions. John Wiley
611	& Sons.
612
613	Wang, Y. & Witten, I. H. (1999). The estimation of mixing distributions by
614	approximating empirical measures. Technical Report (in preparation), Dept.
615	of Computer Science, University of Waikato, New Zealand.
616	</pre></Content>
617	</Section>
618	</Archive>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: other-projects/nightly-tasks/diffcol/trunk/model-collect/wordpdfb/archives/HASH015936f5.dir/doc.xml@ 27726

Download in other formats: