source: other-projects/nightly-tasks/diffcol/trunk/gs3-model-collect/Word-PDF-Formatting/archives/HASH0159.dir/doc.xml@ 28241

Last change on this file since 28241 was 28241, checked in by ak19, 11 years ago

Rebuilt the GS3 model collection after the change over to using placeholders for standard GS path prefixes in the two archiveinf gdb files

File size: 28.4 KB
Line 
1<?xml version="1.0" encoding="utf-8" standalone="no"?>
2<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3<Archive>
4<Section>
5 <Description>
6 <Metadata name="gsdldoctype">indexed_doc</Metadata>
7 <Metadata name="Language">en</Metadata>
8 <Metadata name="Encoding">utf8</Metadata>
9 <Metadata name="Title">Clustering with finite data from semi-parametric mixture distributions</Metadata>
10 <Metadata name="gsdlsourcefilename">import/cluster.ps</Metadata>
11 <Metadata name="gsdlconvertedfilename">tmp/1378708752/cluster.text</Metadata>
12 <Metadata name="OrigSource">cluster.text</Metadata>
13 <Metadata name="Source">cluster.ps</Metadata>
14 <Metadata name="SourceFile">cluster.ps</Metadata>
15 <Metadata name="Plugin">PostScriptPlugin</Metadata>
16 <Metadata name="FileSize">94721</Metadata>
17 <Metadata name="FilenameRoot">cluster</Metadata>
18 <Metadata name="FileFormat">PS</Metadata>
19 <Metadata name="srcicon">_iconps_</Metadata>
20 <Metadata name="srclink_file">doc.ps</Metadata>
21 <Metadata name="srclinkFile">doc.ps</Metadata>
22 <Metadata name="dc.Creator">Yong Wang</Metadata>
23 <Metadata name="dc.Creator">Ian H. Witten</Metadata>
24 <Metadata name="dc.Title">Clustering with finite data from semi-parametric mixture distributions</Metadata>
25 <Metadata name="Identifier">HASH015936f516ed4b1d7b050af9</Metadata>
26 <Metadata name="lastmodified">1378708193</Metadata>
27 <Metadata name="lastmodifieddate">20130909</Metadata>
28 <Metadata name="oailastmodified">1378708753</Metadata>
29 <Metadata name="oailastmodifieddate">20130909</Metadata>
30 <Metadata name="assocfilepath">HASH0159.dir</Metadata>
31 <Metadata name="gsdlassocfile">doc.ps:application/postscript:</Metadata>
32 </Description>
33 <Content>&lt;pre&gt;
34
35
36Clustering with finite data from semi-parametric mixture distributions
37
38Yong Wang Ian H. Witten Computer Science Department Computer Science Department
39University of Waikato, New Zealand University of Waikato, New Zealand
40
41Email: [email protected] Email: [email protected]
42
43Abstract Existing clustering methods for the semi-parametric mixture distribution
44perform well as the volume of data increases. However, they all suffer from
45a serious drawback in finite-data situations: small outlying groups of data
46points can be completely ignored in the clusters that are produced, no matter
47how far away they lie from the major clusters. This can result in unbounded
48loss if the loss function is sensitive to the distance between clusters.
49
50This paper proposes a new distance-based clustering method that overcomes
51the problem by avoiding global constraints. Experimental results illustrate
52its superiority to existing methods when small clusters are present in finite
53data sets; they also suggest that it is more accurate and stable than other
54methods even when there are no small clusters.
55
561 Introduction A common practical problem is to fit an underlying statistical
57distribution to a sample. In some applications, this involves estimating
58the parameters of a single distribution function--e.g. the mean and variance
59of a normal distribution. In others, an appropriate mixture of elementary
60distributions must be found--e.g. a set of normal distributions, each with
61its own mean and variance. Among many kinds of mixture distribution, one
62in particular is attracting increasing research attention because it has
63many practical applications: the semiparametric mixture distribution.
64
65A semi-parametric mixture distribution is one whose cumulative distribution
66function (CDF) has the form
67
68FG(x) = Z
69
70\\Theta F (x; `) dG(`); (1)
71
72where ` 2 \\Theta , the parameter space, and x 2 X , the sample space. This
73gives the CDF of the mixture distribution FG(x) in terms of two more elementary
74distributions: the component distribution F (x; `), which is given, and the
75mixing distribution G(`), which is unknown. The former has a single unknown
76parameter `, while the latter gives a CDF for `. For example, F (x; `) might
77be the normal distribution with mean ` and unit variance, where ` is a random
78variable distributed according to G(`).
79
80The problem that we will address is the estimation of G(`) from sampled data
81that are independent and identically distributed according to the unknown
82distribution FG(x). Once G(`) has been obtained, it is a straightforward
83matter to obtain the mixture distribution.
84
85The CDF G(`) can be either continuous or discrete. In the latter case, G(`)
86is composed of a number of mass points, say, `1; : : : ; `k with masses w1;
87: : : ; wk respectively, satisfying Pki=1 wi = 1. Then (1) can be re-written
88as
89
90FG(x) =
91
92kX
93
94i=1
95
96wiF (x; `i); (2)
97
98each mass point providing a component, or cluster, in the mixture with the
99corresponding weight. If the number of components k is finite and known a
100priori, the mixture distribution is called finite; otherwise it is treated
101as countably infinite. The qualifier &quot;countably&quot; is necessary to distinguish
102this case from the situation with continuous G(`), which is also infinite.
103
104We will focus on the estimation of arbitrary mixing distributions, i.e.,
105G(`) is any general probability distribution--finite, countably infinite
106or continuous. A few methods for tackling this problem can be found in the
107literature. However, as we shall see, they all suffer from a serious drawback
108in finite-data situations: small outlying groups of data points can be completely
109ignored in the clusters that are produced.
110
111This phenomenon seems to have been overlooked, presumably for three reasons:
112small amounts of data may be assumed to represent a small loss; a few data
113points
114
1151
116
117can easily be dismissed as outliers; and in the limit the problem evaporates
118because most estimators possess the property of strong consistency--which
119means that, almost surely, they converge weakly to any given G(`) as the
120sample size approaches infinity. However, often these reasons are inappropriate:
121the loss function may be sensitive to the distance between clusters; the
122small number of outlying data points may actually represent small clusters;
123and any practical clustering situation will necessarily involve finite data.
124
125This paper proposes a new method, based on the idea of local fitting, that
126successfully solves the problem. The experimental results presented below
127illustrate its superiority to existing methods when small clusters are present
128in finite data sets. Moreover, they also suggest that it is more accurate
129and stable than other methods even when there are no small clusters. Existing
130clustering methods for semi-parametric mixture distributions are briefly
131reviewed in the next section. Section 3 identifies a common problem from
132which these current methods suffer. Then we present the new solution, and
133in Section 5 we describe experiments that illustrate the problem that has
134been identified and show how the new method overcomes it.
135
1362 Clustering methods The general problem of inferring mixture models is treated
137extensively and in considerable depth in books by Titterington et al. (1985),
138McLachlan and Basford (1988) and Lindsay (1995). For semi-parametric mixture
139distributions there are three basic approaches: minimum distance, maximum
140likelihood, and Bayesian. We briefly introduce the first approach, which
141is the one adopted in the paper, review the other two to show why they are
142not suitable for arbitrary mixtures, and then return to the chosen approach
143and review the minimum distance estimators for arbitrary semi-parametric
144mixture distributions that have been described in the literature.
145
146The idea of the minimum distance method is to define some measure of the
147goodness of the clustering and optimize this by suitable choice of a mixing
148distribution Gn(`) for a sample of size n. We generally want the estimator
149to be strongly consistent as n ! 1, in the sense defined above, for arbitrary
150mixing distributions. We also generally want to take advantage of the special
151structure of semi-parametric mixtures to come up with an efficient algorithmic
152solution.
153
154The maximum likelihood approach maximizes the likelihood (or equivalently
155the log-likelihood) of the data by suitable choice of Gn(`). It can in fact
156be viewed as
157
158a minimum distance method that uses the Kullback- Leibler distance (Titterington
159et al., 1985). This approach has been widely used for estimating finite mixtures,
160particularly when the number of clusters is fairly small, and it is generally
161accepted that it is more accurate than other methods. However, it has not
162been used to estimate arbitrary semi-parametric mixtures, presumably because
163of its high computational cost. Its speed drops dramatically as the number
164of parameters that must be determined increases, which makes it computationally
165infeasible for arbitrary mixtures, since each data point might represent
166a component of the final distribution with its own parameters.
167
168Bayesian methods assume prior knowledge, often given by some kind of heuristic,
169to determine a suitable a priori probability density function. They are often
170used to determine the number of components in the final distribution--particularly
171when outliers are present. Like the maximum likelihood approach they are
172computationally expensive, for they use the same computational techniques.
173
174We now review existing minimum distance estimators for arbitrary semi-parametric
175mixture distributions. We begin with some notation. Let x1; : : : ; xn be
176a sample chosen according to the mixture distribution, and suppose (without
177loss of generality) that the sequence is ordered so that x1 ^ x2 ^ : : :
178^ xn. Let Gn(`) be a discrete estimator of the underlying mixing distribution
179with a set of support points at f`nj; j = 1; : : :; kng. Each `nj provides
180a component of the final clustering with
181
182weight wnj * 0, where Pk
183
184n
185
186j=1 wnj = 1. Given the sup-port points, obtaining G
187
188n(`) is equivalent to computing the weight vector wn = (wn1; wn2; : : :;
189wnk
190
191n)0. Denoteby F
192
193Gn(x) the estimated mixture CDF with respect to Gn(`).
194
195Two minimum distance estimators were proposed in the late 1960s. Choi and
196Bulgren (1968) used
197
1981 n
199
200nX
201
202i=1
203
204[FG
205
206n(xi) \\Gamma i=n]
207
2082 (3)
209
210as the distance measure. Minimizing this quantity with respect to Gn yields
211a strongly consistent estimator. A slight improvement is obtained by using
212the Cram'er-von Mises statistic
213
2141 n
215
216nX
217
218i=1
219
220[FG
221
222n(xi) \\Gamma (i \\Gamma 1=2)=n]
223
2242 + 1=(12n2); (4)
225
226which essentially replaces i=n in (3) with (i \\Gamma 12 )=n without affecting
227the asymptotic result. As might be expected, this reduces the bias for small-sample
228cases, as
229
230was demonstrated empirically by Macdonald (1971) in a note on Choi and Bulgren's
231paper.
232
233At about the same time, Deely and Kruse (1968) used the sup-norm associated
234with the Kolmogorov-Smirnov test. The minimization is over
235
236sup 1^i^nfjF
237
238Gn(xi) \\Gamma (i \\Gamma 1)=nj; jFGn(xi) \\Gamma i=njg; (5)
239
240and this leads to a linear programming problem. Deely and Kruse also established
241the strong consistency of their estimator Gn. Ten years later, this approach
242was extended by Blum and Susarla (1977) by using any sequence ffng of functions
243which satisfies sup jfn\\Gamma fGj ! 0 a.s. as n ! 1. Each fn can, for example,
244be obtained by a kernel-based density estimator. Blum and Susarla approximated
245the function fn by the overall mixture pdf fG
246
247n , and established the strong consistency of the esti-mator G
248
249n under weak conditions.
250
251For reason of simplicity and generality, we will denote the approximation
252between two mathematical entities of the same type by ,=, which implies the
253minimization with respect to an estimator of a distance measure between the
254entities on either side. The types of entity involved in this paper include
255vector, function and measure, and we use the same symbol ,= for each.
256
257In the work reviewed above, two kinds of estimator are used: CDF-based (Choi
258and Bulgren, Macdonald, and Deely and Kruse) and pdf-based (Blum and Susarla).
259CDF-based estimators involve approximating an empirical distribution with
260an estimated one FG
261
262n. We writethis as
263
264FG
265
266n ,= Fn; (6)
267
268where Fn is the Kolmogorov empirical CDF--or indeed any empirical CDF that
269converges to it. Pdf-based estimators involve the approximation between probability
270density functions:
271
272fG
273
274n ,= fn; (7)
275
276where fG
277
278n is the estimated mixture pdf and fn is theempirical pdf described above.
279
280The entities involved in (6) and (7) are functions. When the approximation
281is computed, however, it is computed between vectors that represent the functions.
282These vectors contain the function values at a particular set of points,
283which we call &quot;fitting points.&quot; In the work reviewed above, the fitting points
284are chosen to be the data points themselves.
285
2863 The problem of minority clusters
287
288Although they perform well asymptotically, all the minimum distance methods
289described above suffer from the finite-sample problem discussed earlier:
290they can neglect small groups of outlying data points no matter how far they
291lie from the dominant data points. The underlying reason is that the objective
292function to be minimized is defined globally rather than locally. A global
293approach means that the value of the estimated probability density function
294at a particular place will be influenced by all data points, no matter how
295far away they are. This can cause small groups of data points to be ignored
296even if they are a long way from the dominant part of the data sample. From
297a probabilistic point of view, however, there is no reason to subsume distant
298groups within the major clusters just because they are relatively small.
299
300The ultimate effect of suppressing distant minority clusters depends on how
301the clustering is applied. If the application's loss function depends on
302the distance between clusters, the result may prove disastrous because there
303is no limit to how far away these outlying groups may be. One might argue
304that small groups of points can easily be explained away as outliers, because
305the effect will become less important as the number of data points increases--and
306it will disappear in the limit of infinite data. However, in a finite-data
307situation--and all practical applications necessarily involve finite data--the
308&quot;outliers&quot; may equally well represent small minority clusters. Furthermore,
309outlying data points are not really treated as outliers by these methods--whether
310or not they are discarded is merely an artifact of the global fitting calculation.
311When clustering, the final mixture distribution should take all data points
312into account--including outlying clusters if any exist. If practical applications
313demand that small outlying clusters are suppressed, this should be done in
314a separate stage.
315
316In distance-based clustering, each data point has a farreaching effect because
317of two global constraints. One is the use of the cumulative distribution
318function; the other is the normalization constraint Pk
319
320n
321
322j=1 wnj = 1. Theseconstraints may sacrifice a small number of data points--
323
324at any distance--for a better overall fit to the data as a whole. Choi and
325Bulgren (1968), the Cramer-von Mises statistic (Macdonald, 1971), and Deely
326and Kruse (1968) all enforce both the CDF and the normalization constraints.
327Blum and Susarla (1977) drop the CDF, but still enforce the normalization
328constraint. The result is that these clustering methods are only appropriate
329for finite mixtures without small clusters, where the risk of suppressing
330clusters is low.
331
332This paper addresses the general problem of arbitrary mixtures. Of course,
333the minority cluster problem exists for all types of mixture--including finite
334mixtures. Even here, the maximum likelihood and Bayesian approaches do not
335solve the problem, because they both introduce a global normalization constraint.
336
3374 Solving the minority cluster
338
339problem
340
341Now that the source of the problem has been identified, the solution is clear,
342at least in principle: drop both the approximation of CDFs, as Blum and Susarla
343(1977) do, and the normalization constraint--no matter how seductive it may
344seem.
345
346Let G0n be a discrete function with masses fwnjg at f`njg; note that we do
347not require the wnj to sum to one. Since the new method operates in terms
348of measures rather than distribution functions, the notion of approximation
349is altered to use intervals rather than points. Using the formulation described
350in Section 2, we have
351
352PG0
353
354n ,= Pn; (8)
355
356where PG0
357
358n is the estimated measure and Pn is the em-pirical measure. The intervals
359over which the approximation takes place are called &quot;fitting intervals.&quot;
360Since (8) is not subject to the normalization constraint, G0n is not a CDF
361and PG0
362
363n is not a probability measure. How-ever, G0
364
365n can be easily converted into a CDF estimatorby normalizing it after equation
366(8) has been solved.
367
368To define the estimation procedure fully, we need to determine (a) the set
369of support points, (b) the set of fitting intervals, (c) the empirical measure,
370and (d) the distance measure. Here we discuss these in an intuitive manner;
371Wang and Witten (1999) show how to determine them in a way that guarantees
372a strongly consistent estimator.
373
374Support points. The support points are usually suggested by the data points
375in the sample. For example, if the component distribution F (x; `) is the
376normal distribution with mean ` and unit variance, each data point can be
377taken as a support point. In fact, the support points are more accurately
378described as potential support points, because their associated weights may
379become zero after solving (8)--and, in practice, many often do.
380
381Fitting intervals. The fitting intervals are also suggested by the data points.
382In the normal distribution example, each data point xi can provide one interval,
383such as [xi \\Gamma 3oe; xi], or two, such as [xi \\Gamma 3oe; xi] and [xi;
384xi + 3oe], or more. There is no problem if the fitting
385
386intervals overlap. Their length should not be so large that points can exert
387an influence on the clustering at an unduly remote place, nor so small that
388the empirical measure is inaccurate. The experiments reported below use intervals
389of a few standard deviations around each data point, and, as we will see,
390this works well.
391
392Empirical measure. The empirical measure can be the probability measure determined
393by the Kolmogorov empirical CDF, or any measure that converges to it. The
394fitting intervals discussed above can be open, closed, or semi-open. This
395will affect the empirical measure if data points are used as interval boundaries,
396although it does not change the values of the estimated measure because the
397corresponding distribution is continuous. In smallsample situations, bias
398can be reduced by careful attention to this detail--as Macdonald (1971) discusses
399with respect to Choi and Bulgren's (1968) method.
400
401Distance measure. The choice of distance measure determines what kind of
402mathematical programming problem must be solved. For example, a quadratic
403distance will give rise to a least squares problem under linear constraints,
404whereas the sup-norm gives rise to a linear programming problem that can
405be solved using the simplex method. These two measures have efficient solutions
406that are globally optimal.
407
408It is worth pointing out that abandoning the global constraints associated
409with both CDFs and normalization can brings with it a computational advantage.
410In vector form, we write PG0
411
412n = AG
413
4140 nwn, where wn is the(unnormalized) weight vector and each element of the
415
416matrix AG0
417
418n is the probability value of a component dis-tribution over an fitting interval.
419Then, provided the
420
421support points corresponding to w0n and w00n lie outside each others' sphere
422of influence as determined by the component distributions F (x; `), the estimation
423procedure becomes`
424
425A0G0
426
427n 00 A00
428
429G0n ' `
430
431w0n w00n ' ,= `
432
433P 0n P 00n ' ; (9)
434
435subject to w0n * 0 and w00n * 0. This is the same as combining the solutions
436of two sub-equations, A0nw0n ,= P 0n subject to w0n * 0, and A00nw00n ,=
437P 00n subject to w00n * 0. If the relevant support points continue to lie
438outside each others' sphere of influence, the sub-equations can be further
439partitioned. This implies that when data points are sufficiently far apart,
440the mixing distribution G can be estimated by grouping data points in different
441regions. Moreover, the solution in each region can be normalized separately
442before they are combined, which yields a better estimation of the mixing
443distribution.
444
445If the normalization constraint Pk
446
447n
448
449j=1 wnj = 1 is re-tained when estimating the mixing distribution, the es
450timation procedure becomes
451
452PG
453
454n ,= Pn: (10)
455
456where the estimator Gn is a discrete CDF on \\Theta . This constraint is necessary
457for the left-hand side of (10) to be a probability measure. Although he did
458not develop an operational estimation scheme, Barbe (1998) suggested exploiting
459the fact that the empirical probability measure is approximated by the estimated
460probability measure--but he retained the normalization constraint. As noted
461above, relaxing the constraint has the effect of loosening the throttling
462effect of large clusters on small groups of outliers, and our experimental
463results show that the resulting estimator suffers from the drawback noted
464earlier.
465
466Both estimators, Gn obtained from (10) and G0n from (8), have been shown
467to be strongly consistent under weak conditions similar to those used by
468others (Wang &amp; Witten, 1999). Of course, the weak convergence of G0n is in
469the sense of general functions, not CDFs. The strong consistency of G0n immediately
470implies the strong consistency of the CDF estimator obtained by normalizing
471G0n.
472
4735 Experimental validation We have conducted experiments to illustrate the
474failure of existing methods to detect small outlying clusters, and the improvement
475achieved by the new scheme. The results also suggest that the new method
476is more accurate and stable than the others.
477
478When comparing clustering methods, it is not always easy to evaluate the
479clusters obtained. To finesse this problem we consider simple artificial
480situations in which the proper outcome is clear. Some practical applications
481of clusters do provide objective evaluation functions; however, these are
482beyond the scope of this paper.
483
484The methods used are Choi and Bulgren (1968) (denoted choi), Macdonald's
485application of the Cram'er-von Mises statistic (cram'er), the new method
486with the normalization constraint (test), and the new method without that
487constraint (new). In each case, equations involving non-negativity and/or
488linear equality constraints are solved as quadratic programming problems
489using the elegant and efficient procedures nnls and lsei provided by Lawson
490and Hanson (1974). All four methods have the same computational time complexity.
491
492We set the sample size n to 100 throughout the experiments. The data points
493are artificially generated from a mixture of two clusters: n1 points from
494N (0; 1) and n2 points from N (100; 1). The values of n1 and n2 are in the
495ratios 99 : 1, 97 : 3, 93 : 7, 80 : 20 and 50 : 50.
496
497Every data point is taken as a potential support point in all four methods:
498thus the number of potential components in the clustering is 100. For test
499and new, fitting intervals need to be determined. In the experiments, each
500data point xi provides the two fitting intervals [xi \\Gamma 3; xi] and [xi;
501xi + 3]. Any data point located on the boundary of an interval is counted
502as half a point when determining the empirical measure over that interval.
503
504These choices are admittedly crude, and further improvements in the accuracy
505and speed of test and new are possible that take advantage of the flexibility
506provided by (10) and (8). For example, accuracy will likely increase with
507more--and more carefully chosen-- support points and fitting intervals. The
508fact that it performs well even with crudely chosen support points and fitting
509intervals testifies to the robustness of the method.
510
511Our primary interest in this experiment is the weights of the clusters that
512are found. To cast the results in terms of the underlying models, we use
513the cluster weights to estimate values for n1 and n2. Of course, the results
514often do not contain exactly two clusters--but because the underlying cluster
515centres, 0 and 100, are well separated compared to their standard deviation
516of 1, it is highly unlikely that any data points from one cluster will fall
517anywhere near the other. Thus we use a threshold of 50 to divide the clusters
518into two groups: those near 0 and those near 100. The final cluster weights
519are normalized, and the weights for the first group are summed to obtain
520an estimate ^n1 of n1, while those for the second group are summed to give
521an estimate ^n2 of n2.
522
523Table 1 shows results for each of the four methods. Each cell represents
524one hundred separate experimental runs. Three figures are recorded. At the
525top is the number of times the method failed to detect the smaller cluster,
526that is, the number of times ^n2 = 0. In the middle are the average values
527for ^n1 and ^n2. At the bottom is the standard deviation of ^n1 and ^n2 (which
528are equal). These three figures can be thought of as measures of reliability,
529accuracy and stability respectively.
530
531The top figures in Table 1 show clearly that only new is always reliable
532in the sense that it never fails to detect the smaller cluster. The other
533methods fail mostly when n2 = 1; their failure rate gradually decreases as
534n2 grows. The center figures show that, under all conditions, new gives a
535more accurate estimate of the correct values of n1 and n2 than the other
536methods. As expected, cram'er shows a noticeable improvement over choi, but
537it is very minor. The test method has lower failure rates and produces estimates
538that are more accurate and far more stable (indicated by the bottom fign1
539= 99 n1 = 97 n1 = 93 n1 = 80 n1 = 50
540
541n2 = 1 n2 = 3 n2 = 7 n2 = 20 n2 = 50 choi Failures 86 42 4 0 0
542
543^n1=^n2 99.9/0.1 99.2/0.8 95.8/4.2 82.0/18.0 50.6/49.4 SD(^n1) 0.36 0.98
5441.71 1.77 1.30 cram'er Failures 80 31 1 0 0
545
546^n1=^n2 99.8/0.2 98.6/1.4 95.1/4.9 81.6/18.4 49.7/50.3 SD(^n1) 0.50 1.13
5471.89 1.80 1.31 test Failures 52 5 0 0 0
548
549^n1=^n2 99.8/0.2 98.2/1.8 94.1/5.9 80.8/19.2 50.1/49.9 SD(^n1) 0.32 0.83
5500.87 0.78 0.55 new Failures 0 0 0 0 0
551
552^n1=^n2 99.0/1.0 96.9/3.1 92.8/7.2 79.9/20.1 50.1/49.9 SD(^n1) 0.01 0.16
5530.19 0.34 0.41
554
555Table 1: Experimental results for detecting small clusters ures) than those
556for choi and cram'er--presumably because it is less constrained. Of the four
557methods, new is clearly and consistently the winner in terms of all three
558measures: reliability, accuracy and stability.
559
560The results of the new method can be further improved. If the decomposed
561form (9) is used instead of (8), and the solutions of the sub-equations are
562normalized before combining them--which is feasible because the two underlying
563clusters are so distant from each other--the correct values are obtained
564for ^n1 and ^n2 in virtually every trial.
565
5666 Conclusions We have identified a shortcoming of existing clustering methods
567for arbitrary semi-parametric mixture distributions: they fail to detect
568very small clusters reliably. This is a significant weakness when the minority
569clusters are far from the dominant ones and the loss function takes account
570of the distance of misclustered points.
571
572We have described a new clustering method for arbitrary semi-parametric mixture
573distributions, and shown experimentally that it overcomes the problem. Furthermore,
574the experiments suggest that the new estimator is more accurate and more
575stable than existing ones.
576
577References Barbe, P. (1998). Statistical analysis of mixtures and
578
579the empirical probability measure. Acta Applicandae Mathematicae, 50(3),
580253-340.
581
582Blum, J. R. &amp; Susarla, V. (1977). Estimation of a mixing
583
584distribution function. Ann. Probab, 5, 200-209.
585
586Choi, K. &amp; Bulgren, W. B. (1968). An estimation procedure for mixtures of
587distributions. J. R. Statist. Soc. B, 30, 444-460.
588
589Deely, J. J. &amp; Kruse, R. L. (1968). Construction of sequences estimating
590the mixing distribution. Ann. Math. Statist., 39, 286-288.
591
592Lawson, C. L. &amp; Hanson, R. J. (1974). Solving Least
593
594Squares Problems. Prentice-Hall, Inc.
595
596Lindsay, B. G. (1995). Mixture models: theory, geometry,
597
598and applications, Volume 5 of NSF-CBMS Regional Conference Series in Probability
599and Statistics. Institute for Mathematical Statistics: Hayward, CA.
600
601Macdonald, P. D. M. (1971). Comment on a paper by
602
603Choi and Bulgren. J. R. Statist. Soc. B, 33, 326- 329.
604
605McLachlan, G. &amp; Basford, K. (1988). Mixture Models:
606
607Inference and Applications to Clustering. Marcel Dekker, New York.
608
609Titterington, D. M., Smith, A. F. M. &amp; Makov, U. E.
610
611(1985). Statistical Analysis of Finite Mixture Distributions. John Wiley
612&amp; Sons.
613
614Wang, Y. &amp; Witten, I. H. (1999). The estimation of mixing distributions by
615approximating empirical measures. Technical Report (in preparation), Dept.
616of Computer Science, University of Waikato, New Zealand.
617&lt;/pre&gt;</Content>
618</Section>
619</Archive>
Note: See TracBrowser for help on using the repository browser.