1 | <?xml version="1.0" encoding="utf-8" standalone="no"?>
|
---|
2 | <!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
|
---|
3 | <Archive>
|
---|
4 | <Section>
|
---|
5 | <Description>
|
---|
6 | <Metadata name="gsdldoctype">indexed_doc</Metadata>
|
---|
7 | <Metadata name="Language">en</Metadata>
|
---|
8 | <Metadata name="Encoding">utf8</Metadata>
|
---|
9 | <Metadata name="Title">Clustering with finite data from semi-parametric mixture distributions</Metadata>
|
---|
10 | <Metadata name="gsdlsourcefilename">import/cluster.ps</Metadata>
|
---|
11 | <Metadata name="gsdlconvertedfilename">tmp/1378708752/cluster.text</Metadata>
|
---|
12 | <Metadata name="OrigSource">cluster.text</Metadata>
|
---|
13 | <Metadata name="Source">cluster.ps</Metadata>
|
---|
14 | <Metadata name="SourceFile">cluster.ps</Metadata>
|
---|
15 | <Metadata name="Plugin">PostScriptPlugin</Metadata>
|
---|
16 | <Metadata name="FileSize">94721</Metadata>
|
---|
17 | <Metadata name="FilenameRoot">cluster</Metadata>
|
---|
18 | <Metadata name="FileFormat">PS</Metadata>
|
---|
19 | <Metadata name="srcicon">_iconps_</Metadata>
|
---|
20 | <Metadata name="srclink_file">doc.ps</Metadata>
|
---|
21 | <Metadata name="srclinkFile">doc.ps</Metadata>
|
---|
22 | <Metadata name="dc.Creator">Yong Wang</Metadata>
|
---|
23 | <Metadata name="dc.Creator">Ian H. Witten</Metadata>
|
---|
24 | <Metadata name="dc.Title">Clustering with finite data from semi-parametric mixture distributions</Metadata>
|
---|
25 | <Metadata name="Identifier">HASH015936f516ed4b1d7b050af9</Metadata>
|
---|
26 | <Metadata name="lastmodified">1378708193</Metadata>
|
---|
27 | <Metadata name="lastmodifieddate">20130909</Metadata>
|
---|
28 | <Metadata name="oailastmodified">1378708753</Metadata>
|
---|
29 | <Metadata name="oailastmodifieddate">20130909</Metadata>
|
---|
30 | <Metadata name="assocfilepath">HASH0159.dir</Metadata>
|
---|
31 | <Metadata name="gsdlassocfile">doc.ps:application/postscript:</Metadata>
|
---|
32 | </Description>
|
---|
33 | <Content><pre>
|
---|
34 |
|
---|
35 |
|
---|
36 | Clustering with finite data from semi-parametric mixture distributions
|
---|
37 |
|
---|
38 | Yong Wang Ian H. Witten Computer Science Department Computer Science Department
|
---|
39 | University of Waikato, New Zealand University of Waikato, New Zealand
|
---|
40 |
|
---|
41 | Email: [email protected] Email: [email protected]
|
---|
42 |
|
---|
43 | Abstract Existing clustering methods for the semi-parametric mixture distribution
|
---|
44 | perform well as the volume of data increases. However, they all suffer from
|
---|
45 | a serious drawback in finite-data situations: small outlying groups of data
|
---|
46 | points can be completely ignored in the clusters that are produced, no matter
|
---|
47 | how far away they lie from the major clusters. This can result in unbounded
|
---|
48 | loss if the loss function is sensitive to the distance between clusters.
|
---|
49 |
|
---|
50 | This paper proposes a new distance-based clustering method that overcomes
|
---|
51 | the problem by avoiding global constraints. Experimental results illustrate
|
---|
52 | its superiority to existing methods when small clusters are present in finite
|
---|
53 | data sets; they also suggest that it is more accurate and stable than other
|
---|
54 | methods even when there are no small clusters.
|
---|
55 |
|
---|
56 | 1 Introduction A common practical problem is to fit an underlying statistical
|
---|
57 | distribution to a sample. In some applications, this involves estimating
|
---|
58 | the parameters of a single distribution function--e.g. the mean and variance
|
---|
59 | of a normal distribution. In others, an appropriate mixture of elementary
|
---|
60 | distributions must be found--e.g. a set of normal distributions, each with
|
---|
61 | its own mean and variance. Among many kinds of mixture distribution, one
|
---|
62 | in particular is attracting increasing research attention because it has
|
---|
63 | many practical applications: the semiparametric mixture distribution.
|
---|
64 |
|
---|
65 | A semi-parametric mixture distribution is one whose cumulative distribution
|
---|
66 | function (CDF) has the form
|
---|
67 |
|
---|
68 | FG(x) = Z
|
---|
69 |
|
---|
70 | \\Theta F (x; `) dG(`); (1)
|
---|
71 |
|
---|
72 | where ` 2 \\Theta , the parameter space, and x 2 X , the sample space. This
|
---|
73 | gives the CDF of the mixture distribution FG(x) in terms of two more elementary
|
---|
74 | distributions: the component distribution F (x; `), which is given, and the
|
---|
75 | mixing distribution G(`), which is unknown. The former has a single unknown
|
---|
76 | parameter `, while the latter gives a CDF for `. For example, F (x; `) might
|
---|
77 | be the normal distribution with mean ` and unit variance, where ` is a random
|
---|
78 | variable distributed according to G(`).
|
---|
79 |
|
---|
80 | The problem that we will address is the estimation of G(`) from sampled data
|
---|
81 | that are independent and identically distributed according to the unknown
|
---|
82 | distribution FG(x). Once G(`) has been obtained, it is a straightforward
|
---|
83 | matter to obtain the mixture distribution.
|
---|
84 |
|
---|
85 | The CDF G(`) can be either continuous or discrete. In the latter case, G(`)
|
---|
86 | is composed of a number of mass points, say, `1; : : : ; `k with masses w1;
|
---|
87 | : : : ; wk respectively, satisfying Pki=1 wi = 1. Then (1) can be re-written
|
---|
88 | as
|
---|
89 |
|
---|
90 | FG(x) =
|
---|
91 |
|
---|
92 | kX
|
---|
93 |
|
---|
94 | i=1
|
---|
95 |
|
---|
96 | wiF (x; `i); (2)
|
---|
97 |
|
---|
98 | each mass point providing a component, or cluster, in the mixture with the
|
---|
99 | corresponding weight. If the number of components k is finite and known a
|
---|
100 | priori, the mixture distribution is called finite; otherwise it is treated
|
---|
101 | as countably infinite. The qualifier "countably" is necessary to distinguish
|
---|
102 | this case from the situation with continuous G(`), which is also infinite.
|
---|
103 |
|
---|
104 | We will focus on the estimation of arbitrary mixing distributions, i.e.,
|
---|
105 | G(`) is any general probability distribution--finite, countably infinite
|
---|
106 | or continuous. A few methods for tackling this problem can be found in the
|
---|
107 | literature. However, as we shall see, they all suffer from a serious drawback
|
---|
108 | in finite-data situations: small outlying groups of data points can be completely
|
---|
109 | ignored in the clusters that are produced.
|
---|
110 |
|
---|
111 | This phenomenon seems to have been overlooked, presumably for three reasons:
|
---|
112 | small amounts of data may be assumed to represent a small loss; a few data
|
---|
113 | points
|
---|
114 |
|
---|
115 | 1
|
---|
116 |
|
---|
117 | can easily be dismissed as outliers; and in the limit the problem evaporates
|
---|
118 | because most estimators possess the property of strong consistency--which
|
---|
119 | means that, almost surely, they converge weakly to any given G(`) as the
|
---|
120 | sample size approaches infinity. However, often these reasons are inappropriate:
|
---|
121 | the loss function may be sensitive to the distance between clusters; the
|
---|
122 | small number of outlying data points may actually represent small clusters;
|
---|
123 | and any practical clustering situation will necessarily involve finite data.
|
---|
124 |
|
---|
125 | This paper proposes a new method, based on the idea of local fitting, that
|
---|
126 | successfully solves the problem. The experimental results presented below
|
---|
127 | illustrate its superiority to existing methods when small clusters are present
|
---|
128 | in finite data sets. Moreover, they also suggest that it is more accurate
|
---|
129 | and stable than other methods even when there are no small clusters. Existing
|
---|
130 | clustering methods for semi-parametric mixture distributions are briefly
|
---|
131 | reviewed in the next section. Section 3 identifies a common problem from
|
---|
132 | which these current methods suffer. Then we present the new solution, and
|
---|
133 | in Section 5 we describe experiments that illustrate the problem that has
|
---|
134 | been identified and show how the new method overcomes it.
|
---|
135 |
|
---|
136 | 2 Clustering methods The general problem of inferring mixture models is treated
|
---|
137 | extensively and in considerable depth in books by Titterington et al. (1985),
|
---|
138 | McLachlan and Basford (1988) and Lindsay (1995). For semi-parametric mixture
|
---|
139 | distributions there are three basic approaches: minimum distance, maximum
|
---|
140 | likelihood, and Bayesian. We briefly introduce the first approach, which
|
---|
141 | is the one adopted in the paper, review the other two to show why they are
|
---|
142 | not suitable for arbitrary mixtures, and then return to the chosen approach
|
---|
143 | and review the minimum distance estimators for arbitrary semi-parametric
|
---|
144 | mixture distributions that have been described in the literature.
|
---|
145 |
|
---|
146 | The idea of the minimum distance method is to define some measure of the
|
---|
147 | goodness of the clustering and optimize this by suitable choice of a mixing
|
---|
148 | distribution Gn(`) for a sample of size n. We generally want the estimator
|
---|
149 | to be strongly consistent as n ! 1, in the sense defined above, for arbitrary
|
---|
150 | mixing distributions. We also generally want to take advantage of the special
|
---|
151 | structure of semi-parametric mixtures to come up with an efficient algorithmic
|
---|
152 | solution.
|
---|
153 |
|
---|
154 | The maximum likelihood approach maximizes the likelihood (or equivalently
|
---|
155 | the log-likelihood) of the data by suitable choice of Gn(`). It can in fact
|
---|
156 | be viewed as
|
---|
157 |
|
---|
158 | a minimum distance method that uses the Kullback- Leibler distance (Titterington
|
---|
159 | et al., 1985). This approach has been widely used for estimating finite mixtures,
|
---|
160 | particularly when the number of clusters is fairly small, and it is generally
|
---|
161 | accepted that it is more accurate than other methods. However, it has not
|
---|
162 | been used to estimate arbitrary semi-parametric mixtures, presumably because
|
---|
163 | of its high computational cost. Its speed drops dramatically as the number
|
---|
164 | of parameters that must be determined increases, which makes it computationally
|
---|
165 | infeasible for arbitrary mixtures, since each data point might represent
|
---|
166 | a component of the final distribution with its own parameters.
|
---|
167 |
|
---|
168 | Bayesian methods assume prior knowledge, often given by some kind of heuristic,
|
---|
169 | to determine a suitable a priori probability density function. They are often
|
---|
170 | used to determine the number of components in the final distribution--particularly
|
---|
171 | when outliers are present. Like the maximum likelihood approach they are
|
---|
172 | computationally expensive, for they use the same computational techniques.
|
---|
173 |
|
---|
174 | We now review existing minimum distance estimators for arbitrary semi-parametric
|
---|
175 | mixture distributions. We begin with some notation. Let x1; : : : ; xn be
|
---|
176 | a sample chosen according to the mixture distribution, and suppose (without
|
---|
177 | loss of generality) that the sequence is ordered so that x1 ^ x2 ^ : : :
|
---|
178 | ^ xn. Let Gn(`) be a discrete estimator of the underlying mixing distribution
|
---|
179 | with a set of support points at f`nj; j = 1; : : :; kng. Each `nj provides
|
---|
180 | a component of the final clustering with
|
---|
181 |
|
---|
182 | weight wnj * 0, where Pk
|
---|
183 |
|
---|
184 | n
|
---|
185 |
|
---|
186 | j=1 wnj = 1. Given the sup-port points, obtaining G
|
---|
187 |
|
---|
188 | n(`) is equivalent to computing the weight vector wn = (wn1; wn2; : : :;
|
---|
189 | wnk
|
---|
190 |
|
---|
191 | n)0. Denoteby F
|
---|
192 |
|
---|
193 | Gn(x) the estimated mixture CDF with respect to Gn(`).
|
---|
194 |
|
---|
195 | Two minimum distance estimators were proposed in the late 1960s. Choi and
|
---|
196 | Bulgren (1968) used
|
---|
197 |
|
---|
198 | 1 n
|
---|
199 |
|
---|
200 | nX
|
---|
201 |
|
---|
202 | i=1
|
---|
203 |
|
---|
204 | [FG
|
---|
205 |
|
---|
206 | n(xi) \\Gamma i=n]
|
---|
207 |
|
---|
208 | 2 (3)
|
---|
209 |
|
---|
210 | as the distance measure. Minimizing this quantity with respect to Gn yields
|
---|
211 | a strongly consistent estimator. A slight improvement is obtained by using
|
---|
212 | the Cram'er-von Mises statistic
|
---|
213 |
|
---|
214 | 1 n
|
---|
215 |
|
---|
216 | nX
|
---|
217 |
|
---|
218 | i=1
|
---|
219 |
|
---|
220 | [FG
|
---|
221 |
|
---|
222 | n(xi) \\Gamma (i \\Gamma 1=2)=n]
|
---|
223 |
|
---|
224 | 2 + 1=(12n2); (4)
|
---|
225 |
|
---|
226 | which essentially replaces i=n in (3) with (i \\Gamma 12 )=n without affecting
|
---|
227 | the asymptotic result. As might be expected, this reduces the bias for small-sample
|
---|
228 | cases, as
|
---|
229 |
|
---|
230 | was demonstrated empirically by Macdonald (1971) in a note on Choi and Bulgren's
|
---|
231 | paper.
|
---|
232 |
|
---|
233 | At about the same time, Deely and Kruse (1968) used the sup-norm associated
|
---|
234 | with the Kolmogorov-Smirnov test. The minimization is over
|
---|
235 |
|
---|
236 | sup 1^i^nfjF
|
---|
237 |
|
---|
238 | Gn(xi) \\Gamma (i \\Gamma 1)=nj; jFGn(xi) \\Gamma i=njg; (5)
|
---|
239 |
|
---|
240 | and this leads to a linear programming problem. Deely and Kruse also established
|
---|
241 | the strong consistency of their estimator Gn. Ten years later, this approach
|
---|
242 | was extended by Blum and Susarla (1977) by using any sequence ffng of functions
|
---|
243 | which satisfies sup jfn\\Gamma fGj ! 0 a.s. as n ! 1. Each fn can, for example,
|
---|
244 | be obtained by a kernel-based density estimator. Blum and Susarla approximated
|
---|
245 | the function fn by the overall mixture pdf fG
|
---|
246 |
|
---|
247 | n , and established the strong consistency of the esti-mator G
|
---|
248 |
|
---|
249 | n under weak conditions.
|
---|
250 |
|
---|
251 | For reason of simplicity and generality, we will denote the approximation
|
---|
252 | between two mathematical entities of the same type by ,=, which implies the
|
---|
253 | minimization with respect to an estimator of a distance measure between the
|
---|
254 | entities on either side. The types of entity involved in this paper include
|
---|
255 | vector, function and measure, and we use the same symbol ,= for each.
|
---|
256 |
|
---|
257 | In the work reviewed above, two kinds of estimator are used: CDF-based (Choi
|
---|
258 | and Bulgren, Macdonald, and Deely and Kruse) and pdf-based (Blum and Susarla).
|
---|
259 | CDF-based estimators involve approximating an empirical distribution with
|
---|
260 | an estimated one FG
|
---|
261 |
|
---|
262 | n. We writethis as
|
---|
263 |
|
---|
264 | FG
|
---|
265 |
|
---|
266 | n ,= Fn; (6)
|
---|
267 |
|
---|
268 | where Fn is the Kolmogorov empirical CDF--or indeed any empirical CDF that
|
---|
269 | converges to it. Pdf-based estimators involve the approximation between probability
|
---|
270 | density functions:
|
---|
271 |
|
---|
272 | fG
|
---|
273 |
|
---|
274 | n ,= fn; (7)
|
---|
275 |
|
---|
276 | where fG
|
---|
277 |
|
---|
278 | n is the estimated mixture pdf and fn is theempirical pdf described above.
|
---|
279 |
|
---|
280 | The entities involved in (6) and (7) are functions. When the approximation
|
---|
281 | is computed, however, it is computed between vectors that represent the functions.
|
---|
282 | These vectors contain the function values at a particular set of points,
|
---|
283 | which we call "fitting points." In the work reviewed above, the fitting points
|
---|
284 | are chosen to be the data points themselves.
|
---|
285 |
|
---|
286 | 3 The problem of minority clusters
|
---|
287 |
|
---|
288 | Although they perform well asymptotically, all the minimum distance methods
|
---|
289 | described above suffer from the finite-sample problem discussed earlier:
|
---|
290 | they can neglect small groups of outlying data points no matter how far they
|
---|
291 | lie from the dominant data points. The underlying reason is that the objective
|
---|
292 | function to be minimized is defined globally rather than locally. A global
|
---|
293 | approach means that the value of the estimated probability density function
|
---|
294 | at a particular place will be influenced by all data points, no matter how
|
---|
295 | far away they are. This can cause small groups of data points to be ignored
|
---|
296 | even if they are a long way from the dominant part of the data sample. From
|
---|
297 | a probabilistic point of view, however, there is no reason to subsume distant
|
---|
298 | groups within the major clusters just because they are relatively small.
|
---|
299 |
|
---|
300 | The ultimate effect of suppressing distant minority clusters depends on how
|
---|
301 | the clustering is applied. If the application's loss function depends on
|
---|
302 | the distance between clusters, the result may prove disastrous because there
|
---|
303 | is no limit to how far away these outlying groups may be. One might argue
|
---|
304 | that small groups of points can easily be explained away as outliers, because
|
---|
305 | the effect will become less important as the number of data points increases--and
|
---|
306 | it will disappear in the limit of infinite data. However, in a finite-data
|
---|
307 | situation--and all practical applications necessarily involve finite data--the
|
---|
308 | "outliers" may equally well represent small minority clusters. Furthermore,
|
---|
309 | outlying data points are not really treated as outliers by these methods--whether
|
---|
310 | or not they are discarded is merely an artifact of the global fitting calculation.
|
---|
311 | When clustering, the final mixture distribution should take all data points
|
---|
312 | into account--including outlying clusters if any exist. If practical applications
|
---|
313 | demand that small outlying clusters are suppressed, this should be done in
|
---|
314 | a separate stage.
|
---|
315 |
|
---|
316 | In distance-based clustering, each data point has a farreaching effect because
|
---|
317 | of two global constraints. One is the use of the cumulative distribution
|
---|
318 | function; the other is the normalization constraint Pk
|
---|
319 |
|
---|
320 | n
|
---|
321 |
|
---|
322 | j=1 wnj = 1. Theseconstraints may sacrifice a small number of data points--
|
---|
323 |
|
---|
324 | at any distance--for a better overall fit to the data as a whole. Choi and
|
---|
325 | Bulgren (1968), the Cramer-von Mises statistic (Macdonald, 1971), and Deely
|
---|
326 | and Kruse (1968) all enforce both the CDF and the normalization constraints.
|
---|
327 | Blum and Susarla (1977) drop the CDF, but still enforce the normalization
|
---|
328 | constraint. The result is that these clustering methods are only appropriate
|
---|
329 | for finite mixtures without small clusters, where the risk of suppressing
|
---|
330 | clusters is low.
|
---|
331 |
|
---|
332 | This paper addresses the general problem of arbitrary mixtures. Of course,
|
---|
333 | the minority cluster problem exists for all types of mixture--including finite
|
---|
334 | mixtures. Even here, the maximum likelihood and Bayesian approaches do not
|
---|
335 | solve the problem, because they both introduce a global normalization constraint.
|
---|
336 |
|
---|
337 | 4 Solving the minority cluster
|
---|
338 |
|
---|
339 | problem
|
---|
340 |
|
---|
341 | Now that the source of the problem has been identified, the solution is clear,
|
---|
342 | at least in principle: drop both the approximation of CDFs, as Blum and Susarla
|
---|
343 | (1977) do, and the normalization constraint--no matter how seductive it may
|
---|
344 | seem.
|
---|
345 |
|
---|
346 | Let G0n be a discrete function with masses fwnjg at f`njg; note that we do
|
---|
347 | not require the wnj to sum to one. Since the new method operates in terms
|
---|
348 | of measures rather than distribution functions, the notion of approximation
|
---|
349 | is altered to use intervals rather than points. Using the formulation described
|
---|
350 | in Section 2, we have
|
---|
351 |
|
---|
352 | PG0
|
---|
353 |
|
---|
354 | n ,= Pn; (8)
|
---|
355 |
|
---|
356 | where PG0
|
---|
357 |
|
---|
358 | n is the estimated measure and Pn is the em-pirical measure. The intervals
|
---|
359 | over which the approximation takes place are called "fitting intervals."
|
---|
360 | Since (8) is not subject to the normalization constraint, G0n is not a CDF
|
---|
361 | and PG0
|
---|
362 |
|
---|
363 | n is not a probability measure. How-ever, G0
|
---|
364 |
|
---|
365 | n can be easily converted into a CDF estimatorby normalizing it after equation
|
---|
366 | (8) has been solved.
|
---|
367 |
|
---|
368 | To define the estimation procedure fully, we need to determine (a) the set
|
---|
369 | of support points, (b) the set of fitting intervals, (c) the empirical measure,
|
---|
370 | and (d) the distance measure. Here we discuss these in an intuitive manner;
|
---|
371 | Wang and Witten (1999) show how to determine them in a way that guarantees
|
---|
372 | a strongly consistent estimator.
|
---|
373 |
|
---|
374 | Support points. The support points are usually suggested by the data points
|
---|
375 | in the sample. For example, if the component distribution F (x; `) is the
|
---|
376 | normal distribution with mean ` and unit variance, each data point can be
|
---|
377 | taken as a support point. In fact, the support points are more accurately
|
---|
378 | described as potential support points, because their associated weights may
|
---|
379 | become zero after solving (8)--and, in practice, many often do.
|
---|
380 |
|
---|
381 | Fitting intervals. The fitting intervals are also suggested by the data points.
|
---|
382 | In the normal distribution example, each data point xi can provide one interval,
|
---|
383 | such as [xi \\Gamma 3oe; xi], or two, such as [xi \\Gamma 3oe; xi] and [xi;
|
---|
384 | xi + 3oe], or more. There is no problem if the fitting
|
---|
385 |
|
---|
386 | intervals overlap. Their length should not be so large that points can exert
|
---|
387 | an influence on the clustering at an unduly remote place, nor so small that
|
---|
388 | the empirical measure is inaccurate. The experiments reported below use intervals
|
---|
389 | of a few standard deviations around each data point, and, as we will see,
|
---|
390 | this works well.
|
---|
391 |
|
---|
392 | Empirical measure. The empirical measure can be the probability measure determined
|
---|
393 | by the Kolmogorov empirical CDF, or any measure that converges to it. The
|
---|
394 | fitting intervals discussed above can be open, closed, or semi-open. This
|
---|
395 | will affect the empirical measure if data points are used as interval boundaries,
|
---|
396 | although it does not change the values of the estimated measure because the
|
---|
397 | corresponding distribution is continuous. In smallsample situations, bias
|
---|
398 | can be reduced by careful attention to this detail--as Macdonald (1971) discusses
|
---|
399 | with respect to Choi and Bulgren's (1968) method.
|
---|
400 |
|
---|
401 | Distance measure. The choice of distance measure determines what kind of
|
---|
402 | mathematical programming problem must be solved. For example, a quadratic
|
---|
403 | distance will give rise to a least squares problem under linear constraints,
|
---|
404 | whereas the sup-norm gives rise to a linear programming problem that can
|
---|
405 | be solved using the simplex method. These two measures have efficient solutions
|
---|
406 | that are globally optimal.
|
---|
407 |
|
---|
408 | It is worth pointing out that abandoning the global constraints associated
|
---|
409 | with both CDFs and normalization can brings with it a computational advantage.
|
---|
410 | In vector form, we write PG0
|
---|
411 |
|
---|
412 | n = AG
|
---|
413 |
|
---|
414 | 0 nwn, where wn is the(unnormalized) weight vector and each element of the
|
---|
415 |
|
---|
416 | matrix AG0
|
---|
417 |
|
---|
418 | n is the probability value of a component dis-tribution over an fitting interval.
|
---|
419 | Then, provided the
|
---|
420 |
|
---|
421 | support points corresponding to w0n and w00n lie outside each others' sphere
|
---|
422 | of influence as determined by the component distributions F (x; `), the estimation
|
---|
423 | procedure becomes`
|
---|
424 |
|
---|
425 | A0G0
|
---|
426 |
|
---|
427 | n 00 A00
|
---|
428 |
|
---|
429 | G0n ' `
|
---|
430 |
|
---|
431 | w0n w00n ' ,= `
|
---|
432 |
|
---|
433 | P 0n P 00n ' ; (9)
|
---|
434 |
|
---|
435 | subject to w0n * 0 and w00n * 0. This is the same as combining the solutions
|
---|
436 | of two sub-equations, A0nw0n ,= P 0n subject to w0n * 0, and A00nw00n ,=
|
---|
437 | P 00n subject to w00n * 0. If the relevant support points continue to lie
|
---|
438 | outside each others' sphere of influence, the sub-equations can be further
|
---|
439 | partitioned. This implies that when data points are sufficiently far apart,
|
---|
440 | the mixing distribution G can be estimated by grouping data points in different
|
---|
441 | regions. Moreover, the solution in each region can be normalized separately
|
---|
442 | before they are combined, which yields a better estimation of the mixing
|
---|
443 | distribution.
|
---|
444 |
|
---|
445 | If the normalization constraint Pk
|
---|
446 |
|
---|
447 | n
|
---|
448 |
|
---|
449 | j=1 wnj = 1 is re-tained when estimating the mixing distribution, the es
|
---|
450 | timation procedure becomes
|
---|
451 |
|
---|
452 | PG
|
---|
453 |
|
---|
454 | n ,= Pn: (10)
|
---|
455 |
|
---|
456 | where the estimator Gn is a discrete CDF on \\Theta . This constraint is necessary
|
---|
457 | for the left-hand side of (10) to be a probability measure. Although he did
|
---|
458 | not develop an operational estimation scheme, Barbe (1998) suggested exploiting
|
---|
459 | the fact that the empirical probability measure is approximated by the estimated
|
---|
460 | probability measure--but he retained the normalization constraint. As noted
|
---|
461 | above, relaxing the constraint has the effect of loosening the throttling
|
---|
462 | effect of large clusters on small groups of outliers, and our experimental
|
---|
463 | results show that the resulting estimator suffers from the drawback noted
|
---|
464 | earlier.
|
---|
465 |
|
---|
466 | Both estimators, Gn obtained from (10) and G0n from (8), have been shown
|
---|
467 | to be strongly consistent under weak conditions similar to those used by
|
---|
468 | others (Wang & Witten, 1999). Of course, the weak convergence of G0n is in
|
---|
469 | the sense of general functions, not CDFs. The strong consistency of G0n immediately
|
---|
470 | implies the strong consistency of the CDF estimator obtained by normalizing
|
---|
471 | G0n.
|
---|
472 |
|
---|
473 | 5 Experimental validation We have conducted experiments to illustrate the
|
---|
474 | failure of existing methods to detect small outlying clusters, and the improvement
|
---|
475 | achieved by the new scheme. The results also suggest that the new method
|
---|
476 | is more accurate and stable than the others.
|
---|
477 |
|
---|
478 | When comparing clustering methods, it is not always easy to evaluate the
|
---|
479 | clusters obtained. To finesse this problem we consider simple artificial
|
---|
480 | situations in which the proper outcome is clear. Some practical applications
|
---|
481 | of clusters do provide objective evaluation functions; however, these are
|
---|
482 | beyond the scope of this paper.
|
---|
483 |
|
---|
484 | The methods used are Choi and Bulgren (1968) (denoted choi), Macdonald's
|
---|
485 | application of the Cram'er-von Mises statistic (cram'er), the new method
|
---|
486 | with the normalization constraint (test), and the new method without that
|
---|
487 | constraint (new). In each case, equations involving non-negativity and/or
|
---|
488 | linear equality constraints are solved as quadratic programming problems
|
---|
489 | using the elegant and efficient procedures nnls and lsei provided by Lawson
|
---|
490 | and Hanson (1974). All four methods have the same computational time complexity.
|
---|
491 |
|
---|
492 | We set the sample size n to 100 throughout the experiments. The data points
|
---|
493 | are artificially generated from a mixture of two clusters: n1 points from
|
---|
494 | N (0; 1) and n2 points from N (100; 1). The values of n1 and n2 are in the
|
---|
495 | ratios 99 : 1, 97 : 3, 93 : 7, 80 : 20 and 50 : 50.
|
---|
496 |
|
---|
497 | Every data point is taken as a potential support point in all four methods:
|
---|
498 | thus the number of potential components in the clustering is 100. For test
|
---|
499 | and new, fitting intervals need to be determined. In the experiments, each
|
---|
500 | data point xi provides the two fitting intervals [xi \\Gamma 3; xi] and [xi;
|
---|
501 | xi + 3]. Any data point located on the boundary of an interval is counted
|
---|
502 | as half a point when determining the empirical measure over that interval.
|
---|
503 |
|
---|
504 | These choices are admittedly crude, and further improvements in the accuracy
|
---|
505 | and speed of test and new are possible that take advantage of the flexibility
|
---|
506 | provided by (10) and (8). For example, accuracy will likely increase with
|
---|
507 | more--and more carefully chosen-- support points and fitting intervals. The
|
---|
508 | fact that it performs well even with crudely chosen support points and fitting
|
---|
509 | intervals testifies to the robustness of the method.
|
---|
510 |
|
---|
511 | Our primary interest in this experiment is the weights of the clusters that
|
---|
512 | are found. To cast the results in terms of the underlying models, we use
|
---|
513 | the cluster weights to estimate values for n1 and n2. Of course, the results
|
---|
514 | often do not contain exactly two clusters--but because the underlying cluster
|
---|
515 | centres, 0 and 100, are well separated compared to their standard deviation
|
---|
516 | of 1, it is highly unlikely that any data points from one cluster will fall
|
---|
517 | anywhere near the other. Thus we use a threshold of 50 to divide the clusters
|
---|
518 | into two groups: those near 0 and those near 100. The final cluster weights
|
---|
519 | are normalized, and the weights for the first group are summed to obtain
|
---|
520 | an estimate ^n1 of n1, while those for the second group are summed to give
|
---|
521 | an estimate ^n2 of n2.
|
---|
522 |
|
---|
523 | Table 1 shows results for each of the four methods. Each cell represents
|
---|
524 | one hundred separate experimental runs. Three figures are recorded. At the
|
---|
525 | top is the number of times the method failed to detect the smaller cluster,
|
---|
526 | that is, the number of times ^n2 = 0. In the middle are the average values
|
---|
527 | for ^n1 and ^n2. At the bottom is the standard deviation of ^n1 and ^n2 (which
|
---|
528 | are equal). These three figures can be thought of as measures of reliability,
|
---|
529 | accuracy and stability respectively.
|
---|
530 |
|
---|
531 | The top figures in Table 1 show clearly that only new is always reliable
|
---|
532 | in the sense that it never fails to detect the smaller cluster. The other
|
---|
533 | methods fail mostly when n2 = 1; their failure rate gradually decreases as
|
---|
534 | n2 grows. The center figures show that, under all conditions, new gives a
|
---|
535 | more accurate estimate of the correct values of n1 and n2 than the other
|
---|
536 | methods. As expected, cram'er shows a noticeable improvement over choi, but
|
---|
537 | it is very minor. The test method has lower failure rates and produces estimates
|
---|
538 | that are more accurate and far more stable (indicated by the bottom fign1
|
---|
539 | = 99 n1 = 97 n1 = 93 n1 = 80 n1 = 50
|
---|
540 |
|
---|
541 | n2 = 1 n2 = 3 n2 = 7 n2 = 20 n2 = 50 choi Failures 86 42 4 0 0
|
---|
542 |
|
---|
543 | ^n1=^n2 99.9/0.1 99.2/0.8 95.8/4.2 82.0/18.0 50.6/49.4 SD(^n1) 0.36 0.98
|
---|
544 | 1.71 1.77 1.30 cram'er Failures 80 31 1 0 0
|
---|
545 |
|
---|
546 | ^n1=^n2 99.8/0.2 98.6/1.4 95.1/4.9 81.6/18.4 49.7/50.3 SD(^n1) 0.50 1.13
|
---|
547 | 1.89 1.80 1.31 test Failures 52 5 0 0 0
|
---|
548 |
|
---|
549 | ^n1=^n2 99.8/0.2 98.2/1.8 94.1/5.9 80.8/19.2 50.1/49.9 SD(^n1) 0.32 0.83
|
---|
550 | 0.87 0.78 0.55 new Failures 0 0 0 0 0
|
---|
551 |
|
---|
552 | ^n1=^n2 99.0/1.0 96.9/3.1 92.8/7.2 79.9/20.1 50.1/49.9 SD(^n1) 0.01 0.16
|
---|
553 | 0.19 0.34 0.41
|
---|
554 |
|
---|
555 | Table 1: Experimental results for detecting small clusters ures) than those
|
---|
556 | for choi and cram'er--presumably because it is less constrained. Of the four
|
---|
557 | methods, new is clearly and consistently the winner in terms of all three
|
---|
558 | measures: reliability, accuracy and stability.
|
---|
559 |
|
---|
560 | The results of the new method can be further improved. If the decomposed
|
---|
561 | form (9) is used instead of (8), and the solutions of the sub-equations are
|
---|
562 | normalized before combining them--which is feasible because the two underlying
|
---|
563 | clusters are so distant from each other--the correct values are obtained
|
---|
564 | for ^n1 and ^n2 in virtually every trial.
|
---|
565 |
|
---|
566 | 6 Conclusions We have identified a shortcoming of existing clustering methods
|
---|
567 | for arbitrary semi-parametric mixture distributions: they fail to detect
|
---|
568 | very small clusters reliably. This is a significant weakness when the minority
|
---|
569 | clusters are far from the dominant ones and the loss function takes account
|
---|
570 | of the distance of misclustered points.
|
---|
571 |
|
---|
572 | We have described a new clustering method for arbitrary semi-parametric mixture
|
---|
573 | distributions, and shown experimentally that it overcomes the problem. Furthermore,
|
---|
574 | the experiments suggest that the new estimator is more accurate and more
|
---|
575 | stable than existing ones.
|
---|
576 |
|
---|
577 | References Barbe, P. (1998). Statistical analysis of mixtures and
|
---|
578 |
|
---|
579 | the empirical probability measure. Acta Applicandae Mathematicae, 50(3),
|
---|
580 | 253-340.
|
---|
581 |
|
---|
582 | Blum, J. R. & Susarla, V. (1977). Estimation of a mixing
|
---|
583 |
|
---|
584 | distribution function. Ann. Probab, 5, 200-209.
|
---|
585 |
|
---|
586 | Choi, K. & Bulgren, W. B. (1968). An estimation procedure for mixtures of
|
---|
587 | distributions. J. R. Statist. Soc. B, 30, 444-460.
|
---|
588 |
|
---|
589 | Deely, J. J. & Kruse, R. L. (1968). Construction of sequences estimating
|
---|
590 | the mixing distribution. Ann. Math. Statist., 39, 286-288.
|
---|
591 |
|
---|
592 | Lawson, C. L. & Hanson, R. J. (1974). Solving Least
|
---|
593 |
|
---|
594 | Squares Problems. Prentice-Hall, Inc.
|
---|
595 |
|
---|
596 | Lindsay, B. G. (1995). Mixture models: theory, geometry,
|
---|
597 |
|
---|
598 | and applications, Volume 5 of NSF-CBMS Regional Conference Series in Probability
|
---|
599 | and Statistics. Institute for Mathematical Statistics: Hayward, CA.
|
---|
600 |
|
---|
601 | Macdonald, P. D. M. (1971). Comment on a paper by
|
---|
602 |
|
---|
603 | Choi and Bulgren. J. R. Statist. Soc. B, 33, 326- 329.
|
---|
604 |
|
---|
605 | McLachlan, G. & Basford, K. (1988). Mixture Models:
|
---|
606 |
|
---|
607 | Inference and Applications to Clustering. Marcel Dekker, New York.
|
---|
608 |
|
---|
609 | Titterington, D. M., Smith, A. F. M. & Makov, U. E.
|
---|
610 |
|
---|
611 | (1985). Statistical Analysis of Finite Mixture Distributions. John Wiley
|
---|
612 | & Sons.
|
---|
613 |
|
---|
614 | Wang, Y. & Witten, I. H. (1999). The estimation of mixing distributions by
|
---|
615 | approximating empirical measures. Technical Report (in preparation), Dept.
|
---|
616 | of Computer Science, University of Waikato, New Zealand.
|
---|
617 | </pre></Content>
|
---|
618 | </Section>
|
---|
619 | </Archive>
|
---|