Edit: The answer of this questions is heavily discussed in: Sum in Spark gone bad
In Compute Cost of Kmeans, we saw how one can compute the cost of his KMeans model. I was wondering if we are able to compute the Unbalanced factor?
If there is no such functionality provide by Spark, is there any easy way to implement this?
I was not able to find a ref for the Unbalanced factor, but it should be similar to Yael's unbalanced_factor (my comments):
// @hist: the number of points assigned to a cluster
// @n: the number of clusters
double ivec_unbalanced_factor(const int *hist, long n) {
int vw;
double tot = 0, uf = 0;
for (vw = 0 ; vw < n ; vw++) {
tot += hist[vw];
uf += hist[vw] * (double) hist[vw];
}
uf = uf * n / (tot * tot);
return uf;
}
which I found here.
So the idea is that tot
(for total) will be equal to the number of points assigned to clusters (i.e. equal to the size of our dataset), while uf
(for unbalanced factor) holds the square of the number of points assigned to a cluster.
Finally he uses uf = uf * n / (tot * tot);
to compute it.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…