performance - Calculating all distances between one point and a group of points efficiently in R

Question

Welcome To Ask or Share your Answers For Others

performance - Calculating all distances between one point and a group of points efficiently in R

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

performance - Calculating all distances between one point and a group of points efficiently in R

First of all, I am new to R (I started yesterday).

I have two groups of points, data and centers, the first one of size n and the second of size K (for instance, n = 3823 and K = 10), and for each i in the first set, I need to find j in the second with the minimum distance.

My idea is simple: for each i, let dist[j] be the distance between i and j, I only need to use which.min(dist) to find what I am looking for.

Each point is an array of 64 doubles, so

> dim(data)
[1] 3823   64
> dim(centers)
[1] 10 64

I have tried with

for (i in 1:n) {
  for (j in 1:K) {
    d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
  }
  S[i] <- which.min(d)
}

which is extremely slow (with n = 200, it takes more than 40s!!). The fastest solution that I wrote is

distance <- function(point, group) {
  return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}

for (i in 1:n) {
  d <- distance(data[i,], centers)
  which.min(d)
}

Even if it does a lot of computation that I don't use (because dist(m) computes the distance between all rows of m), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. And also, the distance code is very ugly. I tried to replace it with

distance <- function(point, group) {
  return (dist(rbind(point,group))[1:nrow(group)])
}

but this seems to be twice slower. I also tried to use dist for each pair, but it is also slower.

I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently?

ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily. stats::kmeans do all computation in less than one second.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:51:21+0000

Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K.

# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)

system.time(
  dists <- apply(centers, 2, function(center) {
    colSums((x - center)^2)
})
)

Runs in:

utilisateur     système      écoulé 
      0.100       0.008       0.108

on my laptop.

Categories

performance - Calculating all distances between one point and a group of points efficiently in R

performance - Calculating all distances between one point and a group of points efficiently in R

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags