Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
557 views
in Technique[技术] by (71.8m points)

numpy - Why python broadcasting in the example below is slower than a simple loop?

I have an array of vectors and compute the norm of their diffs vs the first one. When using python broadcasting, the calculation is significantly slower than doing it via a simple loop. Why?

import numpy as np

def norm_loop(M, v):
  n = M.shape[0]
  d = np.zeros(n)
  for i in range(n):
    d[i] = np.sum((M[i] - v)**2)
  return d

def norm_bcast(M, v):
  n = M.shape[0]
  d = np.zeros(n)
  d = np.sum((M - v)**2, axis=1)
  return d

M = np.random.random_sample((1000, 10000))
v = M[0]

%timeit norm_loop(M, v) 
25.9 ms

%timeit norm_bcast(M, v)
38.5 ms

I have Python 3.6.3 and Numpy 1.14.2

To run the example in google colab: https://drive.google.com/file/d/1GKzpLGSqz9eScHYFAuT8wJt4UIZ3ZTru/view?usp=sharing

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Memory access.

First off, the broadcast version can be simplified to

def norm_bcast(M, v):
     return np.sum((M - v)**2, axis=1)

This still runs slightly slower than the looped version. Now, conventional wisdom says that vectorized code using broadcasting should always be faster, which in many cases isn't true (I'll shamelessly plug another of my answers here). So what's happening?

As I said, it comes down to memory access.

In the broadcast version every element of M is subtracted from v. By the time the last row of M is processed the results of processing the first row have been evicted from cache, so for the second step these differences are again loaded into cache memory and squared. Finally, they are loaded and processed a third time for the summation. Since M is quite large, parts of the cache are cleared on each step to acomodate all of the data.

In the looped version each row is processed completely in one smaller step, leading to fewer cache misses and overall faster code.

Lastly, it is possible to avoid this with some array operations by using einsum. This function allows mixing matrix multiplications and summations. First, I'll point out it's a function that has rather unintuitive syntax compared to the rest of numpy, and potential improvements often aren't worth the extra effort to understand it. The answer may also be slightly different due to rounding errors. In this case it can be written as

def norm_einsum(M, v):
    tmp = M-v
    return np.einsum('ij,ij->i', tmp, tmp)

This reduces it to two operations over the entire array - a subtraction, and calling einsum, which performs the squaring and summation. This gives a slight improvement:

%timeit norm_bcast(M, v)
30.1 ms ± 116 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit norm_loop(M, v)
25.1 ms ± 37.3 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit norm_einsum(M, v)
21.7 ms ± 65.3 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...