ahh, I wish they included speed comparison to numpy.average I know, thats not th...

akasakahakada · on Oct 30, 2023

Agree. Normal Python for loop apply to a Numpy array to do simple math is just pure nonsense.

Just tested how would it be without compile nonsense.

```

a = np.random.random(int(1e6))

%%timtit

np.average(a)

%timeit

np.average(a[::16])

```

And my result is that no matter how uncontiguous in memory (here I take every 16 elements like what they did, and I tested for 2,4,8,16), we are doing less operations so it always end up faster. Contrastingly their SIMD compiled code is 10-20X slower in uncontiguous case.

And for a larger array that is 16X of the contiguous one, but we only take 1/16 of its element, the result is like 10X slower as shown by the article. But I suspect that purely now you have a 16X larger array to load from memory, which itself is slow in nature.

```

b = np.random.random(int(16e6))

np.average(b[::16])

```

Which conclude that people should use Numpy in the right way. It is really hard to beat pure numpy speed.

nerdponx · on Oct 30, 2023

But that's precisely what makes this a good exercise, you can see how far you are able to close the gap between the naive looping implementation and the optimized array implementation.

Elucalidavah · on Oct 30, 2023

> np.average

But that's not the function in the article. The article implements `(a + b) / 2`.

And, on my system, simple `return (arr1 + arr2) / 2` takes 1.2ms, while the `average_arrays_4` takes 0.74ms.

thatsit · on Oct 30, 2023

Few years ago I tried to beat the C/C++ compiler on speed with manual SIMD instructions vs pure C/C++ Didn’t work out…

I can only imagine that this is already backed into Numpy now.

cozzyd · on Oct 30, 2023

You usually have to unroll your loops for it to help (unless compilers have gotten smarter about data dependencies)