Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

ahh, I wish they included speed comparison to numpy.average

I know, thats not the point and average was only picked as a simple example, but still...



Agree. Normal Python for loop apply to a Numpy array to do simple math is just pure nonsense.

Just tested how would it be without compile nonsense.

```

a = np.random.random(int(1e6))

%%timtit

np.average(a)

%timeit

np.average(a[::16])

```

And my result is that no matter how uncontiguous in memory (here I take every 16 elements like what they did, and I tested for 2,4,8,16), we are doing less operations so it always end up faster. Contrastingly their SIMD compiled code is 10-20X slower in uncontiguous case.

And for a larger array that is 16X of the contiguous one, but we only take 1/16 of its element, the result is like 10X slower as shown by the article. But I suspect that purely now you have a 16X larger array to load from memory, which itself is slow in nature.

```

b = np.random.random(int(16e6))

np.average(b[::16])

```

Which conclude that people should use Numpy in the right way. It is really hard to beat pure numpy speed.


But that's precisely what makes this a good exercise, you can see how far you are able to close the gap between the naive looping implementation and the optimized array implementation.


> np.average

But that's not the function in the article. The article implements `(a + b) / 2`.

And, on my system, simple `return (arr1 + arr2) / 2` takes 1.2ms, while the `average_arrays_4` takes 0.74ms.


Few years ago I tried to beat the C/C++ compiler on speed with manual SIMD instructions vs pure C/C++ Didn’t work out…

I can only imagine that this is already backed into Numpy now.


You usually have to unroll your loops for it to help (unless compilers have gotten smarter about data dependencies)




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: