I remember this turning up in a google interview back in the day. The interview was really expecting me not to know the algorithm and to flounder about trying to solve the problem from first principles. Was fun to just shortcut things by knowing the answer that time.
Yeah, this was a google interview question for me too. I didn't know the algorithm and floundered around trying to solve the problem. I came up with the 1/n and k/n selection strategy but still didn't get the job lol. I think the guy who interviewed me was just killing time until lunch.
I like the visualizations in this article, really good explanation.
I didn't know about the algorithm until after I got hired there. It's actually really useful in a number of contexts, but my favorite was using it to find optimal split points for sharding lexicographically sorted string keys for mapping. Often you will have a sorted table, but the underlying distribution of keys isn't known, so uniform sharding will often cause imbalances where some mappers end up doing far more work than others. I don't know if there is a convenient open source class to do this.
Interesting idea, hadn’t that about that way to apply it.
I knew it from before my interview from a turbo pascal program I had seen that sampled dat tape backups of patient records from a hospital system. These samples were used for studies. That was a textbook example of it’s utility.
I guess the question in my mind is: would you expect a smart person who did not previously know this problem (or really much random sampling at all) to come up with the algorithm on the fly in an interview? And if the person had seen it before and memorized the answer, does that provide any signal of their ability to code?
My gut instinct is no. I certainly don't think I'd be able to derive this algorithm from first principles in a 60 minute whiteboarding interview, and I worked at Google for 4 years.
They wanted to see your analytical thinking skills at work. To pass you only needed to be sensible. You didn’t fail the interview if you couldn’t invent reservoir sampling!
This also got me past one interview. I came up with k/n but now I think it's better to just generate a random float in [0, 1] and keep the k largest ones