In this new post to his blog Joshua Thijssen describes something that can help when processing large amounts of data (like, in his example, the text of a book) to search through the information and find if a certain piece of data is in the set - a bloom filter.
Most of my co-workers never really heard of bloom filters, and I’m continuously need to explain what they are, what their purpose is and why it’s a better solution than other ones. So let’s do an introduction on bloom filters. [...] Bloom filters have the property of being exceptionally fast AND exceptionally small compared to other structures but it comes with a price: it MIGHT be possible that our bloom filter thinks that an element is inside our set, when it really isn’t. Luckily, the reverse is not possible: when a bloom filter says something is NOT in the set, you are 100% sure that it isn’t part of the set.
He explains how the filter works, noting how it's better for memory consumption and how it's possible for it to give a "maybe" response instead of ab absolute "yes" or "no". He also points out a PHP extension, bloomy that takes the hard work out of it for you.