Imagine an unending stream of names:
Britney, Brad, Angelina, Britney, Pamela, Jessica, Jessica, Britney, Clay, Brad, Britney, Britney, Pamela, Clay, Brad….
Your job is to keep a running tally of the number of unique names. (The snippet above, with 15 names altogether, has 6 unique names.)
There’s a straightforward method of solution: Keep a record of all the unique names seen so far, then match each name received against all those already on record. If a new name matches one in your database, ignore it. If there is no match, add the new name to the database and increment the count of unique names.
The trouble with this scheme is that the memory requirement is potentially unbounded. You’ll need to store one copy of each unique name. In the worst case (where no name is ever repeated), your database has to be capable of storing the entire stream. If that stream consists of, say, hundreds of millions of queries submitted to an Internet search engine—with more pouring in every second—you may have a hard time keeping up.
Tasks like this one are the subject of my new “Computing Science” column, titled “The Britney Spears Problem.” It’s now available online, and paper-and-ink magazines should be reaching subscribers and newsstands soon. (Note: American Scientist has just revamped its web site. A lot of effort has gone into this migration, but it’s still a work in progress, so please be patient. Links to the old site are not yet being redirected to the corresponding page of the new site. If you discover anything else that’s not working or not to your taste, I’d be grateful if you’d drop me a note or leave a comment below. I’ll pass suggestions along.)
But back to Britney and her friends.
The basic assumption in the study of stream algorithms is that you get only one look at the stream. You are standing on a bridge, watching the river flow by below. When something interesting floats along in the current, you can choose to grapple it out and store it on shore. But if you let it float past, you never get another chance at it. And you have only a fixed amount of storage space for things you choose to save. There’s also a time constraint: You have to be finished with one object before the next comes along. (In some cases the time constraint can be relaxed to an averaged or “amortized” bound.)
With a fixed and finite amount of storage and a stream of infinite variety, there’s no way to count the unique elements exactly. In my American Scientist column I describe some approximation methods. Here I want to briefly mention another approach that I didn’t have room to include in the column. It’s called a Bloom filter, after Burton H. Bloom, who described the idea in 1970. (See Bloom, Burton H. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7):422–426. Link (subscription needed).)
The term “filter” is a little misleading in this context; what Bloom defined is really a specialized memory. The underlying data structure is an array of m bits, all initially set to zero. Also needed are k different “hash” functions, each of which can be applied to any item being stored in the memory. The hash functions map a data element to a number between 0 and m–1; this number can then be interpreted as an index into the array of bits. (An ideal hash function is both perfectly random and perfectly deterministic—a hard combination to achieve in practice. It’s random in the sense that any data item can hash to any of the m possible values with equal probability. It’s deterministic in the sense that the same data item will always hash to the same value.)
There are two operations defined on the Bloom filter:
- To store a data element, set each of the bits specified by the k hash functions to 1. (If any of the bits are already set to 1, leave them as they are.)
- To find out if a data element is already present, calculate the k hash functions for that element and examine the corresponding bits; if any of the bits is 0, then the element cannot be present in the memory.
For the name-list problem, the Bloom filter offers a way to check every name as it comes along to see whether or not it has appeared before. The checking can be done in a fixed amount of space and a fixed amount of time per name. Problem solved, no? Not quite. Although the Bloom filter can never produce a false negative (claiming that a name is not present when it really is) there is a risk of false positives (reporting that a name has been seen before when in fact it is novel). For example, with k=3 and m=10, suppose the name Britney hashes to {0,3,6} and Brad yields bits {3,4,9}. Now, with those two names already stored, Angelina comes along and hashes to the values {4,6,9}. All of those bits are set, and so the Bloom filter will falsely report that Angelina has been encountered previously.
The probability of a false positive obviously depends on the proportion of bits that are set to 1 in the memory array. You might want to try working out just how this probability grows as a function of m, k and n (the number of items stored). Or take the easy way out and read Andrei Broder and Michael Mitzenmacher’s survey. I also recommend the Wikipedia article on Bloom filters.
The Bloom filter strikes me as one of those ideas that seems thunderingly obvious after you’ve read about it. But it’s really very elegant and clever.
Does anyone know who Burton H. Bloom is or was? In 1963 he wrote an MIT AI memo (No. 47) on heuristics in chess. The ACM Guide to Computing Literature lists just one other publication, from 1969. By then he was working for the Computer Usage Company in Newton Upper Falls, Mass., which apparently closed up shop in 1986. I’m not the first (see here and here) to go looking for him.