pt.tumba.spell
Class BloomFilter

java.lang.Object
  extended by pt.tumba.spell.BloomFilter
All Implemented Interfaces:
java.lang.Cloneable

public class BloomFilter
extends java.lang.Object
implements java.lang.Cloneable

Implementation of a Bloom Filter data structure, an elegant alternative to the lookup hash table.

Bloom filters allow you to perform membership tests in just a fraction of the memory you'd need to store a full list of keys. As you might suspect, the savings in space comes at a price: you run an adjustable risk of false positives, and you can't remove a key from a filter once you've added it in. But in the many cases where those constraints are acceptable, a Bloom filter can make a useful tool.

Bloom filters are named after Burton Bloom, who first described them in a 1970 paper entitled Space/time trade-offs in hash coding with allowable errors. In those days of limited memory, Bloom filters were prized primarily for their compactness; in fact, one of their earliest applications was in spell checkers.

A Bloom filter consists of two components: a set of k hash functions and a bit vector of a given length. We choose the length of the bit vector and the number of hash functions depending on how many keys we want to add to the set and how high an error rate we are willing to put up with.

All of the hash functions in a Bloom filter are configured so that their range matches the length of the bit vector. For example, if a vector is 200 bits long, the hash functions return a value between 1 and 200. It's important to use high-quality hash functions in the filter to guarantee that output is equally distributed over all possible values -- "hot spots" in a hash function would increase our false-positive rate.

To enter a key into a Bloom filter, we run it through each one of the k hash functions and treat the result as an offset into the bit vector, turning on whatever bit we find at that position. If the bit is already set, we leave it on. There's no mechanism for turning bits off in a Bloom filter.

Checking to see whether a key already exists in a filter is exactly analogous to adding a new key. We run the key through our set of hash functions, and then check to see whether the bits at those offsets are all turned on. If any of the bits is off, we know for certain the key is not in the filter. If all of the bits are on, we know the key is probably there.

As you might expect, the false-positive rate depends on the bit vector length and the number of keys stored in the filter. The roomier the bit vector, the smaller the probability that all k bits we check will be on, unless the key actually exists in the filter. The relationship between the number of hash functions and the false-positive rate is more subtle. If you use too few hash functions, there won't be enough discrimination between keys; but if you use too many, the filter will be very dense, increasing the probability of collisions. You can calculate the false-positive rate for any filter using the formula:

c = ( 1 - e(-kn/m) )k

Where c is the false positive rate, k is the number of hash functions, n is the number of keys in the filter, and m is the length of the filter in bits.

When using Bloom filters, we very frequently have a desired false-positive rate in mind and we are also likely to have a rough idea of how many keys we want to add to the filter. We need some way of finding out how large a bit vector is to make sure the false-positive rate never exceeds our limit. The following equation will give us vector length from the error rate and number of keys:

m = -kn / ( ln( 1 - c ^ 1/k ) )

You'll notice another free variable here: k, the number of hash functions. However, it's possible to use calculus to find a minimum for k. You can also find lookup tables for various combinations of error rate, filter size, and number of hash functions at Bloom Filters -- the math.

This implementation uses the hashCode() method supplied for all Java objects, which produces a 32-bit signed int number. For example, in String Objects, the hashcode is usually computed by adding up the character values with an prime multiplier (31, in the case of JDK 1.4).

Alternatively, this class can also use an implementation of a hash function based on Rabin fingerprints, which can efficiently produce a 32-bit hash value for a sequence of bytes. It does so by considering strings of bytes as large polynomials with coefficients of 0 and 1 and then reducing them modulo some irreducible polynomial of degree 32. The result is a hash function with very satisfactory properties. In addition the polynomial operations are fast in hardware, and even in this Java implementation the speed is reasonable.

The implementation is derived from the paper "Some applications of Rabin's fingerprinting method" by Andrei Broder. See http://server3.pa-x.dec.com/SRC/publications/src-papers.html for a full citation and the paper in PDF format.

Included in this class are additional methods that can compute the Rabin hash value for any serializable Object, String, File, or resource denoted by URL.

As for the multiple hash functions for the Bloom Filter, these are based on the module of the initial value multiplied by a list of distinct values.

Author:
Bruno Martins
See Also:
Object.hashCode(), Map

Field Summary
private  byte[] buffer
          A buffer for the Rabin fingerprinting algorithm.
private  boolean[] keys
          The bit vector for the Bloom Filter.
private  int numFunctions
          The number of hash functions.
private static int P_DEGREE
          The degree for the irreducible polynomial used by the Rabin fingerprinting algorithm.
private static int POLYNOMIAL
          The 32 bits of this integer represent the coefficients of the degree 32 irreducible polynomial over GF(2); that is, every coefficient is 0 or 1.
private static int READ_BUFFER_SIZE
          The size of the buffer for the Rabin fingerprinting algorithm.
private static int[] table32
          Internal values for the Rabin fingerprinting algorithm.
private static int[] table40
          Internal values for the Rabin fingerprinting algorithm.
private static int[] table48
          Internal values for the Rabin fingerprinting algorithm.
private static int[] table54
          Internal values for the Rabin fingerprinting algorithm.
private  boolean useRabin
          Use Rabin's fingerprinting algorithm ( default is true ).
private static int X_P_DEGREE
          The degree for the irreducible polynomial used by the Rabin fingerprinting algorithm.
 
Constructor Summary
BloomFilter()
          Constructs an empty BloomFilter with the default number of hash functions (10) and the default length for the bit vector (1000).
BloomFilter(int numKeys)
          Constructs an empty BloomFilter with the default number of hash functions (10) and a given length for the bit vector.
BloomFilter(int numKeys, double errorRate)
          Constructs an empty BloomFilter with a given length for the bit vector, guarenteeing a maximum error rate.
BloomFilter(int numKeys, int numHashFunctions)
          Constructs an empty BloomFilter with a given number of hash functions and a given length for the bit vector.
BloomFilter(java.lang.String filter)
          Constructs a Bloom Filter from a string representation.
 
Method Summary
 java.lang.Object clone()
          Returns a copy of this Bloom Filter instance.
private  int getHash(int fnum, int original)
          Internal method for producing the hash value for a given function number.
 int hashRabin(byte[] arr)
          Return the Rabin hash value of an array of bytes.
private  int hashRabin(byte[] arr, int offset, int length, int ws)
          Return the Rabin hash value of an array of bytes.
 int hashRabin(char[] arr)
          Return the Rabin hash value of an array of chars.
 int hashRabin(java.io.File f)
          Computes the Rabin hash value of the contents of a File.
 int hashRabin(java.io.InputStream is)
          Computes the Rabin hash value of the data from an InputStream.
 int hashRabin(int[] arr)
          Returns the Rabin hash value of an array of integers.
 int hashRabin(java.lang.Object obj)
          Computes the Rabin hash value of a given Object.
 int hashRabin(java.io.Serializable obj)
          Computes the Rabin hash value of a given serializable Object.
 int hashRabin(java.lang.String s)
          Computes the Rabin hash value of a String.
 int hashRabin(java.net.URL url)
          Computes the Rabin hash value of the contents of a Web document, specified by an URL.
 boolean hasKey(java.lang.Object obj)
          Returns true if this Bloom Filter contains the specified key.
 void put(java.lang.Object obj)
          Adds the specified key in this Bloom Filter.
 java.lang.String toString()
          Returns a string representation of this Bloom Filter.
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

buffer

private byte[] buffer
A buffer for the Rabin fingerprinting algorithm.


POLYNOMIAL

private static int POLYNOMIAL
The 32 bits of this integer represent the coefficients of the degree 32 irreducible polynomial over GF(2); that is, every coefficient is 0 or 1. However, a degree 32 polynomial has 33 coefficients; the term of degree 32 is assumed to have a coefficient of 1. Therefore, the high-order bit of the int is the degree 31 term's coefficient, and the low-order bit is the constant coefficient.

For example the integer 0x00000803, in binary, is:

00000000 00000000 00001000 00000011

Therefore it correponds to the polynomial:

x32 + x11 + x + 1


table32

private static int[] table32
Internal values for the Rabin fingerprinting algorithm.


table40

private static int[] table40
Internal values for the Rabin fingerprinting algorithm.


table48

private static int[] table48
Internal values for the Rabin fingerprinting algorithm.


table54

private static int[] table54
Internal values for the Rabin fingerprinting algorithm.


P_DEGREE

private static int P_DEGREE
The degree for the irreducible polynomial used by the Rabin fingerprinting algorithm.


READ_BUFFER_SIZE

private static int READ_BUFFER_SIZE
The size of the buffer for the Rabin fingerprinting algorithm.


X_P_DEGREE

private static int X_P_DEGREE
The degree for the irreducible polynomial used by the Rabin fingerprinting algorithm.


keys

private boolean[] keys
The bit vector for the Bloom Filter.


useRabin

private boolean useRabin
Use Rabin's fingerprinting algorithm ( default is true ).


numFunctions

private int numFunctions
The number of hash functions.

Constructor Detail

BloomFilter

public BloomFilter()
Constructs an empty BloomFilter with the default number of hash functions (10) and the default length for the bit vector (1000).


BloomFilter

public BloomFilter(java.lang.String filter)
Constructs a Bloom Filter from a string representation.

See Also:
toString()

BloomFilter

public BloomFilter(int numKeys,
                   double errorRate)
Constructs an empty BloomFilter with a given length for the bit vector, guarenteeing a maximum error rate.

Parameters:
errorRate - The maximum error rate (false positives) for the Bloom Filter.

BloomFilter

public BloomFilter(int numKeys)
Constructs an empty BloomFilter with the default number of hash functions (10) and a given length for the bit vector.

Parameters:
numKeys - The length of the bit vector.

BloomFilter

public BloomFilter(int numKeys,
                   int numHashFunctions)
Constructs an empty BloomFilter with a given number of hash functions and a given length for the bit vector.

Parameters:
numKeys - The length of the bit vector.
numHashFunctions - The number of hash functions.
Method Detail

getHash

private int getHash(int fnum,
                    int original)
Internal method for producing the hash value for a given function number.

Parameters:
fnum - The number of the hash function.
original - The original value for the hash of the object.
Returns:
Returns the hash code value for the given function number.
See Also:
Object.hashCode()

hasKey

public boolean hasKey(java.lang.Object obj)
Returns true if this Bloom Filter contains the specified key.

Parameters:
obj - The key whose presence in this Bloom Filter is to be tested.
Returns:
true if this Bloom Filter contains a mapping for the specified key.

put

public void put(java.lang.Object obj)
Adds the specified key in this Bloom Filter.

Parameters:
obj - The key to be added to this Bloom Filter.

toString

public java.lang.String toString()
Returns a string representation of this Bloom Filter. The string representation consists of an integer specifying the number of hash Functions, an integer specifying the length of the bit vector, and a sequence of 0s and 1s specifying the bit vector. These 3 fields are separated by the character ":". This implementation creates an empty string buffer, and iterates over the bit vector, appending the value of each bit in turn. A string is obtained from the stringbuffer, and returned.

Overrides:
toString in class java.lang.Object
Returns:
A string representation of this Bloom Filter.

clone

public java.lang.Object clone()
Returns a copy of this Bloom Filter instance.

Overrides:
clone in class java.lang.Object
See Also:
Object.clone()

hashRabin

public int hashRabin(byte[] arr)
Return the Rabin hash value of an array of bytes.

Parameters:
arr - An array of bytes.
Returns:
The Rabin hash value for the array of bytes.

hashRabin

private int hashRabin(byte[] arr,
                      int offset,
                      int length,
                      int ws)
Return the Rabin hash value of an array of bytes.

Parameters:
arr - An array of bytes.
offset - Index of the first byte of the array to hash.
length - Number of bytes to hash.
ws - ??
Returns:
The Rabin hash value for the array of bytes.

hashRabin

public int hashRabin(char[] arr)
Return the Rabin hash value of an array of chars.

Parameters:
arr - An array of chars.
Returns:
The Rabin hash value for the array of chars.

hashRabin

public int hashRabin(java.io.File f)
              throws java.io.FileNotFoundException,
                     java.io.IOException
Computes the Rabin hash value of the contents of a File.

Parameters:
f - A File.
Returns:
The Rabin hash value for the contents of the File.
Throws:
java.io.FileNotFoundException - If the file cannot be found.
java.io.IOException - If an error occurs while reading the file.

hashRabin

public int hashRabin(java.io.InputStream is)
              throws java.io.IOException
Computes the Rabin hash value of the data from an InputStream.

Parameters:
is - An InputStream.
Returns:
The Rabin hash value for the contents read from the InputStream.
Throws:
java.io.IOException - if an error occurs while reading from the InputStream.

hashRabin

public int hashRabin(int[] arr)
Returns the Rabin hash value of an array of integers. This method is the most efficient of all the hash methods, so it should be used when possible.

Parameters:
arr - An array of integers.
Returns:
int The Rabin hash value for the array of integers.

hashRabin

public int hashRabin(java.lang.Object obj)
              throws java.io.IOException
Computes the Rabin hash value of a given Object.

Parameters:
obj - An Object.
Returns:
The Rabin hash value for the Object.
Throws:
java.io.IOException - If Object serialization fails.

hashRabin

public int hashRabin(java.io.Serializable obj)
              throws java.io.IOException
Computes the Rabin hash value of a given serializable Object.

Parameters:
obj - An Object.
Returns:
The Rabin hash value for the Object.
Throws:
java.io.IOException - If serialization fails.

hashRabin

public int hashRabin(java.lang.String s)
Computes the Rabin hash value of a String.

Parameters:
s - A String.
Returns:
The Rabin hash value for the String.

hashRabin

public int hashRabin(java.net.URL url)
              throws java.io.IOException
Computes the Rabin hash value of the contents of a Web document, specified by an URL.

Parameters:
url - The URL of the document to be hashed.
Returns:
The Rabin hash value for the document.
Throws:
java.io.IOException - If an error occurs while reading the document.