

PREV CLASS NEXT CLASS  FRAMES NO FRAMES  
SUMMARY: NESTED  FIELD  CONSTR  METHOD  DETAIL: FIELD  CONSTR  METHOD 
java.lang.Object pt.tumba.spell.BloomFilter
public class BloomFilter
Implementation of a Bloom Filter data structure, an elegant alternative to the lookup hash table.
Bloom filters allow you to perform membership tests in just a fraction of the memory you'd need to store a full list of keys. As you might suspect, the savings in space comes at a price: you run an adjustable risk of false positives, and you can't remove a key from a filter once you've added it in. But in the many cases where those constraints are acceptable, a Bloom filter can make a useful tool.
Bloom filters are named after Burton Bloom, who first described them in a 1970 paper entitled Space/time tradeoffs in hash coding with allowable errors. In those days of limited memory, Bloom filters were prized primarily for their compactness; in fact, one of their earliest applications was in spell checkers.
A Bloom filter consists of two components: a set of k
hash functions and a bit vector of
a given length. We choose the length of the bit vector and the number of hash functions
depending on how many keys we want to add to the set and how high an error rate we are
willing to put up with.
All of the hash functions in a Bloom filter are configured so that their range matches the length of the bit vector. For example, if a vector is 200 bits long, the hash functions return a value between 1 and 200. It's important to use highquality hash functions in the filter to guarantee that output is equally distributed over all possible values  "hot spots" in a hash function would increase our falsepositive rate.
To enter a key into a Bloom filter, we run it through each one of the k hash functions and treat the result as an offset into the bit vector, turning on whatever bit we find at that position. If the bit is already set, we leave it on. There's no mechanism for turning bits off in a Bloom filter.
Checking to see whether a key already exists in a filter is exactly analogous to adding a new key. We run the key through our set of hash functions, and then check to see whether the bits at those offsets are all turned on. If any of the bits is off, we know for certain the key is not in the filter. If all of the bits are on, we know the key is probably there.
As you might expect, the falsepositive rate depends on the bit vector length and the number of keys stored in the filter. The roomier the bit vector, the smaller the probability that all k bits we check will be on, unless the key actually exists in the filter. The relationship between the number of hash functions and the falsepositive rate is more subtle. If you use too few hash functions, there won't be enough discrimination between keys; but if you use too many, the filter will be very dense, increasing the probability of collisions. You can calculate the falsepositive rate for any filter using the formula:
c = ( 1  e(kn/m) )k
Where c is the false positive rate, k is the number of hash functions, n is the number of keys in the filter, and m is the length of the filter in bits.
When using Bloom filters, we very frequently have a desired falsepositive rate in mind and we are also likely to have a rough idea of how many keys we want to add to the filter. We need some way of finding out how large a bit vector is to make sure the falsepositive rate never exceeds our limit. The following equation will give us vector length from the error rate and number of keys:
m = kn / ( ln( 1  c ^ 1/k ) )
You'll notice another free variable here: k, the number of hash functions. However, it's possible to use calculus to find a minimum for k. You can also find lookup tables for various combinations of error rate, filter size, and number of hash functions at Bloom Filters  the math.
This implementation uses the hashCode()
method supplied for all Java objects, which
produces a 32bit signed int number. For example, in String
Objects, the hashcode is usually
computed by adding up the character values with an prime multiplier (31, in the case of JDK 1.4).
Alternatively, this class can also use an implementation of a hash function based on Rabin fingerprints, which can efficiently produce a 32bit hash value for a sequence of bytes. It does so by considering strings of bytes as large polynomials with coefficients of 0 and 1 and then reducing them modulo some irreducible polynomial of degree 32. The result is a hash function with very satisfactory properties. In addition the polynomial operations are fast in hardware, and even in this Java implementation the speed is reasonable.
The implementation is derived from the paper "Some applications of Rabin's fingerprinting method" by Andrei Broder. See http://server3.pax.dec.com/SRC/publications/srcpapers.html for a full citation and the paper in PDF format.
Included in this class are additional methods that can compute the Rabin hash value
for any serializable Object
, String
, File
, or resource denoted by URL
.
As for the multiple hash functions for the Bloom Filter, these are based on the module of the initial value multiplied by a list of distinct values.
Object.hashCode()
,
Map
Field Summary  

private byte[] 
buffer
A buffer for the Rabin fingerprinting algorithm. 
private boolean[] 
keys
The bit vector for the Bloom Filter. 
private int 
numFunctions
The number of hash functions. 
private static int 
P_DEGREE
The degree for the irreducible polynomial used by the Rabin fingerprinting algorithm. 
private static int 
POLYNOMIAL
The 32 bits of this integer represent the coefficients of the degree 32 irreducible polynomial over GF(2); that is, every coefficient is 0 or 1. 
private static int 
READ_BUFFER_SIZE
The size of the buffer for the Rabin fingerprinting algorithm. 
private static int[] 
table32
Internal values for the Rabin fingerprinting algorithm. 
private static int[] 
table40
Internal values for the Rabin fingerprinting algorithm. 
private static int[] 
table48
Internal values for the Rabin fingerprinting algorithm. 
private static int[] 
table54
Internal values for the Rabin fingerprinting algorithm. 
private boolean 
useRabin
Use Rabin's fingerprinting algorithm ( default is true ). 
private static int 
X_P_DEGREE
The degree for the irreducible polynomial used by the Rabin fingerprinting algorithm. 
Constructor Summary  

BloomFilter()
Constructs an empty BloomFilter with the default number of hash functions (10) and the default length for the bit vector (1000). 

BloomFilter(int numKeys)
Constructs an empty BloomFilter with the default number of hash functions (10) and a given length for the bit vector. 

BloomFilter(int numKeys,
double errorRate)
Constructs an empty BloomFilter with a given length for the bit vector, guarenteeing a maximum error rate. 

BloomFilter(int numKeys,
int numHashFunctions)
Constructs an empty BloomFilter with a given number of hash functions and a given length for the bit vector. 

BloomFilter(java.lang.String filter)
Constructs a Bloom Filter from a string representation. 
Method Summary  

java.lang.Object 
clone()
Returns a copy of this Bloom Filter instance. 
private int 
getHash(int fnum,
int original)
Internal method for producing the hash value for a given function number. 
int 
hashRabin(byte[] arr)
Return the Rabin hash value of an array of bytes. 
private int 
hashRabin(byte[] arr,
int offset,
int length,
int ws)
Return the Rabin hash value of an array of bytes. 
int 
hashRabin(char[] arr)
Return the Rabin hash value of an array of chars. 
int 
hashRabin(java.io.File f)
Computes the Rabin hash value of the contents of a File . 
int 
hashRabin(java.io.InputStream is)
Computes the Rabin hash value of the data from an InputStream . 
int 
hashRabin(int[] arr)
Returns the Rabin hash value of an array of integers. 
int 
hashRabin(java.lang.Object obj)
Computes the Rabin hash value of a given Object. 
int 
hashRabin(java.io.Serializable obj)
Computes the Rabin hash value of a given serializable Object. 
int 
hashRabin(java.lang.String s)
Computes the Rabin hash value of a String. 
int 
hashRabin(java.net.URL url)
Computes the Rabin hash value of the contents of a Web document, specified by an URL. 
boolean 
hasKey(java.lang.Object obj)
Returns true if this Bloom Filter contains the specified key. 
void 
put(java.lang.Object obj)
Adds the specified key in this Bloom Filter. 
java.lang.String 
toString()
Returns a string representation of this Bloom Filter. 
Methods inherited from class java.lang.Object 

equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait 
Field Detail 

private byte[] buffer
private static int POLYNOMIAL
int
is the degree 31 term's coefficient, and the loworder
bit is the constant coefficient. For example the integer 0x00000803, in binary, is:
00000000 00000000 00001000 00000011
Therefore it correponds to the polynomial:
x^{32} + x^{11} + x + 1
private static int[] table32
private static int[] table40
private static int[] table48
private static int[] table54
private static int P_DEGREE
private static int READ_BUFFER_SIZE
private static int X_P_DEGREE
private boolean[] keys
private boolean useRabin
private int numFunctions
Constructor Detail 

public BloomFilter()
public BloomFilter(java.lang.String filter)
toString()
public BloomFilter(int numKeys, double errorRate)
errorRate
 The maximum error rate (false positives) for the Bloom Filter.public BloomFilter(int numKeys)
numKeys
 The length of the bit vector.public BloomFilter(int numKeys, int numHashFunctions)
numKeys
 The length of the bit vector.numHashFunctions
 The number of hash functions.Method Detail 

private int getHash(int fnum, int original)
fnum
 The number of the hash function.original
 The original value for the hash of the object.
Object.hashCode()
public boolean hasKey(java.lang.Object obj)
obj
 The key whose presence in this Bloom Filter is to be tested.
public void put(java.lang.Object obj)
obj
 The key to be added to this Bloom Filter.public java.lang.String toString()
toString
in class java.lang.Object
public java.lang.Object clone()
clone
in class java.lang.Object
Object.clone()
public int hashRabin(byte[] arr)
arr
 An array of bytes.
private int hashRabin(byte[] arr, int offset, int length, int ws)
arr
 An array of bytes.offset
 Index of the first byte of the array to hash.length
 Number of bytes to hash.ws
 ??
public int hashRabin(char[] arr)
arr
 An array of chars.
public int hashRabin(java.io.File f) throws java.io.FileNotFoundException, java.io.IOException
File
.
f
 A File
.
java.io.FileNotFoundException
 If the file cannot be found.
java.io.IOException
 If an error occurs while reading the file.public int hashRabin(java.io.InputStream is) throws java.io.IOException
InputStream
.
is
 An InputStream.
java.io.IOException
 if an error occurs while reading from the InputStream.public int hashRabin(int[] arr)
arr
 An array of integers.
public int hashRabin(java.lang.Object obj) throws java.io.IOException
obj
 An Object.
java.io.IOException
 If Object serialization fails.public int hashRabin(java.io.Serializable obj) throws java.io.IOException
obj
 An Object.
java.io.IOException
 If serialization fails.public int hashRabin(java.lang.String s)
s
 A String
.
public int hashRabin(java.net.URL url) throws java.io.IOException
url
 The URL of the document to be hashed.
java.io.IOException
 If an error occurs while reading the document.


PREV CLASS NEXT CLASS  FRAMES NO FRAMES  
SUMMARY: NESTED  FIELD  CONSTR  METHOD  DETAIL: FIELD  CONSTR  METHOD 