|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object pt.tumba.spell.BloomFilter
public class BloomFilter
Implementation of a Bloom Filter data structure, an elegant alternative to the lookup hash table.
Bloom filters allow you to perform membership tests in just a fraction of the memory you'd need to store a full list of keys. As you might suspect, the savings in space comes at a price: you run an adjustable risk of false positives, and you can't remove a key from a filter once you've added it in. But in the many cases where those constraints are acceptable, a Bloom filter can make a useful tool.
Bloom filters are named after Burton Bloom, who first described them in a 1970 paper entitled Space/time trade-offs in hash coding with allowable errors. In those days of limited memory, Bloom filters were prized primarily for their compactness; in fact, one of their earliest applications was in spell checkers.
A Bloom filter consists of two components: a set of k
hash functions and a bit vector of
a given length. We choose the length of the bit vector and the number of hash functions
depending on how many keys we want to add to the set and how high an error rate we are
willing to put up with.
All of the hash functions in a Bloom filter are configured so that their range matches the length of the bit vector. For example, if a vector is 200 bits long, the hash functions return a value between 1 and 200. It's important to use high-quality hash functions in the filter to guarantee that output is equally distributed over all possible values -- "hot spots" in a hash function would increase our false-positive rate.
To enter a key into a Bloom filter, we run it through each one of the k hash functions and treat the result as an offset into the bit vector, turning on whatever bit we find at that position. If the bit is already set, we leave it on. There's no mechanism for turning bits off in a Bloom filter.
Checking to see whether a key already exists in a filter is exactly analogous to adding a new key. We run the key through our set of hash functions, and then check to see whether the bits at those offsets are all turned on. If any of the bits is off, we know for certain the key is not in the filter. If all of the bits are on, we know the key is probably there.
As you might expect, the false-positive rate depends on the bit vector length and the number of keys stored in the filter. The roomier the bit vector, the smaller the probability that all k bits we check will be on, unless the key actually exists in the filter. The relationship between the number of hash functions and the false-positive rate is more subtle. If you use too few hash functions, there won't be enough discrimination between keys; but if you use too many, the filter will be very dense, increasing the probability of collisions. You can calculate the false-positive rate for any filter using the formula:
c = ( 1 - e(-kn/m) )k
Where c is the false positive rate, k is the number of hash functions, n is the number of keys in the filter, and m is the length of the filter in bits.
When using Bloom filters, we very frequently have a desired false-positive rate in mind and we are also likely to have a rough idea of how many keys we want to add to the filter. We need some way of finding out how large a bit vector is to make sure the false-positive rate never exceeds our limit. The following equation will give us vector length from the error rate and number of keys:
m = -kn / ( ln( 1 - c ^ 1/k ) )
You'll notice another free variable here: k, the number of hash functions. However, it's possible to use calculus to find a minimum for k. You can also find lookup tables for various combinations of error rate, filter size, and number of hash functions at Bloom Filters -- the math.
This implementation uses the hashCode()
method supplied for all Java objects, which
produces a 32-bit signed int number. For example, in String
Objects, the hashcode is usually
computed by adding up the character values with an prime multiplier (31, in the case of JDK 1.4).
Alternatively, this class can also use an implementation of a hash function based on Rabin fingerprints, which can efficiently produce a 32-bit hash value for a sequence of bytes. It does so by considering strings of bytes as large polynomials with coefficients of 0 and 1 and then reducing them modulo some irreducible polynomial of degree 32. The result is a hash function with very satisfactory properties. In addition the polynomial operations are fast in hardware, and even in this Java implementation the speed is reasonable.
The implementation is derived from the paper "Some applications of Rabin's fingerprinting method" by Andrei Broder. See http://server3.pa-x.dec.com/SRC/publications/src-papers.html for a full citation and the paper in PDF format.
Included in this class are additional methods that can compute the Rabin hash value
for any serializable Object
, String
, File
, or resource denoted by URL
.
As for the multiple hash functions for the Bloom Filter, these are based on the module of the initial value multiplied by a list of distinct values.
Object.hashCode()
,
Map
Field Summary | |
---|---|
private byte[] |
buffer
A buffer for the Rabin fingerprinting algorithm. |
private boolean[] |
keys
The bit vector for the Bloom Filter. |
private int |
numFunctions
The number of hash functions. |
private static int |
P_DEGREE
The degree for the irreducible polynomial used by the Rabin fingerprinting algorithm. |
private static int |
POLYNOMIAL
The 32 bits of this integer represent the coefficients of the degree 32 irreducible polynomial over GF(2); that is, every coefficient is 0 or 1. |
private static int |
READ_BUFFER_SIZE
The size of the buffer for the Rabin fingerprinting algorithm. |
private static int[] |
table32
Internal values for the Rabin fingerprinting algorithm. |
private static int[] |
table40
Internal values for the Rabin fingerprinting algorithm. |
private static int[] |
table48
Internal values for the Rabin fingerprinting algorithm. |
private static int[] |
table54
Internal values for the Rabin fingerprinting algorithm. |
private boolean |
useRabin
Use Rabin's fingerprinting algorithm ( default is true ). |
private static int |
X_P_DEGREE
The degree for the irreducible polynomial used by the Rabin fingerprinting algorithm. |
Constructor Summary | |
---|---|
BloomFilter()
Constructs an empty BloomFilter with the default number of hash functions (10) and the default length for the bit vector (1000). |
|
BloomFilter(int numKeys)
Constructs an empty BloomFilter with the default number of hash functions (10) and a given length for the bit vector. |
|
BloomFilter(int numKeys,
double errorRate)
Constructs an empty BloomFilter with a given length for the bit vector, guarenteeing a maximum error rate. |
|
BloomFilter(int numKeys,
int numHashFunctions)
Constructs an empty BloomFilter with a given number of hash functions and a given length for the bit vector. |
|
BloomFilter(java.lang.String filter)
Constructs a Bloom Filter from a string representation. |
Method Summary | |
---|---|
java.lang.Object |
clone()
Returns a copy of this Bloom Filter instance. |
private int |
getHash(int fnum,
int original)
Internal method for producing the hash value for a given function number. |
int |
hashRabin(byte[] arr)
Return the Rabin hash value of an array of bytes. |
private int |
hashRabin(byte[] arr,
int offset,
int length,
int ws)
Return the Rabin hash value of an array of bytes. |
int |
hashRabin(char[] arr)
Return the Rabin hash value of an array of chars. |
int |
hashRabin(java.io.File f)
Computes the Rabin hash value of the contents of a File . |
int |
hashRabin(java.io.InputStream is)
Computes the Rabin hash value of the data from an InputStream . |
int |
hashRabin(int[] arr)
Returns the Rabin hash value of an array of integers. |
int |
hashRabin(java.lang.Object obj)
Computes the Rabin hash value of a given Object. |
int |
hashRabin(java.io.Serializable obj)
Computes the Rabin hash value of a given serializable Object. |
int |
hashRabin(java.lang.String s)
Computes the Rabin hash value of a String. |
int |
hashRabin(java.net.URL url)
Computes the Rabin hash value of the contents of a Web document, specified by an URL. |
boolean |
hasKey(java.lang.Object obj)
Returns true if this Bloom Filter contains the specified key. |
void |
put(java.lang.Object obj)
Adds the specified key in this Bloom Filter. |
java.lang.String |
toString()
Returns a string representation of this Bloom Filter. |
Methods inherited from class java.lang.Object |
---|
equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
private byte[] buffer
private static int POLYNOMIAL
int
is the degree 31 term's coefficient, and the low-order
bit is the constant coefficient. For example the integer 0x00000803, in binary, is:
00000000 00000000 00001000 00000011
Therefore it correponds to the polynomial:
x32 + x11 + x + 1
private static int[] table32
private static int[] table40
private static int[] table48
private static int[] table54
private static int P_DEGREE
private static int READ_BUFFER_SIZE
private static int X_P_DEGREE
private boolean[] keys
private boolean useRabin
private int numFunctions
Constructor Detail |
---|
public BloomFilter()
public BloomFilter(java.lang.String filter)
toString()
public BloomFilter(int numKeys, double errorRate)
errorRate
- The maximum error rate (false positives) for the Bloom Filter.public BloomFilter(int numKeys)
numKeys
- The length of the bit vector.public BloomFilter(int numKeys, int numHashFunctions)
numKeys
- The length of the bit vector.numHashFunctions
- The number of hash functions.Method Detail |
---|
private int getHash(int fnum, int original)
fnum
- The number of the hash function.original
- The original value for the hash of the object.
Object.hashCode()
public boolean hasKey(java.lang.Object obj)
obj
- The key whose presence in this Bloom Filter is to be tested.
public void put(java.lang.Object obj)
obj
- The key to be added to this Bloom Filter.public java.lang.String toString()
toString
in class java.lang.Object
public java.lang.Object clone()
clone
in class java.lang.Object
Object.clone()
public int hashRabin(byte[] arr)
arr
- An array of bytes.
private int hashRabin(byte[] arr, int offset, int length, int ws)
arr
- An array of bytes.offset
- Index of the first byte of the array to hash.length
- Number of bytes to hash.ws
- ??
public int hashRabin(char[] arr)
arr
- An array of chars.
public int hashRabin(java.io.File f) throws java.io.FileNotFoundException, java.io.IOException
File
.
f
- A File
.
java.io.FileNotFoundException
- If the file cannot be found.
java.io.IOException
- If an error occurs while reading the file.public int hashRabin(java.io.InputStream is) throws java.io.IOException
InputStream
.
is
- An InputStream.
java.io.IOException
- if an error occurs while reading from the InputStream.public int hashRabin(int[] arr)
arr
- An array of integers.
public int hashRabin(java.lang.Object obj) throws java.io.IOException
obj
- An Object.
java.io.IOException
- If Object serialization fails.public int hashRabin(java.io.Serializable obj) throws java.io.IOException
obj
- An Object.
java.io.IOException
- If serialization fails.public int hashRabin(java.lang.String s)
s
- A String
.
public int hashRabin(java.net.URL url) throws java.io.IOException
url
- The URL of the document to be hashed.
java.io.IOException
- If an error occurs while reading the document.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |