pt.tumba.spell
Class TernarySearchTrie

java.lang.Object
  extended by pt.tumba.spell.TernarySearchTrie

public class TernarySearchTrie
extends java.lang.Object

Implementation of a Ternary Search Trie, a data structure for storing String objects that combines the compact size of a binary search tree with the speed of a digital search trie, and is therefore ideal for practical use in sorting and searching data.

This data structure is faster than hashing for many typical search problems, and supports a broader range of useful problems and operations. Ternary searches are faster than hashing and more powerful, too.

The theory of ternary search trees was described at a symposium in 1997 (see "Fast Algorithms for Sorting and Searching Strings," by J.L. Bentley and R. Sedgewick, Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, January 1997). Algorithms in C, Third Edition, by Robert Sedgewick (Addison-Wesley, 1998) provides yet another view of ternary search trees.

Author:
Bruno Martins

Nested Class Summary
protected  class TernarySearchTrie.TSTNode
          An inner class of Ternary Search Trie that represents a node in the trie.
 
Field Summary
private  int defaultNumReturnValues
          The default number of values returned by the matchAlmost method.
private  int matchAlmostDiff
          the number of differences allowed in a call to the matchAlmostKey method.
private  TernarySearchTrie.TSTNode rootNode
          The base node in the trie.
 
Constructor Summary
TernarySearchTrie()
          Constructs an empty Ternary Search Trie.
TernarySearchTrie(java.io.File file)
          Constructs a Ternary Search Trie and loads data from a File into the Trie.
TernarySearchTrie(java.io.File file, boolean compression)
          Constructs a Ternary Search Trie and loads data from a File into the Trie.
 
Method Summary
private static int compareCharsAlphabetically(int cCompare2, int cRef)
          Compares characters by alfabetical order.
private  void deleteNode(TernarySearchTrie.TSTNode nodeToDelete)
          Deletes the node passed in as an argument.
private  TernarySearchTrie.TSTNode deleteNodeRecursion(TernarySearchTrie.TSTNode currentNode)
          Recursivelly visits each node to be deleted.
 java.lang.Object get(java.lang.String key)
          Retrieve the object indexed by a key.
 java.lang.Integer getAndIncrement(java.lang.String key)
          Retrieve the Integer indexed by key, increment it by one unit and store the new Integer.
protected  java.lang.String getKey(TernarySearchTrie.TSTNode node)
          Returns the key that indexes the node argument.
 TernarySearchTrie.TSTNode getNode(java.lang.String key)
          Returns the node indexed by key, or null if that node doesn't exist.
protected  TernarySearchTrie.TSTNode getNode(java.lang.String key2, TernarySearchTrie.TSTNode startNode)
          Returns the node indexed by key, or null if that node doesn't exist.
protected  TernarySearchTrie.TSTNode getOrCreateNode(java.lang.String key)
          Returns the node indexed by key, creating that node if it doesn't exist, and creating any required intermediate nodes if they don't exist.
 java.util.List matchAlmost(java.lang.String key)
          Returns a List of keys that almost match the argument key.
protected  java.util.List matchAlmost(java.lang.String key, int numReturnValues)
          Returns a List of keys that almost match the argument key.
private  java.util.List matchAlmostRecursion(TernarySearchTrie.TSTNode currentNode, int charIndex, int d, java.lang.String matchAlmostKey, int matchAlmostNumReturnValues, java.util.List matchAlmostResult2, boolean upTo)
          Recursivelly vists the nodes in order to find the ones that almost match a given key.
 java.util.List matchPrefix(java.lang.String prefix)
          Returns an alphabetical List of all keys in the trie that begin with a given prefix.
 java.util.List matchPrefix(java.lang.String prefix, int numReturnValues)
          Returns an alphabetical List of all keys in the trie that begin with a given prefix.
 int numDataNodes()
          Returns the number of nodes in the trie that have non-null data.
protected  int numDataNodes(TernarySearchTrie.TSTNode startingNode)
          Returns the number of nodes in the subtrie below and including the starting node.
 int numNodes()
          Returns the total number of nodes in the trie.
protected  int numNodes(TernarySearchTrie.TSTNode startingNode)
          Returns the total number of nodes in the subtrie below and including the starting Node.
 void put(java.lang.String key, java.lang.Object value)
          Stores a value in the trie.
private  int recursiveNodeCalculator(TernarySearchTrie.TSTNode currentNode, boolean checkData, int numNodes2)
          Recursivelly visists each node to calculate the number of nodes.
 void remove(java.lang.String key)
          Removes the value indexed by key.
 void setMatchAlmostDiff(int diff)
          Sets the number of characters by which words can differ from target word when calling the matchAlmost method.
 void setNumReturnValues(int num)
          Sets the default maximum number of values returned from the matchPrefix and matchAlmost methods.
protected  java.util.List sortKeys(TernarySearchTrie.TSTNode startNode, int numReturnValues)
          Returns keys sorted in alphabetical order.
private  java.util.List sortKeysRecursion(TernarySearchTrie.TSTNode currentNode, int sortKeysNumReturnValues, java.util.List sortKeysResult2)
          Returns keys sorted in alphabetical order.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

defaultNumReturnValues

private int defaultNumReturnValues
The default number of values returned by the matchAlmost method.


matchAlmostDiff

private int matchAlmostDiff
the number of differences allowed in a call to the matchAlmostKey method.


rootNode

private TernarySearchTrie.TSTNode rootNode
The base node in the trie.

Constructor Detail

TernarySearchTrie

public TernarySearchTrie()
Constructs an empty Ternary Search Trie.


TernarySearchTrie

public TernarySearchTrie(java.io.File file)
                  throws java.io.IOException
Constructs a Ternary Search Trie and loads data from a File into the Trie. The file is a normal text document, where each line is of the form word : integer.

Parameters:
file - The File with the data to load into the Trie.
Throws:
java.io.IOException - A problem occured while reading the data.

TernarySearchTrie

public TernarySearchTrie(java.io.File file,
                         boolean compression)
                  throws java.io.IOException
Constructs a Ternary Search Trie and loads data from a File into the Trie. The file is a normal text document, where each line is of the form " word : integer".

Parameters:
file - The File with the data to load into the Trie.
compression - If true, the file is compressed with the GZIP algorithm, and if false, the file is a normal text document.
Throws:
java.io.IOException - A problem occured while reading the data.
Method Detail

compareCharsAlphabetically

private static int compareCharsAlphabetically(int cCompare2,
                                              int cRef)
Compares characters by alfabetical order.

Parameters:
cCompare2 - The first char in the comparison.
cRef - The second char in the comparison.
Returns:
A negative number, 0 or a positive number if the second char is less, equal or greater.

deleteNode

private void deleteNode(TernarySearchTrie.TSTNode nodeToDelete)
Deletes the node passed in as an argument. If this node has non-null data, then both the node and the data will be deleted. It also deletes any other nodes in the trie that are no longer needed after the deletion of the node.

Parameters:
nodeToDelete - The node to delete.

deleteNodeRecursion

private TernarySearchTrie.TSTNode deleteNodeRecursion(TernarySearchTrie.TSTNode currentNode)
Recursivelly visits each node to be deleted. To delete a node, first set its data to null, then pass it into this method, then pass the node returned by this method into this method (make sure you don't delete the data of any of the nodes returned from this method!) and continue in this fashion until the node returned by this method is null. The TSTNode instance returned by this method will be next node to be operated on by deleteNodeRecursion (This emulates recursive method call while avoiding the JVM overhead normally associated with a recursive method.)

Parameters:
currentNode - The node to delete.
Returns:
The next node to be called in deleteNodeRecursion.

get

public java.lang.Object get(java.lang.String key)
Retrieve the object indexed by a key.

Parameters:
key - A String index.
Returns:
The object retrieved from the Ternary Search Trie.

getAndIncrement

public java.lang.Integer getAndIncrement(java.lang.String key)
Retrieve the Integer indexed by key, increment it by one unit and store the new Integer.

Parameters:
key - A String index.
Returns:
The integer retrieved from the Ternary Search Trie.

getKey

protected java.lang.String getKey(TernarySearchTrie.TSTNode node)
Returns the key that indexes the node argument.

Parameters:
node - The node whose index is to be calculated.
Returns:
The String that indexes the node argument.

getNode

public TernarySearchTrie.TSTNode getNode(java.lang.String key)
Returns the node indexed by key, or null if that node doesn't exist. Search begins at root node.

Parameters:
key - A String that indexes the node that is returned.
Returns:
The node object indexed by key. This object is an instance of an inner class named TernarySearchTrie.TSTNode.

getNode

protected TernarySearchTrie.TSTNode getNode(java.lang.String key2,
                                            TernarySearchTrie.TSTNode startNode)
Returns the node indexed by key, or null if that node doesn't exist. The search begins at root node.

Parameters:
key2 - A String that indexes the node that is returned.
startNode - The top node defining the subtrie to be searched.
Returns:
The node object indexed by key. This object is an instance of an inner class named TernarySearchTrie.TSTNode.

getOrCreateNode

protected TernarySearchTrie.TSTNode getOrCreateNode(java.lang.String key)
                                             throws java.lang.NullPointerException,
                                                    java.lang.IllegalArgumentException
Returns the node indexed by key, creating that node if it doesn't exist, and creating any required intermediate nodes if they don't exist.

Parameters:
key - A String that indexes the node that is returned.
Returns:
The node object indexed by key. This object is an instance of an inner class named TernarySearchTrie.TSTNode.
Throws:
java.lang.NullPointerException - If the key is null.
java.lang.IllegalArgumentException - If the key is an empty String.

matchAlmost

public java.util.List matchAlmost(java.lang.String key)
Returns a List of keys that almost match the argument key. Keys returned will have exactly diff characters that do not match the target key, where diff is equal to the last value passed in as an argument to the setMatchAlmostDiff method.

If the matchAlmost method is called before the setMatchAlmostDiff method has been called for the first time, then diff = 0.

Parameters:
key - The target key.
Returns:
A List with the results.

matchAlmost

protected java.util.List matchAlmost(java.lang.String key,
                                     int numReturnValues)
Returns a List of keys that almost match the argument key. Keys returned will have exactly diff characters that do not match the target key, where diff is equal to the last value passed in as an argument to the setMatchAlmostDiff method.

If the matchAlmost method is called before the setMatchAlmostDiff method has been called for the first time, then diff = 0.

Parameters:
key - The target key.
numReturnValues - The maximum number of values returned by this method.
Returns:
A List with the results

matchAlmostRecursion

private java.util.List matchAlmostRecursion(TernarySearchTrie.TSTNode currentNode,
                                            int charIndex,
                                            int d,
                                            java.lang.String matchAlmostKey,
                                            int matchAlmostNumReturnValues,
                                            java.util.List matchAlmostResult2,
                                            boolean upTo)
Recursivelly vists the nodes in order to find the ones that almost match a given key.

Parameters:
currentNode - The current node.
charIndex - The current char.
d - The number of differences so far.
matchAlmostNumReturnValues - The maximum number of values in the result List.
matchAlmostResult2 - The results so far.
upTo - If true all keys having up to and including matchAlmostDiff mismatched letters will be included in the result (including a key that is exactly the same as the target string) otherwise keys will be included in the result only if they have exactly matchAlmostDiff number of mismatched letters.
matchAlmostKey - The key being searched.
Returns:
A List with the results.

matchPrefix

public java.util.List matchPrefix(java.lang.String prefix)
Returns an alphabetical List of all keys in the trie that begin with a given prefix. Only keys for nodes having non-null data are included in the List.

Parameters:
prefix - Each key returned from this method will begin with the characters in prefix.
Returns:
A List with the results.

matchPrefix

public java.util.List matchPrefix(java.lang.String prefix,
                                  int numReturnValues)
Returns an alphabetical List of all keys in the trie that begin with a given prefix. Only keys for nodes having non-null data are included in the List.

Parameters:
prefix - Each key returned from this method will begin with the characters in prefix.
numReturnValues - The maximum number of values returned from this method.
Returns:
A List with the results

numDataNodes

public int numDataNodes()
Returns the number of nodes in the trie that have non-null data.

Returns:
The number of nodes in the trie that have non-null data.

numDataNodes

protected int numDataNodes(TernarySearchTrie.TSTNode startingNode)
Returns the number of nodes in the subtrie below and including the starting node. The method counts only nodes that have non-null data.

Parameters:
startingNode - The top node of the subtrie. the node that defines the subtrie.
Returns:
The total number of nodes in the subtrie.

numNodes

public int numNodes()
Returns the total number of nodes in the trie. The method counts nodes whether or not they have data.

Returns:
The total number of nodes in the trie.

numNodes

protected int numNodes(TernarySearchTrie.TSTNode startingNode)
Returns the total number of nodes in the subtrie below and including the starting Node. The method counts nodes whether or not they have data.

Parameters:
startingNode - The top node of the subtrie. The node that defines the subtrie.
Returns:
The total number of nodes in the subtrie.

put

public void put(java.lang.String key,
                java.lang.Object value)
Stores a value in the trie. The value may be retrieved using the key.

Parameters:
key - A String that indexes the object to be stored.
value - The object to be stored in the Trie.

recursiveNodeCalculator

private int recursiveNodeCalculator(TernarySearchTrie.TSTNode currentNode,
                                    boolean checkData,
                                    int numNodes2)
Recursivelly visists each node to calculate the number of nodes.

Parameters:
currentNode - The current node.
checkData - If true we check the data to be different of null.
numNodes2 - The number of nodes so far.
Returns:
The number of nodes accounted.

remove

public void remove(java.lang.String key)
Removes the value indexed by key. Also removes all nodes that are rendered unnecessary by the removal of this data.

Parameters:
key - A string that indexes the object to be removed from the Trie.

setMatchAlmostDiff

public void setMatchAlmostDiff(int diff)
Sets the number of characters by which words can differ from target word when calling the matchAlmost method.

Arguments less than 0 will set the char difference to 0, and arguments greater than 3 will set the char difference to 3.

Parameters:
diff - The number of characters by which words can differ from target word.

setNumReturnValues

public void setNumReturnValues(int num)
Sets the default maximum number of values returned from the matchPrefix and matchAlmost methods.

The value should be set this to -1 to get an unlimited number of return values. note that the methods mentioned above provide overloaded versions that allow you to specify the maximum number of return values, in which case this value is temporarily overridden.

Parameters:
num - The number of values that will be returned when calling the methods above.

sortKeys

protected java.util.List sortKeys(TernarySearchTrie.TSTNode startNode,
                                  int numReturnValues)
Returns keys sorted in alphabetical order. This includes the start Node and all nodes connected to the start Node.

The number of keys returned is limited to numReturnValues. To get a list that isn't limited in size, set numReturnValues to -1.

Parameters:
startNode - The top node defining the subtrie to be searched.
numReturnValues - The maximum number of values returned from this method.
Returns:
A List with the results.

sortKeysRecursion

private java.util.List sortKeysRecursion(TernarySearchTrie.TSTNode currentNode,
                                         int sortKeysNumReturnValues,
                                         java.util.List sortKeysResult2)
Returns keys sorted in alphabetical order. This includes the current Node and all nodes connected to the current Node.

Sorted keys will be appended to the end of the resulting List. The result may be empty when this method is invoked, but may not be null.

Parameters:
currentNode - The current node.
sortKeysNumReturnValues - The maximum number of values in the result.
sortKeysResult2 - The results so far.
Returns:
A List with the results.