pt.tumba.spell
Class LevenshteinDistance

java.lang.Object
  extended by pt.tumba.spell.LevenshteinDistance

public final class LevenshteinDistance
extends java.lang.Object

This classs holds the methods to compute a modified Levenshtein distance.

Levenshtein distance (LD) is a measure of the similarity between two String objects, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,

If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed and the strings are already identical.
If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.

The greater the Levenshtein distance, the more different the strings are. The measure is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance. TODO: The spelling checker currently doesn't use this class. The code also needs to be optimized.


Field Summary
private  int[][] matrix
          A reusable int matrix, used to compute the Levenshtein distance.
private  int maxM
          The maximum used number of columns in the matrix.
private  int maxN
          The maximum used number of rows in the matrix.
 
Constructor Summary
LevenshteinDistance()
          Sole constructor for LevenshteinDistance.
 
Method Summary
private  int[][] getMatrix(int n, int m)
          Reuses the same matrix, making sure that there is enough room in it.
 int levenshteinDistance(char[] s, char[] t, char[] lowercaseS, char[] lowercaseT, boolean useCaseDifference)
          This method returns a modified Levenshtein distance for any given two String objects.
 double levenshteinDistance(java.lang.String s, java.lang.String t)
          The Levenshtein distance for any given two String objects.
private static int minimum(int a, int b, int c)
          Return the minimum of three values.
 double modifiedLevenshteinDistance(java.lang.String s, java.lang.String t)
          A modified Levenshtein distance for any given two String objects.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

matrix

private int[][] matrix
A reusable int matrix, used to compute the Levenshtein distance.


maxN

private int maxN
The maximum used number of rows in the matrix.


maxM

private int maxM
The maximum used number of columns in the matrix.

Constructor Detail

LevenshteinDistance

public LevenshteinDistance()
Sole constructor for LevenshteinDistance.

Method Detail

levenshteinDistance

public double levenshteinDistance(java.lang.String s,
                                  java.lang.String t)
The Levenshtein distance for any given two String objects.

Parameters:
s - Source String.
t - Target String.
Returns:
The Levenshtein distance between the two Stringobjects.

modifiedLevenshteinDistance

public double modifiedLevenshteinDistance(java.lang.String s,
                                          java.lang.String t)
A modified Levenshtein distance for any given two String objects. It is noted whether there is any case difference, an extra .5 is added to the distance. This slighly penalizes for case differences, rather than giving a severe penalization when each character case difference is counted as a full transformation.

Parameters:
s - Source String.
t - Target String.
Returns:
The modified Levenshtein distance between the two String objects.

levenshteinDistance

public int levenshteinDistance(char[] s,
                               char[] t,
                               char[] lowercaseS,
                               char[] lowercaseT,
                               boolean useCaseDifference)
This method returns a modified Levenshtein distance for any given two String objects. It is noted whether there is any case difference, an extra .5 is added to the distance. This slighly penalizes for case differences, rather than giving a severe penalization when each character case difference is counted as a full transformation.

The char arrays are being passed in to speed things up. Also, the lowercase char arrays are being passed in to speed things up. It was observed that in a typical invocation of this method, this method gets called over and over again, with one of the strings being held constant, and one of them changing. By being able to pass in the char arrays, I can convert the string that remains constant only once to a char array, and lowercase it only once. I can do array indexing rather than method invocation which should speed things up a bit as well.

To futher the efficiencies, this method returns an int of the distance that is the distance multiplied by 1000. Thus, an edit distance of 2 multiplied by 1000 = 2000;

Parameters:
s - Source char array
t - Target char array
lowercaseS - LowerCase version of the source char array
lowercaseT - LowerCase version of the target char array
useCaseDifference - If true, than case differences result in an extra .5 added to the distance
Returns:
The modified Levenshtein distance multiplied by 1000.

minimum

private static int minimum(int a,
                           int b,
                           int c)
Return the minimum of three values.

Parameters:
a - Value 1
b - Value 2
c - Value 3
Returns:
Return the minimum between the three values.

getMatrix

private int[][] getMatrix(int n,
                          int m)
Reuses the same matrix, making sure that there is enough room in it.

Parameters:
n - The number of rows.
m - The number of columns.
Returns:
int[][] A matrix.