|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object pt.tumba.spell.LevenshteinDistance
public final class LevenshteinDistance
This classs holds the methods to compute a modified Levenshtein distance.
Levenshtein distance (LD) is a measure of the similarity between two String
objects,
which we will refer to as the source string (s) and the target string (t). The distance is
the number of deletions, insertions, or substitutions required to transform s into t.
For example,
If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed and the strings are already identical.
If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.
The greater the Levenshtein distance, the more different the strings are. The measure is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance. TODO: The spelling checker currently doesn't use this class. The code also needs to be optimized.
Field Summary | |
---|---|
private int[][] |
matrix
A reusable int matrix, used to compute the Levenshtein distance. |
private int |
maxM
The maximum used number of columns in the matrix. |
private int |
maxN
The maximum used number of rows in the matrix. |
Constructor Summary | |
---|---|
LevenshteinDistance()
Sole constructor for LevenshteinDistance. |
Method Summary | |
---|---|
private int[][] |
getMatrix(int n,
int m)
Reuses the same matrix, making sure that there is enough room in it. |
int |
levenshteinDistance(char[] s,
char[] t,
char[] lowercaseS,
char[] lowercaseT,
boolean useCaseDifference)
This method returns a modified Levenshtein distance for any given two String objects. |
double |
levenshteinDistance(java.lang.String s,
java.lang.String t)
The Levenshtein distance for any given two String objects. |
private static int |
minimum(int a,
int b,
int c)
Return the minimum of three values. |
double |
modifiedLevenshteinDistance(java.lang.String s,
java.lang.String t)
A modified Levenshtein distance for any given two String objects. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private int[][] matrix
private int maxN
private int maxM
Constructor Detail |
---|
public LevenshteinDistance()
Method Detail |
---|
public double levenshteinDistance(java.lang.String s, java.lang.String t)
String
objects.
s
- Source String
.t
- Target String
.
String
objects.public double modifiedLevenshteinDistance(java.lang.String s, java.lang.String t)
String
objects.
It is noted whether there is any case difference, an extra .5 is added to the distance.
This slighly penalizes for case differences, rather than giving a severe penalization
when each character case difference is counted as a full transformation.
s
- Source String
.t
- Target String
.
String
objects.public int levenshteinDistance(char[] s, char[] t, char[] lowercaseS, char[] lowercaseT, boolean useCaseDifference)
String
objects.
It is noted whether there is any case difference, an extra .5 is added to the distance. This
slighly penalizes for case differences, rather than giving a severe penalization
when each character case difference is counted as a full transformation.
The char arrays are being passed in to speed things up. Also, the lowercase char arrays are being passed in to speed things up. It was observed that in a typical invocation of this method, this method gets called over and over again, with one of the strings being held constant, and one of them changing. By being able to pass in the char arrays, I can convert the string that remains constant only once to a char array, and lowercase it only once. I can do array indexing rather than method invocation which should speed things up a bit as well.
To futher the efficiencies, this method returns an int of the distance that is the distance multiplied by 1000. Thus, an edit distance of 2 multiplied by 1000 = 2000;
s
- Source char arrayt
- Target char arraylowercaseS
- LowerCase version of the source char arraylowercaseT
- LowerCase version of the target char arrayuseCaseDifference
- If true, than case differences result in an extra .5 added to the distance
private static int minimum(int a, int b, int c)
a
- Value 1b
- Value 2c
- Value 3
private int[][] getMatrix(int n, int m)
n
- The number of rows.m
- The number of columns.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |