pt.tumba.spell
Class SpellChecker

java.lang.Object
  extended by pt.tumba.spell.SpellChecker

public class SpellChecker
extends java.lang.Object

The main class of the spell checking package.

Author:
Bruno Martins

Field Summary
private  CommonMisspellings commonErrors
          A dictionary of common misspellings
private  TernarySearchTrie dictionary
          The main dictionary for the spelling checker.
private  boolean useBigrams
          Use bigrams for context dependent spelling correction
 
Constructor Summary
SpellChecker()
           
 
Method Summary
 java.lang.String findMostSimilar(java.lang.String key)
          Takes a word and returns the most similar word from the dictionary, using Levenshtein Distance, Phonetic similarity, Keyboard Proximity and other heuristics to measure similarity.
 java.lang.String findMostSimilar(java.lang.String key, boolean useFrequency)
          Takes a word and returns the most similar word from the dictionary, using Levenshtein Distance, Phonetic similarity, Keyboard Proximity and other heuristics to measure similarity.
 java.util.List findMostSimilarList(java.lang.String key)
          Takes a word and returns a List with similar words from the dictionary, using Levenshtein Distance to rank words in the list.
 SpellChecker getInstance()
          Deprecated. TODO: Remove this method and check dependencies with other code.
private static java.lang.String heuristicsPortuguese(java.lang.String str)
          Phonetic heuristics for the Portuguese language, taking as input a Portuguese word and replacing letters and groups of letter that correspond to a specific "sound" by a cannonical representation.
 void initialize(java.lang.String path)
          Reads the dictionary to memory.
 void initialize(java.lang.String path1, java.lang.String path2)
          Reads the dictionary to memory.
 void initialize(java.lang.String path1, java.lang.String path2, java.lang.String path3)
          Reads the dictionary to memory.
static void main(java.lang.String[] args)
          Main method.
 java.lang.String spellCheck(java.lang.String s)
          Checks spelling errors in terms from a given String.
 java.lang.String spellCheckQuery(java.lang.String s)
          Checks spelling errors in terms for a search engine query, ignoring commands to the search system.
 java.lang.String spellCheckTeX(java.lang.String s)
          Checks spelling errors in terms from a TeX document.
 java.lang.String spellCheckWord(java.lang.String word)
          Checks if a word is correctly spelled, producing as output a string with the word plus SGML tags indicating if it is correctly spelled or not.
 java.lang.String spellCheckXML(java.lang.String s)
          Checks spelling errors in terms from an XML document.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

dictionary

private TernarySearchTrie dictionary
The main dictionary for the spelling checker.


commonErrors

private CommonMisspellings commonErrors
A dictionary of common misspellings


useBigrams

private boolean useBigrams
Use bigrams for context dependent spelling correction

Constructor Detail

SpellChecker

public SpellChecker()
Method Detail

getInstance

public SpellChecker getInstance()
Deprecated. TODO: Remove this method and check dependencies with other code.

Return an instance of this class. This method is here only for backward compatibility, in a previous version this class was Singleton and the dictionary was stored in a static variable.

Returns:
An instance of SpellChecker.

heuristicsPortuguese

private static java.lang.String heuristicsPortuguese(java.lang.String str)
Phonetic heuristics for the Portuguese language, taking as input a Portuguese word and replacing letters and groups of letter that correspond to a specific "sound" by a cannonical representation.

Parameters:
str - A String with a Portuguese word.
Returns:
A "normalized" representation for the portuguese word, where groups of letters that have the same sound are represented in a cannonical way.

initialize

public void initialize(java.lang.String path)
                throws java.lang.Exception
Reads the dictionary to memory.

Parameters:
path - The File path leading up to the dictionary.
Throws:
java.lang.Exception - an Exception indicating if any problem occured while reading the dictionary.

initialize

public void initialize(java.lang.String path1,
                       java.lang.String path2)
                throws java.lang.Exception
Reads the dictionary to memory.

Parameters:
path1 - The File path leading up to the dictionary.
path2 - The File path leading up to a dictionary of common misspellings.
Throws:
java.lang.Exception - an Exception indicating if any problem occured while reading the dictionary.

initialize

public void initialize(java.lang.String path1,
                       java.lang.String path2,
                       java.lang.String path3)
                throws java.lang.Exception
Reads the dictionary to memory.

Parameters:
path1 - The File path leading up to the dictionary.
path2 - The File path leading up to a dictionary of common misspellings.
path3 - The File path leading up to a dictionary of correct spellings.
Throws:
java.lang.Exception - an Exception indicating if any problem occured while reading the dictionary.

findMostSimilar

public java.lang.String findMostSimilar(java.lang.String key)
Takes a word and returns the most similar word from the dictionary, using Levenshtein Distance, Phonetic similarity, Keyboard Proximity and other heuristics to measure similarity.

Parameters:
key - The word to check in the dictionary.
Returns:
The most similar word in the dictionary.

findMostSimilar

public java.lang.String findMostSimilar(java.lang.String key,
                                        boolean useFrequency)
Takes a word and returns the most similar word from the dictionary, using Levenshtein Distance, Phonetic similarity, Keyboard Proximity and other heuristics to measure similarity.

Parameters:
key - The word to check in the dictionary.
useFrequency - Use the relative frequency method.
Returns:
The most similar word in the dictionary.

findMostSimilarList

public java.util.List findMostSimilarList(java.lang.String key)
Takes a word and returns a List with similar words from the dictionary, using Levenshtein Distance to rank words in the list.

Parameters:
key - The word to check in the dictionary.
Returns:
A List of similar words from the dictionary.

spellCheckQuery

public java.lang.String spellCheckQuery(java.lang.String s)
Checks spelling errors in terms for a search engine query, ignoring commands to the search system.

Parameters:
s - A String with a search engine query.
Returns:
The String with spelling errors identifyed.
See Also:
spellCheckWord(String)

spellCheck

public java.lang.String spellCheck(java.lang.String s)
Checks spelling errors in terms from a given String.

Parameters:
s - A String.
Returns:
The String with spelling errors identifyed.
See Also:
spellCheckWord(String)

spellCheckTeX

public java.lang.String spellCheckTeX(java.lang.String s)
Checks spelling errors in terms from a TeX document.

Parameters:
s - A String with the TeX document.
Returns:
The String with spelling errors identifyed.
See Also:
spellCheckWord(String)

spellCheckXML

public java.lang.String spellCheckXML(java.lang.String s)
Checks spelling errors in terms from an XML document.

Parameters:
s - A String with the XML document.
Returns:
The String with spelling errors identifyed.
See Also:
spellCheckWord(String)

spellCheckWord

public java.lang.String spellCheckWord(java.lang.String word)
Checks if a word is correctly spelled, producing as output a string with the word plus SGML tags indicating if it is correctly spelled or not.

The possible SGML tags are:

<misspell> - The word was not found in the dictionary but a suggestion could not be generated.
<plain> - The word is correctly spelled.
<suggestion> - The word was not found in the dictionary and a suggestion was generated.

Parameters:
word - The word to check.
Returns:
A String with the word provided as input (or an appropriate correction) surrounded with SGML tags indicating if it is correctly spelled or not.

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Main method.

Parameters:
args - The command line input, tokenized.
Throws:
java.lang.Exception