pt.tumba.spell
Class DefaultWordFinder

java.lang.Object
  extended by pt.tumba.spell.DefaultWordFinder
Direct Known Subclasses:
TeXWordFinder, XMLWordFinder

public class DefaultWordFinder
extends java.lang.Object

A word finder for normal text documents, which searches text for sequences of words and text blocks.This class also defines common methods and behaviour for the various word finding subclasses.

Author:
Bruno Martins
See Also:
StringTokenizer, BreakIterator, TeXWordFinder, XMLWordFinder

Field Summary
protected  int currentSegmentPos
          The index of the current segment in the input text.
protected  java.lang.String currentWord
          A string with the current word for the word finder.
protected  int currentWordPos
          The index of the current word in the input text.
protected  int nextSegmentPos
          The index of the next segment in the input text.
protected  java.lang.String nextWord
          A string with the word next to the current one.
protected  int nextWordPos
          The index of the next word in the input text.
protected  java.text.BreakIterator sentenceIterator
          An iterator over the input text.
protected  boolean solveHardCases
          Solve the tokenization hard cases.
protected  boolean startsSentence
          A boolean flag indicating if the current word marks the begining of a sentence.
protected  java.lang.String text
          The input text.
 
Constructor Summary
DefaultWordFinder()
          Constructor for DefaultWordFinder.
DefaultWordFinder(java.lang.String inText)
          Constructor for DefaultWordFinder.
 
Method Summary
 java.lang.String current()
          Returns the current word in the text.
 java.lang.String currentSegment()
          Returns the current text segment from the input.
private static int getNextWordEnd(java.lang.String text, int startPos)
          Returns the position in the string after the end of the next word.
 java.lang.String getText()
          Returns the text associated with this DefaultWordFinder.
 boolean hasNext()
          Tests if there are more words available from the text.
protected  int ignore(int index, char startIgnore)
          Ignore all characters from the text after the first occurence of a given character.
protected  int ignore(int index, java.lang.Character startIgnore, java.lang.Character endIgnore)
          Ignore all characters from the text between the first occurence of a given character and the next occurence of another given character.
protected  int ignore(int index, char startIgnore, char endIgnore)
          Ignore all characters from the text between the first occurence of a given character and the next occurence of another given character.
protected  int ignore(int index, java.lang.String startIgnore, java.lang.String endIgnore)
          Ignore all characters from the text between the first occurence of a given String and the next occurence of another given String.
protected static boolean isWordChar(char c)
          Checks if a given character is alphanumeric.
protected static boolean isWordChar(java.lang.String text, int posn)
          Checks if the character at a given position in a String is part of a word.
 java.lang.String lookAhead()
          Retuns the next word without advancing the tokenizer, cheking if the character separating both words is an empty space.
 java.lang.String next()
          This method scans the text from the end of the last word, and returns a String corresponding to the next word.
 java.lang.String nextSegment()
          Returns the next text segment from the input.
 void replace(java.lang.String newWord)
          Replaces the current word in the text.
 void replaceBigram(java.lang.String newBigram)
          Replaces the current bigram (current word and the next as returned by lookahead) in the text.
 void replaceSegment(java.lang.String newSegment)
          Replaces the current text segment.
 void setText(java.lang.String newText)
          Changes the text associates with this DefaultWordFinder.
private static java.lang.String solveHardCases(java.lang.String text)
          Resolves the hard tokenization cases which envolve splitting the original word in two words (e.g.
static java.lang.String[] splitSegments(java.lang.String text)
          Splits a given String into an array with its constituent text segments.
static java.lang.String[] splitWords(java.lang.String text)
          Splits a given String into an array with its constituent words.
 boolean startsSentence()
          Checks if the current word marks the begining of a sentence.
 java.lang.String toString()
          Produces a string representation of this word finder by returning the associated text.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

currentWord

protected java.lang.String currentWord
A string with the current word for the word finder.


nextWord

protected java.lang.String nextWord
A string with the word next to the current one.


currentWordPos

protected int currentWordPos
The index of the current word in the input text.


nextWordPos

protected int nextWordPos
The index of the next word in the input text.


currentSegmentPos

protected int currentSegmentPos
The index of the current segment in the input text.


nextSegmentPos

protected int nextSegmentPos
The index of the next segment in the input text.


startsSentence

protected boolean startsSentence
A boolean flag indicating if the current word marks the begining of a sentence.


text

protected java.lang.String text
The input text.


solveHardCases

protected boolean solveHardCases
Solve the tokenization hard cases.


sentenceIterator

protected java.text.BreakIterator sentenceIterator
An iterator over the input text.

See Also:
BreakIterator
Constructor Detail

DefaultWordFinder

public DefaultWordFinder(java.lang.String inText)
Constructor for DefaultWordFinder.

Parameters:
inText - A String with the input text to tokenize.

DefaultWordFinder

public DefaultWordFinder()
Constructor for DefaultWordFinder.

Method Detail

currentSegment

public java.lang.String currentSegment()
Returns the current text segment from the input. A segment is defined as the character sequence between the current position and the next non-alphanumeric character, considering also white spaces.

Returns:
A String with the current text segment.

nextSegment

public java.lang.String nextSegment()
Returns the next text segment from the input. A segment is defined as the character sequence between the current position and the next non-alphanumeric character, considering also white spaces.If there are no more segments to return, it retuns a null String.

Returns:
A String with the next text segment.

replaceSegment

public void replaceSegment(java.lang.String newSegment)
Replaces the current text segment. After a call to this method, a call to currentSegment() returns the new text segment and a call to getText() returns the text supplied to this WordFinder with the current segment replaced.

Parameters:
newSegment - A String with the new text segment.

getText

public java.lang.String getText()
Returns the text associated with this DefaultWordFinder.

Returns:
A String with the text associated with this DefaultWordFinder.

setText

public void setText(java.lang.String newText)
Changes the text associates with this DefaultWordFinder.

Parameters:
newText - The new String with the input text to tokenize.

current

public java.lang.String current()
Returns the current word in the text.

Returns:
A String with the current word in the text.

hasNext

public boolean hasNext()
Tests if there are more words available from the text.

Returns:
true if and only if there is at least one word in the string after the current position, and false otherwise.

replace

public void replace(java.lang.String newWord)
Replaces the current word in the text. After a call to this method, a call to current() returns the new word and a call to getText() returns the text supplied to this WordFinder with the current word replaced.

Parameters:
newWord - A string with the replacement word.

replaceBigram

public void replaceBigram(java.lang.String newBigram)
Replaces the current bigram (current word and the next as returned by lookahead) in the text. After a call to this method, a call to current() returns the Bigram and a call to getText() returns the text supplied to this WordFinder with the current Bigram replaced.

Parameters:
newBigram - A string with the replacement Bigram.

lookAhead

public java.lang.String lookAhead()
Retuns the next word without advancing the tokenizer, cheking if the character separating both words is an empty space. This is usefull for getting BiGrams from the text.

Returns:
The next word in the text, or null.

startsSentence

public boolean startsSentence()
Checks if the current word marks the begining of a sentence.

Returns:
true if the current word marks the begining of a sentence and false otherwise.

toString

public java.lang.String toString()
Produces a string representation of this word finder by returning the associated text.

Overrides:
toString in class java.lang.Object

isWordChar

protected static boolean isWordChar(java.lang.String text,
                                    int posn)
Checks if the character at a given position in a String is part of a word. Special characters such as '.' or '-' are considered alphanumeric or not depending on the surrounding characters. TODO: recognize URLs, mail addresses and abbreviations.

Parameters:
text - The text String.
posn - The position for the character in the String.
Returns:
true if the character at the given position is alphanumeric and false otherwise.

isWordChar

protected static boolean isWordChar(char c)
Checks if a given character is alphanumeric.

Parameters:
c - The char to check.
Returns:
true if the given character is alphanumeric and false otherwise.

ignore

protected int ignore(int index,
                     char startIgnore)
Ignore all characters from the text after the first occurence of a given character.

Parameters:
index - A starting index for the text from where characters should be ignored
startIgnore - The character that marks the begining of the sequence to be ignored.
Returns:
the index in the text marking the begining of the ignored sequence, or -1 if no sequence was ignored (the supplied character does not occur in the text).

ignore

protected int ignore(int index,
                     char startIgnore,
                     char endIgnore)
Ignore all characters from the text between the first occurence of a given character and the next occurence of another given character.

Parameters:
index - A starting index for the text from where characters should be ignored.
startIgnore - The character that marks the begining of the sequence to be ignored.
endIgnore - The character that marks the ending of the sequence to be ignored.
Returns:
the index in the text marking the begining of the ignored sequence, or -1 if no sequence was ignored (the supplied starting character does not occur in the text).

ignore

protected int ignore(int index,
                     java.lang.Character startIgnore,
                     java.lang.Character endIgnore)
Ignore all characters from the text between the first occurence of a given character and the next occurence of another given character.

Parameters:
index - A starting index for the text from where characters should be ignored.
startIgnore - The character that marks the begining of the sequence to be ignored.
endIgnore - The character that marks the ending of the sequence to be ignored, or null if all the next characters from the text are to be ignored.
Returns:
the index in the text marking the begining of the ignored sequence, or -1 if no sequence was ignored (the supplied starting character does not occur in the text).

ignore

protected int ignore(int index,
                     java.lang.String startIgnore,
                     java.lang.String endIgnore)
Ignore all characters from the text between the first occurence of a given String and the next occurence of another given String.

Parameters:
index - A starting index for the text from where characters should be ignored.
startIgnore - The String that marks the begining of the sequence to be ignored.
endIgnore - The String that marks the ending of the sequence to be ignored.
Returns:
the index in the text marking the begining of the ignored sequence, or -1 if no sequence was ignored (the supplied starting String does not occur in the text).

next

public java.lang.String next()
This method scans the text from the end of the last word, and returns a String corresponding to the next word. If there are no more words to return, it retuns a null String.

Returns:
the next word.

getNextWordEnd

private static int getNextWordEnd(java.lang.String text,
                                  int startPos)
Returns the position in the string after the end of the next word. Note that this return value should not be used as an index into the string without checking first that it is in range, since it is possible for the value text.length() to be returned by this method.

Parameters:
text - A string with the text to check.
startPos - the starting position in the text to check.
Returns:
the index position in the string after the end of the next word.

splitWords

public static java.lang.String[] splitWords(java.lang.String text)
Splits a given String into an array with its constituent words.

Parameters:
text - A String.
Returns:
An array with the words extracted from the String.

splitSegments

public static java.lang.String[] splitSegments(java.lang.String text)
Splits a given String into an array with its constituent text segments.

Parameters:
text - A String.
Returns:
An array with the text segments extracted from the String.

solveHardCases

private static java.lang.String solveHardCases(java.lang.String text)
Resolves the hard tokenization cases which envolve splitting the original word in two words (e.g. doesn't -> "does not"). TODO: Disambiguate some cases.

Parameters:
text - A string.
Returns:
The string with the hard cases solved.