DefaultWordFinder

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

pt.tumba.spell
Class DefaultWordFinder

java.lang.Object
  pt.tumba.spell.DefaultWordFinder

Direct Known Subclasses:: TeXWordFinder, XMLWordFinder

public class DefaultWordFinder
extends java.lang.Object
extends java.lang.Object

A word finder for normal text documents, which searches text for sequences of words and text blocks.This class also defines common methods and behaviour for the various word finding subclasses.

Author:: Bruno Martins
See Also:: StringTokenizer, BreakIterator, TeXWordFinder, XMLWordFinder

Field Summary
`protected int`	`currentSegmentPos` The index of the current segment in the input text.
`protected java.lang.String`	`currentWord` A string with the current word for the word finder.
`protected int`	`currentWordPos` The index of the current word in the input text.
`protected int`	`nextSegmentPos` The index of the next segment in the input text.
`protected java.lang.String`	`nextWord` A string with the word next to the current one.
`protected int`	`nextWordPos` The index of the next word in the input text.
`protected java.text.BreakIterator`	`sentenceIterator` An iterator over the input text.
`protected boolean`	`solveHardCases` Solve the tokenization hard cases.
`protected boolean`	`startsSentence` A boolean flag indicating if the current word marks the begining of a sentence.
`protected java.lang.String`	`text` The input text.

Constructor Summary
`DefaultWordFinder()` Constructor for DefaultWordFinder.
`DefaultWordFinder(java.lang.String inText)` Constructor for DefaultWordFinder.

Method Summary
`java.lang.String`	`current()` Returns the current word in the text.
`java.lang.String`	`currentSegment()` Returns the current text segment from the input.
`private static int`	`getNextWordEnd(java.lang.String text, int startPos)` Returns the position in the string after the end of the next word.
`java.lang.String`	`getText()` Returns the text associated with this DefaultWordFinder.
`boolean`	`hasNext()` Tests if there are more words available from the text.
`protected int`	`ignore(int index, char startIgnore)` Ignore all characters from the text after the first occurence of a given character.
`protected int`	`ignore(int index, java.lang.Character startIgnore, java.lang.Character endIgnore)` Ignore all characters from the text between the first occurence of a given character and the next occurence of another given character.
`protected int`	`ignore(int index, char startIgnore, char endIgnore)` Ignore all characters from the text between the first occurence of a given character and the next occurence of another given character.
`protected int`	`ignore(int index, java.lang.String startIgnore, java.lang.String endIgnore)` Ignore all characters from the text between the first occurence of a given String and the next occurence of another given String.
`protected static boolean`	`isWordChar(char c)` Checks if a given character is alphanumeric.
`protected static boolean`	`isWordChar(java.lang.String text, int posn)` Checks if the character at a given position in a String is part of a word.
`java.lang.String`	`lookAhead()` Retuns the next word without advancing the tokenizer, cheking if the character separating both words is an empty space.
`java.lang.String`	`next()` This method scans the text from the end of the last word, and returns a String corresponding to the next word.
`java.lang.String`	`nextSegment()` Returns the next text segment from the input.
`void`	`replace(java.lang.String newWord)` Replaces the current word in the text.
`void`	`replaceBigram(java.lang.String newBigram)` Replaces the current bigram (current word and the next as returned by lookahead) in the text.
`void`	`replaceSegment(java.lang.String newSegment)` Replaces the current text segment.
`void`	`setText(java.lang.String newText)` Changes the text associates with this DefaultWordFinder.
`private static java.lang.String`	`solveHardCases(java.lang.String text)` Resolves the hard tokenization cases which envolve splitting the original word in two words (e.g.
`static java.lang.String[]`	`splitSegments(java.lang.String text)` Splits a given String into an array with its constituent text segments.
`static java.lang.String[]`	`splitWords(java.lang.String text)` Splits a given String into an array with its constituent words.
`boolean`	`startsSentence()` Checks if the current word marks the begining of a sentence.
`java.lang.String`	`toString()` Produces a string representation of this word finder by returning the associated text.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Field Detail

currentWord

protected java.lang.String currentWord

A string with the current word for the word finder.

nextWord

protected java.lang.String nextWord

A string with the word next to the current one.

currentWordPos

protected int currentWordPos

The index of the current word in the input text.

nextWordPos

protected int nextWordPos

The index of the next word in the input text.

currentSegmentPos

protected int currentSegmentPos

The index of the current segment in the input text.

nextSegmentPos

protected int nextSegmentPos

The index of the next segment in the input text.

startsSentence

protected boolean startsSentence

A boolean flag indicating if the current word marks the begining of a sentence.

text

protected java.lang.String text

The input text.

solveHardCases

protected boolean solveHardCases

Solve the tokenization hard cases.

sentenceIterator

protected java.text.BreakIterator sentenceIterator

An iterator over the input text.

See Also:: BreakIterator

Constructor Detail

DefaultWordFinder

public DefaultWordFinder(java.lang.String inText)

Constructor for DefaultWordFinder.

Parameters:: inText - A String with the input text to tokenize.

DefaultWordFinder

public DefaultWordFinder()

Constructor for DefaultWordFinder.

Method Detail

currentSegment

public java.lang.String currentSegment()

Returns the current text segment from the input. A segment is defined as the character sequence between the current position and the next non-alphanumeric character, considering also white spaces.

Returns:: A String with the current text segment.

nextSegment

public java.lang.String nextSegment()

Returns the next text segment from the input. A segment is defined as the character sequence between the current position and the next non-alphanumeric character, considering also white spaces.If there are no more segments to return, it retuns a null String.

Returns:: A String with the next text segment.

replaceSegment

public void replaceSegment(java.lang.String newSegment)

Replaces the current text segment. After a call to this method, a call to currentSegment() returns the new text segment and a call to getText() returns the text supplied to this WordFinder with the current segment replaced.

Parameters:: newSegment - A String with the new text segment.

getText

public java.lang.String getText()

Returns the text associated with this DefaultWordFinder.

Returns:: A String with the text associated with this DefaultWordFinder.

setText

public void setText(java.lang.String newText)

Changes the text associates with this DefaultWordFinder.

Parameters:: newText - The new String with the input text to tokenize.

current

public java.lang.String current()

Returns the current word in the text.

Returns:: A String with the current word in the text.

hasNext

public boolean hasNext()

Tests if there are more words available from the text.

Returns:: true if and only if there is at least one word in the string after the current position, and false otherwise.

replace

public void replace(java.lang.String newWord)

Replaces the current word in the text. After a call to this method, a call to current() returns the new word and a call to getText() returns the text supplied to this WordFinder with the current word replaced.

Parameters:: newWord - A string with the replacement word.

replaceBigram

public void replaceBigram(java.lang.String newBigram)

Replaces the current bigram (current word and the next as returned by lookahead) in the text. After a call to this method, a call to current() returns the Bigram and a call to getText() returns the text supplied to this WordFinder with the current Bigram replaced.

Parameters:: newBigram - A string with the replacement Bigram.

lookAhead

public java.lang.String lookAhead()

Retuns the next word without advancing the tokenizer, cheking if the character separating both words is an empty space. This is usefull for getting BiGrams from the text.

Returns:: The next word in the text, or null.

startsSentence

public boolean startsSentence()

Checks if the current word marks the begining of a sentence.

Returns:: true if the current word marks the begining of a sentence and false otherwise.

toString

public java.lang.String toString()

Produces a string representation of this word finder by returning the associated text.

Overrides:: toString in class java.lang.Object

isWordChar

protected static boolean isWordChar(java.lang.String text,
                                    int posn)

Checks if the character at a given position in a String is part of a word. Special characters such as '.' or '-' are considered alphanumeric or not depending on the surrounding characters. TODO: recognize URLs, mail addresses and abbreviations.

Parameters:: text - The text String.; posn - The position for the character in the String.
Returns:: true if the character at the given position is alphanumeric and false otherwise.

isWordChar

protected static boolean isWordChar(char c)

Checks if a given character is alphanumeric.

Parameters:: c - The char to check.
Returns:: true if the given character is alphanumeric and false otherwise.

ignore

protected int ignore(int index,
                     char startIgnore)

Ignore all characters from the text after the first occurence of a given character.

Parameters:: index - A starting index for the text from where characters should be ignored; startIgnore - The character that marks the begining of the sequence to be ignored.
Returns:: the index in the text marking the begining of the ignored sequence, or -1 if no sequence was ignored (the supplied character does not occur in the text).

ignore

protected int ignore(int index,
                     char startIgnore,
                     char endIgnore)

Ignore all characters from the text between the first occurence of a given character and the next occurence of another given character.

Parameters:: index - A starting index for the text from where characters should be ignored.; startIgnore - The character that marks the begining of the sequence to be ignored.; endIgnore - The character that marks the ending of the sequence to be ignored.
Returns:: the index in the text marking the begining of the ignored sequence, or -1 if no sequence was ignored (the supplied starting character does not occur in the text).

ignore

protected int ignore(int index,
                     java.lang.Character startIgnore,
                     java.lang.Character endIgnore)

Ignore all characters from the text between the first occurence of a given character and the next occurence of another given character.

Parameters:: index - A starting index for the text from where characters should be ignored.; startIgnore - The character that marks the begining of the sequence to be ignored.; endIgnore - The character that marks the ending of the sequence to be ignored, or null if all the next characters from the text are to be ignored.
Returns:: the index in the text marking the begining of the ignored sequence, or -1 if no sequence was ignored (the supplied starting character does not occur in the text).

ignore

protected int ignore(int index,
                     java.lang.String startIgnore,
                     java.lang.String endIgnore)

Ignore all characters from the text between the first occurence of a given String and the next occurence of another given String.

Parameters:: index - A starting index for the text from where characters should be ignored.; startIgnore - The String that marks the begining of the sequence to be ignored.; endIgnore - The String that marks the ending of the sequence to be ignored.
Returns:: the index in the text marking the begining of the ignored sequence, or -1 if no sequence was ignored (the supplied starting String does not occur in the text).

public java.lang.String next()

This method scans the text from the end of the last word, and returns a String corresponding to the next word. If there are no more words to return, it retuns a null String.

Returns:: the next word.

getNextWordEnd

private static int getNextWordEnd(java.lang.String text,
                                  int startPos)

Returns the position in the string after the end of the next word. Note that this return value should not be used as an index into the string without checking first that it is in range, since it is possible for the value text.length() to be returned by this method.

Parameters:: text - A string with the text to check.; startPos - the starting position in the text to check.
Returns:: the index position in the string after the end of the next word.

splitWords

public static java.lang.String[] splitWords(java.lang.String text)

Splits a given String into an array with its constituent words.

Parameters:: text - A String.
Returns:: An array with the words extracted from the String.

splitSegments

public static java.lang.String[] splitSegments(java.lang.String text)

Splits a given String into an array with its constituent text segments.

Parameters:: text - A String.
Returns:: An array with the text segments extracted from the String.

solveHardCases

private static java.lang.String solveHardCases(java.lang.String text)

Resolves the hard tokenization cases which envolve splitting the original word in two words (e.g. doesn't -> "does not"). TODO: Disambiguate some cases.

Parameters:: text - A string.
Returns:: The string with the hard cases solved.

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

pt.tumba.spell Class DefaultWordFinder

currentWord

nextWord

currentWordPos

nextWordPos

currentSegmentPos

nextSegmentPos

startsSentence

text

solveHardCases

sentenceIterator

DefaultWordFinder

DefaultWordFinder

currentSegment

nextSegment

replaceSegment

getText

setText

current

hasNext

replace

replaceBigram

lookAhead

startsSentence

toString

isWordChar

isWordChar

ignore

ignore

ignore

ignore

next

getNextWordEnd

splitWords

splitSegments

solveHardCases

pt.tumba.spell
Class DefaultWordFinder