|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object pt.tumba.spell.DefaultWordFinder
public class DefaultWordFinder
A word finder for normal text documents, which searches text for sequences of words and text blocks.This class also defines common methods and behaviour for the various word finding subclasses.
StringTokenizer
,
BreakIterator
,
TeXWordFinder
,
XMLWordFinder
Field Summary | |
---|---|
protected int |
currentSegmentPos
The index of the current segment in the input text. |
protected java.lang.String |
currentWord
A string with the current word for the word finder. |
protected int |
currentWordPos
The index of the current word in the input text. |
protected int |
nextSegmentPos
The index of the next segment in the input text. |
protected java.lang.String |
nextWord
A string with the word next to the current one. |
protected int |
nextWordPos
The index of the next word in the input text. |
protected java.text.BreakIterator |
sentenceIterator
An iterator over the input text. |
protected boolean |
solveHardCases
Solve the tokenization hard cases. |
protected boolean |
startsSentence
A boolean flag indicating if the current word marks the begining of a sentence. |
protected java.lang.String |
text
The input text. |
Constructor Summary | |
---|---|
DefaultWordFinder()
Constructor for DefaultWordFinder. |
|
DefaultWordFinder(java.lang.String inText)
Constructor for DefaultWordFinder. |
Method Summary | |
---|---|
java.lang.String |
current()
Returns the current word in the text. |
java.lang.String |
currentSegment()
Returns the current text segment from the input. |
private static int |
getNextWordEnd(java.lang.String text,
int startPos)
Returns the position in the string after the end of the next word. |
java.lang.String |
getText()
Returns the text associated with this DefaultWordFinder. |
boolean |
hasNext()
Tests if there are more words available from the text. |
protected int |
ignore(int index,
char startIgnore)
Ignore all characters from the text after the first occurence of a given character. |
protected int |
ignore(int index,
java.lang.Character startIgnore,
java.lang.Character endIgnore)
Ignore all characters from the text between the first occurence of a given character and the next occurence of another given character. |
protected int |
ignore(int index,
char startIgnore,
char endIgnore)
Ignore all characters from the text between the first occurence of a given character and the next occurence of another given character. |
protected int |
ignore(int index,
java.lang.String startIgnore,
java.lang.String endIgnore)
Ignore all characters from the text between the first occurence of a given String and the next occurence of another given String. |
protected static boolean |
isWordChar(char c)
Checks if a given character is alphanumeric. |
protected static boolean |
isWordChar(java.lang.String text,
int posn)
Checks if the character at a given position in a String is part of a word. |
java.lang.String |
lookAhead()
Retuns the next word without advancing the tokenizer, cheking if the character separating both words is an empty space. |
java.lang.String |
next()
This method scans the text from the end of the last word, and returns a String corresponding to the next word. |
java.lang.String |
nextSegment()
Returns the next text segment from the input. |
void |
replace(java.lang.String newWord)
Replaces the current word in the text. |
void |
replaceBigram(java.lang.String newBigram)
Replaces the current bigram (current word and the next as returned by lookahead) in the text. |
void |
replaceSegment(java.lang.String newSegment)
Replaces the current text segment. |
void |
setText(java.lang.String newText)
Changes the text associates with this DefaultWordFinder. |
private static java.lang.String |
solveHardCases(java.lang.String text)
Resolves the hard tokenization cases which envolve splitting the original word in two words (e.g. |
static java.lang.String[] |
splitSegments(java.lang.String text)
Splits a given String into an array with its constituent text segments. |
static java.lang.String[] |
splitWords(java.lang.String text)
Splits a given String into an array with its constituent words. |
boolean |
startsSentence()
Checks if the current word marks the begining of a sentence. |
java.lang.String |
toString()
Produces a string representation of this word finder by returning the associated text. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
protected java.lang.String currentWord
protected java.lang.String nextWord
protected int currentWordPos
protected int nextWordPos
protected int currentSegmentPos
protected int nextSegmentPos
protected boolean startsSentence
protected java.lang.String text
protected boolean solveHardCases
protected java.text.BreakIterator sentenceIterator
BreakIterator
Constructor Detail |
---|
public DefaultWordFinder(java.lang.String inText)
inText
- A String with the input text to tokenize.public DefaultWordFinder()
Method Detail |
---|
public java.lang.String currentSegment()
public java.lang.String nextSegment()
public void replaceSegment(java.lang.String newSegment)
newSegment
- A String with the new text segment.public java.lang.String getText()
public void setText(java.lang.String newText)
newText
- The new String with the input text to tokenize.public java.lang.String current()
public boolean hasNext()
public void replace(java.lang.String newWord)
newWord
- A string with the replacement word.public void replaceBigram(java.lang.String newBigram)
newBigram
- A string with the replacement Bigram.public java.lang.String lookAhead()
public boolean startsSentence()
public java.lang.String toString()
toString
in class java.lang.Object
protected static boolean isWordChar(java.lang.String text, int posn)
text
- The text String.posn
- The position for the character in the String.
protected static boolean isWordChar(char c)
c
- The char to check.
protected int ignore(int index, char startIgnore)
index
- A starting index for the text from where characters should be ignoredstartIgnore
- The character that marks the begining of the sequence to be ignored.
protected int ignore(int index, char startIgnore, char endIgnore)
index
- A starting index for the text from where characters should be ignored.startIgnore
- The character that marks the begining of the sequence to be ignored.endIgnore
- The character that marks the ending of the sequence to be ignored.
protected int ignore(int index, java.lang.Character startIgnore, java.lang.Character endIgnore)
index
- A starting index for the text from where characters should be ignored.startIgnore
- The character that marks the begining of the sequence to be ignored.endIgnore
- The character that marks the ending of the sequence to be ignored, or null
if all the next characters from the text are to be ignored.
protected int ignore(int index, java.lang.String startIgnore, java.lang.String endIgnore)
index
- A starting index for the text from where characters should be ignored.startIgnore
- The String that marks the begining of the sequence to be ignored.endIgnore
- The String that marks the ending of the sequence to be ignored.
public java.lang.String next()
private static int getNextWordEnd(java.lang.String text, int startPos)
text.length()
to be returned by this method.
text
- A string with the text to check.startPos
- the starting position in the text to check.
public static java.lang.String[] splitWords(java.lang.String text)
text
- A String.
public static java.lang.String[] splitSegments(java.lang.String text)
text
- A String.
private static java.lang.String solveHardCases(java.lang.String text)
text
- A string.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |