Interface WordTokenizer

All Known Implementing Classes:
AbstractWordTokenizer, DocumentWordTokenizer, FileWordTokenizer, StringWordTokenizer

public interface WordTokenizer

An interface for objects which take a text-based media as input, and iterate through the words in the text stored in that media. Examples of such media could be Strings, Documents, Files, TextComponents etc.

When the object is instantiated, and before the first call to next() is made, the following methods should throw a WordNotFoundException:
getCurrentWordEnd(), getCurrentWordPosition(), isNewSentence() and replaceWord().

A call to next() when hasMoreWords() returns false should throw a WordNotFoundException.

Author:
Jason Height (jheight@chariot.net.au)
  • Method Summary

    Modifier and Type
    Method
    Description
    Returns the context text that is being tokenized (should include any changes that have been made).
    int
    Returns the number of word tokens that have been processed thus far
    int
    Returns an index representing the end location of the current word in the text.
    int
    Returns an index representing the start location of the current word in the text.
    boolean
    Indicates if there are more words left
    boolean
    Returns true if the current word is at the start of a sentence
    This returns the next word in the iteration.
    void
    Replaces the current word token
  • Method Details

    • getContext

      String getContext()
      Returns the context text that is being tokenized (should include any changes that have been made).
      Returns:
      the text being searched.
    • getCurrentWordCount

      int getCurrentWordCount()
      Returns the number of word tokens that have been processed thus far
      Returns:
      the number of words found so far.
    • getCurrentWordEnd

      int getCurrentWordEnd()
      Returns an index representing the end location of the current word in the text.
      Returns:
      index of the end of the current word in the text.
      Throws:
      WordNotFoundException - current word has not yet been set.
    • getCurrentWordPosition

      int getCurrentWordPosition()
      Returns an index representing the start location of the current word in the text.
      Returns:
      index of the start of the current word in the text.
      Throws:
      WordNotFoundException - current word has not yet been set.
    • isNewSentence

      boolean isNewSentence()
      Returns true if the current word is at the start of a sentence
      Returns:
      true if the current word starts a sentence.
      Throws:
      WordNotFoundException - current word has not yet been set.
    • hasMoreWords

      boolean hasMoreWords()
      Indicates if there are more words left
      Returns:
      true if more words can be found in the text.
    • nextWord

      String nextWord()
      This returns the next word in the iteration. Note that any implementation should return the current word, and then replace the current word with the next word found in the input text (if one exists).
      Returns:
      the next word in the iteration.
      Throws:
      WordNotFoundException - search string contains no more words.
    • replaceWord

      void replaceWord(String newWord)
      Replaces the current word token

      When a word is replaced care should be taken that the WordTokenizer repositions itself such that the words that were added aren't rechecked. Of course this is not mandatory, maybe there is a case when an application doesn't need to do this.

      Parameters:
      newWord - the string which should replace the current word.
      Throws:
      WordNotFoundException - current word has not yet been set.