Class TokenizerImpl

  • All Implemented Interfaces:
    Tokenizer

    public class TokenizerImpl
    extends java.lang.Object
    implements Tokenizer
    Implements the tokenizer interface. Breaks an input sequence of characters into a set of tokens.
    • Constructor Summary

      Constructors 
      Constructor Description
      TokenizerImpl()
      Constructs a Tokenizer.
      TokenizerImpl​(java.io.Reader file)
      Creates a tokenizer that will return tokens from the given file.
      TokenizerImpl​(java.lang.String string)
      Creates a tokenizer that will return tokens from the given string.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.lang.String getErrorDescription()
      if hasErrors returns true, this will return a description of the error encountered, otherwise it will return null
      Token getNextToken()
      Returns the next token.
      boolean hasErrors()
      Returns true if there were errors while reading tokens
      boolean hasMoreTokens()
      Returns true if there are more tokens, false otherwise.
      boolean isBreak()
      Determines if the current token should start a new sentence.
      void setInputReader​(java.io.Reader reader)
      Sets the input reader
      void setInputText​(java.lang.String inputString)
      Sets the text to tokenize.
      void setPostpunctuationSymbols​(java.lang.String symbols)
      Sets the postpunctuation symbols of this Tokenizer to the given symbols.
      void setPrepunctuationSymbols​(java.lang.String symbols)
      Sets the prepunctuation symbols of this Tokenizer to the given symbols.
      void setSingleCharSymbols​(java.lang.String symbols)
      Sets the single character symbols of this Tokenizer to the given symbols.
      void setWhitespaceSymbols​(java.lang.String symbols)
      Sets the whitespace symbols of this Tokenizer to the given symbols.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • EOF

        public static final int EOF
        A constant indicating that the end of the stream has been read.
        See Also:
        Constant Field Values
      • DEFAULT_WHITESPACE_SYMBOLS

        public static final java.lang.String DEFAULT_WHITESPACE_SYMBOLS
        A string containing the default whitespace characters.
        See Also:
        Constant Field Values
      • DEFAULT_SINGLE_CHAR_SYMBOLS

        public static final java.lang.String DEFAULT_SINGLE_CHAR_SYMBOLS
        A string containing the default single characters.
        See Also:
        Constant Field Values
      • DEFAULT_PREPUNCTUATION_SYMBOLS

        public static final java.lang.String DEFAULT_PREPUNCTUATION_SYMBOLS
        A string containing the default pre-punctuation characters.
        See Also:
        Constant Field Values
      • DEFAULT_POSTPUNCTUATION_SYMBOLS

        public static final java.lang.String DEFAULT_POSTPUNCTUATION_SYMBOLS
        A string containing the default post-punctuation characters.
        See Also:
        Constant Field Values
    • Constructor Detail

      • TokenizerImpl

        public TokenizerImpl()
        Constructs a Tokenizer.
      • TokenizerImpl

        public TokenizerImpl​(java.lang.String string)
        Creates a tokenizer that will return tokens from the given string.
        Parameters:
        string - the string to tokenize
      • TokenizerImpl

        public TokenizerImpl​(java.io.Reader file)
        Creates a tokenizer that will return tokens from the given file.
        Parameters:
        file - where to read the input from
    • Method Detail

      • setWhitespaceSymbols

        public void setWhitespaceSymbols​(java.lang.String symbols)
        Sets the whitespace symbols of this Tokenizer to the given symbols.
        Specified by:
        setWhitespaceSymbols in interface Tokenizer
        Parameters:
        symbols - the whitespace symbols
      • setSingleCharSymbols

        public void setSingleCharSymbols​(java.lang.String symbols)
        Sets the single character symbols of this Tokenizer to the given symbols.
        Specified by:
        setSingleCharSymbols in interface Tokenizer
        Parameters:
        symbols - the single character symbols
      • setPrepunctuationSymbols

        public void setPrepunctuationSymbols​(java.lang.String symbols)
        Sets the prepunctuation symbols of this Tokenizer to the given symbols.
        Specified by:
        setPrepunctuationSymbols in interface Tokenizer
        Parameters:
        symbols - the prepunctuation symbols
      • setPostpunctuationSymbols

        public void setPostpunctuationSymbols​(java.lang.String symbols)
        Sets the postpunctuation symbols of this Tokenizer to the given symbols.
        Specified by:
        setPostpunctuationSymbols in interface Tokenizer
        Parameters:
        symbols - the postpunctuation symbols
      • setInputText

        public void setInputText​(java.lang.String inputString)
        Sets the text to tokenize.
        Specified by:
        setInputText in interface Tokenizer
        Parameters:
        inputString - the string to tokenize
      • setInputReader

        public void setInputReader​(java.io.Reader reader)
        Sets the input reader
        Specified by:
        setInputReader in interface Tokenizer
        Parameters:
        reader - the input source
      • getNextToken

        public Token getNextToken()
        Returns the next token.
        Specified by:
        getNextToken in interface Tokenizer
        Returns:
        the next token if it exists, null if no more tokens
      • hasMoreTokens

        public boolean hasMoreTokens()
        Returns true if there are more tokens, false otherwise.
        Specified by:
        hasMoreTokens in interface Tokenizer
        Returns:
        true if there are more tokens false otherwise
      • hasErrors

        public boolean hasErrors()
        Returns true if there were errors while reading tokens
        Specified by:
        hasErrors in interface Tokenizer
        Returns:
        true if there were errors; false otherwise
      • getErrorDescription

        public java.lang.String getErrorDescription()
        if hasErrors returns true, this will return a description of the error encountered, otherwise it will return null
        Specified by:
        getErrorDescription in interface Tokenizer
        Returns:
        a description of the last error that occurred.
      • isBreak

        public boolean isBreak()
        Determines if the current token should start a new sentence.
        Specified by:
        isBreak in interface Tokenizer
        Returns:
        true if a new sentence should be started