Package pal.io

Class NexusTokenizer

java.lang.Object
pal.io.NexusTokenizer

public final class NexusTokenizer extends Object

Comments

A simple token pull-parser for the NEXUS file format as specified in:

Maddison, D. R., Swofford, D. L., & Maddison, W. P., Systematic Biology, 46(4), pp. 590 - 621.

The parser is designed to break a NEXUS file into tokens which are read individually. Tokens come in four different types:

  • Punctuation: any of the punctuation characters (see constants)
  • Whitespace: sequences of characters composed of ' ' or '\t'. Whitespace is only returned if the option is set
  • Word: any string of characters delimited by whitespace or punctuation
  • Newline: '\r', '\n' or '\r\n'. The parser will return the character unless convertNL is set, in which case it will replace the token with the user specified new line character

The parser has a set of options allowing tokens to be modified before they are returned (such as case modification or newline substitution).

Each read by the parser moves forward in the stream, at present there is no support for unreading tokens or for moving bi-directionally through the stream

NB: in this implementation, the token #NEXUS is considered special and when read by the parser, it will return one token: '#NEXUS' not two: '#' and 'NEXUS'. This token has special meaning and is reflected in it having its own token type

Usage

NexusTokenizer ntp = new NexusTokenizer(new PushbackReader(new FileReader("afile")));
ntp.setReadWhiteSpace(false);
    // ignore whitespace ntp.setIgnoreComments(true);
     // ignore comments ntp.setWordModification(NexusTokenizer.WORD_UPPERCASE);
// all tokens in uppercase String nToken = ntp.readToken();

while(nToken != null) {
    System.out.println("Token: " + nToken);
    System.out.println("Col: " + ntp.getCol());
    System.out.println("Row: " + ntp.getRow());
}
Version:
$Id$, $Name$
Author:
$Author$
  • Field Details

  • Constructor Details

    • NexusTokenizer

      public NexusTokenizer(String file) throws IOException
      Constructor for a NexusTokenParser
      Parameters:
      file - File name for the NEXUS file
      Throws:
      IOException - I/O errors
    • NexusTokenizer

      public NexusTokenizer(PushbackReader pr) throws IOException
      Constructor for a NexusTokenParser
      Parameters:
      pr - PushbackReader
      Throws:
      IOException - I/O errors
  • Method Details

    • readWhiteSpace

      public boolean readWhiteSpace()
      Get the flag indicating whether or not this parser object is reading (and returning) whitespace
      Returns:
      returns the readWS flag
    • convertNewLine

      public boolean convertNewLine()
      Gets the flag indicating whether this parser instance should convert newline characters. As the specification says (see link in class description above), newline characters may be '\r', '\n', '\r\n'. To provide some kind of uniformity, the parser can convert these symbols into one specified. As a default, this feature is off.
      Returns:
      returns the convertNL flag
    • setReadWhiteSpace

      public void setReadWhiteSpace(boolean b)
      Sets the readWS flag. True means that the parser will return whitespace characters as a token (where whitespace = ' ' or '\t').
      Parameters:
      b - flag value for readWS
    • setConvertNewLine

      public void setConvertNewLine(boolean b)
      Sets the convertNL flag. True means that the the parser will convert newline characters ('\r', '\n' or '\r\n') into either the default ('\n' if setNewLineChar() is not called) or to a user specified newline char
      Parameters:
      b - flag value for convertNL
    • setIgnoreComments

      public void setIgnoreComments(boolean b)
      Sets the ignoreComments flag. True means that the the tokenizer will ignore comments (i.e. sections of a nexus file delimited by '[...]'. When set to true, the tokenizer will return the first token available after a comment.
      Parameters:
      b - flag value for ignoreComments
    • setNewLineChar

      public void setNewLineChar(char nl)
      Sets the character to be convert newline characters into
      Parameters:
      nl - Replacement newline character
    • getCol

      public int getCol()
      Gets the current column position of the cursor. Changed after each read.
      Returns:
      Column number (zero indexed)
    • getRow

      public int getRow()
      Gets the current row position of the cursor. Changed after each read.
      Returns:
      Row number (zero indexed)
    • getWordModification

      public int getWordModification()
      Gets the word modification flag currently in use
      Returns:
      Flag value for word modification
    • setWordModification

      public void setWordModification(int flag)
      Sets the flag value for word modification. The token case can be changed to lowercase or uppercasse once it has been read from the stream (depending on the set flag). WORD_UNMODIFIED indicates that the tokens should be returned in the case that they are read from the stream. This value can be set at any time between token reads and thus the next token read will be altered depending on this value. The default is WORD_UNMODIFIED.
      Parameters:
      flag - Flag value, one of WORD_LOWERCASE, WORD_UPPERCASE or WORD_UNMODIFIED
    • readToken

      public String readToken() throws IOException, NexusParseException
      Reads a token in from the underlying stream. Tokens are individual chunks read from the underlying stream. Each token is one of the four basic types:
      • Word: any string of characters delimited by whitespace or punctuation
      • Punctuation: any of the punctuation characters (see constants)
      • Whitespace: sequences of characters composed of ' ' or '\t'. Whitespace is only returned if the option is set
      • Newline: '\r', '\n' or '\r\n'. The parser will return the character unless convertNL is set, in which case it will replace the token with the user specified new line character
      Returns:
      returns a String token or null if EOF is reached (i.e. no more tokens to read)
      Throws:
      IOException - I/O errors
      NexusParseException - Parsing errors
    • getLastTokenType

      public int getLastTokenType()
      Determine the type of the last read token. After readToken() has been called, the type of token returned can be determined by calling getLastTokenType(). This returns one of five different constants:
      • UNDEFINED_TOKEN : default before anything is read from the stream
      • WORD_TOKEN : word token was read
      • PUNCTUATION_TOKEN : punctuation token was read
      • NEWLINE_TOKEN : newline token was read
      • WHITESPACE_TOKEN : whitespace token was read (never returned unless whitespace is being returned)
      • HEADER_TOKEN : last token was the special word #NEXUS
      Returns:
      Last token read.
    • seek

      public String seek(int tokenType) throws IOException, NexusParseException
      Seeks through the stream to find the next token of the specified type. The type value can be one of:
      • WORD_TOKEN
      • PUNCTUATION_TOKEN
      • NEWLINE_TOKEN
      • WHITESPACE_TOKEN
      • HEADER_TOKEN
      Returns:
      returns a String token or null if EOF is reached (i.e. no more tokens to read)
      Throws:
      IOException - I/O errors
      NexusParseException - Thrown by parsing errors or if tokenType == WHITESPACE_TOKEN && readWhiteSpace() == false
    • seek

      public String seek(String token) throws IOException, NexusParseException
      Seeks through the stream to find the token argument.
      Returns:
      returns a String token or null if token is not found (i.e. EOF is reached)
      Throws:
      IOException - I/O errors
      NexusParseException - Thrown by parsing errors or if token is whitespace && readWhiteSpace() == false
    • getLastReadToken

      public String getLastReadToken()
      Returns the last read token. Each call to readToken() stores the returned token so that it can be retrieved again. However, each consuming readToken() call replaces this buffer with the new token.
      Returns:
      return the last read token