Package nu.validator.htmlparser.impl
Class Tokenizer
java.lang.Object
nu.validator.htmlparser.impl.Tokenizer
- All Implemented Interfaces:
Locator
- Direct Known Subclasses:
ErrorReportingTokenizer
An implementation of
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html
This class implements the
Locator
interface. This is not an
incidental implementation detail: Users of this class are encouraged to make
use of the Locator
nature.
By default, the tokenizer may report data that XML 1.0 bans. The tokenizer
can be configured to treat these conditions as fatal or to coerce the infoset
to something that XML 1.0 allows.- Version:
- $Id$
- Author:
- hsivonen
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final int
static final int
static final int
static final int
static final int
static final int
static final int
protected LocatorImpl
static final int
static final int
static final int
static final int
protected AttributeName
The current attribute name.static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
protected boolean
static final int
static final int
protected int
protected int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
protected EncodingDeclarationHandler
protected boolean
true
if tokenizing an end tagprotected ElementName
The element whose end tag closes the current CDATA or RCDATA element.protected ErrorHandler
The error handler.static final int
static final int
static final int
protected boolean
true
when HTML4-specific additional errors are requested.protected int
protected boolean
Whether the previous char read was CR.static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
protected int
static final int
static final int
protected final TokenHandler
The token handler.protected int
-
Constructor Summary
ConstructorsConstructorDescriptionTokenizer
(TokenHandler tokenHandler) The constructor.Tokenizer
(TokenHandler tokenHandler, boolean newAttributesEachTime) -
Method Summary
Modifier and TypeMethodDescriptionvoid
protected char
checkChar
(char[] buf, int pos) void
end()
void
eof()
void
Reports a Parse Error.protected void
errAstralNonCharacter
(int ch) protected void
protected void
errBadCharAfterLt
(char c) protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
errHtml4LtSlashInRcdata
(char folded) protected void
protected void
protected void
protected void
errLtGt()
protected void
protected void
protected void
protected void
protected void
protected char
errNcrControlChar
(char ch) protected void
errNcrCr()
protected void
protected char
errNcrNonCharacter
(char ch) protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
protected void
errQuoteBeforeAttributeName
(char c) protected void
protected void
void
errTreeBuilder
(String message) protected void
protected void
errUnquotedAttributeValOrNull
(char c) protected void
void
Reports an condition that would make the infoset incompatible with XML 1.0 as fatal.protected void
flushChars
(char[] buf, int pos) Flushes coalesced character tokens.int
getCol()
Returns the col.int
int
getLine()
Returns the line.int
void
void
initLocation
(String newPublicId, String newSystemId) boolean
internalEncodingDeclaration
(String internalCharset) boolean
Returns the alreadyComplainedAboutNonAscii.boolean
boolean
Returns the mappingLangToXmlLang.boolean
Returns the nextCharOnNewLine.boolean
isPrevCR()
void
protected void
protected void
maybeErrSlashInEndTag
(boolean selfClosing) protected void
maybeWarnPrivateUse
(char ch) protected void
protected void
protected void
void
void
void
void
setCommentPolicy
(XmlViolationPolicy commentPolicy) Sets the commentPolicy.void
setContentNonXmlCharPolicy
(XmlViolationPolicy contentNonXmlCharPolicy) Sets the contentNonXmlCharPolicy.void
setContentSpacePolicy
(XmlViolationPolicy contentSpacePolicy) Sets the contentSpacePolicy.void
setEncodingDeclarationHandler
(EncodingDeclarationHandler encodingDeclarationHandler) Sets the encodingDeclarationHandler.void
Sets the error handler.void
setHtml4ModeCompatibleWithXhtml1Schemata
(boolean html4ModeCompatibleWithXhtml1Schemata) Sets the html4ModeCompatibleWithXhtml1Schemata.void
setInterner
(Interner interner) void
setLineNumber
(int line) For C++ use only.void
setMappingLangToXmlLang
(boolean mappingLangToXmlLang) Sets the mappingLangToXmlLang.void
setNamePolicy
(XmlViolationPolicy namePolicy) void
setStateAndEndTagExpectation
(int specialTokenizerState, String endTagExpectation) Sets the tokenizer state and the associated element name.void
setStateAndEndTagExpectation
(int specialTokenizerState, ElementName endTagExpectation) Sets the tokenizer state and the associated element name.void
setTransitionBaseOffset
(int offset) Sets an offset to be added to the position reported toTransitionHandler
.void
setXmlnsPolicy
(XmlViolationPolicy xmlnsPolicy) Sets the xmlnsPolicy.protected void
protected void
void
start()
protected void
protected String
The smaller buffer as a String.boolean
tokenizeBuffer
(UTF16Buffer buffer) protected int
transition
(int from, int to, boolean reconsume, int pos) void
Reports a warning
-
Field Details
-
DATA
public static final int DATA- See Also:
-
RCDATA
public static final int RCDATA- See Also:
-
SCRIPT_DATA
public static final int SCRIPT_DATA- See Also:
-
RAWTEXT
public static final int RAWTEXT- See Also:
-
SCRIPT_DATA_ESCAPED
public static final int SCRIPT_DATA_ESCAPED- See Also:
-
ATTRIBUTE_VALUE_DOUBLE_QUOTED
public static final int ATTRIBUTE_VALUE_DOUBLE_QUOTED- See Also:
-
ATTRIBUTE_VALUE_SINGLE_QUOTED
public static final int ATTRIBUTE_VALUE_SINGLE_QUOTED- See Also:
-
ATTRIBUTE_VALUE_UNQUOTED
public static final int ATTRIBUTE_VALUE_UNQUOTED- See Also:
-
PLAINTEXT
public static final int PLAINTEXT- See Also:
-
TAG_OPEN
public static final int TAG_OPEN- See Also:
-
CLOSE_TAG_OPEN
public static final int CLOSE_TAG_OPEN- See Also:
-
TAG_NAME
public static final int TAG_NAME- See Also:
-
BEFORE_ATTRIBUTE_NAME
public static final int BEFORE_ATTRIBUTE_NAME- See Also:
-
ATTRIBUTE_NAME
public static final int ATTRIBUTE_NAME- See Also:
-
AFTER_ATTRIBUTE_NAME
public static final int AFTER_ATTRIBUTE_NAME- See Also:
-
BEFORE_ATTRIBUTE_VALUE
public static final int BEFORE_ATTRIBUTE_VALUE- See Also:
-
AFTER_ATTRIBUTE_VALUE_QUOTED
public static final int AFTER_ATTRIBUTE_VALUE_QUOTED- See Also:
-
BOGUS_COMMENT
public static final int BOGUS_COMMENT- See Also:
-
MARKUP_DECLARATION_OPEN
public static final int MARKUP_DECLARATION_OPEN- See Also:
-
DOCTYPE
public static final int DOCTYPE- See Also:
-
BEFORE_DOCTYPE_NAME
public static final int BEFORE_DOCTYPE_NAME- See Also:
-
DOCTYPE_NAME
public static final int DOCTYPE_NAME- See Also:
-
AFTER_DOCTYPE_NAME
public static final int AFTER_DOCTYPE_NAME- See Also:
-
BEFORE_DOCTYPE_PUBLIC_IDENTIFIER
public static final int BEFORE_DOCTYPE_PUBLIC_IDENTIFIER- See Also:
-
DOCTYPE_PUBLIC_IDENTIFIER_DOUBLE_QUOTED
public static final int DOCTYPE_PUBLIC_IDENTIFIER_DOUBLE_QUOTED- See Also:
-
DOCTYPE_PUBLIC_IDENTIFIER_SINGLE_QUOTED
public static final int DOCTYPE_PUBLIC_IDENTIFIER_SINGLE_QUOTED- See Also:
-
AFTER_DOCTYPE_PUBLIC_IDENTIFIER
public static final int AFTER_DOCTYPE_PUBLIC_IDENTIFIER- See Also:
-
BEFORE_DOCTYPE_SYSTEM_IDENTIFIER
public static final int BEFORE_DOCTYPE_SYSTEM_IDENTIFIER- See Also:
-
DOCTYPE_SYSTEM_IDENTIFIER_DOUBLE_QUOTED
public static final int DOCTYPE_SYSTEM_IDENTIFIER_DOUBLE_QUOTED- See Also:
-
DOCTYPE_SYSTEM_IDENTIFIER_SINGLE_QUOTED
public static final int DOCTYPE_SYSTEM_IDENTIFIER_SINGLE_QUOTED- See Also:
-
AFTER_DOCTYPE_SYSTEM_IDENTIFIER
public static final int AFTER_DOCTYPE_SYSTEM_IDENTIFIER- See Also:
-
BOGUS_DOCTYPE
public static final int BOGUS_DOCTYPE- See Also:
-
COMMENT_START
public static final int COMMENT_START- See Also:
-
COMMENT_START_DASH
public static final int COMMENT_START_DASH- See Also:
-
COMMENT
public static final int COMMENT- See Also:
-
COMMENT_END_DASH
public static final int COMMENT_END_DASH- See Also:
-
COMMENT_END
public static final int COMMENT_END- See Also:
-
COMMENT_END_BANG
public static final int COMMENT_END_BANG- See Also:
-
NON_DATA_END_TAG_NAME
public static final int NON_DATA_END_TAG_NAME- See Also:
-
MARKUP_DECLARATION_HYPHEN
public static final int MARKUP_DECLARATION_HYPHEN- See Also:
-
MARKUP_DECLARATION_OCTYPE
public static final int MARKUP_DECLARATION_OCTYPE- See Also:
-
DOCTYPE_UBLIC
public static final int DOCTYPE_UBLIC- See Also:
-
DOCTYPE_YSTEM
public static final int DOCTYPE_YSTEM- See Also:
-
AFTER_DOCTYPE_PUBLIC_KEYWORD
public static final int AFTER_DOCTYPE_PUBLIC_KEYWORD- See Also:
-
BETWEEN_DOCTYPE_PUBLIC_AND_SYSTEM_IDENTIFIERS
public static final int BETWEEN_DOCTYPE_PUBLIC_AND_SYSTEM_IDENTIFIERS- See Also:
-
AFTER_DOCTYPE_SYSTEM_KEYWORD
public static final int AFTER_DOCTYPE_SYSTEM_KEYWORD- See Also:
-
CONSUME_CHARACTER_REFERENCE
public static final int CONSUME_CHARACTER_REFERENCE- See Also:
-
CONSUME_NCR
public static final int CONSUME_NCR- See Also:
-
CHARACTER_REFERENCE_TAIL
public static final int CHARACTER_REFERENCE_TAIL- See Also:
-
HEX_NCR_LOOP
public static final int HEX_NCR_LOOP- See Also:
-
DECIMAL_NRC_LOOP
public static final int DECIMAL_NRC_LOOP- See Also:
-
HANDLE_NCR_VALUE
public static final int HANDLE_NCR_VALUE- See Also:
-
HANDLE_NCR_VALUE_RECONSUME
public static final int HANDLE_NCR_VALUE_RECONSUME- See Also:
-
CHARACTER_REFERENCE_HILO_LOOKUP
public static final int CHARACTER_REFERENCE_HILO_LOOKUP- See Also:
-
SELF_CLOSING_START_TAG
public static final int SELF_CLOSING_START_TAG- See Also:
-
CDATA_START
public static final int CDATA_START- See Also:
-
CDATA_SECTION
public static final int CDATA_SECTION- See Also:
-
CDATA_RSQB
public static final int CDATA_RSQB- See Also:
-
CDATA_RSQB_RSQB
public static final int CDATA_RSQB_RSQB- See Also:
-
SCRIPT_DATA_LESS_THAN_SIGN
public static final int SCRIPT_DATA_LESS_THAN_SIGN- See Also:
-
SCRIPT_DATA_ESCAPE_START
public static final int SCRIPT_DATA_ESCAPE_START- See Also:
-
SCRIPT_DATA_ESCAPE_START_DASH
public static final int SCRIPT_DATA_ESCAPE_START_DASH- See Also:
-
SCRIPT_DATA_ESCAPED_DASH
public static final int SCRIPT_DATA_ESCAPED_DASH- See Also:
-
SCRIPT_DATA_ESCAPED_DASH_DASH
public static final int SCRIPT_DATA_ESCAPED_DASH_DASH- See Also:
-
BOGUS_COMMENT_HYPHEN
public static final int BOGUS_COMMENT_HYPHEN- See Also:
-
RAWTEXT_RCDATA_LESS_THAN_SIGN
public static final int RAWTEXT_RCDATA_LESS_THAN_SIGN- See Also:
-
SCRIPT_DATA_ESCAPED_LESS_THAN_SIGN
public static final int SCRIPT_DATA_ESCAPED_LESS_THAN_SIGN- See Also:
-
SCRIPT_DATA_DOUBLE_ESCAPE_START
public static final int SCRIPT_DATA_DOUBLE_ESCAPE_START- See Also:
-
SCRIPT_DATA_DOUBLE_ESCAPED
public static final int SCRIPT_DATA_DOUBLE_ESCAPED- See Also:
-
SCRIPT_DATA_DOUBLE_ESCAPED_LESS_THAN_SIGN
public static final int SCRIPT_DATA_DOUBLE_ESCAPED_LESS_THAN_SIGN- See Also:
-
SCRIPT_DATA_DOUBLE_ESCAPED_DASH
public static final int SCRIPT_DATA_DOUBLE_ESCAPED_DASH- See Also:
-
SCRIPT_DATA_DOUBLE_ESCAPED_DASH_DASH
public static final int SCRIPT_DATA_DOUBLE_ESCAPED_DASH_DASH- See Also:
-
SCRIPT_DATA_DOUBLE_ESCAPE_END
public static final int SCRIPT_DATA_DOUBLE_ESCAPE_END- See Also:
-
tokenHandler
The token handler. -
encodingDeclarationHandler
-
errorHandler
The error handler. -
lastCR
protected boolean lastCRWhether the previous char read was CR. -
stateSave
protected int stateSave -
index
protected int index -
value
protected int value -
cstart
protected int cstart -
endTagExpectation
The element whose end tag closes the current CDATA or RCDATA element. -
endTag
protected boolean endTagtrue
if tokenizing an end tag -
attributeName
The current attribute name. -
html4
protected boolean html4true
when HTML4-specific additional errors are requested. -
confident
protected boolean confident -
currentBufferGlobalOffset
protected int currentBufferGlobalOffset -
ampersandLocation
-
-
Constructor Details
-
Tokenizer
-
Tokenizer
The constructor.- Parameters:
tokenHandler
- the handler for receiving tokens
-
-
Method Details
-
setInterner
-
initLocation
-
isMappingLangToXmlLang
public boolean isMappingLangToXmlLang()Returns the mappingLangToXmlLang.- Returns:
- the mappingLangToXmlLang
-
setMappingLangToXmlLang
public void setMappingLangToXmlLang(boolean mappingLangToXmlLang) Sets the mappingLangToXmlLang.- Parameters:
mappingLangToXmlLang
- the mappingLangToXmlLang to set
-
setErrorHandler
Sets the error handler.- See Also:
-
getErrorHandler
-
setCommentPolicy
Sets the commentPolicy.- Parameters:
commentPolicy
- the commentPolicy to set
-
setContentNonXmlCharPolicy
Sets the contentNonXmlCharPolicy.- Parameters:
contentNonXmlCharPolicy
- the contentNonXmlCharPolicy to set
-
setContentSpacePolicy
Sets the contentSpacePolicy.- Parameters:
contentSpacePolicy
- the contentSpacePolicy to set
-
setXmlnsPolicy
Sets the xmlnsPolicy.- Parameters:
xmlnsPolicy
- the xmlnsPolicy to set
-
setNamePolicy
-
setHtml4ModeCompatibleWithXhtml1Schemata
public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata) Sets the html4ModeCompatibleWithXhtml1Schemata.- Parameters:
html4ModeCompatibleWithXhtml1Schemata
- the html4ModeCompatibleWithXhtml1Schemata to set
-
setStateAndEndTagExpectation
Sets the tokenizer state and the associated element name. This should only ever used to put the tokenizer into one of the states that have a special end tag expectation.- Parameters:
specialTokenizerState
- the tokenizer state to setendTagExpectation
- the expected end tag for transitioning back to normal
-
setStateAndEndTagExpectation
Sets the tokenizer state and the associated element name. This should only ever used to put the tokenizer into one of the states that have a special end tag expectation.- Parameters:
specialTokenizerState
- the tokenizer state to setendTagExpectation
- the expected end tag for transitioning back to normal
-
setLineNumber
public void setLineNumber(int line) For C++ use only. -
getLineNumber
public int getLineNumber()- Specified by:
getLineNumber
in interfaceLocator
- See Also:
-
getColumnNumber
public int getColumnNumber()- Specified by:
getColumnNumber
in interfaceLocator
- See Also:
-
getPublicId
- Specified by:
getPublicId
in interfaceLocator
- See Also:
-
getSystemId
- Specified by:
getSystemId
in interfaceLocator
- See Also:
-
notifyAboutMetaBoundary
public void notifyAboutMetaBoundary() -
strBufToString
The smaller buffer as a String. Currently only used for error reporting.C++ memory note: The return value must be released.
- Returns:
- the smaller buffer as a string
-
flushChars
Flushes coalesced character tokens.- Parameters:
buf
- TODOpos
- TODO- Throws:
SAXException
-
fatal
Reports an condition that would make the infoset incompatible with XML 1.0 as fatal.- Parameters:
message
- the message- Throws:
SAXException
SAXParseException
-
err
Reports a Parse Error.- Parameters:
message
- the message- Throws:
SAXException
-
errTreeBuilder
- Throws:
SAXException
-
warn
Reports a warning- Parameters:
message
- the message- Throws:
SAXException
-
startErrorReporting
- Throws:
SAXException
-
start
- Throws:
SAXException
-
tokenizeBuffer
- Throws:
SAXException
-
transition
- Throws:
SAXException
-
silentCarriageReturn
protected void silentCarriageReturn() -
silentLineFeed
protected void silentLineFeed() -
eof
- Throws:
SAXException
-
checkChar
- Throws:
SAXException
-
isAlreadyComplainedAboutNonAscii
public boolean isAlreadyComplainedAboutNonAscii()Returns the alreadyComplainedAboutNonAscii.- Returns:
- the alreadyComplainedAboutNonAscii
-
internalEncodingDeclaration
- Throws:
SAXException
-
end
- Throws:
SAXException
-
requestSuspension
public void requestSuspension() -
becomeConfident
public void becomeConfident() -
isNextCharOnNewLine
public boolean isNextCharOnNewLine()Returns the nextCharOnNewLine.- Returns:
- the nextCharOnNewLine
-
isPrevCR
public boolean isPrevCR() -
getLine
public int getLine()Returns the line.- Returns:
- the line
-
getCol
public int getCol()Returns the col.- Returns:
- the col
-
isInDataState
public boolean isInDataState() -
resetToDataState
public void resetToDataState() -
loadState
- Throws:
SAXException
-
initializeWithoutStarting
- Throws:
SAXException
-
errGarbageAfterLtSlash
- Throws:
SAXException
-
errLtSlashGt
- Throws:
SAXException
-
errWarnLtSlashInRcdata
- Throws:
SAXException
-
errHtml4LtSlashInRcdata
- Throws:
SAXException
-
errCharRefLacksSemicolon
- Throws:
SAXException
-
errNoDigitsInNCR
- Throws:
SAXException
-
errGtInSystemId
- Throws:
SAXException
-
errGtInPublicId
- Throws:
SAXException
-
errNamelessDoctype
- Throws:
SAXException
-
errConsecutiveHyphens
- Throws:
SAXException
-
errPrematureEndOfComment
- Throws:
SAXException
-
errBogusComment
- Throws:
SAXException
-
errUnquotedAttributeValOrNull
- Throws:
SAXException
-
errSlashNotFollowedByGt
- Throws:
SAXException
-
errHtml4XmlVoidSyntax
- Throws:
SAXException
-
errNoSpaceBetweenAttributes
- Throws:
SAXException
-
errHtml4NonNameInUnquotedAttribute
- Throws:
SAXException
-
errLtOrEqualsOrGraveInUnquotedAttributeOrNull
- Throws:
SAXException
-
errAttributeValueMissing
- Throws:
SAXException
-
errBadCharBeforeAttributeNameOrNull
- Throws:
SAXException
-
errEqualsSignBeforeAttributeName
- Throws:
SAXException
-
errBadCharAfterLt
- Throws:
SAXException
-
errLtGt
- Throws:
SAXException
-
errProcessingInstruction
- Throws:
SAXException
-
errUnescapedAmpersandInterpretedAsCharacterReference
- Throws:
SAXException
-
errNotSemicolonTerminated
- Throws:
SAXException
-
errNoNamedCharacterMatch
- Throws:
SAXException
-
errQuoteBeforeAttributeName
- Throws:
SAXException
-
errQuoteOrLtInAttributeNameOrNull
- Throws:
SAXException
-
errExpectedPublicId
- Throws:
SAXException
-
errBogusDoctype
- Throws:
SAXException
-
maybeWarnPrivateUseAstral
- Throws:
SAXException
-
maybeWarnPrivateUse
- Throws:
SAXException
-
maybeErrAttributesOnEndTag
- Throws:
SAXException
-
maybeErrSlashInEndTag
- Throws:
SAXException
-
errNcrNonCharacter
- Throws:
SAXException
-
errAstralNonCharacter
- Throws:
SAXException
-
errNcrSurrogate
- Throws:
SAXException
-
errNcrControlChar
- Throws:
SAXException
-
errNcrCr
- Throws:
SAXException
-
errNcrInC1Range
- Throws:
SAXException
-
errEofInPublicId
- Throws:
SAXException
-
errEofInComment
- Throws:
SAXException
-
errEofInDoctype
- Throws:
SAXException
-
errEofInAttributeValue
- Throws:
SAXException
-
errEofInAttributeName
- Throws:
SAXException
-
errEofWithoutGt
- Throws:
SAXException
-
errEofInTagName
- Throws:
SAXException
-
errEofInEndTag
- Throws:
SAXException
-
errEofAfterLt
- Throws:
SAXException
-
errNcrOutOfRange
- Throws:
SAXException
-
errNcrUnassigned
- Throws:
SAXException
-
errDuplicateAttribute
- Throws:
SAXException
-
errEofInSystemId
- Throws:
SAXException
-
errExpectedSystemId
- Throws:
SAXException
-
errMissingSpaceBeforeDoctypeName
- Throws:
SAXException
-
errHyphenHyphenBang
- Throws:
SAXException
-
errNcrControlChar
- Throws:
SAXException
-
errNcrZero
- Throws:
SAXException
-
errNoSpaceBetweenDoctypeSystemKeywordAndQuote
- Throws:
SAXException
-
errNoSpaceBetweenPublicAndSystemIds
- Throws:
SAXException
-
errNoSpaceBetweenDoctypePublicKeywordAndQuote
- Throws:
SAXException
-
noteAttributeWithoutValue
- Throws:
SAXException
-
noteUnquotedAttributeValue
- Throws:
SAXException
-
setEncodingDeclarationHandler
Sets the encodingDeclarationHandler.- Parameters:
encodingDeclarationHandler
- the encodingDeclarationHandler to set
-
setTransitionBaseOffset
public void setTransitionBaseOffset(int offset) Sets an offset to be added to the position reported toTransitionHandler
.- Parameters:
offset
- the offset
-