Class HtmlParser
- All Implemented Interfaces:
XMLReader
- Direct Known Subclasses:
InfosetCoercingHtmlParser
By default, when using the constructor without arguments, the
this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible
infosets. This corresponds to ALTER_INFOSET
as the general
XML violation policy. To make the parser support non-conforming HTML fully
per the HTML 5 spec while on the other hand potentially violating the SAX2
API contract, set the general XML violation policy to ALLOW
.
It is possible to treat XML 1.0 infoset violations as fatal by setting
the general XML violation policy to FATAL
.
By default, this parser doesn't do true streaming but buffers everything
first. The parser can be made truly streaming by calling
setStreamabilityViolationPolicy(XmlViolationPolicy.FATAL)
. This
has the consequence that errors that require non-streamable recovery are
treated as fatal.
By default, in order to make the parse events emulate the parse events
for a DTDless XML document, the parser does not report the doctype through
LexicalHandler
. Doctype reporting through
LexicalHandler
can be turned on by calling
setReportingDoctype(true)
.
- Version:
- $Id$
- Author:
- hsivonen
-
Constructor Summary
ConstructorsConstructorDescriptionInstantiates the parser with a fatal XML violation policy.HtmlParser
(XmlViolationPolicy xmlPolicy) Instantiates the parser with a specific XML violation policy. -
Method Summary
Modifier and TypeMethodDescriptionvoid
addCharacterHandler
(CharacterHandler characterHandler) Deprecated.Returns the commentPolicy.Returns the contentNonXmlCharPolicy.Returns the contentSpacePolicy.Returns the doctype expectation.Returns theLocator
during parse.Returns the document mode handler.boolean
getFeature
(String name) Exposes the configuration of the emulated XML parser as well as boolean-valued configuration without using non-XMLReader
getters directly.Returns the lexicalHandler.The policy for non-NCName element and attribute names.getProperty
(String name) AllowsXMLReader
-level access to non-boolean valued getters.Returns the streamabilityViolationPolicy.Returns the xmlnsPolicy.boolean
Indicates whether NFC normalization of source is being checked.boolean
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.boolean
Whetherlang
is mapped toxml:lang
.boolean
Returns the reportingDoctype.boolean
Whether the parser considers scripting to be enabled for noscript treatment.void
void
parse
(InputSource input) void
parseFragment
(InputSource input, String context) Parses a fragment.void
setBogusXmlnsPolicy
(XmlViolationPolicy bogusXmlnsPolicy) Deprecated.void
setCheckingNormalization
(boolean enable) Toggles the checking of the NFC normalization of source.void
setCommentPolicy
(XmlViolationPolicy commentPolicy) Sets the policy for consecutive hyphens in comments.void
setContentHandler
(ContentHandler handler) void
setContentNonXmlCharPolicy
(XmlViolationPolicy contentNonXmlCharPolicy) Sets the policy for non-XML characters except white space.void
setContentSpacePolicy
(XmlViolationPolicy contentSpacePolicy) Sets the policy for non-XML white space.void
setDoctypeExpectation
(DoctypeExpectation doctypeExpectation) Sets the doctype expectation.void
setDocumentModeHandler
(DocumentModeHandler documentModeHandler) Sets the document mode handler.void
setDTDHandler
(DTDHandler handler) void
setEntityResolver
(EntityResolver resolver) void
setErrorHandler
(ErrorHandler handler) void
setErrorProfile
(HashMap<String, String> errorProfileMap) void
setFeature
(String name, boolean value) Sets a boolean feature without having to use non-XMLReader
setters directly.void
setHeuristics
(Heuristics heuristics) Sets the encoding sniffing heuristics.void
setHtml4ModeCompatibleWithXhtml1Schemata
(boolean html4ModeCompatibleWithXhtml1Schemata) Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.void
setLexicalHandler
(LexicalHandler handler) Sets the lexical handler.void
setMappingLangToXmlLang
(boolean mappingLangToXmlLang) Whetherlang
is mapped toxml:lang
.void
setNamePolicy
(XmlViolationPolicy namePolicy) The policy for non-NCName element and attribute names.void
setProperty
(String name, Object value) Sets a non-boolean property without having to use non-XMLReader
setters directly.void
setReportingDoctype
(boolean reportingDoctype) void
setScriptingEnabled
(boolean scriptingEnabled) Sets whether the parser considers scripting to be enabled for noscript treatment.void
setStreamabilityViolationPolicy
(XmlViolationPolicy streamabilityViolationPolicy) Sets the streamabilityViolationPolicy.void
setTransitionHandler
(TransitionHandler handler) void
Deprecated.For Validator.nu internal usevoid
setXmlnsPolicy
(XmlViolationPolicy xmlnsPolicy) Whether thexmlns
attribute on the root element is passed to through.void
setXmlPolicy
(XmlViolationPolicy xmlPolicy) This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go.
-
Constructor Details
-
HtmlParser
public HtmlParser()Instantiates the parser with a fatal XML violation policy. -
HtmlParser
Instantiates the parser with a specific XML violation policy.- Parameters:
xmlPolicy
- the policy
-
-
Method Details
-
getContentHandler
- Specified by:
getContentHandler
in interfaceXMLReader
- See Also:
-
getDTDHandler
- Specified by:
getDTDHandler
in interfaceXMLReader
- See Also:
-
getEntityResolver
- Specified by:
getEntityResolver
in interfaceXMLReader
- See Also:
-
getErrorHandler
- Specified by:
getErrorHandler
in interfaceXMLReader
- See Also:
-
getFeature
Exposes the configuration of the emulated XML parser as well as boolean-valued configuration without using non-XMLReader
getters directly.http://xml.org/sax/features/external-general-entities
false
http://xml.org/sax/features/external-parameter-entities
false
http://xml.org/sax/features/is-standalone
true
http://xml.org/sax/features/lexical-handler/parameter-entities
false
http://xml.org/sax/features/namespaces
true
http://xml.org/sax/features/namespace-prefixes
false
http://xml.org/sax/features/resolve-dtd-uris
true
http://xml.org/sax/features/string-interning
false
http://xml.org/sax/features/unicode-normalization-checking
isCheckingNormalization
http://xml.org/sax/features/use-attributes2
false
http://xml.org/sax/features/use-locator2
false
http://xml.org/sax/features/use-entity-resolver2
false
http://xml.org/sax/features/validation
false
http://xml.org/sax/features/xmlns-uris
false
http://xml.org/sax/features/xml-1.1
false
http://validator.nu/features/html4-mode-compatible-with-xhtml1-schemata
isHtml4ModeCompatibleWithXhtml1Schemata
http://validator.nu/features/mapping-lang-to-xml-lang
isMappingLangToXmlLang
http://validator.nu/features/scripting-enabled
isScriptingEnabled
- Specified by:
getFeature
in interfaceXMLReader
- Parameters:
name
- feature URI string- Returns:
- a value per the list above
- Throws:
SAXNotRecognizedException
SAXNotSupportedException
- See Also:
-
getProperty
AllowsXMLReader
-level access to non-boolean valued getters.The properties are mapped as follows:
http://xml.org/sax/properties/document-xml-version
"1.0"
http://xml.org/sax/properties/lexical-handler
getLexicalHandler
http://validator.nu/properties/content-space-policy
getContentSpacePolicy
http://validator.nu/properties/content-non-xml-char-policy
getContentNonXmlCharPolicy
http://validator.nu/properties/comment-policy
getCommentPolicy
http://validator.nu/properties/xmlns-policy
getXmlnsPolicy
http://validator.nu/properties/name-policy
getNamePolicy
http://validator.nu/properties/streamability-violation-policy
getStreamabilityViolationPolicy
http://validator.nu/properties/document-mode-handler
getDocumentModeHandler
http://validator.nu/properties/doctype-expectation
getDoctypeExpectation
http://xml.org/sax/features/unicode-normalization-checking
- Specified by:
getProperty
in interfaceXMLReader
- Parameters:
name
- property URI string- Returns:
- a value per the list above
- Throws:
SAXNotRecognizedException
SAXNotSupportedException
- See Also:
-
parse
- Specified by:
parse
in interfaceXMLReader
- Throws:
IOException
SAXException
- See Also:
-
parseFragment
Parses a fragment.- Parameters:
input
- the input to parsecontext
- the name of the context element- Throws:
IOException
SAXException
-
parse
- Specified by:
parse
in interfaceXMLReader
- Throws:
IOException
SAXException
- See Also:
-
setContentHandler
- Specified by:
setContentHandler
in interfaceXMLReader
- See Also:
-
setLexicalHandler
Sets the lexical handler.- Parameters:
handler
- the hander.
-
setDTDHandler
- Specified by:
setDTDHandler
in interfaceXMLReader
- See Also:
-
setEntityResolver
- Specified by:
setEntityResolver
in interfaceXMLReader
- See Also:
-
setErrorHandler
- Specified by:
setErrorHandler
in interfaceXMLReader
- See Also:
-
setTransitionHandler
-
setTreeBuilderErrorHandlerOverride
Deprecated.For Validator.nu internal use- See Also:
-
setFeature
public void setFeature(String name, boolean value) throws SAXNotRecognizedException, SAXNotSupportedException Sets a boolean feature without having to use non-XMLReader
setters directly.The supported features are:
http://xml.org/sax/features/unicode-normalization-checking
setCheckingNormalization
http://validator.nu/features/html4-mode-compatible-with-xhtml1-schemata
setHtml4ModeCompatibleWithXhtml1Schemata
http://validator.nu/features/mapping-lang-to-xml-lang
setMappingLangToXmlLang
http://validator.nu/features/scripting-enabled
setScriptingEnabled
- Specified by:
setFeature
in interfaceXMLReader
- Throws:
SAXNotRecognizedException
SAXNotSupportedException
- See Also:
-
setProperty
public void setProperty(String name, Object value) throws SAXNotRecognizedException, SAXNotSupportedException Sets a non-boolean property without having to use non-XMLReader
setters directly.http://xml.org/sax/properties/lexical-handler
setLexicalHandler
http://validator.nu/properties/content-space-policy
setContentSpacePolicy
http://validator.nu/properties/content-non-xml-char-policy
setContentNonXmlCharPolicy
http://validator.nu/properties/comment-policy
setCommentPolicy
http://validator.nu/properties/xmlns-policy
setXmlnsPolicy
http://validator.nu/properties/name-policy
setNamePolicy
http://validator.nu/properties/streamability-violation-policy
setStreamabilityViolationPolicy
http://validator.nu/properties/document-mode-handler
setDocumentModeHandler
http://validator.nu/properties/doctype-expectation
setDoctypeExpectation
http://validator.nu/properties/xml-policy
setXmlPolicy
- Specified by:
setProperty
in interfaceXMLReader
- Throws:
SAXNotRecognizedException
SAXNotSupportedException
- See Also:
-
isCheckingNormalization
public boolean isCheckingNormalization()Indicates whether NFC normalization of source is being checked.- Returns:
true
if NFC normalization of source is being checked.- See Also:
-
setCheckingNormalization
public void setCheckingNormalization(boolean enable) Toggles the checking of the NFC normalization of source.- Parameters:
enable
-true
to check normalization- See Also:
-
setCommentPolicy
Sets the policy for consecutive hyphens in comments.- Parameters:
commentPolicy
- the policy- See Also:
-
setContentNonXmlCharPolicy
Sets the policy for non-XML characters except white space.- Parameters:
contentNonXmlCharPolicy
- the policy- See Also:
-
setContentSpacePolicy
Sets the policy for non-XML white space.- Parameters:
contentSpacePolicy
- the policy- See Also:
-
isScriptingEnabled
public boolean isScriptingEnabled()Whether the parser considers scripting to be enabled for noscript treatment.- Returns:
true
if enabled- See Also:
-
setScriptingEnabled
public void setScriptingEnabled(boolean scriptingEnabled) Sets whether the parser considers scripting to be enabled for noscript treatment.- Parameters:
scriptingEnabled
-true
to enable- See Also:
-
getDoctypeExpectation
Returns the doctype expectation.- Returns:
- the doctypeExpectation
-
setDoctypeExpectation
Sets the doctype expectation.- Parameters:
doctypeExpectation
- the doctypeExpectation to set- See Also:
-
getDocumentModeHandler
Returns the document mode handler.- Returns:
- the documentModeHandler
-
setDocumentModeHandler
Sets the document mode handler.- Parameters:
documentModeHandler
- the documentModeHandler to set- See Also:
-
getStreamabilityViolationPolicy
Returns the streamabilityViolationPolicy.- Returns:
- the streamabilityViolationPolicy
-
setStreamabilityViolationPolicy
Sets the streamabilityViolationPolicy.- Parameters:
streamabilityViolationPolicy
- the streamabilityViolationPolicy to set
-
setHtml4ModeCompatibleWithXhtml1Schemata
public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata) Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.- Parameters:
html4ModeCompatibleWithXhtml1Schemata
-
-
getDocumentLocator
Returns theLocator
during parse.- Returns:
- the
Locator
-
isHtml4ModeCompatibleWithXhtml1Schemata
public boolean isHtml4ModeCompatibleWithXhtml1Schemata()Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.- Returns:
- the html4ModeCompatibleWithXhtml1Schemata
-
setMappingLangToXmlLang
public void setMappingLangToXmlLang(boolean mappingLangToXmlLang) Whetherlang
is mapped toxml:lang
.- Parameters:
mappingLangToXmlLang
-- See Also:
-
isMappingLangToXmlLang
public boolean isMappingLangToXmlLang()Whetherlang
is mapped toxml:lang
.- Returns:
- the mappingLangToXmlLang
-
setXmlnsPolicy
Whether thexmlns
attribute on the root element is passed to through. (FATAL not allowed.)- Parameters:
xmlnsPolicy
-- See Also:
-
getXmlnsPolicy
Returns the xmlnsPolicy.- Returns:
- the xmlnsPolicy
-
getLexicalHandler
Returns the lexicalHandler.- Returns:
- the lexicalHandler
-
getCommentPolicy
Returns the commentPolicy.- Returns:
- the commentPolicy
-
getContentNonXmlCharPolicy
Returns the contentNonXmlCharPolicy.- Returns:
- the contentNonXmlCharPolicy
-
getContentSpacePolicy
Returns the contentSpacePolicy.- Returns:
- the contentSpacePolicy
-
setReportingDoctype
public void setReportingDoctype(boolean reportingDoctype) - Parameters:
reportingDoctype
-- See Also:
-
isReportingDoctype
public boolean isReportingDoctype()Returns the reportingDoctype.- Returns:
- the reportingDoctype
-
setErrorProfile
- Parameters:
errorProfile
-- See Also:
-
setNamePolicy
The policy for non-NCName element and attribute names.- Parameters:
namePolicy
-- See Also:
-
setHeuristics
Sets the encoding sniffing heuristics.- Parameters:
heuristics
- the heuristics to set- See Also:
-
getHeuristics
-
setXmlPolicy
This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. This does not affect the streamability policy or doctype reporting.- Parameters:
xmlPolicy
-
-
getNamePolicy
The policy for non-NCName element and attribute names.- Returns:
- the namePolicy
-
setBogusXmlnsPolicy
Deprecated.Does nothing. -
getBogusXmlnsPolicy
Deprecated.ReturnsXmlViolationPolicy.ALTER_INFOSET
.- Returns:
XmlViolationPolicy.ALTER_INFOSET
-
addCharacterHandler
-