Class HtmlBuilder

java.lang.Object
nu.xom.Builder
nu.validator.htmlparser.xom.HtmlBuilder

public class HtmlBuilder extends nu.xom.Builder
This class implements an HTML5 parser that exposes data through the XOM interface.

By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to ALTER_INFOSET as the general XML violation policy. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy to FATAL.

The doctype is not represented in the tree.

The document mode is represented via the Mode interface on the Document node if the node implements that interface (depends on the used node factory).

The form pointer is stored if the node factory supports storing it.

This package has its own node factory class because the official XOM node factory may return multiple nodes instead of one confusing the assumptions of the DOM-oriented HTML5 parsing algorithm.

Version:
$Id$
Author:
hsivonen
  • Constructor Details

    • HtmlBuilder

      public HtmlBuilder()
      Constructor with default node factory and fatal XML violation policy.
    • HtmlBuilder

      public HtmlBuilder(SimpleNodeFactory nodeFactory)
      Constructor with given node factory and fatal XML violation policy.
      Parameters:
      nodeFactory - the factory
    • HtmlBuilder

      public HtmlBuilder(XmlViolationPolicy xmlPolicy)
      Constructor with default node factory and given XML violation policy.
      Parameters:
      xmlPolicy - the policy
    • HtmlBuilder

      public HtmlBuilder(SimpleNodeFactory nodeFactory, XmlViolationPolicy xmlPolicy)
      Constructor with given node factory and given XML violation policy.
      Parameters:
      nodeFactory - the factory
      xmlPolicy - the policy
  • Method Details

    • build

      public nu.xom.Document build(InputSource is) throws nu.xom.ParsingException, IOException
      Parse from SAX InputSource.
      Parameters:
      is - the InputSource
      Returns:
      the document
      Throws:
      nu.xom.ParsingException - in case of an XML violation
      IOException - if IO goes wrang
    • buildFragment

      public nu.xom.Nodes buildFragment(InputSource is, String context) throws IOException, nu.xom.ParsingException
      Parse a fragment from SAX InputSource.
      Parameters:
      is - the InputSource
      context - the name of the context element
      Returns:
      the fragment
      Throws:
      nu.xom.ParsingException - in case of an XML violation
      IOException - if IO goes wrang
    • build

      public nu.xom.Document build(File file) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException
      Parse from File.
      Overrides:
      build in class nu.xom.Builder
      Parameters:
      file - the file
      Returns:
      the document
      Throws:
      nu.xom.ParsingException - in case of an XML violation
      IOException - if IO goes wrang
      nu.xom.ValidityException
      See Also:
      • Builder.build(java.io.File)
    • build

      public nu.xom.Document build(InputStream stream, String uri) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException
      Parse from InputStream.
      Overrides:
      build in class nu.xom.Builder
      Parameters:
      stream - the stream
      uri - the base URI
      Returns:
      the document
      Throws:
      nu.xom.ParsingException - in case of an XML violation
      IOException - if IO goes wrang
      nu.xom.ValidityException
      See Also:
      • Builder.build(java.io.InputStream, java.lang.String)
    • build

      public nu.xom.Document build(InputStream stream) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException
      Parse from InputStream.
      Overrides:
      build in class nu.xom.Builder
      Parameters:
      stream - the stream
      Returns:
      the document
      Throws:
      nu.xom.ParsingException - in case of an XML violation
      IOException - if IO goes wrang
      nu.xom.ValidityException
      See Also:
      • Builder.build(java.io.InputStream)
    • build

      public nu.xom.Document build(Reader stream, String uri) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException
      Parse from Reader.
      Overrides:
      build in class nu.xom.Builder
      Parameters:
      stream - the reader
      uri - the base URI
      Returns:
      the document
      Throws:
      nu.xom.ParsingException - in case of an XML violation
      IOException - if IO goes wrang
      nu.xom.ValidityException
      See Also:
      • Builder.build(java.io.Reader, java.lang.String)
    • build

      public nu.xom.Document build(Reader stream) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException
      Parse from Reader.
      Overrides:
      build in class nu.xom.Builder
      Parameters:
      stream - the reader
      Returns:
      the document
      Throws:
      nu.xom.ParsingException - in case of an XML violation
      IOException - if IO goes wrang
      nu.xom.ValidityException
      See Also:
      • Builder.build(java.io.Reader)
    • build

      public nu.xom.Document build(String content, String uri) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException
      Parse from String.
      Overrides:
      build in class nu.xom.Builder
      Parameters:
      content - the HTML source as string
      uri - the base URI
      Returns:
      the document
      Throws:
      nu.xom.ParsingException - in case of an XML violation
      IOException - if IO goes wrang
      nu.xom.ValidityException
      See Also:
      • Builder.build(java.lang.String, java.lang.String)
    • build

      public nu.xom.Document build(String uri) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException
      Parse from URI.
      Overrides:
      build in class nu.xom.Builder
      Parameters:
      uri - the URI of the document
      Returns:
      the document
      Throws:
      nu.xom.ParsingException - in case of an XML violation
      IOException - if IO goes wrang
      nu.xom.ValidityException
      See Also:
      • Builder.build(java.lang.String)
    • getSimpleNodeFactory

      public SimpleNodeFactory getSimpleNodeFactory()
      Gets the node factory
    • setEntityResolver

      public void setEntityResolver(EntityResolver resolver)
      See Also:
    • setErrorHandler

      public void setErrorHandler(ErrorHandler handler)
      See Also:
    • setTransitionHander

      public void setTransitionHander(TransitionHandler handler)
    • isCheckingNormalization

      public boolean isCheckingNormalization()
      Indicates whether NFC normalization of source is being checked.
      Returns:
      true if NFC normalization of source is being checked.
      See Also:
      • invalid reference
        nu.validator.htmlparser.impl.Tokenizer#isCheckingNormalization()
    • setCheckingNormalization

      public void setCheckingNormalization(boolean enable)
      Toggles the checking of the NFC normalization of source.
      Parameters:
      enable - true to check normalization
      See Also:
      • invalid reference
        nu.validator.htmlparser.impl.Tokenizer#setCheckingNormalization(boolean)
    • setCommentPolicy

      public void setCommentPolicy(XmlViolationPolicy commentPolicy)
      Sets the policy for consecutive hyphens in comments.
      Parameters:
      commentPolicy - the policy
      See Also:
    • setContentNonXmlCharPolicy

      public void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
      Sets the policy for non-XML characters except white space.
      Parameters:
      contentNonXmlCharPolicy - the policy
      See Also:
    • setContentSpacePolicy

      public void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
      Sets the policy for non-XML white space.
      Parameters:
      contentSpacePolicy - the policy
      See Also:
    • isScriptingEnabled

      public boolean isScriptingEnabled()
      Whether the parser considers scripting to be enabled for noscript treatment.
      Returns:
      true if enabled
      See Also:
    • setScriptingEnabled

      public void setScriptingEnabled(boolean scriptingEnabled)
      Sets whether the parser considers scripting to be enabled for noscript treatment.
      Parameters:
      scriptingEnabled - true to enable
      See Also:
    • getDoctypeExpectation

      public DoctypeExpectation getDoctypeExpectation()
      Returns the doctype expectation.
      Returns:
      the doctypeExpectation
    • setDoctypeExpectation

      public void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
      Sets the doctype expectation.
      Parameters:
      doctypeExpectation - the doctypeExpectation to set
      See Also:
    • getDocumentModeHandler

      public DocumentModeHandler getDocumentModeHandler()
      Returns the document mode handler.
      Returns:
      the documentModeHandler
    • setDocumentModeHandler

      public void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
      Sets the document mode handler.
      Parameters:
      documentModeHandler - the documentModeHandler to set
      See Also:
    • getStreamabilityViolationPolicy

      public XmlViolationPolicy getStreamabilityViolationPolicy()
      Returns the streamabilityViolationPolicy.
      Returns:
      the streamabilityViolationPolicy
    • setStreamabilityViolationPolicy

      public void setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
      Sets the streamabilityViolationPolicy.
      Parameters:
      streamabilityViolationPolicy - the streamabilityViolationPolicy to set
    • setHtml4ModeCompatibleWithXhtml1Schemata

      public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
      Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
      Parameters:
      html4ModeCompatibleWithXhtml1Schemata -
    • getDocumentLocator

      public Locator getDocumentLocator()
      Returns the Locator during parse.
      Returns:
      the Locator
    • isHtml4ModeCompatibleWithXhtml1Schemata

      public boolean isHtml4ModeCompatibleWithXhtml1Schemata()
      Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
      Returns:
      the html4ModeCompatibleWithXhtml1Schemata
    • setMappingLangToXmlLang

      public void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
      Whether lang is mapped to xml:lang.
      Parameters:
      mappingLangToXmlLang -
      See Also:
    • isMappingLangToXmlLang

      public boolean isMappingLangToXmlLang()
      Whether lang is mapped to xml:lang.
      Returns:
      the mappingLangToXmlLang
    • setXmlnsPolicy

      public void setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
      Whether the xmlns attribute on the root element is passed to through. (FATAL not allowed.)
      Parameters:
      xmlnsPolicy -
      See Also:
    • getXmlnsPolicy

      public XmlViolationPolicy getXmlnsPolicy()
      Returns the xmlnsPolicy.
      Returns:
      the xmlnsPolicy
    • getCommentPolicy

      public XmlViolationPolicy getCommentPolicy()
      Returns the commentPolicy.
      Returns:
      the commentPolicy
    • getContentNonXmlCharPolicy

      public XmlViolationPolicy getContentNonXmlCharPolicy()
      Returns the contentNonXmlCharPolicy.
      Returns:
      the contentNonXmlCharPolicy
    • getContentSpacePolicy

      public XmlViolationPolicy getContentSpacePolicy()
      Returns the contentSpacePolicy.
      Returns:
      the contentSpacePolicy
    • setReportingDoctype

      public void setReportingDoctype(boolean reportingDoctype)
      Parameters:
      reportingDoctype -
      See Also:
    • isReportingDoctype

      public boolean isReportingDoctype()
      Returns the reportingDoctype.
      Returns:
      the reportingDoctype
    • setNamePolicy

      public void setNamePolicy(XmlViolationPolicy namePolicy)
      The policy for non-NCName element and attribute names.
      Parameters:
      namePolicy -
      See Also:
    • setHeuristics

      public void setHeuristics(Heuristics heuristics)
      Sets the encoding sniffing heuristics.
      Parameters:
      heuristics - the heuristics to set
      See Also:
      • invalid reference
        nu.validator.htmlparser.impl.Tokenizer#setHeuristics(nu.validator.htmlparser.common.Heuristics)
    • getHeuristics

      public Heuristics getHeuristics()
    • setXmlPolicy

      public void setXmlPolicy(XmlViolationPolicy xmlPolicy)
      This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. This does not affect the streamability policy or doctype reporting.
      Parameters:
      xmlPolicy -
    • getNamePolicy

      public XmlViolationPolicy getNamePolicy()
      The policy for non-NCName element and attribute names.
      Returns:
      the namePolicy
    • setBogusXmlnsPolicy

      public void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
      Deprecated.
      Does nothing.
    • getBogusXmlnsPolicy

      public XmlViolationPolicy getBogusXmlnsPolicy()
      Deprecated.
      Returns XmlViolationPolicy.ALTER_INFOSET.
      Returns:
      XmlViolationPolicy.ALTER_INFOSET
    • addCharacterHandler

      public void addCharacterHandler(CharacterHandler characterHandler)
    • setIgnoringComments

      public void setIgnoringComments(boolean ignoreComments)
      Sets whether comment nodes appear in the tree.
      Parameters:
      ignoreComments - true to ignore comments
      See Also: