Class HTMLParser

  • Direct Known Subclasses:
    JsoupBasedHtmlParser

    public abstract class HTMLParser
    extends Object
    HtmlParsers can parse HTML content to obtain URLs.
    • Constructor Detail

      • HTMLParser

        protected HTMLParser()
        Protected constructor to prevent instantiation except from within subclasses.
    • Method Detail

      • getParser

        public static final HTMLParser getParser()
      • getParser

        public static final HTMLParser getParser​(String htmlParserClassName)
      • getEmbeddedResourceURLs

        public Iterator<URL> getEmbeddedResourceURLs​(String userAgent,
                                                     byte[] html,
                                                     URL baseUrl,
                                                     String encoding)
                                              throws HTMLParseException
        Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...

        URLs should not appear twice in the returned iterator.

        Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.

        Parameters:
        userAgent - User Agent
        html - HTML code
        baseUrl - Base URL from which the HTML code was obtained
        encoding - Charset
        Returns:
        an Iterator for the resource URLs
        Throws:
        HTMLParseException - when parsing the html fails
      • getEmbeddedResourceURLs

        public abstract Iterator<URL> getEmbeddedResourceURLs​(String userAgent,
                                                              byte[] html,
                                                              URL baseUrl,
                                                              URLCollection coll,
                                                              String encoding)
                                                       throws HTMLParseException
        Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...

        All URLs should be added to the Collection.

        Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.

        N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.

        Parameters:
        userAgent - User Agent
        html - HTML code
        baseUrl - Base URL from which the HTML code was obtained
        coll - URLCollection
        encoding - Charset
        Returns:
        an Iterator for the resource URLs
        Throws:
        HTMLParseException - when parsing the html fails
      • getEmbeddedResourceURLs

        public Iterator<URL> getEmbeddedResourceURLs​(String userAgent,
                                                     byte[] html,
                                                     URL baseUrl,
                                                     Collection<URLString> coll,
                                                     String encoding)
                                              throws HTMLParseException
        Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...

        N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.

        Parameters:
        userAgent - User Agent
        html - HTML code
        baseUrl - Base URL from which the HTML code was obtained
        coll - Collection - will contain URLString objects, not URLs
        encoding - Charset
        Returns:
        an Iterator for the resource URLs
        Throws:
        HTMLParseException - when parsing the html fails
      • isReusable

        protected boolean isReusable()
        Parsers should over-ride this method if the parser class is re-usable, in which case the class will be cached for the next getParser() call.
        Returns:
        true if the Parser is reusable
      • isEnableConditionalComments

        protected final boolean isEnableConditionalComments​(Float ieVersion)
        Parameters:
        ieVersion - Float IE version
        Returns:
        true if IE version < IE v10
      • extractIEVersion

        protected Float extractIEVersion​(String userAgent)
        Parameters:
        userAgent - User Agent
        Returns:
        version null if not IE or the version after MSIE