Class JsoupBasedHtmlParser


  • public class JsoupBasedHtmlParser
    extends HTMLParser
    Parser based on JSOUP
    Since:
    2.10 TODO Factor out common code between LagartoBasedHtmlParser and this one (adapter pattern)
    • Constructor Detail

      • JsoupBasedHtmlParser

        public JsoupBasedHtmlParser()
    • Method Detail

      • getEmbeddedResourceURLs

        public Iterator<URL> getEmbeddedResourceURLs​(String userAgent,
                                                     byte[] html,
                                                     URL baseUrl,
                                                     URLCollection coll,
                                                     String encoding)
                                              throws HTMLParseException
        Description copied from class: HTMLParser
        Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...

        All URLs should be added to the Collection.

        Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.

        N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.

        Specified by:
        getEmbeddedResourceURLs in class HTMLParser
        Parameters:
        userAgent - User Agent
        html - HTML code
        baseUrl - Base URL from which the HTML code was obtained
        coll - URLCollection
        encoding - Charset
        Returns:
        an Iterator for the resource URLs
        Throws:
        HTMLParseException - when parsing the html fails
      • isReusable

        protected boolean isReusable()
        Description copied from class: HTMLParser
        Parsers should over-ride this method if the parser class is re-usable, in which case the class will be cached for the next getParser() call.
        Overrides:
        isReusable in class HTMLParser
        Returns:
        true if the Parser is reusable