Package | Description |
---|---|
org.htmlparser |
The basic API classes which will be used by most developers when working with
the HTML Parser.
|
org.htmlparser.beans |
The beans package contains Java Beans using the HTML Parser.
|
org.htmlparser.http |
The http package is responsible for HTTP connections to servers.
|
org.htmlparser.lexer |
The lexer package is the base level I/O subsystem.
|
org.htmlparser.lexerapplications.thumbelina |
Extract the images behind thumbnail images.
|
org.htmlparser.nodes |
The nodes package has the concrete node implementations.
|
org.htmlparser.parserapplications |
Example applications.
|
org.htmlparser.sax |
The sax package implements a SAX (Simple API for XML) parser for HTML.
|
org.htmlparser.scanners |
The scanners package contains classes responsible for the tertiary
identification of tags.
|
org.htmlparser.tags |
The tags package contains specific tags.
|
org.htmlparser.util |
Code which can be reused by many classes, is located in this package.
|
Modifier and Type | Method | Description |
---|---|---|
Remark |
NodeFactory.createRemarkNode(Page page,
int start,
int end) |
Create a new remark node.
|
Text |
NodeFactory.createStringNode(Page page,
int start,
int end) |
Create a new text node.
|
Tag |
NodeFactory.createTagNode(Page page,
int start,
int end,
java.util.Vector attributes) |
Create a new tag node.
|
void |
Node.doSemanticAction() |
Perform the meaning of this tag.
|
NodeIterator |
Parser.elements() |
Returns an iterator (enumeration) over the html nodes.
|
NodeList |
Parser.extractAllNodesThatMatch(NodeFilter filter) |
Extract all nodes matching the given filter.
|
NodeList |
Parser.parse(NodeFilter filter) |
Parse the given resource, using the filter provided.
|
void |
Parser.postConnect(java.net.HttpURLConnection connection) |
Called just after calling connect.
|
void |
Parser.preConnect(java.net.HttpURLConnection connection) |
Called just prior to calling connect.
|
void |
Parser.setConnection(java.net.URLConnection connection) |
Set the connection for this parser.
|
void |
Parser.setEncoding(java.lang.String encoding) |
Set the encoding for the page this parser is reading from.
|
void |
Parser.setInputHTML(java.lang.String inputHTML) |
Initializes the parser with the given input HTML String.
|
void |
Parser.setResource(java.lang.String resource) |
Set the html, a url, or a file.
|
void |
Parser.setURL(java.lang.String url) |
Set the URL for this parser.
|
void |
Parser.visitAllNodesWith(NodeVisitor visitor) |
Apply the given visitor to the current page.
|
Constructor | Description |
---|---|
Parser(java.lang.String resource) |
Creates a Parser object with the location of the resource (URL or file).
|
Parser(java.lang.String resource,
ParserFeedback feedback) |
Creates a Parser object with the location of the resource (URL or file)
You would typically create a DefaultHTMLParserFeedback object and pass
it in.
|
Parser(java.net.URLConnection connection) |
Construct a parser using the provided URLConnection.
|
Parser(java.net.URLConnection connection,
ParserFeedback fb) |
Constructor for custom HTTP access.
|
Modifier and Type | Method | Description |
---|---|---|
protected NodeList |
FilterBean.applyFilters() |
Apply each of the filters.
|
protected java.net.URL[] |
LinkBean.extractLinks() |
Internal routine to extract all the links from the parser.
|
protected java.lang.String |
StringBean.extractStrings() |
Extract the text from a page.
|
Modifier and Type | Method | Description |
---|---|---|
java.net.URLConnection |
ConnectionManager.openConnection(java.lang.String string) |
Opens a connection based on a given string.
|
java.net.URLConnection |
ConnectionManager.openConnection(java.net.URL url) |
Opens a connection using the given url.
|
void |
ConnectionMonitor.postConnect(java.net.HttpURLConnection connection) |
Called just after calling connect.
|
void |
ConnectionMonitor.preConnect(java.net.HttpURLConnection connection) |
Called just prior to calling connect.
|
Modifier and Type | Method | Description |
---|---|---|
char |
Page.getCharacter(Cursor cursor) |
Read the character at the given cursor position.
|
static void |
Lexer.main(java.lang.String[] args) |
Mainline for command line operation
|
protected Node |
Lexer.makeRemark(int start,
int end) |
Create a remark node based on the current cursor and the one provided.
|
protected Node |
Lexer.makeString(int start,
int end) |
Create a string node based on the current cursor and the one provided.
|
protected Node |
Lexer.makeTag(int start,
int end,
java.util.Vector attributes) |
Create a tag node based on the current cursor and the one provided.
|
Node |
Lexer.nextNode() |
Get the next node from the source.
|
Node |
Lexer.nextNode(boolean quotesmart) |
Get the next node from the source.
|
Node |
Lexer.parseCDATA() |
Return CDATA as a text node.
|
Node |
Lexer.parseCDATA(boolean quotesmart) |
Return CDATA as a text node.
|
protected Node |
Lexer.parseJsp(int start) |
Parse a java server page node.
|
protected Node |
Lexer.parsePI(int start) |
Parse an XML processing instruction.
|
protected Node |
Lexer.parseRemark(int start,
boolean quotesmart) |
Parse a comment.
|
protected Node |
Lexer.parseString(int start,
boolean quotesmart) |
Parse a string node.
|
protected Node |
Lexer.parseTag(int start) |
Parse a tag.
|
protected void |
Lexer.scanJIS(Cursor cursor) |
Advance the cursor through a JIS escape sequence.
|
void |
Page.setConnection(java.net.URLConnection connection) |
Set the URLConnection to be used by this page.
|
void |
InputStreamSource.setEncoding(java.lang.String character_set) |
Begins reading from the source with the given character set.
|
void |
Page.setEncoding(java.lang.String character_set) |
Begins reading from the source with the given character set.
|
abstract void |
Source.setEncoding(java.lang.String character_set) |
Set the encoding to the given character set.
|
void |
StringSource.setEncoding(java.lang.String character_set) |
Set the encoding to the given character set.
|
void |
Page.ungetCharacter(Cursor cursor) |
Return a character.
|
Constructor | Description |
---|---|
Lexer(java.net.URLConnection connection) |
Creates a new instance of a Lexer.
|
Page(java.net.URLConnection connection) |
Construct a page reading from a URL connection.
|
Modifier and Type | Method | Description |
---|---|---|
protected java.net.URL[][] |
Thumbelina.extractImageLinks(Lexer lexer,
java.net.URL docbase) |
Get the links of an element of a document.
|
Modifier and Type | Method | Description |
---|---|---|
void |
AbstractNode.doSemanticAction() |
Perform the meaning of this tag.
|
Modifier and Type | Method | Description |
---|---|---|
java.lang.String |
StringExtractor.extractStrings(boolean links) |
Extract the text from a page.
|
protected boolean |
SiteCapturer.isHtml(java.lang.String link) |
Returns
true if the link contains text/html content. |
protected void |
SiteCapturer.process(NodeFilter filter) |
Process a single page.
|
Modifier and Type | Method | Description |
---|---|---|
void |
Feedback.error(java.lang.String message,
ParserException e) |
Error message.
|
Modifier and Type | Method | Description |
---|---|---|
protected void |
XMLReader.doSAX(Node node) |
Process nodes recursively on the DocumentHandler.
|
Modifier and Type | Method | Description |
---|---|---|
protected Tag |
CompositeTagScanner.createVirtualEndTag(Tag tag,
Lexer lexer,
Page page,
int position) |
Creates an end tag with the same name as the given tag.
|
static java.lang.String |
ScriptDecoder.Decode(Page page,
Cursor cursor) |
Decode script encoded by the Microsoft obfuscator.
|
protected void |
CompositeTagScanner.finishTag(Tag tag,
Lexer lexer) |
Finish off a tag.
|
Tag |
CompositeTagScanner.scan(Tag tag,
Lexer lexer,
NodeList stack) |
Collect the children.
|
Tag |
Scanner.scan(Tag tag,
Lexer lexer,
NodeList stack) |
Scan the tag.
|
Tag |
ScriptScanner.scan(Tag tag,
Lexer lexer,
NodeList stack) |
Scan for script.
|
Tag |
StyleScanner.scan(Tag tag,
Lexer lexer,
NodeList stack) |
Scan for style definitions.
|
Tag |
TagScanner.scan(Tag tag,
Lexer lexer,
NodeList stack) |
Scan the tag.
|
Modifier and Type | Method | Description |
---|---|---|
void |
BaseHrefTag.doSemanticAction() |
Perform the meaning of this tag.
|
void |
MetaTag.doSemanticAction() |
Perform the META tag semantic action.
|
Modifier and Type | Class | Description |
---|---|---|
class |
EncodingChangeException |
The encoding is changed invalidating already scanned characters.
|
Modifier and Type | Method | Description |
---|---|---|
void |
DefaultParserFeedback.error(java.lang.String message,
ParserException exception) |
Print an error message.
|
static void |
FeedbackManager.error(java.lang.String message,
ParserException e) |
|
void |
ParserFeedback.error(java.lang.String message,
ParserException e) |
Modifier and Type | Method | Description |
---|---|---|
static Parser |
ParserUtils.createParserParsingAnInputString(java.lang.String input) |
Create a Parser Object having a String Object as input (instead of a url or a string representing the url location).
|
boolean |
IteratorImpl.hasMoreNodes() |
Check if more nodes are available.
|
boolean |
NodeIterator.hasMoreNodes() |
Check if more nodes are available.
|
Node |
IteratorImpl.nextNode() |
Get the next node.
|
Node |
NodeIterator.nextNode() |
Get the next node.
|
static java.lang.String[] |
ParserUtils.splitTags(java.lang.String input,
java.lang.Class nodeType) |
Split the input string in a string array,
considering the tags as delimiter for splitting.
|
static java.lang.String[] |
ParserUtils.splitTags(java.lang.String input,
java.lang.Class nodeType,
boolean recursive,
boolean insideTag) |
Split the input string in a string array,
considering the tags as delimiter for splitting.
|
static java.lang.String[] |
ParserUtils.splitTags(java.lang.String input,
java.lang.String[] tags) |
Split the input string in a string array,
considering the tags as delimiter for splitting.
|
static java.lang.String[] |
ParserUtils.splitTags(java.lang.String input,
java.lang.String[] tags,
boolean recursive,
boolean insideTag) |
Split the input string in a string array,
considering the tags as delimiter for splitting.
|
static java.lang.String[] |
ParserUtils.splitTags(java.lang.String input,
NodeFilter filter) |
Split the input string in a string array,
considering the tags as delimiter for splitting.
|
static java.lang.String[] |
ParserUtils.splitTags(java.lang.String input,
NodeFilter filter,
boolean recursive,
boolean insideTag) |
Split the input string in a string array,
considering the tags as delimiter for splitting.
|
static java.lang.String |
ParserUtils.trimTags(java.lang.String input,
java.lang.Class nodeType) |
Trim all tags in the input string and
return a string like the input one
without the tags and their content.
|
static java.lang.String |
ParserUtils.trimTags(java.lang.String input,
java.lang.Class nodeType,
boolean recursive,
boolean insideTag) |
Trim all tags in the input string and
return a string like the input one
without the tags and their content (optional).
|
static java.lang.String |
ParserUtils.trimTags(java.lang.String input,
java.lang.String[] tags) |
Trim all tags in the input string and
return a string like the input one
without the tags and their content.
|
static java.lang.String |
ParserUtils.trimTags(java.lang.String input,
java.lang.String[] tags,
boolean recursive,
boolean insideTag) |
Trim all tags in the input string and
return a string like the input one
without the tags and their content (optional).
|
static java.lang.String |
ParserUtils.trimTags(java.lang.String input,
NodeFilter filter) |
Trim all tags in the input string and
return a string like the input one
without the tags and their content.
|
static java.lang.String |
ParserUtils.trimTags(java.lang.String input,
NodeFilter filter,
boolean recursive,
boolean insideTag) |
Trim all tags in the input string and
return a string like the input one
without the tags and their content (optional).
|
void |
NodeList.visitAllNodesWith(NodeVisitor visitor) |
Utility to apply a visitor to a node list.
|
HTML Parser is an open source library released under LGPL.