Package org.apache.pdfbox.util
Class PDFText2HTML
- java.lang.Object
-
- org.apache.pdfbox.util.PDFStreamEngine
-
- org.apache.pdfbox.util.PDFTextStripper
-
- org.apache.pdfbox.util.PDFText2HTML
-
public class PDFText2HTML extends PDFTextStripper
Wrap stripped text in simple HTML, trying to form HTML paragraphs. Paragraphs broken by pages, columns, or figures are not mended.- Author:
- jjb - http://www.johnjbarton.com
-
-
Field Summary
-
Fields inherited from class org.apache.pdfbox.util.PDFTextStripper
charactersByArticle, document, output, outputEncoding, systemLineSeparator
-
-
Constructor Summary
Constructors Constructor Description PDFText2HTML(java.lang.String encoding)
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
endArticle()
Write out the article separator.void
endDocument(PDDocument pdf)
This method is available for subclasses of this class.protected java.lang.String
getTitle()
This method will attempt to guess the title of the document using either the document properties or the first lines of text.protected void
startArticle(boolean isltr)
Write out the article separator (div tag) with proper text direction information.protected void
writeHeader()
Write the header to the output document.protected void
writePage()
This will print the text of the processed page to "output".protected void
writeParagraphEnd()
Writes the paragraph end "protected void
writeString(java.lang.String chars)
Write a string to the output stream and escape some HTML characters.protected void
writeString(java.lang.String text, java.util.List<TextPosition> textPositions)
Write a string to the output stream, maintain font state, and escape some HTML characters.-
Methods inherited from class org.apache.pdfbox.util.PDFTextStripper
endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageSeparator, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getText, getWordSeparator, handleLineSeparation, inspectFontEncoding, isParagraphSeparation, matchListItemPattern, matchPattern, processPage, processPages, processTextPosition, resetEngine, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageSeparator, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageEnd, writePageSeperator, writePageStart, writeParagraphSeparator, writeParagraphStart, writeText, writeText, writeWordSeparator
-
Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, isForceParsing, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, setColorSpaces, setFonts, setForceParsing, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix
-
-
-
-
Method Detail
-
writeHeader
protected void writeHeader() throws java.io.IOException
Write the header to the output document. Now also writes the tag defining the character encoding.- Throws:
java.io.IOException
- If there is a problem writing out the header to the document.
-
writePage
protected void writePage() throws java.io.IOException
This will print the text of the processed page to "output". It will estimate, based on the coordinates of the text, where newlines and word spacings should be placed. The text will be sorted only if that feature was enabled.- Overrides:
writePage
in classPDFTextStripper
- Throws:
java.io.IOException
- If there is an error writing the text.
-
endDocument
public void endDocument(PDDocument pdf) throws java.io.IOException
This method is available for subclasses of this class. It will be called after processing of the document finishes.- Overrides:
endDocument
in classPDFTextStripper
- Parameters:
pdf
- The PDF document that is being processed.- Throws:
java.io.IOException
- If an IO error occurs.
-
getTitle
protected java.lang.String getTitle()
This method will attempt to guess the title of the document using either the document properties or the first lines of text.- Returns:
- returns the title.
-
startArticle
protected void startArticle(boolean isltr) throws java.io.IOException
Write out the article separator (div tag) with proper text direction information.- Overrides:
startArticle
in classPDFTextStripper
- Parameters:
isltr
- true if direction of text is left to right- Throws:
java.io.IOException
- If there is an error writing to the stream.
-
endArticle
protected void endArticle() throws java.io.IOException
Write out the article separator.- Overrides:
endArticle
in classPDFTextStripper
- Throws:
java.io.IOException
- If there is an error writing to the stream.
-
writeString
protected void writeString(java.lang.String text, java.util.List<TextPosition> textPositions) throws java.io.IOException
Write a string to the output stream, maintain font state, and escape some HTML characters. The font state is only preserved per word.- Overrides:
writeString
in classPDFTextStripper
- Parameters:
text
- The text to write to the stream.textPositions
- the corresponding text positions- Throws:
java.io.IOException
- If there is an error writing to the stream.
-
writeString
protected void writeString(java.lang.String chars) throws java.io.IOException
Write a string to the output stream and escape some HTML characters.- Overrides:
writeString
in classPDFTextStripper
- Parameters:
chars
- String to be written to the stream- Throws:
java.io.IOException
- If there is an error writing to the stream.
-
writeParagraphEnd
protected void writeParagraphEnd() throws java.io.IOException
Writes the paragraph end "" to the output. Furthermore, it will also clear the font state. Write something (if defined) at the end of a paragraph.- Overrides:
writeParagraphEnd
in classPDFTextStripper
- Throws:
java.io.IOException
- if something went wrong
-
-