public final class WordsFormattedTextExtractor extends TextExtractor implements IPageTextExtractor, IHighlightExtractor, ITextExtractorWithFormatter
Provides the formatted text extractor for text documents.
Supported formats:
.DOC | Microsoft Word Text document |
.DOT | Microsoft Word Text template |
.DOCX | Microsoft Office Open XML Text document |
.DOCM | Microsoft Word 2007 Master document |
.RTF | Rich Text Format text file |
.ODT | OpenDocument text |
.HTML (.XHTML, .HTM) | Hypertext Markup Language document |
.MHTML (.MHT) | Web Archive Single File |
Extracting text from document:
// Create a formatted text extractor for text documents
WordsFormattedTextExtractor extractor = new WordsFormattedTextExtractor(stream);
// Extract a formatted text
System.out.println(extractor.extractAll());
Extracting by pages:
// Create a formatted text extractor for text documents
WordsFormattedTextExtractor extractor = new WordsFormattedTextExtractor(stream);
// Iterate pages
for (int pageIndex = 0; pageIndex < extractor.getPageCount(); pageIndex++) {
// Extract a formatted text from the page which index is pageIndex
System.out.println(extractor.extractPage(pageIndex));
}
For setting a formatter DocumentFormatter
property is used.
// Create a formatted text extractor for text documents
WordsFormattedTextExtractor extractor = new WordsFormattedTextExtractor(stream);
// Set a markdown formatter for formatting
extractor.setDocumentFormatter(new MarkdownDocumentFormatter()); // all the text will be formatted as Markdown
By default a text is formatted as a plain text by PlainDocumentFormatter
.
Constructor and Description |
---|
WordsFormattedTextExtractor(InputStream stream)
Initializes a new instance
of the
WordsFormattedTextExtractor class. |
WordsFormattedTextExtractor(InputStream stream,
LoadOptions loadOptions)
Initializes a new instance
of the
WordsFormattedTextExtractor class. |
WordsFormattedTextExtractor(String fileName)
Initializes a new instance of the
WordsFormattedTextExtractor class. |
WordsFormattedTextExtractor(String fileName,
LoadOptions loadOptions)
Initializes a new instance of the
WordsFormattedTextExtractor class. |
Modifier and Type | Method and Description |
---|---|
List<String> |
extractHighlights(HighlightOptions... highlightOptions)
Extracts highlights.
|
String |
extractPage(int pageIndex)
Extracts all characters from the page with
pageIndex and returns the data as a string. |
protected String |
extractText()
Extracts all characters from the current position to the end of the text extractor
and returns them as one string.
|
protected String |
extractTextLine()
Extracts a line of characters from the text extractor and returns the data as a string.
|
DocumentFormatter |
getDocumentFormatter()
Gets a
DocumentFormatter . |
int |
getPageCount()
Gets a total count of the pages.
|
protected String |
prepareLine()
Returns a line of the text.
|
void |
reset()
Resets the current document.
|
void |
setDocumentFormatter(DocumentFormatter value)
Sets a
DocumentFormatter . |
checkDisposed, close, dispose, dispose, extractAll, extractLine, getEncoding, getMediaType, getPassword, isDisposed, setEncoding, setMediaType
public WordsFormattedTextExtractor(String fileName)
Initializes a new instance of the WordsFormattedTextExtractor
class.
fileName
- The path to the file.InvalidPasswordException
- Incorrect passwords.UnsupportedDocumentFormatException
- File format isn't supported.GroupDocsParserException
- File is corrupted.public WordsFormattedTextExtractor(String fileName, LoadOptions loadOptions)
Initializes a new instance of the WordsFormattedTextExtractor
class.
fileName
- The path to the file.loadOptions
- The options of loading the file.InvalidPasswordException
- Incorrect passwords.UnsupportedDocumentFormatException
- File format isn't supported.GroupDocsParserException
- File is corrupted.public WordsFormattedTextExtractor(InputStream stream)
Initializes a new instance
of the WordsFormattedTextExtractor
class.
stream
- The stream of the document.InvalidPasswordException
- Incorrect passwords.UnsupportedDocumentFormatException
- File format isn't supported.GroupDocsParserException
- File is corrupted.public WordsFormattedTextExtractor(InputStream stream, LoadOptions loadOptions)
Initializes a new instance
of the WordsFormattedTextExtractor
class.
stream
- The stream of the document.loadOptions
- The options of loading the file.InvalidPasswordException
- Incorrect passwords.UnsupportedDocumentFormatException
- File format isn't supported.GroupDocsParserException
- File is corrupted.public DocumentFormatter getDocumentFormatter()
Gets a DocumentFormatter
.
getDocumentFormatter
in interface ITextExtractorWithFormatter
DocumentFormatter
. The default is PlainDocumentFormatter
.
PlainDocumentFormatter
class. You can
set any other formatter or null, if you want to use default formatter.
public void setDocumentFormatter(DocumentFormatter value)
Sets a DocumentFormatter
.
setDocumentFormatter
in interface ITextExtractorWithFormatter
value
- An instance of the DocumentFormatter
. The default is PlainDocumentFormatter
.
PlainDocumentFormatter
class. You can
set any other formatter or null, if you want to use default formatter.
public int getPageCount()
Gets a total count of the pages.
getPageCount
in interface IPageTextExtractor
public String extractPage(int pageIndex)
Extracts all characters from the page with pageIndex
and returns the data as a string.
extractPage
in interface IPageTextExtractor
pageIndex
- The index of the page.public void reset()
Resets the current document.
ExtractLine
method will return the first line of the document.
reset
in class TextExtractor
public List<String> extractHighlights(HighlightOptions... highlightOptions)
Extracts highlights.
extractHighlights
in interface IHighlightExtractor
highlightOptions
- A collection of HighlightOptions
.
Mode = FixedWidth
.
UnsupportedOperationException
- Mode is not FixedWith.protected String extractText()
Extracts all characters from the current position to the end of the text extractor and returns them as one string.
extractText
in class TextExtractor
protected String extractTextLine()
Extracts a line of characters from the text extractor and returns the data as a string.
extractTextLine
in class TextExtractor
protected String prepareLine()
Returns a line of the text.
prepareLine
in class TextExtractor
Copyright © 2018. All rights reserved.