public class Latin1StringsParser extends AbstractParser
Modifier and Type | Field and Description |
---|---|
private static int |
BUF_SIZE
The size of the internal buffers.
|
private int |
inPos
The position into the input buffer.
|
private byte[] |
input
The input buffer.
|
private int |
inSize
The number of bytes into the input buffer.
|
private static boolean[] |
isChar
The valid ISO-8859-1 character map.
|
private int |
minSize
The minimum size of a character sequence to be extracted.
|
private int |
outPos
The current position into the output buffer.
|
private byte[] |
output
The output buffer.
|
private static long |
serialVersionUID |
private static java.util.Set<MediaType> |
SUPPORTED_TYPES
The set of supported types
|
private int |
tmpPos
The temporary position into the output buffer.
|
private XHTMLContentHandler |
xhtml
The output content handler.
|
Constructor and Description |
---|
Latin1StringsParser() |
Modifier and Type | Method and Description |
---|---|
private void |
doParse(java.io.InputStream stream,
org.xml.sax.ContentHandler handler,
Metadata metadata,
ParseContext context)
Does a best effort to extract Latin1 strings encoded with ISO-8859-1,
UTF-8 or UTF-16.
|
private void |
flushBuffer()
Flushes the internal output buffer to the content handler.
|
private static boolean[] |
getCharMap()
Populates the valid ISO-8859-1 character map.
|
int |
getMinSize()
Returns the minimum size of a character sequence to be extracted.
|
java.util.Set<MediaType> |
getSupportedTypes(ParseContext arg0)
Returns the set of media types supported by this parser when used
with the given parse context.
|
private static java.util.Set<MediaType> |
getTypes()
Returns the set of supported types.
|
private static boolean |
isChar(byte c)
Tests if the byte is a ISO-8859-1 char.
|
void |
parse(java.io.InputStream stream,
org.xml.sax.ContentHandler handler,
Metadata metadata,
ParseContext context)
Parses a document stream into a sequence of XHTML SAX events.
|
void |
setMinSize(int minSize)
Sets the minimum size of a character sequence to be extracted.
|
parse
private static final long serialVersionUID
private static final java.util.Set<MediaType> SUPPORTED_TYPES
private static final boolean[] isChar
private static int BUF_SIZE
private int minSize
private byte[] output
private byte[] input
private int tmpPos
private int outPos
private int inSize
private int inPos
private XHTMLContentHandler xhtml
public int getMinSize()
public void setMinSize(int minSize)
minSize
- the minimum size of a character sequenceprivate static boolean[] getCharMap()
private static java.util.Set<MediaType> getTypes()
private static final boolean isChar(byte c)
c
- the byte to test.private void flushBuffer() throws java.io.UnsupportedEncodingException, org.xml.sax.SAXException
java.io.UnsupportedEncodingException
org.xml.sax.SAXException
public java.util.Set<MediaType> getSupportedTypes(ParseContext arg0)
Parser
arg0
- parse contextpublic void parse(java.io.InputStream stream, org.xml.sax.ContentHandler handler, Metadata metadata, ParseContext context) throws java.io.IOException, org.xml.sax.SAXException
Parser
The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.
Information about the parsing context can be passed in the context parameter. See the parser implementations for the kinds of context information they expect.
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse contextjava.io.IOException
- if the document stream could not be readorg.xml.sax.SAXException
- if the SAX events could not be processedParser.parse(java.io.InputStream,
org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
private void doParse(java.io.InputStream stream, org.xml.sax.ContentHandler handler, Metadata metadata, ParseContext context) throws java.io.IOException, org.xml.sax.SAXException
stream
- the input stream.handler
- the output content handlermetadata
- the metadata of the filecontext
- the parsing contextjava.io.IOException
- if an io error occursorg.xml.sax.SAXException
- if a sax error occurs