public final class PatternTokenizer
extends org.apache.lucene.analysis.Tokenizer
group=-1 (the default) is equivalent to "split". In this case, the tokens will
be equivalent to the output from (without empty tokens):
String.split(java.lang.String)
Using group >= 0 selects the matching group as the token. For example, if you have:
pattern = \'([^\']+)\' group = 0 input = aaa 'bbb' 'ccc'the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)
NOTE: This Tokenizer does not output tokens that are of zero length.
Pattern
Modifier and Type | Field and Description |
---|---|
(package private) char[] |
buffer |
private int |
group |
private int |
index |
private java.util.regex.Matcher |
matcher |
private org.apache.lucene.analysis.tokenattributes.OffsetAttribute |
offsetAtt |
private java.util.regex.Pattern |
pattern |
private java.lang.StringBuilder |
str |
private org.apache.lucene.analysis.tokenattributes.CharTermAttribute |
termAtt |
Constructor and Description |
---|
PatternTokenizer(java.io.Reader input,
java.util.regex.Pattern pattern,
int group)
creates a new PatternTokenizer returning tokens from group (-1 for split functionality)
|
Modifier and Type | Method and Description |
---|---|
void |
end() |
private void |
fillBuffer(java.lang.StringBuilder sb,
java.io.Reader input) |
boolean |
incrementToken() |
void |
reset(java.io.Reader input) |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
private final org.apache.lucene.analysis.tokenattributes.CharTermAttribute termAtt
private final org.apache.lucene.analysis.tokenattributes.OffsetAttribute offsetAtt
private final java.lang.StringBuilder str
private int index
private final java.util.regex.Pattern pattern
private final int group
private final java.util.regex.Matcher matcher
final char[] buffer
public PatternTokenizer(java.io.Reader input, java.util.regex.Pattern pattern, int group) throws java.io.IOException
java.io.IOException
public boolean incrementToken() throws java.io.IOException
incrementToken
in class org.apache.lucene.analysis.TokenStream
java.io.IOException
public void end() throws java.io.IOException
end
in class org.apache.lucene.analysis.TokenStream
java.io.IOException
public void reset(java.io.Reader input) throws java.io.IOException
reset
in class org.apache.lucene.analysis.Tokenizer
java.io.IOException
private void fillBuffer(java.lang.StringBuilder sb, java.io.Reader input) throws java.io.IOException
java.io.IOException