public class PatternTokenizerFactory extends BaseTokenizerFactory
PatternTokenizer
.
This tokenizer uses regex pattern matching to construct distinct tokens
for the input stream. It takes two arguments: "pattern" and "group".
group=-1 (the default) is equivalent to "split". In this case, the tokens will
be equivalent to the output from (without empty tokens):
String.split(java.lang.String)
Using group >= 0 selects the matching group as the token. For example, if you have:
pattern = \'([^\']+)\' group = 0 input = aaa 'bbb' 'ccc'the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)
NOTE: This Tokenizer does not output tokens that are of zero length.
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern="\'([^\']+)\'" group="1"/> </analyzer> </fieldType>
PatternTokenizer
Modifier and Type | Field and Description |
---|---|
protected int |
group |
static java.lang.String |
GROUP |
protected java.util.regex.Pattern |
pattern |
static java.lang.String |
PATTERN |
log
args, luceneMatchVersion
Constructor and Description |
---|
PatternTokenizerFactory() |
Modifier and Type | Method and Description |
---|---|
org.apache.lucene.analysis.Tokenizer |
create(java.io.Reader in)
Split the input using configured pattern
|
static java.util.List<org.apache.lucene.analysis.Token> |
group(java.util.regex.Matcher matcher,
java.lang.String input,
int group)
Deprecated.
|
void |
init(java.util.Map<java.lang.String,java.lang.String> args)
Require a configured pattern
|
static java.util.List<org.apache.lucene.analysis.Token> |
split(java.util.regex.Matcher matcher,
java.lang.String input)
Deprecated.
|
assureMatchVersion, getArgs, getBoolean, getBoolean, getInt, getInt, getInt, getSnowballWordSet, getWordSet, warnDeprecated
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getArgs
public static final java.lang.String PATTERN
public static final java.lang.String GROUP
protected java.util.regex.Pattern pattern
protected int group
public void init(java.util.Map<java.lang.String,java.lang.String> args)
init
in interface TokenizerFactory
init
in class BaseTokenStreamFactory
public org.apache.lucene.analysis.Tokenizer create(java.io.Reader in)
@Deprecated public static java.util.List<org.apache.lucene.analysis.Token> split(java.util.regex.Matcher matcher, java.lang.String input)
@Deprecated public static java.util.List<org.apache.lucene.analysis.Token> group(java.util.regex.Matcher matcher, java.lang.String input, int group)