Field analyzers are used both during ingestion, when a document is indexed, and at query time. Analyzers are only valid for <fieldType> declarations that specify the TextField class. Analyzers may be a single class or they may be composed of a series of zero or more CharFilter, one Tokenizer and zero or more TokenFilter classes.
Analyzers are specified by adding <analyzer> children to the <fieldType> element in the schema.xml config file. Field Types typically use a single analyzer, but the type attribute can be used to specify distinct analyzers for the index vs query.
The simplest way to configure an analyzer is with a single <analyzer> element whose class attribute is the fully qualified Java class name of an existing Lucene analyzer.
For more configurable analysis, an analyzer chain can be created using a simple <analyzer> element with no class attribute, with the child elements that name factory classes for CharFilter, Tokenizer and TokenFilter to use, and in the order they should run, as in the following example:
<fieldType name=”nametext” class=”solr.TextField”>
CharFilter pre-process input characters with the possibility to add, remove or change characters while preserving the original character offsets.
The following table provides an overview of some of the CharFilter factories available in Solr 1.4:
CharFilterDescriptionMappingCharFilterFactoryApplies mapping contained in a map to the character stream. The map contains pairings of String input to String output.PatternReplaceCharFilterFactoryApplies a regular expression pattern to the string in the character stream, replacing matches with the specified replacement string.HTMLStripCharFilterFactoryStrips HTML from the input stream and passes the result to either a CharFilter or a Tokenizer. This filter removes tags while keeping content. It also removes <script>, <style>, comments, and processing instructions.
Tokenizer breaks up a stream of text into tokens. Tokenizer reads from a Reader and produces a TokenStream containing various metadata such as the locations at which each token occurs in the field.
The following table provides an overview of some of the Tokenizer factory classes included in Solr 1.4:
TokenizerDescriptionStandardTokenizerFactoryTreats whitespace and punctuation as delimiters.NGramTokenizerFactoryGenerates n-gram tokens of sizes in the given range.EdgeNGramTokenizerFactoryGenerates edge n-gram tokens of sizes in the given range.PatternTokenizerFactoryUses a Java regular expression to break the text stream into tokens.WhitespaceTokenizerFactorySplits the text stream on whitespace, returning sequences of non-whitespace characters as tokens.
TokenFilter consumes and produces TokenStreams. TokenFilter looks at each token sequentially and decides to pass it along, replace it or discard it.
A TokenFilter may also do more complex analysis by buffering to look ahead and consider multiple tokens at once.
The following table provides an overview of some of the TokenFilter factory classes included in Solr 1.4:
TokenFilterDescriptionKeepWordFilterFactoryDiscards all tokens except those that are listed in the given word list. Inverse of StopFilterFactory.LengthFilterFactoryPasses tokens whose length falls within the min/max limit specified.LowerCaseFilterFactoryConverts any uppercases letters in a token to lowercase.PatternReplaceFilterFactoryApplies a regular expression to each token, and substitutes the givenPhoneticFilterFactoryCreates tokens using one of the phonetic encoding algorithms from the org.apache.commons.codec.language package.PorterStemFilterFactoryAn algorithmic stemmer that is not as accurate as tablebased stemmer, but faster and less complex.ShingleFilterFactoryConstructs shingles (token n-grams) from the token stream.StandardFilterFactoryRemoves dots from acronyms and ‘s from the end of tokens. This class only works when used in conjunction with the StandardTokenizerFactoryStopFilterFactoryDiscards, or stops, analysis of tokens that are on the given stop words list.SynonymFilterFactoryEach token is looked up in the list of synonyms and if a match is found, then the synonym is emitted in place of the token.TrimFilterFactoryTrims leading and trailing whitespace from tokens.WordDelimitedFilterFactorySplits and recombines tokens at punctuations, case change and numbers. Useful for indexing
Testing Your Analyzer There is a handy page in the Solr admin interface that allows you to test out your analysis against a field type at the <"http://[hostname]:8983/solr/admin/ analysis.jsp">http://[hostname]:8983/solr/admin/ analysis.jsp> page in your installation.