Class Ferret::Analysis::Analyzer
In: ext/r_analysis.c
Parent: Object

Summary

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

Typical implementations first build a Tokenizer, which breaks the stream of characters from the Reader into raw Tokens. One or more TokenFilters may then be applied to the output of the Tokenizer.

The default Analyzer just creates a LowerCaseTokenizer which converts all text to lowercase tokens. See LowerCaseTokenizer for more details.

Example

To create your own custom Analyzer you simply need to implement a token_stream method which takes the field name and the data to be tokenized as parameters and returns a TokenStream. Most analyzers typically ignore the field name.

Here we‘ll create a StemmingAnalyzer;

  def MyAnalyzer < Analyzer
    def token_stream(field, str)
      return StemFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str)))
    end
  end

Methods

new   token_stream  

Public Class methods

Create a new LetterAnalyzer which downcases tokens by default but can optionally leave case as is. Lowercasing will be done based on the current locale.

lower:set to false if you don‘t want the field‘s tokens to be downcased

Public Instance methods

Create a new TokenStream to tokenize input. The TokenStream created may also depend on the field_name. Although this parameter is typically ignored.

field_name:name of the field to be tokenized
input:data from the field to be tokenized

[Validate]