Object
An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.
Typical implementations first build a Tokenizer, which breaks the stream of characters from the Reader into raw Tokens. One or more TokenFilters may then be applied to the output of the Tokenizer.
The default Analyzer just creates a LowerCaseTokenizer which converts all text to lowercase tokens. See LowerCaseTokenizer for more details.
To create your own custom Analyzer you simply need to implement a token_stream method which takes the field name and the data to be tokenized as parameters and returns a TokenStream. Most analyzers typically ignore the field name.
Here we’ll create a StemmingAnalyzer;
def MyAnalyzer < Analyzer def token_stream(field, str) return StemFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str))) end end
Create a new LetterAnalyzer which downcases tokens by default but can optionally leave case as is. Lowercasing will be done based on the current locale.
lower | set to false if you don’t want the field’s tokens to be downcased |
static VALUE frb_letter_analyzer_init(int argc, VALUE *argv, VALUE self) { Analyzer *a; GET_LOWER(true); #ifndef POSH_OS_WIN32 if (!frb_locale) frb_locale = setlocale(LC_CTYPE, ""); #endif a = mb_letter_analyzer_new(lower); Frt_Wrap_Struct(self, NULL, &frb_analyzer_free, a); object_add(a, self); return self; }
Create a new TokenStream to tokenize input. The TokenStream created may also depend on the field_name. Although this parameter is typically ignored.
field_name | name of the field to be tokenized |
input | data from the field to be tokenized |
static VALUE frb_analyzer_token_stream(VALUE self, VALUE rfield, VALUE rstring) { /* NOTE: Any changes made to this method may also need to be applied to * frb_re_analyzer_token_stream */ Analyzer *a; GET_A(a, self); StringValue(rstring); return get_rb_ts_from_a(a, rfield, rstring); }
Disabled; run with --debug to generate this.
Generated with the Darkfish Rdoc Generator 1.1.6.