Class | Ferret::Analysis::Token |
In: |
ext/r_analysis.c
|
Parent: | Object |
A Token is an occurence of a term from the text of a field. It consists of a term‘s text and the start and end offset of the term in the text of the field;
The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc.
text: | the terms text which may have been modified by a Token Filter or Tokenizer from the text originally found in the document |
start: | is the position of the first character corresponding to this token in the source text |
end: | is equal to one greater than the position of the last character corresponding of this token Note that the difference between @end_offset and @start_offset may not be equal to @text.length(), as the term text may have been altered by a stemmer or some other filter. |
Creates a new token setting the text, start and end offsets of the token and the position increment for the token.
The position increment is usually set to 1 but you can set it to other values as needed. For example, if you have a stop word filter you will be skipping tokens. Let‘s say you have the stop words "the" and "and" and you parse the title "The Old Man and the Sea". The terms "Old", "Man" and "Sea" will have the position incerements 2, 1 and 3 respectively.
Another reason you might want to vary the position increment is if you are adding synonyms to the index. For example let‘s say you have the synonym group "quick", "fast" and "speedy". When tokenizing the phrase "Next day speedy delivery", you‘ll add "speedy" first with a position increment of 1 and then "fast" and "quick" with position increments of 0 since they are represented in the same position.
The offset set values start and end should be byte offsets, not character offsets. This makes it easy to use those offsets to quickly access the token in the input string and also to insert highlighting tags when necessary.
text: | the main text for the token. |
start: | the start offset of the token in bytes. |
end: | the end offset of the token in bytes. |
pos_inc: | the position increment of a token. See above. |
return: | a newly created and assigned Token object |
Used to compare two tokens. Token is extended by Comparable so you can also use +<+, +>+, +<=+, +>=+ etc. to compare tokens.
Tokens are sorted by the position in the text at which they occur, ie the start offset. If two tokens have the same start offset, (see pos_inc=) then, they are sorted by the end offset and then lexically by the token text.
Set the position increment. This determines the position of this token relative to the previous Token in a TokenStream, used in phrase searching.
The default value is 1.
Some common uses for this are: