Posted on 04/02/2021 09:49:34
I think you do not understand how an analyzer works and what it does. You have to understand that, otherwise you will never get to the bottom of this.
- SynonymsAnalyzer
- Gets string in
- Tokenizes it - split it into terms. How depends on the tokenizer
- Add filters - lowercase, stopword, other funny stuff
- Add synonym filter - replaces one term with another term. It does not replace a string on the original value, or adds or anyting.
Using the keyword tokenizer would get "Mongoose HW Commuter bike 2021" in and send "Mongoose HW Commuter bike 2021" out on step 2 - and that entire string will become the term send to the filteres - lowercase:ok, stop word: no, and the term (entire string) is not in your synonym list, and will not get stemmed unless you you add the entire term <synonym>Mongoose HW Commuter bike 2021</synonym> to the list.
- Using a keyord tokenizer will get 'Wood ply 12" x 10"' in and 'Wood ply 12" x 10"' out
- Using a whitespace tokenizer will get 'Wood ply 12" x 10"' in and 'wood', 'ply', '12"', 'x', '10"' out
- Using a standard tokenizer will get 'Wood ply 12" x 10"' in and 'wood', 'ply', '12', 'x', '10' out
What comes out of the tokenizers is what you need to put in your synonym list. And there can be casing differences.
So that would be these synonyms to get matches in the 3 above scenarios using different tokenizer on the same incoming string:
<group stem="12ply">
- Using a keyord tokenizer: <synonym>Wood ply 12" x 10"</synonym>
- Using a whitespace tokenizer: <synonym>12"</synonym>
- Using a standard tokenizer: <synonym>12</synonym>
When you use the synonyms analyzer the original string comes in, i.e. Wood ply 12" x 10". Then it is passed to a tokenizer and broken into tokens. As the 3 examples above.
Then it will look in the list of synonyms for a match - as in the synonyms list above and return the stem ('12ply') for any given value to replace the original token/term. So in example 2, '12"' becomes '12ply' you have as stem of that group and the term '12"' is removed before terms are put in the index now being replaced with '12ply'. Then searching for 'ply 12"' will also pass that input string to same synonyms analyzer with the same tokenizer and split it into 'ply' and '12"', run it through the synonyms changing it to 'ply' and '12ply' which is used for the query - and it will match.
Now - adding dual words into the synonyms list is a huge challenge because it does not fit the book on how lucene is working with terms coming out of the tokenizers:
Consider this:
<group stem="12ply">
- <synonym>Wood ply 12" x 10"</synonym>
- <synonym>12 ply</synonym>
- <synonym>12 inch</synonym>
If the synonyms analyzer replaced '12 ply' with '12ply', you could not search for '12' or 'ply' alone any more because it would have been removed before indexing.
So what happens here is that 'Board 12 ply' comes in, and 'board', '12ply', '12' and 'ply' and '12 ply' comes out. This is great! Now all of it can be searched.
But if your incoming is 'Board 12ply' (now without a space), 'Board' '12ply' comes out of the tokenizer and '12ply' is not located in the synonyms list and no changes to the token stream happens.
Now - having indexed 'Board 12ply' and someone searches '12 ply', it goes into the tokenizer and becomes '12' and 'ply' and then passed to the synonyms which will locate the synonym '12 ply' and return (because it is 2 words) '12', 'ply', '12 ply' and apply all to the query. But you do not have '12' and 'ply' in the index for this product.
It is the exact same analyzer that is applied to search strings and strings being indexed.
So you have to wrap your head around these tokenizers and filters to understand.
But yes - I can get that to work - using the information above:
<group stem="hw">
<synonym>hello world</synonym>
</group>
<group stem="hello world">
<synonym>hw</synonym>
</group>
Capiche?