Hi,
Consider this Product Name: Non-Sterile Clinisorb Non-Woven Sponges 3" x 3" 4-Ply
- I can get results searching by "sterile" - because it's getting me all results that have "sterile"
- I can get results searching by "non sterile" - finding "non" + "sterile"
- I can get results searching by "non-sterile" - because the "dash" is removed and it's finding "non" + "sterile"
- But some users search for "nonsterile" - and now I don't get results
- Synonyms don't work, because the tokenizers will do this
- Whitespace tokenizer: Non-Sterile Clinisorb Non-Woven Sponges 3" x 4-Ply
- But this setup did not work
<group stem="Non-Woven">
<synonym>nonwoven</synonym>
<synonym>non-woven</synonym>
</group>
- But this setup did not work
- All other tokenizers: non sterile clinisorb woven sponges 3 x 4-ply
so we'll never have non-sterile (lowercase) as a token to be able to set as a synonyms
converting the user's input from "nonsterile" to "non sterile" results in an inexisting token (because of the space) - so we also get no results
- Whitespace tokenizer: Non-Sterile Clinisorb Non-Woven Sponges 3" x 4-Ply
- Synonyms don't work, because the tokenizers will do this
We have 2 workarounds:
- Tell the customer to have a custom field where they add any keywords this product can be found by
But that's a lot of overhead/duplciate work for the customer maitaining all of this, specially when we're trying to account for user typos and/or acronym type situations
- Create an index builder extender to create a custom field that injects "nonsterile" when it finds "non-sterile" (and same for other situations like "non-woven")
But that means a customer can hardly maintain this himself, which adds costs
Here's my current results with the current options:
- Product Name: Non-Sterile Clinisorb Non-Woven Sponges 3" x 3" 4-Ply
Focusing on Non-Sterile and Non-Woven - With the Whitespace Tokenizer you get: Non-Sterile Non-Woven
- With the all other Tokenizers you get: non sterile woven
- With the Synonyms Analyzer using the CaseInsensitiveWhitespace Tokenizer you would get: non-sterile non-woven
This would allow us to create synonyms for:- non-sterile = nonsterile
- non-woven = nonwoven
So a new tokenizer would immensily help make Synonyms more useful for "synonym-like" situations.
As a parallel need/usage, if we could also add a CaseInsensitiveWhitespaceAnalyzer to the Field Types would be helpful making it more flexible for some scenarios (where we would not even need synonyms). I've found this to be relatively simple to achieve, but if it came built-into the platform would be useful https://stackoverflow.com/questions/4734095/lucene-insensitive-whitespace-analyzer
Best Regards,
Nuno Aguiar