CaseInsensitiveWhitespace Tokenizer for Synonyms Analyzer

E-mail notifications

Nuno Aguiar

Posted on 25/02/2021 13:12:52

Hi,

Consider this Product Name: Non-Sterile Clinisorb Non-Woven Sponges 3" x 3" 4-Ply

I can get results searching by "sterile" - because it's getting me all results that have "sterile"
I can get results searching by "non sterile" - finding "non" + "sterile"
I can get results searching by "non-sterile" - because the "dash" is removed and it's finding "non" + "sterile"
But some users search for "nonsterile" - and now I don't get results
- Synonyms don't work, because the tokenizers will do this
  - Whitespace tokenizer: Non-Sterile Clinisorb Non-Woven Sponges 3" x 4-Ply
    - But this setup did not work
      <group stem="Non-Woven">
      <synonym>nonwoven</synonym>
      <synonym>non-woven</synonym>
      </group>
  - All other tokenizers: non sterile clinisorb woven sponges 3 x 4-ply
    so we'll never have non-sterile (lowercase) as a token to be able to set as a synonyms
    converting the user's input from "nonsterile" to "non sterile" results in an inexisting token (because of the space) - so we also get no results

We have 2 workarounds:

Tell the customer to have a custom field where they add any keywords this product can be found by
But that's a lot of overhead/duplciate work for the customer maitaining all of this, specially when we're trying to account for user typos and/or acronym type situations
Create an index builder extender to create a custom field that injects "nonsterile" when it finds "non-sterile" (and same for other situations like "non-woven")
But that means a customer can hardly maintain this himself, which adds costs

Here's my current results with the current options:

Product Name: Non-Sterile Clinisorb Non-Woven Sponges 3" x 3" 4-Ply

Focusing on Non-Sterile and Non-Woven
With the Whitespace Tokenizer you get: Non-Sterile Non-Woven
With the all other Tokenizers you get: non sterile woven
With the Synonyms Analyzer using the CaseInsensitiveWhitespace Tokenizer you would get: non-sterile non-woven
This would allow us to create synonyms for:
- non-sterile = nonsterile
- non-woven = nonwoven

So a new tokenizer would immensily help make Synonyms more useful for "synonym-like" situations.

As a parallel need/usage, if we could also add a CaseInsensitiveWhitespaceAnalyzer to the Field Types would be helpful making it more flexible for some scenarios (where we would not even need synonyms). I've found this to be relatively simple to achieve, but if it came built-into the platform would be useful https://stackoverflow.com/questions/4734095/lucene-insensitive-whitespace-analyzer

Best Regards,

Nuno Aguiar

Replies

Nicolai Pedersen

Posted on 25/02/2021 13:49:00

Hi Nuno

Index the value into 2 fields. One with the synonyms analyzer using whitespace tokenizer and one without the synonyms analyzer using normal analyze checkbox.

When searching for "non steril"

productnameWithSynonyms equals "non steril" - no result
OR
productnameJustAnalyzed equals "non steril" - will give result

When searching for "nonsteril"

productnameWithSynonyms equals "non-steril" - will give result
OR
productnameJustAnalyzed equals "non-steril" - no result

Nuno Aguiar

Posted on 25/02/2021 18:21:52

Hi Nicolai,

I already have that and that works, but I believe you missed the point. It's when you search for "nonsteril" that is does not work (users would input with no spaces nor dashes).

Does that make sense?

Nuno Aguiar

Nicolai Pedersen

Posted on 25/02/2021 20:15:24

Yes, and that would work on the synonyms version of the field using the whitespace analyzer.

Nuno Aguiar

Posted on 26/02/2021 11:22:11

Hi Nicolai,

I figured out why it was not working despite all of my efforts...

(I was comparing with another field that only used the WhiteSpace Analyzer and I was unsure if LukeNet what I was supposed to see in Luke with the Synonym Analyzer)

I used the example config files from https://doc.dynamicweb.com/documentation-9/indexing/other/synonyms and in it the config file is named WithSynonym.config and the names should really be WithSynonyms.config. - This should be fixed.

Looking at the Synonym Analyzer source code, it's really looking for a config and synomns xml named after the Field Type. This was causing the config to fallback to (hardcoded) defaults, which is why I was never seeing the changes in LukeNet.

As well as I still feel there's something to say about a case insensitive whitespace analyzer (at least the analyzer, not so much the tokenizer since I can now make it work)

Thanks,

Nuno Aguiar

Posted on 13/04/2021 14:50:25

Hi Nicolai,

Attached is the reason why we it's important that the Whitespace Tokenizer should be case insensitive. Currently Biolegend needs to use the Whitespace tokenizer on Synonyms, but that forces us to set up synonyms for every possible typing scenario, instead of just the lowercase version of it.

Hope it makes more sense now.

Best Regards,

Nuno Aguiar

Nicolai Pedersen

Posted on 13/04/2021 16:24:14

Hi Nuno

The latest Lucene package has a new whitespace analyzer that is case insensitive - DynamicwebCaseInsensitiveWhitespaceAnalyzer

Find it here: https://www.nuget.org/packages/Dynamicweb.Indexing.Lucene/

BR Nicolai

Nuno Aguiar

Posted on 13/04/2021 16:32:34

Hi Nicolai,

But my use case is using the Synonyms Analyzer with the Whitespace tokenizer. Unless I am missing something.

Nuno

Nicolai Pedersen

Posted on 13/04/2021 16:50:20

ok.

Try the attached provider. Devops#1687

BR Nicolai

Dynamicweb.Indexing.Lucene.zip

Nuno Aguiar

Posted on 13/04/2021 17:08:27

Looks like a charm. Thanks a lot

Developer forum

CaseInsensitiveWhitespace Tokenizer for Synonyms Analyzer

Replies