Developer forum

Forum » Feature requests » CaseInsensitiveWhitespace Tokenizer for Synonyms Analyzer

CaseInsensitiveWhitespace Tokenizer for Synonyms Analyzer

Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

Hi,

 

Consider this Product Name: Non-Sterile Clinisorb Non-Woven Sponges 3" x 3" 4-Ply

  • I can get results searching by "sterile" - because it's getting me all results that have "sterile"
  • I can get results searching by "non sterile" - finding "non" + "sterile"
  • I can get results searching by "non-sterile" - because the "dash" is removed and it's finding "non" + "sterile"
  • But some users search for "nonsterile" - and now I don't get results
    • Synonyms don't work, because the tokenizers will do this
      • Whitespace tokenizer: Non-Sterile Clinisorb Non-Woven Sponges 3" x 4-Ply
        • But this setup did not work
            <group stem="Non-Woven">
              <synonym>nonwoven</synonym>
              <synonym>non-woven</synonym>
            </group>
      • All other tokenizers: non  sterile  clinisorb  woven  sponges  3  x  4-ply
        so we'll never have non-sterile (lowercase) as a token to be able to set as a synonyms
        converting the user's input from "nonsterile" to "non sterile" results in an inexisting token (because of the space) - so we also get no results

 

We have 2 workarounds:

  1. Tell the customer to have a custom field where they add any keywords this product can be found by
    But that's a lot of overhead/duplciate work for the customer maitaining all of this, specially when we're trying to account for user typos and/or acronym type situations
  2. Create an index builder extender to create a custom field that injects "nonsterile" when it finds "non-sterile" (and same for other situations like "non-woven")
    But that means a customer can hardly maintain this himself, which adds costs

 

Here's my current results with the current options:

  • Product Name: Non-Sterile Clinisorb Non-Woven Sponges 3" x 3" 4-Ply

    Focusing on Non-Sterile and Non-Woven
  • With the Whitespace Tokenizer you get: Non-Sterile   Non-Woven
  • With the all other Tokenizers you get: non   sterile   woven
  • With the Synonyms Analyzer using the CaseInsensitiveWhitespace Tokenizer you would get: non-sterile   non-woven
    This would allow us to create synonyms for:
    • non-sterile = nonsterile
    • non-woven = nonwoven

 

So a new tokenizer would immensily help make Synonyms more useful for "synonym-like" situations.

 

As a parallel need/usage, if we could also add a CaseInsensitiveWhitespaceAnalyzer to the Field Types would be helpful making it more flexible for some scenarios (where we would not even need synonyms). I've found this to be relatively simple to achieve, but if it came built-into the platform would be useful https://stackoverflow.com/questions/4734095/lucene-insensitive-whitespace-analyzer

 

 

Best Regards,

Nuno Aguiar


Replies

 
Nicolai Pedersen
Reply

Hi Nuno

Index the value into 2 fields. One with the synonyms analyzer using whitespace tokenizer and one without the synonyms analyzer using normal analyze checkbox.

When searching for "non steril"

  • productnameWithSynonyms equals "non steril" - no result
    OR
  • productnameJustAnalyzed equals "non steril" - will give result

When searching for "nonsteril"

  • productnameWithSynonyms equals "non-steril" - will give result
    OR
  • productnameJustAnalyzed equals "non-steril" - no result
 
Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

Hi Nicolai,

 

I already have that and that works, but I believe you missed the point. It's when you search for "nonsteril" that is does not work (users would input with no spaces nor dashes).

 

Does that make sense?

 

Nuno Aguiar

 
Nicolai Pedersen
Reply

Yes, and that would work on the synonyms version of the field using the whitespace analyzer.

 
Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

Hi Nicolai,

 

 

I figured out why it was not working despite all of my efforts...

(I was comparing with another field that only used the WhiteSpace Analyzer and I was unsure if LukeNet what I was supposed to see in Luke with the Synonym Analyzer)

 

I used the example config files from https://doc.dynamicweb.com/documentation-9/indexing/other/synonyms and in it the config file is named WithSynonym.config and the names should really be WithSynonyms.config. - This should be fixed.

 

Looking at the Synonym Analyzer source code, it's really looking for a config and synomns xml named after the Field Type. This was causing the config to fallback to (hardcoded) defaults, which is why I was never seeing the changes in LukeNet.

 

As well as I still feel there's something to say about a case insensitive whitespace analyzer (at least the analyzer, not so much the tokenizer since I can now make it work)

 

Thanks,

Nuno Aguiar

 

 
Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

Hi Nicolai,

 

Attached is the reason why we it's important that the Whitespace Tokenizer should be case insensitive. Currently Biolegend needs to use the Whitespace tokenizer on Synonyms, but that forces us to set up synonyms for every possible typing scenario, instead of just the lowercase version of it.

 

Hope it makes more sense now.

 

Best Regards,

Nuno Aguiar

DuplicateSynonyms_accurate.gif
 
Nicolai Pedersen
Reply

Hi Nuno

The latest Lucene package has a new whitespace analyzer that is case insensitive - DynamicwebCaseInsensitiveWhitespaceAnalyzer

Find it here: https://www.nuget.org/packages/Dynamicweb.Indexing.Lucene/

BR Nicolai

 
Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

Hi Nicolai,

 

But my use case is using the Synonyms Analyzer with the Whitespace tokenizer. Unless I am missing something.

 

Nuno

 
Nicolai Pedersen
Reply

ok.

Try the attached provider. Devops#1687

BR Nicolai

 
Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

Looks like a charm. Thanks a lot