Synonym with a quote character

Nuno Aguiar

Posted on 02/02/2021 21:22:28

Hi,

I need some help trying to get Synonyms to work with some special characters. The use case is the medical/scientifical business, where they have some special needs.

Considering the term 10" (notice that it has the quote char)

I have the synonym config to use the WhitespaceTokenizer, but that's not treating the string the same way as the Whitespace Analyzer. Attached you can see how the field with the synonym analyzer is not generating the terms as expected (just like the WhiteSpaceAnalyzer).

Is this expected? Is there another way where I could have it both?

My next question would be how to encode these special chars in the Synonyms xml, specially because of the quote char

<?xml version="1.0" encoding="utf-8" ?>
<synonyms>
  <group stem="12"">
    <synonym>12x12</synonym>
  </group>
</synonyms>

I currently have 2 customers looking for this, both of them live, so if there's a quick way for me to get this solved, I'd highly appreciate it.

Best Regards,

Nuno Aguiar

Replies

Nicolai Pedersen

Posted on 03/02/2021 09:41:46

Lucene does not support it with its currently available analyzers. You have to do something in the browser. Or create your own synonyms analyzer that use a custom token filter that basically stops analyzing and understands what you want...

But '12"' analyzed becomes 12. And 12" x 10" becomes 12, x, 10 analyzed and will work with no synonyms.

It looks like you are trying to use synonyms for something it is not made for. And that is not easy.

A term is a term - a phrase split by a tokenizer, usually using whitespace and removing non-alpha characters. Synonyms change one term to another - meaning " is disregarded. You have phrases that you want to be treated as one term using human logic. A computer will have a hard time doing that.

So you would have to create a tokenizer that understands that logic - seems like you need a tokenizer that will use your synonyms list instead of using a sysnonyms filter on top of another tokenizer.

Nuno Aguiar

Posted on 03/02/2021 11:44:14

Hi Nicolai,

It looks like you are trying to use synonyms for something it is not made for. And that is not easy.

I understand why you'd think we're using Synonyms the wrong way, and may have been a poor choice of an example using the 2". The reality is that we do have some industry terms with some special chars that people would relate to / search by differently (the same statement could be said for acronyms).

You have phrases that you want to be treated as one term using human logic. A computer will have a hard time doing that.

Yes, I understand that and to be clear (specially for anyone reading this in the future), we're not using Synonyms for Description fields, only for Product Names (and potentially one or 2 custom text fields - not long text nor RTE)

The reason "Synonyms" are perceived solution, is because it becomes unrealistic to ask our customers to inject keywords on every product to account for those synonyms. Essentially instead of setting it up once in WithSynonyms.xml, they have to do that thousands of times per product.

So you would have to create a tokenizer that understands that logic - seems like you need a tokenizer that will use your synonyms list instead of using a sysnonyms filter on top of another tokenizer.

I agree. And the more I think and look at their base data the more I need to dwell into this. I do have a few options to consider (probably multiple in paralel)

At index time (IndexBuilderExtender) add terms based on some logic
- It adds considerable time to running the index depending on the logic
- Prevents customer from having to maintain some "synonyms" per product
Add a new tokenizer to use our synonym filter
- If I understood DW's source code we need a new Analyzer. I can use the Synonym Analyzer as inspiration

The hardest part is to make this simple enough for the customer. Hopefully they can have 1 or 2 places to handle this data.

Also a challenge is addressing the contradictory expectations around free text search, when they want "Hello world" to match "HW" and also search for "Hello" and for "World" separatel. Worse when they need "Hello World 2021" or "2021 Hello Workd" and want to match "HB", "Hello", "World" and "2021". So far I haven't been smart enough to crack that one (if that will ever be possible without a huge performance hit.

Nuno Aguiar

Nicolai Pedersen

Posted on 03/02/2021 15:00:32

Hi Nuno.

Your last example I have no issues with. See dumps.

'12"' going into the synonyms analyzer will be tokenized before it goes into the filter and being replaced - meaning it will be just '12'. The tokenizer will remove everything that is not alphanumeric. So if you want to create a synonym for 'ABC!"#¤%&/(()' the stem is just ABC - the rest is removed by whatever token filter you use.

So something like this:

<group stem="12">
 <synonym>12 ply</synonym>
    <synonym>12-ply</synonym>
    <synonym>12x12</synonym>
  </group>

It works fine for me. I can search for 12, 12-ply, 12 ply, 12x12 and find the same product. See next post with more dumps...

So - either you are overdoing this, or your setup is overly complicated or something.

Nicolai Pedersen

Posted on 03/02/2021 15:02:33

And searching for 12"

Nuno Aguiar

Posted on 04/02/2021 00:09:29

Hi Nicolai,

Thank you for looking into this with me. The main difference I see is that:

In your example the data on the product = "Hello World" (2 words)
In my example the data on the product = "HW" (1 word)

Here's a simplified scenario:

A brand called DawnMist (1 word) but people also search for "Dawn Mist"
Another example is toothbrush (1 word) and people would also search for "tooth brush" (2 words)

Can you get it to work with the example scenarios above?

Best Regards,

Nuno Aguiar

Posted on 04/02/2021 00:13:51

And please consider that the actual product name would be something like "Mongoose HW Commuter bike 2021".

This is what happens with the toothbrush example - https://dukal.com/products?Search=toothbrush

I tried to use the KeywordTokenizer to consider the entire "sentence" and then use a ContainsExtended as the operator with no success either

Nicolai Pedersen

Posted on 04/02/2021 09:49:34

I think you do not understand how an analyzer works and what it does. You have to understand that, otherwise you will never get to the bottom of this.

SynonymsAnalyzer
1. Gets string in
2. Tokenizes it - split it into terms. How depends on the tokenizer
3. Add filters - lowercase, stopword, other funny stuff
4. Add synonym filter - replaces one term with another term. It does not replace a string on the original value, or adds or anyting.

Using the keyword tokenizer would get "Mongoose HW Commuter bike 2021" in and send "Mongoose HW Commuter bike 2021" out on step 2 - and that entire string will become the term send to the filteres - lowercase:ok, stop word: no, and the term (entire string) is not in your synonym list, and will not get stemmed unless you you add the entire term <synonym>Mongoose HW Commuter bike 2021</synonym> to the list.

Using a keyord tokenizer will get 'Wood ply 12" x 10"' in and 'Wood ply 12" x 10"' out
Using a whitespace tokenizer will get 'Wood ply 12" x 10"' in and 'wood', 'ply', '12"', 'x', '10"' out
Using a standard tokenizer will get 'Wood ply 12" x 10"' in and 'wood', 'ply', '12', 'x', '10' out

What comes out of the tokenizers is what you need to put in your synonym list. And there can be casing differences.

So that would be these synonyms to get matches in the 3 above scenarios using different tokenizer on the same incoming string:

Using a keyord tokenizer: <synonym>Wood ply 12" x 10"</synonym>
Using a whitespace tokenizer: <synonym>12"</synonym>
Using a standard tokenizer: <synonym>12</synonym>

When you use the synonyms analyzer the original string comes in, i.e. Wood ply 12" x 10". Then it is passed to a tokenizer and broken into tokens. As the 3 examples above.

Then it will look in the list of synonyms for a match - as in the synonyms list above and return the stem ('12ply') for any given value to replace the original token/term. So in example 2, '12"' becomes '12ply' you have as stem of that group and the term '12"' is removed before terms are put in the index now being replaced with '12ply'. Then searching for 'ply 12"' will also pass that input string to same synonyms analyzer with the same tokenizer and split it into 'ply' and '12"', run it through the synonyms changing it to 'ply' and '12ply' which is used for the query - and it will match.

Now - adding dual words into the synonyms list is a huge challenge because it does not fit the book on how lucene is working with terms coming out of the tokenizers:

Consider this:

<synonym>Wood ply 12" x 10"</synonym>
<synonym>12 ply</synonym>
<synonym>12 inch</synonym>

If the synonyms analyzer replaced '12 ply' with '12ply', you could not search for '12' or 'ply' alone any more because it would have been removed before indexing.

So what happens here is that 'Board 12 ply' comes in, and 'board', '12ply', '12' and 'ply' and '12 ply' comes out. This is great! Now all of it can be searched.

But if your incoming is 'Board 12ply' (now without a space), 'Board' '12ply' comes out of the tokenizer and '12ply' is not located in the synonyms list and no changes to the token stream happens.

Now - having indexed 'Board 12ply' and someone searches '12 ply', it goes into the tokenizer and becomes '12' and 'ply' and then passed to the synonyms which will locate the synonym '12 ply' and return (because it is 2 words) '12', 'ply', '12 ply' and apply all to the query. But you do not have '12' and 'ply' in the index for this product.

It is the exact same analyzer that is applied to search strings and strings being indexed.

So you have to wrap your head around these tokenizers and filters to understand.

But yes - I can get that to work - using the information above:

<group stem="hw">
    <synonym>hello world</synonym>
  </group>
  <group stem="hello world">
    <synonym>hw</synonym>
  </group>

Capiche?

Nuno Aguiar

Posted on 09/02/2021 15:08:00

Hi Nicolai,

That does help clarify a few things, thank you, but there's just so many movinf parts that I don't feel like I am grasping this yet.

I see your screendumps, but I can't get this to work on my end as expected. I am failing to see what I'm missing and/or it's the scalability scenario that just does not work for me yet. I'll reach out to you over email and see if we can schedule a session with you or someone on your team to help us look into some of our most complex issues so we know how to handle them.

Currently we a few complex scenario that we're trying to optimize (old customer implementation that we had to adopt) and without being able to get this understood, we don't know how to do it.

Hope you or your team are up for the challenge.

Nuno Aguiar

You must be logged in to post in the forum

Developer forum

Synonym with a quote character

Replies