Searching for products with special characters in their name

Imar Spaanjaars

Posted on 29/11/2022 08:53:37

Hi there,

I know this topic has come up a number of times. However, some of the posts I found were very old so I am not sure what the current state and recommendations are.

I have products in the database that contain special characters like the ß to represent to consecutive s's. For example: MEßRAD. When users search for this, they will search for MESSRAD instead. What can I do to have a search for the MESSRAD return a product called MEßRAD? Are there built-in tools (like analyzers) for this? Or do I have to build my own? One option could be to store the plain text version in an additional field and use that in a freetext summary, but that feels a bit like a backup plan :-)

Thanks in advance!

Imar

Replies

Nicolai Pedersen

Posted on 29/11/2022 10:46:08

This post has been marked as an answer

Hi Imar

You probably need the german analyzer: https://lucenenet.apache.org/docs/3.0.3/de/dbc/class_lucene_1_1_net_1_1_analysis_1_1_de_1_1_german_analyzer.html

Attached a version of the Lucene provider that exposes the german analyzer.

With that, you can create a field type that uses the German analyzer like below:

Dynamicweb.Indexing.Lucene.zip

Votes for this answer: 1

Nuno Aguiar

Posted on 29/11/2022 10:52:30

Hi Imar,

CaseInsensitive Whitespace Analyzer

I also used the Case Insensitive Whitespace analyzer in the past with positive results https://doc.dynamicweb.com/forum/ecommerce-standard-features/ecommerce-standard-features/lucene-query-with-special-characters. You'd have to create a new index field to store this data and then search on both the Product name and the new field with equal Boost values.

Snowball Analyzer

Also standard, we've used the snowball Analyzer. It's meant to support stemming too, which might be helpful in your case. Just like in the previous case, you'd point this to a new index field and query both "Product name" and this one equally.

Synonym Analyzer / Synonyms

There is a built-in too for Synonyms https://doc.dynamicweb.com/documentation-9/repositories/other/synonyms which can be used for this. However this will mean you'd need to feed it for every word you want, which may not be too helpful.

Hope this helps you going. When I have to go through this, I have to download the index files and validate it agains Luke.Net. I would recommend you try all approaches at once (meaning creating multiple index fields with different analyzers) and then validate the results in LukeNet.

A lesson learned is also to try with multiple product names, to get a good set of variations and ensure you are getting the right analyzer set up, as some could give you the same output for a particular term.

Let me know if I can help you more.

Best Regards,

Nuno Aguiar

Imar Spaanjaars

Posted on 29/11/2022 14:08:15

Thanks Nicolai. That seems like a custom version? Will this be included in the core product if it works for me? Or should I build a custom analyzer? The solution is quite old (9.10 I believe) and I try to find a way to add this without redoing too much work.

@Nuno: all your solutions seem to suggest adding a new, normalized field, right? Was hoping to avoid that if I can.

Imar

Nuno Aguiar

Posted on 29/11/2022 14:26:12

Hi Imar

You, correct. Adding a field, and adding a new expression in the Query.

Or adding that field to the summary field afterwards (not sure if summary fields are processed after all other fields or if they follow a top-down approach for that to even be an option.

Nuno

Nicolai Pedersen

Posted on 29/11/2022 16:26:03

Hi Imar

Yes, I have added it to my local build of the lucene provider, and will commit it to you if it does what is expected.

BR Nicolai

Imar Spaanjaars

Posted on 29/11/2022 21:20:47

Nice! I dropped the dll in the solution, added a type using the GermanAnalyzer, and switched the ProuctName_Search field to my custom type and lo and behold: I got data on a double s search term for products with an Eszett:

I added the DLL to the 9.10 project and so far so good. I looked at its dependencies and they seem to target low versions of packages, so should I be good to go? Or is there a way I can extract the GermanAnalyzer from your DLL and put it in a custom one, minimizing version conflicts with standard DW functionality?

Thanks!

Imar

Adrian Ursu

Posted on 30/11/2022 09:17:21

Hi guys,

This looks promising and very useful and it does not seem to be very complicated to achieve it.

@Nicolai: maybe it would be good to add some sort of guide for guys like me (Lucene analyzers for Dummies) on the documentation site for incorporating language specific analyzers into DW.

Thank you,
Adrian

Nicolai Pedersen

Posted on 30/11/2022 09:47:36

Cool - added Devops#10996 that adds German, french, greek, dutch and chinese analyzers.

Will be out with next version of this package.

Imar Spaanjaars

Posted on 30/11/2022 10:18:30

Great, thanks! And what about old solutions? I need to have this work on 9.10. It does, currently, but I am not sure if such a new package compared to the rest of the app will cause issues.

Also, I may come back one day for Spanish which also has a bunch of special characters.

Imar

Adrian Ursu

Posted on 30/11/2022 10:18:33

Hi Nicolai,

Would it be too much to ask to add the Romanian one as well?

Of course, if it's not already there.

Thank you,
Adrian

Nicolai Pedersen

Posted on 30/11/2022 11:14:34

@Adrian: I do not have a Romanian analyzer - but I have a Czech one which I just added. The only Romanian I can find is for the Java version of Lucene - https://tool.oschina.net/uploads/apidocs/lucene-3.6.0/org/apache/lucene/analysis/ro/RomanianAnalyzer.html

@Imar: The Dynamicweb Lucene package is also valid for 9.10 - no changes in dependencies to add these analyzers. Spanish is not available out of the box in the analysis.

You can see the available analyzers here: https://lucenenet.apache.org/docs/3.0.3/dd/de5/namespace_lucene_1_1_net_1_1_analysis.html

Imar Spaanjaars

Posted on 01/12/2022 15:48:54

Are these analyzers only needed for special / combined characters? Does the default index engine support characters like é and â or will those need to be analyzed separately?

Imar

Imar Spaanjaars

Posted on 06/12/2022 08:31:33

Bump. Would love to know if and how this would work for other languages like Danish, Spanish, German and Dutch.

Nicolai Pedersen

Posted on 06/12/2022 11:26:01

No - term suggestions (spellcheck) will find those versions for you if misspelled. But the analyzer does not handle it.

You will need an analyzer with the ASCIIFoldingFilter https://lucenenet.apache.org/docs/3.0.3/d2/d34/class_lucene_1_1_net_1_1_analysis_1_1_a_s_c_i_i_folding_filter.html

An example can be found here: https://gist.github.com/jmcd/4110064

Imar Spaanjaars

Posted on 07/12/2022 11:54:19

Sorry to keep coming back to this but I still need a little help. I have a solution with a bunch of different languages:

German
French
Danish
Spanish
Dutch

Each of these languages has special characters which we need to support. The ASCIIFoldingFilter looks really good and using it with a custom analyzer seems simple and straightforward enough. However, how would I use it together with the German analyzer? Having multiple analyzers on a field won't work nor would that make sense. Should I create separate types for each analyzer and then duplicate the fields for each analyzer type? Or maybe the ASCIIFoldingFilter also solves the Eszet in German? Or maybe I should create an AsciiGermanFrenchAndTheKitchenSink analyzer that does all of those in one? Or?

A little confused so more practical guidance would be much appreciated.

Thanks!

Imar

Nicolai Pedersen

Posted on 07/12/2022 15:18:59

Good question - I do not know. I think the issue is that you want different behavior on different documents in the same index. One way could be combining all of them in one analyzer - which seems like a bad idea as they probably handle the same thinngs differently and you risk destroying terms. Another option is to have different indexes based on language...

An analyzer is combinations of filters and tokenizers - and take the sum of those together with the ASCIIFoldingFilter might be an option.

The big difference between the different language analyzers is that they have different stemmers (StemmingFilter). Stemmers handles plurals of the same word - to ensure "developer" = "developers" = "developing" kind of rules.

You can see the TokenStream() method here for German:

https://lucenenet.apache.org/docs/3.0.3/de/d4a/_german_analyzer_8cs_source.html

Last filter is the German stemmer and if you put all language stemmers after each other I am sure 'bad' things happen. Also stop words are different inside these analyzers.

So we would probably need to create a document specific analyzer that can take a country code and based on that apply the correct filter. That would require a bit of change to our indexing to locate and pass that identifyer - and then have a "Dynamicweb mother of all" analyzer that can do all of the above.

I will be happy to play with this with you?

Imar Spaanjaars

Posted on 16/12/2022 09:40:51

>> I will be happy to play with this with you?

Yes, that would be great, thank you. How would you like to approach that?

Imar

Nicolai Pedersen

Posted on 20/12/2022 13:53:04

This post has been marked as an answer

Hi Imar

Attached find a new version of the Lucene provider - it now contains a "DynamicwebStandardAnalyzer" that acts like the StandardAnalyzer but contains the ASCIIFoldingFilter which should take care of the diacritics. Give it a spin and see if that is closer to your goal.

BR Nicolai

Dynamicweb.Indexing.Lucene.zip

Votes for this answer: 1

Imar Spaanjaars

Posted on 22/12/2022 09:50:47

Perfect, thanks. We'll give this a try!

Imar

Imar Spaanjaars

Posted on 30/12/2022 12:37:51

Hi Nicolai,

Yes, this gets me closer, but I am not quite there yet. It works well when I create a custom field type with this analyzer and then use my type as the type for a single field. It works great, for example, on a product name or long description field.

However, where I need it the most is where I cannot get it to work yet: summary fields. A summary field doesn't have a type (in the UI) so I can't make it use my custom type it seems. I tried a few solutions:

Manually updated the index file and changed the type from System.String to my custom type. Then rebuilt the index. But search against the summary field didn't work if the accents didn't match.
Created new fields for Name, Short and Long description, each one using my custom type; Then I created a new summary field using these 3 new fields. Search against the summary field didn't work if the accents didn't match.

I then manually updated my query to search against the separate fields directly, bypassing the summary field. That does the trick and it'll be my work around for now. But it does eliminate the option to use summary fields.

So, in summary: the analyzer seems to be doing its work (I still have more languages and characters to try out but so far it looks good) but its usage is limited to separate fields. Having the option to specify the type on a Summary field would greatly expand its usefulness.

Thanks!

Imar

Imar Spaanjaars

Posted on 11/01/2023 11:54:41

Bump. Any solution for the above where I could use the analyzer on summary fields?

Imar

Nicolai Pedersen

Posted on 18/01/2023 17:40:29

Not a simple solution. Requires a bit of re-writing which is currently out of scope...

Imar Spaanjaars

Posted on 19/01/2023 09:05:03

Understood. Do you recommend a solution? Should I change the setup and query to not query summary fields but search in individual fields instead? Or is there a better alternative?

Nicolai Pedersen

Posted on 19/01/2023 10:48:48

Hi Imar

Just had a call with Jeppe and looking into the code. And it seems that the code supports a lot.

I.e. it is possible to add an analyzer directly on a field, including a Summaryfield (in code CopyFieldDefinition) - and it also supports the custom fields which can then have analyzers on them.

If you want to test - you can do like this in your .index file

We are trying to add the option to choose a custom field (using another analyzer) for summary field and see if that is enough to make it work.

Imar Spaanjaars

Posted on 19/01/2023 16:07:35

That sounds great. Will give this a try.

Imar

You must be logged in to post in the forum

Developer forum

Searching for products with special characters in their name

Replies