Developer forum

Forum » Ecommerce - Standard features » Searching for products with special characters in their name

Searching for products with special characters in their name

Imar Spaanjaars Dynamicweb Employee
Imar Spaanjaars
Reply

Hi there,

I know this topic has come up a number of times. However, some of the posts I found were very old so I am not sure what the current state and recommendations are.

I have products in the database that contain special characters like the ß to represent to consecutive s's. For example: MEßRAD. When users search for this, they will search for MESSRAD instead. What can I do to have a search for the MESSRAD return a product called MEßRAD? Are there built-in tools (like analyzers) for this? Or do I have to build my own? One option could be to store the plain text version in an additional field and use that in a freetext summary, but that feels a bit like a backup plan :-)

Thanks in advance!

Imar


Replies

 
Nicolai Pedersen Dynamicweb Employee
Nicolai Pedersen
Reply
This post has been marked as an answer

Hi Imar

You probably need the german analyzer: https://lucenenet.apache.org/docs/3.0.3/de/dbc/class_lucene_1_1_net_1_1_analysis_1_1_de_1_1_german_analyzer.html

Attached a version of the Lucene provider that exposes the german analyzer.

With that, you can create a field type that uses the German analyzer like below:

Votes for this answer: 1
 
Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

Hi Imar,

 

CaseInsensitive Whitespace Analyzer

I also used the Case Insensitive Whitespace analyzer in the past with positive results https://doc.dynamicweb.com/forum/ecommerce-standard-features/ecommerce-standard-features/lucene-query-with-special-characters. You'd have to create a new index field to store this data and then search on both the Product name and the new field with equal Boost values.

 

Snowball Analyzer

Also standard, we've used the snowball Analyzer. It's meant to support stemming too, which might be helpful in your case. Just like in the previous case, you'd point this to a new index field and query both "Product name" and this one equally.

 

Synonym Analyzer / Synonyms

There is a built-in too for Synonyms https://doc.dynamicweb.com/documentation-9/repositories/other/synonyms which can be used for this. However this will mean you'd need to feed it for every word you want, which may not be too helpful.

 

Hope this helps you going. When I have to go through this, I have to download the index files and validate it agains Luke.Net. I would recommend you try all approaches at once (meaning creating multiple index fields with different analyzers) and then validate the results in LukeNet.

A lesson learned is also to try with multiple product names, to get a good set of variations and ensure you are getting the right analyzer set up, as some could give you the same output for a particular term.

 

Let me know if I can help you more.

 

Best Regards,

Nuno Aguiar

 
Imar Spaanjaars Dynamicweb Employee
Imar Spaanjaars
Reply

Thanks Nicolai. That seems like a custom version? Will this be included in the core product if it works for me? Or should I build a custom analyzer? The solution is quite old (9.10 I believe) and I try to find a way to add this without redoing too much work.

@Nuno: all your solutions seem to suggest adding a new, normalized field, right? Was hoping to avoid that if I can. 

Imar

 
Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

Hi Imar

 

You, correct. Adding a field, and adding a new expression in the Query.

 

Or adding that field to the summary field afterwards (not sure if summary fields are processed after all other fields or if they follow a top-down approach for that to even be an option.

 

Nuno

 
Nicolai Pedersen Dynamicweb Employee
Nicolai Pedersen
Reply

Hi Imar

Yes, I have added it to my local build of the lucene provider, and will commit it to you if it does what is expected.

BR Nicolai

 
Imar Spaanjaars Dynamicweb Employee
Imar Spaanjaars
Reply

Nice! I dropped the dll in the solution, added a type using the GermanAnalyzer, and switched the ProuctName_Search field to my custom type and lo and behold: I got data on a double s search term for products with an Eszett:

I added the DLL to the 9.10 project and so far so good. I looked at its dependencies and they seem to target low versions of packages, so should I be good to go? Or is there a way I can extract the GermanAnalyzer from your DLL and put it in a custom one, minimizing version conflicts with standard DW functionality?

Thanks!

Imar

 
Adrian Ursu Dynamicweb Employee
Adrian Ursu
Reply

Hi guys,

This looks promising and very useful and it does not seem to be very complicated to achieve it.

@Nicolai: maybe it would be good to add some sort of guide for guys like me (Lucene analyzers for Dummies) on the documentation site for incorporating language specific analyzers into DW.

Thank you,
Adrian

 
Nicolai Pedersen Dynamicweb Employee
Nicolai Pedersen
Reply

Cool - added Devops#10996 that adds German, french, greek, dutch and chinese analyzers.

Will be out with next version of this package.

 
Imar Spaanjaars Dynamicweb Employee
Imar Spaanjaars
Reply

Great, thanks! And what about old solutions? I need to have this work on 9.10. It does, currently, but I am not sure if such a new package compared to the rest of the app will cause issues.

Also, I may come back one day for Spanish which also has a bunch of special characters.

Imar

 
Adrian Ursu Dynamicweb Employee
Adrian Ursu
Reply

Hi Nicolai,

Would it be too much to ask to add the Romanian one as well? 

Of course, if it's not already there.

Thank you,
Adrian

 
Nicolai Pedersen Dynamicweb Employee
Nicolai Pedersen
Reply

@Adrian: I do not have a Romanian analyzer - but I have a Czech one which I just added. The only Romanian I can find is for the Java version of Lucene - https://tool.oschina.net/uploads/apidocs/lucene-3.6.0/org/apache/lucene/analysis/ro/RomanianAnalyzer.html

@Imar: The Dynamicweb Lucene package is also valid for 9.10 - no changes in dependencies to add these analyzers. Spanish is not available out of the box in the analysis.

You can see the available analyzers here: https://lucenenet.apache.org/docs/3.0.3/dd/de5/namespace_lucene_1_1_net_1_1_analysis.html

 
Imar Spaanjaars Dynamicweb Employee
Imar Spaanjaars
Reply

Are these analyzers only needed for special / combined characters? Does the default index engine support characters like é and â or will those need to be analyzed separately? 

Imar

 
Imar Spaanjaars Dynamicweb Employee
Imar Spaanjaars
Reply

Bump. Would love to know if and how this would work for other languages like Danish, Spanish, German and Dutch.

 
Nicolai Pedersen Dynamicweb Employee
Nicolai Pedersen
Reply

No - term suggestions (spellcheck) will find those versions for you if misspelled. But the analyzer does not handle it.

You will need an analyzer with the ASCIIFoldingFilter https://lucenenet.apache.org/docs/3.0.3/d2/d34/class_lucene_1_1_net_1_1_analysis_1_1_a_s_c_i_i_folding_filter.html

An example can be found here: https://gist.github.com/jmcd/4110064

 
Imar Spaanjaars Dynamicweb Employee
Imar Spaanjaars
Reply

Sorry to keep coming back to this but I still need a little help. I have a solution with a bunch of different languages:

German
French
Danish
Spanish
Dutch

Each of these languages has special characters which we need to support. The ASCIIFoldingFilter looks really good and using it with a custom analyzer seems simple and straightforward enough. However, how would I use it together with the German analyzer? Having multiple analyzers on a field won't work nor would that make sense. Should I create separate types for each analyzer and then duplicate the fields for each analyzer type? Or maybe the ASCIIFoldingFilter also solves the Eszet in German? Or maybe I should create an AsciiGermanFrenchAndTheKitchenSink analyzer that does all of those in one? Or?

A little confused so more practical guidance would be much appreciated.

Thanks!

Imar

 
Nicolai Pedersen Dynamicweb Employee
Nicolai Pedersen
Reply

Good question - I do not know. I think the issue is that you want different behavior on different documents in the same index. One way could be combining all of them in one analyzer - which seems like a bad idea as they probably handle the same thinngs differently and you risk destroying terms. Another option is to have different indexes based on language...

An analyzer is combinations of filters and tokenizers - and take the sum of those together with the ASCIIFoldingFilter might be an option.

The big difference between the different language analyzers is that they have different stemmers (StemmingFilter). Stemmers handles plurals of the same word - to ensure "developer" = "developers" = "developing" kind of rules.

You can see the TokenStream() method here for German:

https://lucenenet.apache.org/docs/3.0.3/de/d4a/_german_analyzer_8cs_source.html

Last filter is the German stemmer and if you put all language stemmers after each other I am sure 'bad' things happen. Also stop words are different inside these analyzers.

So we would probably need to create a document specific analyzer that can take a country code and based on that apply the correct filter. That would require a bit of change to our indexing to locate and pass that identifyer - and then have a "Dynamicweb mother of all" analyzer that can do all of the above.

I will be happy to play with this with you?

 
Imar Spaanjaars Dynamicweb Employee
Imar Spaanjaars
Reply

>> I will be happy to play with this with you?

Yes, that would be great, thank you. How would you like to approach that?

Imar

 
Nicolai Pedersen Dynamicweb Employee
Nicolai Pedersen
Reply
This post has been marked as an answer

Hi Imar

Attached find a new version of the Lucene provider - it now contains a "DynamicwebStandardAnalyzer" that acts like the StandardAnalyzer but contains the ASCIIFoldingFilter which should take care of the diacritics. Give it a spin and see if that is closer to your goal.

BR Nicolai

Votes for this answer: 1
 
Imar Spaanjaars Dynamicweb Employee
Imar Spaanjaars
Reply

Perfect, thanks. We'll give this a try!

Imar

 
Imar Spaanjaars Dynamicweb Employee
Imar Spaanjaars
Reply

Hi Nicolai,

Yes, this gets me closer, but I am not quite there yet. It works well when I create a custom field type with this analyzer and then use my type as the type for a single field. It works great, for example, on a product name or long description field.

However, where I need it the most is where I cannot get it to work yet: summary fields. A summary field doesn't have a type (in the UI) so I can't make it use my custom type it seems. I tried a few solutions:

  1. Manually updated the index file and changed the type from System.String to my custom type. Then rebuilt the index. But search against the summary field didn't work if the accents didn't match.
  2. Created new fields for Name, Short and Long description, each one using my custom type; Then I created a new summary field using these 3 new fields. Search against the summary field didn't work if the accents didn't match.

I then manually updated my query to search against the separate fields directly, bypassing the summary field. That does the trick and it'll be my work around for now. But it does eliminate the option to use summary fields.

So, in summary: the analyzer seems to be doing its work (I still have more languages and characters to try out but so far it looks good) but its usage is limited to separate fields. Having the option to specify the type on a Summary field would  greatly expand its usefulness.

Thanks!

Imar

 
Imar Spaanjaars Dynamicweb Employee
Imar Spaanjaars
Reply

Bump. Any solution for the above where I could use the analyzer on summary fields?

Imar

 

 
Nicolai Pedersen Dynamicweb Employee
Nicolai Pedersen
Reply

Not a simple solution. Requires a bit of re-writing which is currently out of scope...

 
Imar Spaanjaars Dynamicweb Employee
Imar Spaanjaars
Reply

Understood. Do you recommend a solution? Should I change the setup and query to not query summary fields but search in individual fields instead? Or is there a better alternative?

 
Nicolai Pedersen Dynamicweb Employee
Nicolai Pedersen
Reply

Hi Imar

Just had a call with Jeppe and looking into the code. And it seems that the code supports a lot.

I.e. it is possible to add an analyzer directly on a field, including a Summaryfield (in code CopyFieldDefinition) - and it also supports the custom fields which can then have analyzers on them.

If you want to test - you can do like this in your .index file

 

We are trying to add the option to choose a custom field (using another analyzer) for summary field and see if that is enough to make it work.

 
Imar Spaanjaars Dynamicweb Employee
Imar Spaanjaars
Reply

That sounds great. Will give this a try.

Imar

 

You must be logged in to post in the forum