Developer forum

Forum » Ecommerce - Standard features » Adding wildcard search to Repository

Adding wildcard search to Repository

Søren Heide Larsen
Søren Heide Larsen
Reply

Hi,

Is there any way to support wildcard searches? e.g. searching for "he" will match "hello". I know that you can develop another analyzer but I would prefer not breaking all keywords down to N-gram terms as I know lucene supports wildcards.

Best Regards
Søren


Replies

 
Nicolai Pedersen
Reply

Hi Søren

If you do a "Contains" on a text field, that is exactly how it works. But it will not result in finding "Hello" when searching for "lo".

For free text searches, you should define a summary field containing the fields that should be searched and in the query define a contains on that field. See dumps

Capture.PNG Capture1.PNG
 
Søren Heide Larsen
Søren Heide Larsen
Reply

Hi Nicolai,

Thank you for the fast response. THis was actually excactly what I did but it still does not work as I would like it to. I have now realised that contains does wildcard search but only in the beginning, e.g. "Hel" will match "Hello" but "llo" will not match "Hello".

Is there any chance for modifying this?

 

2016-10-04_09_13_21-Hummel_-_Dynamicweb_Administration.png 2016-10-04_09_13_44-Hummel_-_Dynamicweb_Administration.png
 
Keld Gøtterup
Reply

I would also very much like to know if there is a possibility to modify this.

 
Søren Møller Larsen
Søren Møller Larsen
Reply

I would love if you could do either of these:

Option 1: Modify the Contains method in Dynamicweb.Indexing.Lucene.Helpers line 306 with the following line. You might also argue that Contains in general would be like the WildcardQuery below. The functionality today is more like "StartsWith", which could be nice as a OperatorType as well.
 

booleanQuery.Add(new WildcardQuery(new Term(fieldName, "*" + term + "*")), Occur.MUST);

Option 2: Create a new OperatorType and add it there

Option 3: Make Indexing.Lucene.Helpers extendable so we can modify this ourselves without creating a whole new provider :-)

 

 
Nicolai Pedersen
Reply

Hi Søren & Keld

The current implementation is the recommended version from Lucene - adding the leading * will impact performance and memory usage dramatically, especially on larger indexes. Lucene has to do a completely different way of searching that will put a lot of data on the memory... It prevents Lucene from using its index and it has to scan all its terms/token in the index after loading all of it in memory - and it is of course slow if the index does not sit in memory (which can also be a problem).

So thats the explanation. Not sure this feature will really be a feature...

Your thoughts?

 
Søren Heide Larsen
Søren Heide Larsen
Reply

I believe you're correct that this will impact the performance, so this is a tough one. I think that the scenario is unavoidable as some customers simply requires this, so this should definently be configurable either through code or through the configurator. The scenario that makes most people happy would proberly be making it configurable to avoid too much C#.

Performance reflections

The thing is that it does not scan the entire text strings, only the dictionary of terms, so I do not believe that the performance hit will be that big unless you have an insanely large lucene index which proberly is not the case - it would be nice with a test though to get some facts instead of speculating :-)

My deam scenario would be the following:

  1. Implement StarsWith and Contains and maybe create some sort of hint that Contains might be expensive (We might consider creating a new for double wild card to ensure that people, who just update Dynamicweb does not get any performance hit because Contains changes)
  2. Put in a little Notification just before the query hits Lucene so we can both Log and Modify the Query for all sorts of intresting scenarios.
 
Nicolai Pedersen
Reply
This post has been marked as an answer

Hi Søren

There are lots of 'scary' stories on enabling double wildcards on Lucene out there with performance 20-40 times more slow, just google it. It will probably not be a big problem for 9 out of 10 implementations because of limited index sizes, but then we have customers with 700.000 products adding 10 criterias using this new approach and THEN....! (site will crash, has been tested).

Also, not even Google supports searching for *something - as it makes little sense.

@1: Contains in this connection does not mean "Each term in this field contains part of the word being searched for" - it means "this text field contains the word being searched". Renaming it to Startswith would be wrong for all other fields than fields with just one word in a string. If you have a name field with the value "Søren Heide Larsen", searching it with a criteria "Name StartsWith 'la'" would return a result which you would not expect where as the name starts with 'Sø'...

ANYWAYS. I will consider a "Bad performing term double contains" criteria for the next service release... With warnings.

@2: Will be happy to that.

BR Nicolai

Votes for this answer: 1
 
Søren Heide Larsen
Søren Heide Larsen
Reply

Hi Nicolai,

Love scary stories, nice that you have tested it through - Always nice to have these discussions with you! :-) 

@1: OK. I will do a workaround because the scenario on the customer where I need it now is only on the Customer Number, thus I can make an extention to the IndexBuilder that manages this.

@2: Great, looking forward to this.

Best Regards
Søren

 
Nicolai Pedersen
Reply
This post has been marked as an answer

Hi Søren.

I've added to tasks for the next service release covering these things.

TFS#26668 "Add a ContainsExtended criteria to the search builder to allow leading wildcard searches"

and

TFS#26669 "Add a notification to ParseQueryExpression"

BR Nicolai

Votes for this answer: 1
 
Søren Heide Larsen
Søren Heide Larsen
Reply

Great! :)

 
Keld Gøtterup
Reply

In which version are these features implemented?

 
Jonas Krarup Dam
Reply

These features are currently planned for 8.9.1 which is scheduled to be released at the end of November,

 

Regards, Jonas

 
Kristian Kirkholt Dynamicweb Employee
Kristian Kirkholt
Reply

Hi Søren and Keld

The feature TFS#26669 and TFS#26668 "Add a notification to ParseQueryExpression" + "Add a ContainsExtended criteria to the search builder to allow leading wildcard searches" has now been implemented in version 8.9.1.0

You are able to find this build in the download section:

http://doc.dynamicweb.com/releases-and-downloads/releases

Please contact Dynamicweb Support if you need any additional help regarding this.

Kind Regards
Dynamicweb Support
Kristian Kirkholt

 

 

You must be logged in to post in the forum