Indexes - Free text search and html entities (ø = &oslahs; etc.)

Jacob Storgaard Jensen

Posted on 26/03/2019 14:51:21

Hi guys,

I'm trying to setup a freetext search on page items and their paragraph item content - An it's indexing just fine... but whenever the search query includes special characters it'll return nothing.
I have inspected the index with Luke, and found that all html editor fields have the special characters stored as html entities... ø = ø etc... Seems like it's CFK-editor which does this...
Is there any way we can have it match those entites?

Do we have to implement something like this?: https://lucene.apache.org/solr/guide/6_6/charfilterfactories.html#CharFilterFactories-solr.HTMLStripCharFilterFactory
Nicolai linked to a Romanian analyzer in another post and in that same place I found this: https://github.com/apache/lucenenet/tree/master/src/Lucene.Net.Analysis.Common/Analysis/CharFilter

P.S. Just tested on Rapido 3.0... same problem there...
So I would say we need to have HTMLStripCharFilterFactory.cs added to standard DW... otherwise freetext search for ie. danish content made via rich text editors is sort of useless...

Replies

Nicolai Pedersen

Posted on 26/03/2019 17:34:31

You can just html encode the search string... So q=k&oslashdannelse;

Jacob Storgaard Jensen

Posted on 26/03/2019 17:42:51

Ok, and that will also match ø in non-html fields?

Nicolai Pedersen

Posted on 26/03/2019 17:53:57

It depends on how your search has been setup. Generally all free text search should search in a summary field and not use the same parameter (i.e. q=) on multiple fields.

So instead of adding multiple expressions using the search term - add multiple fieldsto the summary.

And maybe make the q parameter a string array and search for q=sø,sø

Jacob Storgaard Jensen

Posted on 26/03/2019 18:40:07

Ok, already used a summary field, and I'll try with the string array, should work for now. Thanks!

But... In the end... wouldn't it be prettier to have the entities replaced using a custom field type with analyzer like the one I linked to (it might remove too much though), and then add that stripped field to the summary field? Would be a little bit cleaner... But I can't add the analyzer on the solution I'm working on now, thats why I think it could be a nice one to include in default DW :-)

Jacob Storgaard Jensen

Posted on 26/03/2019 19:32:31

Ok, so I tried with the html encoded search string q=førerhund - that word førerhund is only present on one page... but it gave me somewhat all pages on the site... Luke actually does the same when I search for førerhund in the index...

Nicolai Pedersen

Posted on 26/03/2019 19:41:55

Hm, probably because the f gets its own term...

You can also change the editor config to not encode the html:

config.htmlEncodeOutput = false;
config.entities = false;

Also I've changed the content index builder to htmldecode text from items when indexing (as it does with paragraph texts)

Jacob Storgaard Jensen

Posted on 26/03/2019 19:53:48

Yep, was going down that road earlier with the editor config, but all content is in place on the site I'm working on, so not most viable solution in this case.

Ok nice!! Will that change come in a release soon or a package? :-)

Jacob Storgaard Jensen

Posted on 27/03/2019 12:31:31

Nicolai, it's just so I can tell the client to wait a little bit and use Search Weighted until I can fully implement the freetext search... :-)

Nicolai Pedersen

Posted on 27/03/2019 12:50:36

Will merge for release next to tuesday. You can use database search and replace tool to replace all encoded strings to æ ø and å.

https://doc.dynamicweb.com/documentation-9/platform/advanced-settings/database#4801

Jacob Storgaard Jensen

Posted on 27/03/2019 13:37:10

Cool! Thanks – will just make my templates ready for tuesday then ;-)

Did know about the search and replace, but right now my editor config just ignores the entity settings somehow, so it will just replace them again when the client edit texts.
UPDATE: just upgraded from 9.5.3 to 9.6.4 which fixed the editor config problem.

You must be logged in to post in the forum

Developer forum

Indexes - Free text search and html entities (ø = &oslahs; etc.)

Replies