Change token delimiter with the indexing?

Mats Lycken

Posted on 27/04/2015 20:28:48

Hi!
I've started experimenting with the new indexing feature, I'm doing a proof of concept for a customer to evaluate the new feature. I'm using 8.6.1.5 with the wrap-template.

I would now like to customize how the input strings are broken down to terms. I have a field, manufacturer, that contains a value Northern Star. When indexed this is separated into two different terms, "Northern" and "Star". This also shows up in the facet for filtering, two alternatives in the loop, "northern" and "star". (They are lower case for some reason..)

Also, I know the documentation is on its way, but could someone please explain the difference between a field facet and a list facet? Doesn't have to be thorough, just something to give me a basic understanding of what it is.

Could someone also point me out to where in the API these features live? So I can have a look at what methods we as developers can call from the outside?

Thanks!

BR,
Mats

Replies

Nicolai Høeg Pedersen

Posted on 27/04/2015 21:38:57

Hi Mats

Please find the preleminary documentation attached.

You are using the schema extender - that will map all fields from the source (Ecommerce) to the index. When doing that, it stores and analyzes all fields. Analyzers will handle how tokens are put into the index, and that is what you want to change. Here is how.

1: First you need to take out the "ManufacturerName" from the schema extender (See dump 1)

2: Then you need to add it again as a field - but this time, uncheck the "Analyzed". This way the term will be indexed as is - the full string. (See dump 2)

3: Remember to re-index after this change.

4: Now add the new field to your search and create a parameter for it (See dump 3)

5: Then add it to facets (See dump 4)

When running the frontend, you get data like this (I've converted it to JSON)

{
  "name": "Manufacturer facet",
  "optionWithResultCount": "2",
  "optionCount": "2",
  "optionResultTotalCount": "4",
  "queryParameter": "manu",
  "options": [
   {
    "value": "Cube",
    "name": "Cube",
    "count": "3"
   },{
    "value": "Northen Star",
    "name": "Northen Star",
    "count": "1"
   }
  ]
}

Hope that works!

BR Nicolai

Indexing_-_second_draft.docx

Nicolai Høeg Pedersen

Posted on 27/04/2015 21:40:58

You find the code, byt the way, in Dynamicweb.Indexing, Dynamicweb.Querying and Dynamicweb.Repository

BR Nicolai

Mats Lycken

Posted on 28/04/2015 08:08:37

Great, thanks Nicolai!

It worked exactly as described and my facets look much better now. Another thing I will run into quite fast is that a product has a list of engines that it fits, these engines have names that I would not like to analyze, but I assume I need to do some analyzing to split up the individual names into a list of engines.

Engines have names like: V6 231 3,8L 2 bbl. (VIN A) and V8 305 5,0L F.I. (VIN E). I'm thinking I could create a simple CharTokenizer that splits on a specific character and somehow make the builder use that (writing my own Builder?). I'll dive into the documentation and have a look at the API.

Thanks for the info!

BR,
Mats

Nicolai Høeg Pedersen

Posted on 28/04/2015 08:11:11

Hi Mats

It might be easier to just add that engine information to a custom field on the products - and do the 'magic' while importing or something?

BR Nicolai

Mats Lycken

Posted on 28/04/2015 09:18:03

How would I add engine information to the custom field? That sounds more convenient but I couldn't figure out how.

Nicolai Høeg Pedersen

Posted on 28/04/2015 09:20:49

How do you get data in? importing or typing?

Mats Lycken

Posted on 28/04/2015 09:22:53

I'm running a SQL-query to export it from ERP then use Data Integration to import it from CSV.

Nicolai Høeg Pedersen

Posted on 28/04/2015 09:41:06

Then as part of the SQL query you use to export, add a field to your output with the manipulated data and map that to a custom field in the product catalog. Then that field will be indexed as well.

Mats Lycken

Posted on 28/04/2015 09:58:10

Wait, I realise now that I totally misread your response. When you wrote engine information I read it as Lucene indexing engine information... hehehe.

I'm adding it as a custom field but I'm not sure how to specify to Lucene that this string should be split up into terms using a special character. When I give it the example input string: @V8 430 7,0L Carb (VIN )@@V8 430 7,0L Carb (VIN )@@V8 430 7,0L Carb (VIN )@ I want that to be translated to an array ["V8 430 7,0L Carb (VIN )", "V8 430 7,0L Carb (VIN )", "V8 430 7,0L Carb (VIN )"]. So one spare part cat fit many different engine, and also so I can create a facet for engines.

Mats Lycken

Posted on 28/04/2015 17:39:29

Ok, I've made some progress but I had to write a custom IndexProvider and IndexWriter to add special handling to fix the issue with engines being split up.

As a PoC-solution I'm checking EndsWith of each field being added to the document, or when searching, and selecing appropriate Analyzer using a PerFieldAnalyzerWrapper. I'm using an analyzer that internally uses a CharTokenizer to split up the string into desired chunks.

A nicer solution to my hack above would be to be able to specify an analyzer on a field-by-field basis if you want something else than the StandardAnalyzer. Could come in handy in several cases I think.

BR,
Mats

Nicolai Høeg Pedersen

Posted on 28/04/2015 18:06:54

Hi Mats

Yes an analyzer per field is coming up - that would make a lot of sense and is in progress actually.

Thank you fór the input.

BR Nicolai

You must be logged in to post in the forum

Developer forum

Change token delimiter with the indexing?

Replies