Indexes

Dynamicweb 9 features a lightning fast generalized search framework usually referred to as New Indexing, which can be used to index and search through all kinds of data on your solution – productscontentusers, etc.

Broadly speaking, New Indexing consists of the following elements:

  • Indexes – which are data structures optimized for data retrieval
  • Queries – which are used to retrieve data based on criteria you define
  • Facets – which are used to create filters in frontend
  • Tasks – which are used to rebuild indexes at an interval

This article will tell you more about Indexes.

An index is a data structure optimized for data retrieval operations – which means that querying it is much faster than searching through each row in the database whenever a table is accessed.

To create an index:

  • Click Add index in the repository toolbar (Figure 2.1)
  • Name it
  • Click OK
Figure 2.1 Creating an index

This will open the index configuration page (Figure 2.2) from which you can add and configure the various index components.

Figure 2.2 The index configuration page

An index consists of the following components:

  • Instances – the physical data structures which are queried
  • Build configurations – a set of instructions for retrieving data from Dynamicweb and building an instance
  • Field definitions – a set of instructions detailing what goes in the index and how it should be stored
  • (Optional) Field types – custom field types can be used when you need to analyze data in a non-standard manner.

Read about the components below.

Instances are used to specify a location and an IndexProvider, which will create the index files at the location.

By default, Dynamicweb uses Lucene 3.0.3 and comes with a LuceneIndexProvider which builds index files in a folder in the file archive, e.g. /Files/System/Indexes/YourIndexName/YourInstanceName.

To create an instance:

  • Click Add instance on the index configuration page to open the Instance configuration (Figure 3.1)
  • Provide a name
  • Select a provider – this step is easy, as we supply only one provider out of the box
  • Specify a folder
  • Click OK
Figure 3.1 Creating an instance

Once created, an instance will look like the Lucene B index in Figure 3.2 – this is because it has not been built yet. Once it has been built it will look like the Lucene A index.

Figure 3.2 Instance configuration page

An instance only contains the data which existed at the time it was built. This means that your instances must be periodically rebuilt to include new data - and that you should always have more than one instance defined, since indexes cannot be queried when they are being built.

You can create scheduled tasks for automatically rebuilding your instances at an interval – see details below – and these tasks will build the instances sequentially, and will not rebuilt the last index if the previous instanced failed to be built correctly. This means that you will never be without an instance to query – even if something goes wrong during the built process.

If you have more than three instances defined, you can choose between two different methods for selecting an alternative index when the primary index is being rebuilt:

  • ActivePassive mode selects the next active instance available on the list. So if instances A is unavailable (being built, has failed to build), instance B will be used unless it is unavailable, in which case instance C will be used, and so forth
  • LastUpdated mode selects the most recently built index and uses that

ActivePassive mode is used by default – to change to LastUpdated mode:

  • Click Balancer in the ribbon bar (Figure 4.1)
  • Use the dropdown to select the LastUpdated balancer
  • Click OK
Figure 4.1 Choosing a balancing mode

A build is a set of instructions for retrieving data from Dynamicweb and delivering it to the IndexProvider on an instance, which will then build the physical index files.

To create a build configuration:

  • Click Add build on the index configuration page to open the build configuration dialog
  • Provide a name
  • Select a builder
  • (Optional) Configure the settings exposed by the builder
  • (Optional) Set up notifications on run or failure
  • Click OK

See below for a detailed look at the builders provided by Dynamicweb.

The ProductIndexBuilder is used to index products. It indexes data from multiple Ecommerce data tables, calculating group hierarchies and more – so the index contains all products fields, variant group fields, custom fields, category fields, stock location fields and a number of generated fields.

Figure 6.1 The ProductIndexBuilder

The ProductIndexBuilder supports the following builder actions:

  • Full builds everything from scratch
  • Update rebuilds only the products which have been edited within the timespan which falls between the current time and the HoursToUpdate setting

The following settings can be configured:

Setting

Value

Comments

EmptyStringReplacement

String – default is an empty string

NULL values are not indexed by Lucene, so to be able to locate an empty field you need to index it with a dummy value – this dummy value can be specified here.

BulkSize

Integer – default is 500

The number of products being built at a time

OnlyIndexActiveProducts

Boolean – defaults is False

If set to True, only active products are indexed

MaxProductsToIndex

Integer – default is 2147483647

The maximum number of products to index

SkipGrouping

Boolean – defaults to False

If set to True, the fields "GroupIDs", "ShopIDs", "GroupNames",
"GroupNumbers", "GroupDescriptions", "PrimaryGroupSort", "ParentGroupIDs", and "ParentGroupNames" are skipped

SkipExtenders

Boolean – defaults to False

If set to True, no custom Extenders can extend (update, remove, add) the fields in the index

SkipAllExtendedFields

Boolean – defaults to False

If set to True, the fields "CampaignStartTime", "CampaignEndTime", "CampaignShowProductsAfterExpiration", "IsVariant", "ManufacturerName", "AssortmentIDs", and "StockLocationProductAvailable" are skipped

SkipCategoryFields

Boolean – defaults to False

If set to True, all product category fields are skipped

DoNotStoreDefaultFields

Boolean – defaults to False

If True, schema extender fields are not set to stored by default

DoNotAnalyzeDefaultFields Boolean - defaults to False If True, schema extender fields not set to analyzed by default

HoursToUpdate

An integer – not set by default

If combined with the builder action Update, only the products updated within the hours specified here are rebuilt

HandleInheritedCategoryValues Boolean - defaults to False If True, inherited product category values are indexed. This is very slow, so please don't set this to true unless you really need to.
SkipImages Boolean - defaults to False If true, image paths are not indexed
DoNotFailOnMismatchingProductCount Boolean - defaults to False If true, building an index will not fail even if the product count before indexing and after indexing is different. This may be desirable if an import job happens while the index is being built.

 

The ContentIndexBuilder is used for indexing content – pages, their paragraphs, and their item fields.

The index is built by enumerating all available pages, then handling active paragraphs & item fields for each page.

The corresponding schema extender – the ContentIndexSchemaExtender – contains the following types of fields:

  • All fields from the Page table – e.g. PageActive, PageID, PageItemType, etc.
  • A number of Page content fields:
    • Paragraph headers contains an array of all paragraph headers on a page
    • Paragraph texts contains an array of all paragraph text content on a page
    • Paragraph content contains an array of the item type properties for each item-based paragraph on a page
    • Page property item type contains the name of the item type used to extend the page properties of this page (if relevant)
  • All item type fields in the format [item.SystemName]_[itemField.SystemName] and Property_[item.SystemName]_[itemField.SystemName], except the fields marked as 'do not include in search' in the item field settings.
  • Possibly a number of App fields – see more below.
Figure 7.1 The ContentIndexBuilder

The following settings are available:

  • ExcludeItemsFromIndex allows you to control whether or not item-based content should be indexed. False by default – which means item content IS indexed.
  • AppsToHandle allows you to specify exactly which ContentAppIndexProviders to include. Valid input is a comma-separated list of ContentAppIndexProviders to include. If nothing is set here, all ContentAppIndexProviders are included.

By default, we deliver a ContentAppIndexProvider for the forum – review the API doc on the ContentForumIndexProvider here.

Due to complexity issues, the ItemListEditor and ItemRelationListEditor field types are not indexed.

Creating a custom ContentAppIndexProvider

If you want to extend the content index with app-specific fields or documents, you must create a class inheriting from the ContentAppIndexProvider Class and override the relevant methods.

As example please take a look at ContentForumIndexProvider class.

The Sql index builder is used to index a table from the sql server database – it executes a query without manipulating any data. Currently only understands the builder action Execute.

Figure 8.1 The SqlIndexBuilder

The following settings are available:

  • Connection String can contain an SQL connection string, e.g. “Server=.;Database=test;User Id=sa;Password=sa;”
  • Query can contain an SQL query which retrieves the columns and rows which should be indexed, e.g. “SELECT * FROM AccessUser”
  • Query to get count can contain an SQL query which returns a count of the rows being added to the index, e.g. “SELECT COUNT(*) FROM AccessUser”
  • UseStoredProcedure – a setting which can be set in the index XML config file. When set to True, the stored procedure name must be set in the “Query” setting which will be executed when running the builder to index the data.

The UserIndexBuilder indexes all fields on users, including custom fields (but not user behavior, like orders placed, or order value, etc.).

It executes the following query to retrieve users: 

"SELECT * FROM AccessUser WHERE AccessUserType in (1, 3, 5)"

It uses no settings and only understands the builder action full (Figure 9.2).

Figure 9.2 The UserIndexBuilder

In addition to the standard user fields, the index contains the following generated fields for each user:

Field

Field content

Groups

An array of user group IDs where the user is a member

GroupNames

An array of user group names where the user is a member

Is Admin

True if System Administrator or Administrator

Combined order totals

The sum of Order Price with VAT from orders completed by this user

Largest order price

Largest Order Price with VAT entry associated with this user

Order count for last 30 days

A count of completed orders associated with this user within the last 30 days

Bought products

An array of product IDs from order completed by this user

Loyalty points total

The sum of LoyaltyUserTransactionPoints from EcomLoyaltyUserTransaction associated with this user

Loyalty point last added

A DateTime entry of the last time loyalty points were added to the users

Loyalty point next expirery

Oldest loyalty point transaction date by the user summed with global setting /Globalsettings/Ecom/LoyaltyPoints/ExpirationPeriodInMonths

 

The FileIndexBuilder (Figure 10.1) indexes various data about the files in the file system – NOT the content of the files. This can be used to create e.g. a searchable media library for images, pdf files, etc.

The following standard data is indexed:

  • File name
  • Directory path (/Files/whatever/Folder/OtherFolder/)
  • Directory (OtherFolder)
  • ParentDirectory (Folder)
  • RootDirectory (Files)
  • Extension (i.e. jpg, png, txt etc)
  • Filesize in bytes
  • LastWriteTime

The following fields are generated:

  • FileFullName - file path and name
  • Date created time/Date created time UTC
  • Last access time/Last access time UTC
  • Last write time UTC
  • Is read only

We also index metadata (EXIF, XMP, and IPTC) for certain types of (image) files.

Currently, we can index metadata for following file formats; .pdf, .gif, .jpg, .jpeg, .psd, .bmp, .png, .tiff, .tif, and .ai.

Figure 10.1 The FileIndexBuilder

The following settings can be used to tweak the builder behavior:

  • Recursive can contain a Boolean value, and controls whether subfolder content is indexed. Defaults to True.
  • StartFolder contains the path to a folder, defaults to /Files.
  • SkipMetadata contains a Boolean value, and controls whether metadata (EXIF, XMP, and IPTC) on image files is indexed. 

Fields are mappings between the data retrieved by the builder and the index – a set of instructions detailing which fields should be added to the index and how they should be stored.

To make things easier for you, we’ve created schema extenders for products, content and users – these are predefined sets of field mappings with everything defined for you.

To use a schema extender:

  • Click Add field on the index configuration page to open the build configuration overlay (Figure 11.1)
  • Select the schema extender field type
  • Select the appropriate schema extender; FileIndexSchemaExtender, ContentIndexSchemaExtender, ProductIndexSchemaExtender, or UserIndexSchemaExtender
  • Click OK
  • Save
Figure 11.1 Using the Schema Extender

Once you’ve saved the index, you will see a list of fields provided by the schema extender in question, e.g. the fields provided by the ProductIndexSchemaExtender in Figure 11.2.

Figure 11.2 The ProductIndexSchemaExtender

Now, the schema extender naturally makes some choices on your behalf – that’s the tradeoff with a predefined set.

Here are some headlines:

  • All string type fields are analyzed by default which means that spaces are considered a divider (which in turn makes it possible to conduct free-text searches on the data).
  • The fields cannot be assigned a custom boost value

If this behavior is a problem for your setup – which it often will be – you can exclude fields from the schema extender, and then add them manually and with the settings matching your needs.

To exclude a field from the schema extender:

  • Click the schema extender in the Fields area to open the settings (Figure 11.3)
  • Under Excluded fields click Add
  • Select the field(s) you want to exclude
  • Click OK
  • Save

The field will now be excluded the next time the index is built.

Figure 11.3 Excluded fields in the Schema Extender

If you want more control over the way your fields are indexed, you can add them manually – this is especially useful when you need to sort on, create facets or boost data from particular fields.

You can add three different types of fields:

  • A standard Field maps a single field from the builder to a corresponding field in the index
  • A Summary field maps data from more than one source field to a single destination field, converts the data to text and splits on whitespace. This is typically used to implement free-text search
  • A Grouping field allows you to group values together under a common label – this can be used to create facet-friendly index entries from values not inherently suited for faceting

When fields are added manually, you have access to the following settings:

  • Custom boost values – a factor multiplied onto the base relevance of the field (as automatically calculated by Lucene)
  • The storage setting, which details how the data is added to the index:
    • A stored field has its values stored in the index – this is necessary if you want to either store data from an external data source in the index, or publish data using the Query Publisher. 
    • An indexed field is made searchable, and the value stored as a single value.
    • An analyzed field is run through an analyzer, and split into tokens -  this means  you can search and find individual tokens (words). The field must also be indexed. To perform wildcard search the search field must NOT be analyzed.

Generally speaking, a field you want query or show in frontend should be indexed, and a field you want to be able to free-text search on should be analyzed. If you want to publish a field using the query publisher, it must also be stored. Fields you want to display as facets should be indexed but not analyzed.

To create any of the field types, manually click Add field on the index configuration page, select the field type you want to add, and configure the settings associated with the field type in question, as outlined below.

Stop words

By default, we use the Lucene StandardAnalyzer to analyze and tokenize data – and the StandardAnalyzer has a built-in list of stop words, which are words that will not be indexed (and therefore cannot be searched for).

The default stop words are: “a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”, “for”, “if”, “in”, “into”, “is”, “it”, “no”, “not”, “of”, “on”, “or”, “such”, “that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”, “was”, “will”, “with”.

To override the default list of stop words, place a file called stopwords.txt in the /Files/System/Repositories/ folder – the format is one word on each line:

a an and are as etc.

Wildcard search

In order to use wildcard search (? and *), it is important the the inflicted search fields are NOT analyzed. For example, having query parameter "q" to search for names, then "q=J%3Fnsen" would find both Jansen and Jensen (or any other result where second character would produce a result).

A standard Field maps a single field from the builder or an external source to a field in the index (Figure 13.1).

Figure 13.1 A standard Field

To create a standard Field:

  • Provide a Name and a System name for the field in the index
  • Select a source of data – if mapping to a custom source from a builder extender, use the plus icon and enter a source name matching the name in the code
  • In the settings area:
    • Select the data type you want the data to be indexed as – please ensure that the source and destination data types are compatible
    • (optional) Provide a custom boost value
    • Check Stored, Indexed, and/or Analyzed as appropriate to your needs

A Summary field maps data from more than one source field to a single destination field, converts the data to text and splits on whitespace (Figure 14.1).

Figure 14.1 A Summary field

To create a Summary field:

  • Provide a Name and a System name for the field in the index
  • In the settings area:
    •  (optional) Provide a custom boost value
    • Check Stored, Indexed, and/or Analyzed as appropriate to your needs
  • Select a number of sources

For a summary field, you do not need to select a data type – the data added to a summary field is always aggregated, converted to a string and split on whitespace.

A Grouping field allows you to group values together under a common label – this can be used to create facet-friendly index entries from values not inherently suited for faceting (Figure 15.1).

Figure 15.1 A Grouping field

To create a Grouping field:

  • Provide a Name and a System name for the field in the index
  • Select a source of data – if mapping to a custom source from a builder extender, use the plus icon and enter a source name matching the name in the code
  • In the settings area:
    • Select the data type you want the data to be indexed as – please ensure that the source and destination data types are compatible
    •  (optional) Provide a custom boost value
    • Check Stored, Indexed, and/or Analyzed as appropriate to your needs

Finally, add a number of groups and specify the values contained therein – you will then be able to create facets on the group names from frontend.

Standard Field type fields use the Lucene StandardAnalyzer to analyze and tokenize the source data and add it to the index – and this is perfectly adequate for most applications. However, in some cases you need to do more advanced stuff to the source data before indexing it.

In those cases, you can create custom field types and select one or more analyzers:

  • Click Add field type on the index configuration page to open the build configuration dialog (Figure 16.1)
  • Provide a name for the field type
  • Select a data type for the field type and optionally a custom boost value
  • Select one or more analyzers to run the data through
Figure 16.1 A custom field type

Once a custom field type has been created, you can use it by adding a standard Field type field and selecting the custom field type using the Type-dropdown (Figure 16.2).

Figure 16.2 Using the custom field type

On Analyzers

The Lucene StandardAnalyzer used by us by default has the following behavior:

  • Splits words at punctuation characters, removing punctuation. However, a dot that’s not followed by whitespace is considered part of a token.
  • Splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token.

For information on the other analyzers available, please refer to the Lucene documentation.