Developer forum

Forum » CMS - Standard features » Did you mean odd suggestions

Did you mean odd suggestions

Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

Hi,

 

We have a customer that's used to having multiple terms searched. Introducing the "Did you mean" suggestions feature, it recommends some unexpected results and we're wondering if there's anything that can be done to improve it, whether that's from a custom POV or in the core platform.

 

Consider this https://www.screencast.com/t/GGZJEGZbjRI

  • Searching for "mouse" I get good suggestions
  • Searching for "cd8" I get good suggestions (it's a technical term of theirs)
  • Searching for both terms at the same time, I get odd suggestions

 

Best Regards,

Nuno Aguiar


Replies

 
Nicolai Pedersen
Reply

The feature is a lookup on terms in the index that is as close as possible to your term. The yields a result.

"mouse" and "cd8" gives you 2 lists seperately.

The combination of all suggestions for mouse and cd8 is not nescesarily available in the index...? So when you search mouse first (which is a term that exists) it will find the next terms that exists in that combination. 

 
Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

Hi Nicolai,

 

In short, we always get results with the suggestions presented, however they are not the most logical (and I haven't figured out how they could be better).

 

I guess in my mind it would reccomend something like this

  • Assuming search terms are "A" + "B"
  • Assuming "A" would return
    • A1
    • A2
    • A3
    • A4
    • A5
    • A6
    • A7
    • A8
    • A9
  • Assuming "B" would return
    • B1
    • B2
    • B3
    • B4
    • B5
    • B6
    • B7
    • B8
    • B9
  • Searching for "A B" I would expect recommendations to be
    • A1 B1
    • A1 B2
    • A1 B3
    • A2 B1
    • A2 B2
    • A2 B3
    • A3 B1
    • A3 B2
    • A3 B3

 

I guess what I am saying is that the list of suggestions would be a mix of what we would have if searched individually, and right now we're getting some other ones.

 

Does that make sense? Sorry I can't be clearer

 

Nuno

 
Nicolai Pedersen
Reply

I understand what you expect. But why return a suggestion that will not yield a result?

If document 1 has the term A1 and document 2 has B1, giving you a suggestion of A1 + B1 would give you a suggestion that does not give you any results.

So given A1, what are the best options for B? It might not be B1 or B2, but C3 because it exists in documents with A1 and is the closest match.

In other words - if you index has 100 documents, and A1 is in only 3 of them, suggestions for the term B will be the best match of the terms found in the 3 documents that has the term A1. And not the best terms found in all 100 documents...

Makes sense?

 
Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

Hi Nicolai,

 

I agree with you and my suggestion was not well thought through. But from my perspective, the feature is not behaving quite like that (I know these are scientific terms, which makes this harder)

 

Consider a true search for "mouse cd8" like I did in the screencast

  • If you do search for that you'll get this URL https://development.biolegend.com/en-us/search-results?Keywords=mouse+cd8
  • There actually are results that would match 
    • mouse cd80b
    • mouse cd80b.2
  • The point being that suggesting "cd80b" and "cd80b.2" (or even "cd38" and other cd variations) would be more expectable recommendations (that yeld results) than "ak" and "fitc"

 

Lastly, having a "mouse mouse" suggestion is odd to say the least. I can't think of a case where a duplicate term as suggestion would be a good use case, but I might be wrong.

 

So I'm with you on the logic, but based on the customer's results so far, I don't think the end result is achieving it.

(BTW if you go and check the website, the customer temporarilly changed the site to a different field for suggestions. If you find there's grounds to dig deeper with that site/data I can get some test scenarions back)

 

Does that make sense?

Nuno Aguiar

 
Nicolai Pedersen
Reply

Hi Nuno

Sorry about that. This is just a computer and some algorithms that are provided out of the Lucene library that we utilize.

As with your synonyms you want something completely custom, and I think you should just go for that.

You are more than welcome to get the code and see what you can do with it.

BR Nicolai

 
Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

Hi Nicolai,

 

Of course :) I'd be happy to get the code and see what I can do with it. I am sure the customer would need something custom, if nothing else because of their non-language terms.

 

Thanks,

Nuno

 
Nicolai Pedersen
Reply

Consider the problem

  • One word searches - i.e. "Mouse"
    • Term exist in the index = suggest closest match that is not the term
    • Term does not exist in the index = suggest closest match
  • Two word searches - i.e. "Mouse" "Cd8"
    • Term 1 exists in the index
      • Term 2 does exist in the index
      • Term 2 does not exist in the index
    • Term 1 does not exist in the index
      • Term 2 does exist in the index
      • Term 2 does not exist in the index
  • Three word searches  - i.e. "Mouse" "Cd8" "bla"
    • Term 1 exists in the index
      • Term 2 does exist in the index
        • Term 3 does exist
        • Term 3 does not exist
      • Term 2 does not exist in the index
        • Term 3 does exist
        • Term 3 does not exist
    • Term 1 does not exist in the index
      • Term 2 does exist in the index
        • Term 3 does exist
        • Term 3 does not exist
      • Term 2 does not exist in the index
        • Term 3 does exist
        • Term 3 does not exist

In all scenarios return a suggestion that will yield a result.

In your case you think it should suggest "mouse cd80b". It might look pretty. But it will give 0 result so it is not a valid suggestion.

 
Nicolai Pedersen
Reply

They do not want computer logic.

They want a list of input and suggestions they can apply them selves, because it requires a human brain with human decision process to give them what they find logic.

Below the code in question - enjoy :-). The project is attached. Then you can also do your own synonyms.

BR Nicolai

 

/// <summary>
    /// Represents Lucene spell checker
    /// </summary>
    public class LuceneSpellChecker
    {
        private readonly SpellChecker.Net.Search.Spell.SpellChecker checker;
        private readonly IndexReader indexReader;
        private readonly string indexField;
        private readonly int numberOfSuggestions;
        private bool isIndexed;

        /// <summary>
        /// Constructs new spell checker instance
        /// </summary>
        /// <param name="reader"></param>
        /// <param name="field"></param>
        public LuceneSpellChecker(IndexReader reader, string field)
        {
            indexReader = reader;
            indexField = field;
            //checker = new SpellChecker.Net.Search.Spell.SpellChecker(new RAMDirectory(), new JaroWinklerDistance());
            checker = new SpellChecker.Net.Search.Spell.SpellChecker(new RAMDirectory());
            numberOfSuggestions = Configuration.SystemConfiguration.Instance.GetInt32("/GlobalSettings/System/Repository/LuceneSpellChecker/NumberOfSuggestions");
            if (numberOfSuggestions <= 0)
                numberOfSuggestions = 10;
        }

        private void EnsureIndexed()
        {
            if (!isIndexed)
            {
                checker.IndexDictionary(new LuceneDictionary(indexReader, indexField));
                isIndexed = true;
            }
        }

        /// <summary>
        /// Suggest similar words.
        /// </summary>
        /// <param name="searchString">Word to find alternative suggestions for</param>
        public IEnumerable<string> SuggestSimilar(string searchString)
        {
            return SuggestSimilar(searchString, false);
        }

        /// <summary>
        /// Suggest similar words.
        /// </summary>
        /// <param name="searchString">Word to find alternative suggestions for</param>
        /// <param name="analyzed">If the field that is looked at for suggestions is analyzed</param>
        /// <returns></returns>
        public IEnumerable<string> SuggestSimilar(string searchString, bool analyzed)
        {
            EnsureIndexed();
            if (analyzed)
            {
                searchString = searchString.ToLowerInvariant();
            }

            List<string> searchTerms = new List<string>();
            Analyzer analyzer = new mylucene.Analysis.Standard.StandardAnalyzer(mylucene.Util.Version.LUCENE_30);
            using (var tokenStream = analyzer.TokenStream("inputquery", new System.IO.StringReader(searchString)))
            {
                tokenStream.Reset();
                while (tokenStream.IncrementToken())
                {
                    var termAttr = tokenStream.GetAttribute<ITermAttribute>();
                    searchTerms.Add(termAttr.Term);
                }
            }

            //var searchTerms = searchString.Split(new[] { ' ', '\r', '\n' }, System.StringSplitOptions.RemoveEmptyEntries).ToList();
            int depth = 1;
            bool foundSuggestions = false;
            string combinedWordTermMatch = string.Empty;
            SortedDictionary<int, string> suggestionsCombined = new SortedDictionary<int, string>(); //Our final result - will hold a combination of results for each word in the original search string
            foreach (string searchTerm in searchTerms) //Iterate the single words
            {
                int suggestionsToRequest = numberOfSuggestions;
                if (searchTerms.Count > depth)
                {
                    suggestionsToRequest = 1;
                }
                List<string> singlewordSuggestions;
                if (depth > 1 && depth == searchTerms.Count) //If we have a search string of several words, we will use the first suggestion(s) for first (2nd, 3rd, etc) word in search string, and a list of suggestions for last word
                {
                    singlewordSuggestions = GetTermsSuggestionsFromSearch(suggestionsCombined[0], searchTerm);
                    if (singlewordSuggestions.Count == 0)
                    {
                        string tempCombinedWordTermMatch = string.Empty;
                        singlewordSuggestions = GetTermSuggestions(searchTerm, suggestionsToRequest, string.Empty, out tempCombinedWordTermMatch); //Find suggestions for the singleword
                    }
                }
                else
                {
                    string combinedWord = string.Empty;
                    string tempCombinedWordTermMatch = string.Empty;
                    if (searchTerms.Count > 1)
                    {
                        combinedWord = searchTerms[0] + searchTerms[1];
                        suggestionsToRequest = 10;
                    }
                    singlewordSuggestions = GetTermSuggestions(searchTerm, suggestionsToRequest, combinedWord, out tempCombinedWordTermMatch); //Find suggestions for the singleword
                    if (string.IsNullOrEmpty(combinedWordTermMatch))
                    {
                        combinedWordTermMatch = tempCombinedWordTermMatch;
                    }
                }

                if (singlewordSuggestions == null)
                {
                    singlewordSuggestions = new List<string>();
                }
                if (singlewordSuggestions.Count > 0)
                {
                    foundSuggestions = true;
                }

                //If we do not have a suggestion for place i, use the original word in this place
                var wordToAdd = searchTerm;
                for (var i = 0; i < numberOfSuggestions; i++) //For each expected result, we will add a record to the result
                {
                    if (singlewordSuggestions.Count > i)
                    {
                        if (searchTerms.Count > 1 && depth == 1)
                        {
                            wordToAdd = singlewordSuggestions[0];
                        }
                        else
                        {
                            wordToAdd = singlewordSuggestions[i];
                        }
                        
                    }
                    //Add or update the result with our finding for this single word and insert the right place
                    if (suggestionsCombined.ContainsKey(i))
                    {
                        suggestionsCombined[i] += ' ' + wordToAdd;
                    }
                    else
                    {
                        suggestionsCombined.Add(i, wordToAdd);
                    }
                }
                depth++;
            }
            if (foundSuggestions && suggestionsCombined.Count > 0)
            {
                List<string> result = new List<string>();
                if (!string.IsNullOrEmpty(combinedWordTermMatch))
                {
                    result.Add(combinedWordTermMatch);
                }
                //if (searchTerms.Count > 1)
                //{
                //    string combinedWord = searchTerms[0] + searchTerms[1];
                //    string combinedWordSuggestion = GetTermSuggestions(combinedWord, 1)?.FirstOrDefault();
                //    if (!string.IsNullOrEmpty(combinedWordSuggestion) && combinedWordSuggestion.StartsWith(combinedWord, StringComparison.OrdinalIgnoreCase))
                //    {
                //        result.Add(combinedWordSuggestion);
                //    }
                //}
                foreach (var suggestion in suggestionsCombined)
                {
                    if (!result.Contains(suggestion.Value, StringComparer.OrdinalIgnoreCase))
                    {
                        result.Add(suggestion.Value);
                    }
                }
                return result.Take(numberOfSuggestions);
            }
            else
            {
                return Enumerable.Empty<string>();
            }
        }

        internal List<string> GetTermSuggestions(string word, int neededSuggestions, string combinedTwoWordTerm, out string combinedWordTermMatch)
        {
            combinedWordTermMatch = string.Empty;
            //Terms - find existing terms in the field that starts with the passed word
            List<string> termSuggestions = new List<string>(numberOfSuggestions);
            TermEnum terms = indexReader.Terms(new Term(indexField, word));
            int maxSuggestsCpt = 0;
            do
            {
                var term = terms.Term.Text;
                if (!string.IsNullOrEmpty(combinedTwoWordTerm) && term.StartsWith(combinedTwoWordTerm, System.StringComparison.OrdinalIgnoreCase))
                {
                    combinedWordTermMatch = term;
                }
                if (term.StartsWith(word, System.StringComparison.OrdinalIgnoreCase))
                {
                    if (!termSuggestions.Contains(term, StringComparer.OrdinalIgnoreCase))
                    {
                        termSuggestions.Add(term);
                        maxSuggestsCpt++;
                    }
                }

                if (maxSuggestsCpt >= neededSuggestions || maxSuggestsCpt == 0) //if maxSuggestsCpt = 0 means that there are no terms in this list starting with the search word - no reason to iterate
                    break;
            }
            while (terms.Next());

            if (termSuggestions.Count() >= neededSuggestions || word.Length < 2) //If there is enough suggestions or the word is one letter only
            {
                return termSuggestions;
            }

            //Add suggestions to the list of existing terms
            int missingSuggestions = neededSuggestions - termSuggestions.Count;

            List<string> metrics = GetSimilarSuggestions(word, missingSuggestions);

            foreach (string suggestion in metrics)
            {
                termSuggestions.Add(suggestion);
            }

            return termSuggestions.Distinct().ToList();
        }

        internal List<string> GetSimilarSuggestions(string word, int numberOfSuggestions)
        {
            //Suggestions
            var suggestions = checker.SuggestSimilar(word, numberOfSuggestions, indexReader, indexField, true);
            var jaro = new JaroWinklerDistance();
            var leven = new LevenshteinDistance();
            var ngram = new NGramDistance();

            var metrics = suggestions.Select(s => new
            {
                suggestion = s,
                freq = indexReader.DocFreq(new Term(indexField, s)),
                jaro = jaro.GetDistance(word, s),
                leven = leven.GetDistance(word, s),
                ngram = ngram.GetDistance(word, s)
            })
            .OrderByDescending(metric =>
                (
                    (metric.freq / 10f) +
                    metric.jaro +
                    metric.leven +
                    metric.ngram
                )
                / 4f
            )
             .ToList();
            return metrics.Select(m => m.suggestion).ToList();
        }

        internal List<string> GetTermsSuggestionsFromSearch(string termToSearch, string word)
        {
            List<string> termSuggestions = new List<string>();
            List<string> spellCheckedSuggestions = new List<string>(numberOfSuggestions);
            List<string> fallbacTermSuggestions = new List<string>(numberOfSuggestions);
            Analyzer analyzer = new mylucene.Analysis.Standard.StandardAnalyzer(mylucene.Util.Version.LUCENE_30);
            //QueryParser parser = new QueryParser(mylucene.Util.Version.LUCENE_30, indexField, analyzer);
            var parser = new MultiFieldQueryParser(mylucene.Util.Version.LUCENE_30, new[] { indexField }, analyzer);
            parser.DefaultOperator = QueryParser.Operator.AND;
            var query = parser.Parse(termToSearch);

            var booleanQuery = new BooleanQuery();
            booleanQuery.Add(query, Occur.MUST);
            var filter = new QueryWrapperFilter(booleanQuery);

            bool debugAdded = false;
            using (Searcher searcher = new IndexSearcher(indexReader))
            {
                string spellCheckedWord = string.Empty;
                var spellingSuggestion = GetSimilarSuggestions(word, 1);
                if (spellingSuggestion != null)
                {
                    spellCheckedWord = spellingSuggestion.FirstOrDefault();
                }

                TopScoreDocCollector collector = TopScoreDocCollector.Create(25, true);
                searcher.Search(booleanQuery, filter, collector);
                //TopDocs docs = searcher.Search(query, 10);
                var hits = collector.TopDocs().ScoreDocs;

                for (int i = 0; i < hits.Length; i++)
                {
                    ITermFreqVector vector = indexReader.GetTermFreqVector(hits[i].Doc, indexField);

                    //Get all terms and sort them by frequency - one document at the time.
                    List<TermFrequency> termFrequencies = new List<TermFrequency>();
                    var termCounts = vector.GetTermFrequencies();
                    int termArrayPointer = 0;
                    foreach (string term in vector?.GetTerms())
                    {
                        termFrequencies.Add(new TermFrequency(termCounts[termArrayPointer], term));
                        termArrayPointer++;
                    }
                    if (!debugAdded)
                    {
                        //string terms = string.Empty;
                        //foreach (var term in termFrequencies.OrderByDescending(o => o.Frequency))
                        //{
                        //    terms += $"{term.Term} ({term.Frequency}) ";
                        //}
                        //termSuggestions.Add($"DEBUG (F:{termToSearch} c:{hits.Count()} tf:{termFrequencies.Count} q:{query.ToString()}) {terms}");
                        debugAdded = true;
                    }
                    foreach (var term in termFrequencies.OrderByDescending(o => o.Frequency))
                    {
                        if (term.Term.StartsWith(word, StringComparison.OrdinalIgnoreCase))
                        {
                            termSuggestions.Add(term.Term);
                        }
                        else if (!string.IsNullOrEmpty(spellCheckedWord) && term.Term.StartsWith(spellCheckedWord, StringComparison.OrdinalIgnoreCase))
                        {
                            spellCheckedSuggestions.Add(term.Term);
                        }
                        else
                        {
                            if (fallbacTermSuggestions.Count < numberOfSuggestions)
                            {
                                fallbacTermSuggestions.Add(term.Term);
                            }
                        }
                    }
                    if (termSuggestions.Count > numberOfSuggestions)
                    {
                        break;
                    }
                }
            }

            if (termSuggestions.Count < numberOfSuggestions)
            {
                //Suggestions are missing. Add falback suggestions:
                termSuggestions.AddRange(spellCheckedSuggestions.Take(numberOfSuggestions - termSuggestions.Count));
                termSuggestions.AddRange(fallbacTermSuggestions.Take(numberOfSuggestions - termSuggestions.Count));
            }

            return termSuggestions;
        }

        internal class TermFrequency
        {
            public int Frequency;
            public string Term;
            public TermFrequency(int frequency, string term)
            {
                Frequency = frequency;
                Term = term;
            }
        }
    }
 
Nicolai Pedersen
Reply

Hi Nuno

Try the attached dll. I have changed this line

TopScoreDocCollector.Create(25, true);

To

TopScoreDocCollector.Create(300, true);

That might give you better results. But performance is not as good.

BR Nicolai

 
Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

Hi Nicolai,

 

Back from vacation and looking at this now. Yeah, their search, specially being about scientific terms (some of which they come up with - industry trenders) make it trickier. Sacrificing performance for accuracy should be an acceptable tradeoff. We can always lazy-load the suggestions to workaround that issue.

 

I'll give this a try and let you know the results.

 

Thanks,

Nuno Aguiar

 

 
Nuno Aguiar Dynamicweb Employee
Nuno Aguiar
Reply

That shift from 25 to 300 was brutal. Please be careful, you're starting to sound like me when I was building v1 of this customer's website cheeky

 

You must be logged in to post in the forum