Microsoft Search Service Dynamicweb Cloud

Morten Snedker

Posted on 14/09/2020 13:41:43

Reply

Hi Martin,

Yes, it is possible. But I am not entirely sure that is what you really want.

The documenation in your link points to Weighted Search, and that app is in the winter of its life span (of fix and features at least). We very much encourage the use of Repository instead. This is where we put our developer efforts, and it will grant you way more future support and "safety" over Weighted Search. Weighted Search has an easy setup, but it has overhead and performance as a clear drawback compared to Repository / Lucene.

Remember you can use the index for both content, files and ecommerce. So, when you ask for Microsoft Search Service, I take it that file search is what is in question here. What specifically do you want to search for, for which you require Microsoft Search?

/Snedker

Martin Grønbekk Moen

Posted on 14/09/2020 14:01:39

Reply

Thank you Morten for your reply.

What we really would like is the possibility to index and search for content inside files.
Lets say an PDF-file should generate a search hit for content inside the file.

Is this possible at the moment?

Morten Snedker

Posted on 14/09/2020 16:13:46

Reply

This post has been marked as an answer

Searching PDF is not possible out of the box. But in the T3 training material we actually build this exact functionality - extract text content of pdf files:

using Dynamicweb.Diagnostics.Tracking;
using Dynamicweb.Indexing.Schemas;
using Dynamicweb.Configuration;
using Dynamicweb.Content.Files;
using System;
using System.IO;
using System.Linq;
using System.Collections.Generic;
using iTextSharp;

namespace Dynamicweb.Indexing
{
    public class CustomPDFIndexBuilder : IndexBuilderBase
    {
        // No http context available - getting domain from custom setting. Used for building complete link to file.
        private string Domain = SystemConfiguration.Instance.GetValue("/Globalsettings/Settings/CustomPDFFileIndexer/Domain"); // your-domain.com
        private string StartFolder = FilesAndFolders.GetFilesFolderName();

        /// <summary>
        /// List of supported actions
        /// </summary>
        public override IEnumerable<string> SupportedActions
        {
            get
            {
                return new string[] { "Full", "Update" };
            }
        }
        /// <summary>
        /// Gets default settings collection
        /// </summary>
        public override IDictionary<string, object> DefaultSettings
        {
            get { return new Dictionary<string, object> { { "StartFolder", StartFolder }, { "Domain", Domain } }; }
        }

        /// <summary>
        /// Default constructor
        /// </summary>
        public CustomPDFIndexBuilder()
        {
            Action = "Full";
            Settings = new Dictionary<string, string>();
        }

        /// <summary>
        /// Creates new object using settings data
        /// </summary>
        /// <param name="settings"></param>
        public CustomPDFIndexBuilder(IDictionary<string, string> settings)
        {
            Action = "Full";
            Settings = settings;
        }

        /// <summary>
        /// Gets index builder fields
        /// </summary>
        /// <returns>Set of key-value pairs</returns>        
        public override IEnumerable<FieldDefinitionBase> GetFields()
        {
            FileIndexSchemaExtender extender = new FileIndexSchemaExtender();
            var schemaExtenderFields = extender.GetFields() as List<FieldDefinitionBase>;

            // Add your custom fields
            if (schemaExtenderFields != null)
            {
                schemaExtenderFields.Add(new FieldDefinition() { Name = "Text Content", SystemName = "TextContent", Source = "TextContent", TypeName = "System.String", Group = "PDF Specific", Indexed = true, Analyzed = false, Stored = true });
                schemaExtenderFields.Add(new FieldDefinition() { Name = "Link to file", SystemName = "LinktToFile", Source = "LinkToFile", TypeName = "System.String", Group = "PDF Specific", Indexed = true, Analyzed = false, Stored = true });
            }
            return schemaExtenderFields;
        }

        /// <summary>
        /// Builds current sql index
        /// </summary>
        /// <param name="writer"></param>
        /// <param name="tracker"></param>        
        public override void Build(IIndexWriter writer, Tracker tracker)
        {
            string directory = string.Empty;
            tracker.LogInformation("{0} building using {1}", GetType().FullName, writer.GetType().FullName);
            try
            {
                tracker.LogInformation("Opening index writer.");
                writer.Open(false);
                tracker.LogInformation("Opened index writer to overwrite index");

                //load builder settings
                if (Settings.ContainsKey("StartFolder"))
                    StartFolder = Settings["StartFolder"];

                if (Settings.ContainsKey("Domain"))
                    Domain = Settings["Domain"];

                tracker.LogInformation("StartFolder: '{0}'", StartFolder);
                tracker.LogInformation("Domain: '{0}'", Domain);

                if (Action.Equals("Full"))
                {
                    //process files
                    tracker.LogInformation("Starting processing files.");
                    directory = Core.SystemInformation.MapPath("/Files/") + "\\" + StartFolder.Trim(new char[] { '/', '\\' });
                    if (Directory.Exists(directory))
                    {
                        List<string> fileList = FileList(directory, tracker);
                        tracker.Status.TotalCount = fileList.Count();

                        foreach (string file in fileList)
                        {
                            try
                            {
                                FileInfo fileInfo = new FileInfo(file);
                                IndexDocument document = new IndexDocument();
                                document["FileName"] = fileInfo.Name;
                                document["FileFullName"] = fileInfo.FullName;
                                document["LinkToFile"] = LinkToFile(fileInfo.FullName);
                                document["Extension"] = fileInfo.Extension;
                                document["TextContent"] = GetPdfText(fileInfo.FullName, tracker);
                                document["DirectoryFullName"] = fileInfo.DirectoryName;
                                WriteDocument(writer, tracker, document, fileInfo.FullName);
                            }
                            catch (Exception ex)
                            {
                                tracker.LogInformation(string.Format("Failed getting file-info from '{0}'. Failed with exception: {1}", file, ex.Message));
                            }
                        }
                    }
                    tracker.LogInformation("--- Finished processing files ---");
                }
                else
                {
                    //check other actions and handle them
                }
            }
            catch (Exception ex)
            {
                tracker.Fail("Custom index builder experienced a fatal error: ", ex);
            }
        }

        private void WriteDocument(IIndexWriter writer, Tracker tracker, IndexDocument document, string filePath)
        {
            //allow extenders to process the index document
            foreach (var extender in Extenders)
            {
                extender.ExtendDocument(document);
            }
            //write to index
            writer.AddDocument(document);

            tracker.Status.Meta["CurrentFile"] = filePath;
            tracker.IncrementCounter();
        }

        private List<string> FileList(string dir, Tracker tracker)
        {
            // Prepare the final list of PDF files
            string[] files = Directory.GetFiles(dir, "*.pdf", SearchOption.AllDirectories);
            List<string> returnList = new List<string>();

            for (int i = 0; i < files.Length; i++)
            {
                try
                {
                    if (files[i].Length > 260)
                    {
                        tracker.LogInformation(string.Format("Length of full path to file exceeded 260 characters. File ignored: '{0}'", files[i].ToString()));
                    }
                    else
                    {
                        FileInfo fileInfo = new FileInfo(files[i].ToString());
                        if (fileInfo != null)
                            returnList.Add(files[i].ToString());
                    }
                }
                catch (Exception ex)
                {
                    tracker.LogInformation(string.Format("Preparing file list failed with the exception: '{0}'", ex.Message));
                }
            }
            return returnList;
        }

         private string GetPdfText(string InputFile, Tracker tracker)
        {
            string sOut = string.Empty;
            try
            {
                iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(InputFile);
                for (int i = 1; i < reader.NumberOfPages; i++)
                {
                    iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy tes = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
                    sOut += iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, i, tes);
                }
            }
            catch (Exception ex)
            {
                tracker.LogInformation(string.Format("iTextSharper failed parsing PDF: '{0}'. Failed with exception: {1}", InputFile, ex.Message));
            }
            return sOut;
        }

        private string LinkToFile(string File)
        {
            try
            {
                if (Domain == string.Empty)
                    return "";

                string file = File.Substring(File.IndexOf(@"\Files"));
                file = file.Replace(@"\", "/");
                string link = string.Format("https://{0}{1}", Domain, file);
                return link;
            }
            catch (Exception)
            {
                return "";
            }
        }
    }
}

From Repository it looks like

So it is indeed possible.

BR
Snedker

Votes for this answer: 1

Developer forum

Microsoft Search Service Dynamicweb Cloud

Replies