Indexing PDF Document with Sitecore Search

Jaya Jha
5 min readApr 15, 2024

--

While working on a composable tech stack, it isn’t complete without addressing the search aspect. I recently had the opportunity to work with Sitecore Search. In this blog, I’ll discuss a use case that is necessary for every project where search functionality is utilized.

Use Case — Enabling PDF search in Sitecore Search requires a few additional configuration steps, unlike normal pages. Let’s understand these steps.

Step 1 — First, in Sitecore search, we need to configure a temporary source to comprehend the HTML structure of a PDF document. This is because document extractors can only parse HTML or JSON. Therefore, in order to effectively extract attribute values from PDF content, it’s crucial to understand the HTML structure of your PDFs.

Let me briefly describe the term ‘source,’ which is a crucial part of Sitecore Search when you’re using it to index your content.

Source —This stage is integral for setting up indexing configurations. As we establish a source, we specify certain settings related to indexing, such as identifying the content to be indexed, selecting the portions of the content to be extracted as attributes, and crucially, determining our indexing source.

The image below shows the screen after logging into the Customer Engagement Console (CEC). Click on the ‘Add Source’ link.

Screen After Login

Step 2- After clicking the ‘Add Source’ link, a popup window will appear where you need to fill in the details and click the ‘Save’ button.

adding a source
adding a source detail

Step 3- Once you’ve added a source, it will show up under the ‘Sources’ section. Clicking on ‘Edit’ will bring up a new configuration screen where you must fill in the details in the highlighted areas as displayed below, given that this is a temporary source.

Triggers— In the Triggers section, we need to add a few sample PDF URLs that can be used on our website. This allows us to collect different types of PDFs under various categories to analyze their HTML structures.

Trigger Screen

Note —Multiple triggers can be added here (referring to the links of different PDFs).

Document Extractor — In this case, we will add a JavaScript document extractor to extract the HTML content from the PDF link provided above.

Similar to indexing HTML content, you can use any type of document extractor to extract PDF content. However, a JavaScript document extractor is generally used for PDFs as it can accommodate complex use cases, such as text sanitization and condition application.

When you click on ‘Content’ under ‘Taggers’, a popup window will appear where you can add the following sample code.

Source Sitecore Documentation

Important — The ‘Validate’ button is particularly useful when you write your document extractor code. Using this button, you can test the sample PDF link, and the button will display the validation result based on the document extractor.

Step 4 — After completing the above configuration, publish and scan the newly created source, then go to the content collection in the CEC portal. Here, it will display a list of the indexed PDF documents.

Click on one of the documents and inspect the ‘PDF To HTML’ field. Here, you can see the HTML structure of a PDF, which will assist you in extracting attributes like the title, description, last modified date, and so forth.

source Sitecore documentation

Step 5 — Now that we understand the HTML structure of a PDF through the temporary source, we can configure a new source or update an existing one to enable the PDF indexing feature as well.

Repeat steps 2 to 4. This time, in step 3, please configure all the settings if this is a new source for you. If it is an existing source, add a new document extractor specifically for PDFs under ‘Document Extractor,’ and add the attribute extraction code.

Remember the following considerations when configuring a document extractor for PDF content:

  • To make sure the extraction rules you set apply exclusively to PDFs and not to other content types like HTML, you need to define ‘URLs to Match.’ This ensures the crawler only applies rules to URLs that match the defined pattern.
  • Generally, the GLOB expression **/*.pdf is sufficient because it conducts a recursive search.
  • To ensure all PDF documents are correctly marked, It’s recommended to set the ‘type’ attribute to a fixed value of ‘pdf’.
  • To extract other attributes such as the title, description, URL, and parent_url, utilize the HTML structure of the PDFs you extracted using the temporary source.

Step 6 —Settings for Crawler to Index PDFs

Max Depth — If you want the crawler to locate PDF URLs by following hyperlinks, make sure that the MAX DEPTH is set to at least 1. If MAX DEPTH is set to 0, the crawler will not follow any hyperlinks, including PDF URLs.

Note — In case of sitemap as trigger MAX DEPTH is set to default of 0.

Allowed Domains —If your PDF files are hosted on a different domain than your HTML pages, make sure to add these domains.

Below is the HTML structure of a sample PDF that I’ve used, along with the document extractor code.

The above code can be modified based on project requirements and attributes specific to the project.

Since Sitecore Search is new to many of us, I have tried to explain each important aspect of the configuration, in the hope that it will save time for others who encounter similar use cases.

Happy Learning with Composable Solutions!

--

--

Jaya Jha
Jaya Jha

Written by Jaya Jha

I am a full-stack Web Application Developer with extensive experience in Sitecore Ecosystem .Passionate about exploring cutting edge technologies.

No responses yet