Excluding a page from being indexed: The Sitecore Search Way

Jaya Jha
5 min readSep 18, 2024

--

I recently encountered a use case where I had to find ways to prevent a page from being indexed using the Sitecore Search platform. In this blog, I’ll explain the methods provided by the Sitecore Search platform and the solution that I implemented.

There are several ways to exclude a page from being indexed, but when using a composable tech stack and a content management system, we must also consider the perspective of the content authors. Let’s explore these methods now.

There are two options that can assist in excluding pages from being indexed.

1. You can exclude URLs by utilizing the ‘Exclusion Patterns’ in the WebCrawler settings.

Navigate to Sources > Select the source > Go to Web Crawler Settings > Go to Exclusion Patterns.

Let’s understand what Exclusion Patterns are.

To exclude certain URL patterns from the crawler’s scope, click ‘Add Exclusion Pattern’. Then, in the ‘Type’ drop-down menu, select either ‘Glob expression’ or ‘Regular Expression’. Next, in the ‘Value’ field, input the expression that matches the URLs you want to exclude.

For instance, if you want to prevent the crawler from crawling your search page, enter the following Glob expression:

Regular expression:

Here, you can either use a regular expression or a Glob Expression to exclude pages from being indexed. With this approach, we always need to come to the search platform and configure the exclusion patterns. Non-technical users, who are unfamiliar with this platform, would need to depend on developers for this task.
2. You can include the URLs you want to exclude in the ‘Document Extractors’.

Navigate to Sources > Select the source > Go to Document Extractors > Go to JS Extractor code.

In the previous approach, we can write JavaScript extractor code where we can define the pattern of URLs or directly input URL that need to be excluded. However, this still requires a developer to configure it.

In my use case, I was seeking a solution where Sitecore XMC could be leveraged, providing authors with the flexibility to configure which pages they want to include or exclude, while also benefiting from the features of the Sitecore search platform.

I’m going to explain the method of combining Sitecore Search and XMC.

In Sitecore XMC, I’ve created a base template for search fields which can be inherited across all page types in the Sitecore content tree. This template includes a field called ‘Do Not Index ’, which is a checkbox.

The idea behind this implementation is to provide a field available on each page within the CMS, which content authors can use if they decide not to index a page but still want it included in the website’s sitemap.

In our head application, similar to how we handle other meta tags, we can introduce an additional meta tag called ‘do-not-index.’ When writing our document extractor code under CEC, we can retrieve the value of this newly created meta tag. If the meta tag’s value is present, we can return an empty document. This approach effectively excludes the page from the index document collection, while simultaneously granting content authors the flexibility to manage this in the future.

Here is an example of document extractor code that excludes a URL based on the presence of a ‘Do Not Index’ meta tag:

Navigate to Sources > Select source > Go to Document Extractors > Go to JS Extractor code.

In the document extractor code, we must identify the ‘do-not-index’ attribute within the DOM (Document Object Model). If the ‘do-not-index’ attribute exists under the meta tag properties of a page, then we should return ‘null’, effectively excluding it from the document index. However, if this attribute is absent, we should include the page in the index and return the document from the extractor function.

// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
$ = response.body;

const do_not_index = $('meta[name="do-not-index"]').attr('content')
if(do_not_index == 1)
return null;

return [{
'description': $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content') || $('p').text() ||$('#content').text(),
'name': $('meta[name="searchtitle"]').attr('content') || $('title').text() ||"no title",
'type': $('meta[property="og:type"]').attr('content') || 'website_content',
'url': $('meta[property="og:url"]').attr('content')
}];
}

Now, the question arises: how can we ensure that this code is functioning correctly?

Sitecore Search offers a very handy feature on the platform where you can readily validate the JavaScript extractor code.

After clicking ‘Validate’, a window will appear where you can insert the URL you want to use to test the JavaScript code.

When I saw this ‘excluded’ message, I was quite confused. So, I raised a support case to understand its exact purpose. They explained to me that it means the document is either dropped or not being indexed.

They also raised a feature request to enhance the explanation of this message in future releases to make it more self-explanatory.

Besides our traditional approach of using the “noindex”, “nofollow” meta tags, there is also the above-mentioned approach which is more composable and doesn’t require much modification of the code.

With the above approach, we utilized both the XMC and Sitecore Search.

Happy Learning with Composable Solutions!

--

--

Jaya Jha
Jaya Jha

Written by Jaya Jha

I am a full-stack Web Application Developer with extensive experience in Sitecore Ecosystem .Passionate about exploring cutting edge technologies.

No responses yet