Validating Sitecore Search Document Extractor Code

3 min readJul 31, 2024

Validating Sitecore Search Document Extractor Code

In Sitecore search, we use document extractor code to retrieve information from an indexed document. This code is written in a JavaScript-like syntax called Cheerio. In today’s post, I’ll explain how to test our document extractor code using the Cheerio sandbox before rebuilding the index.

Let’s use a sample page from the Sitecore Developer Portal as an example.

example — Sitecore Search

Here are the steps to use the Cheerio sandbox:

Step 1: ScrapeNinja Cheerio Live Sandbox
Once you click on the provided URL for the sandbox, the following screen will open in your browser:

As shown in the screenshot above, you will notice that it requests a sample HTML of a page. In our case, it’s for “Sitecore Search.” Use the inspect command in your browser, click on “Edit as HTML,” and copy the HTML of the page into the sandbox under “Sample HTML.”

Step 2: Copy the sample HTML code of the page for which you want to test your code.

Now that our HTML code is prepared, it’s time to write the sample document extractor code that we will test for the document extraction process.

Step 3: Once the sample HTML is in place, if you scroll down, you’ll find the Extractor section. This is where you will write and modify your code for testing. Below is a screenshot of the sample extractor code from the sandbox screen, which we will adapt according to our needs.

Step 4: Typically, the specific metatags you need to extract from a page depend on your client’s requirements. The goal is to enhance users’ ability to search content quickly and efficiently, which, in turn, boosts website performance.

In this scenario, I aim to extract the metatag properties for site name, title, and description.
Let’s proceed to modify our code accordingly.

// define function which accepts body and cheerio as args
function extract(input, cheerio) {
    // return object with extracted values              
    let $ = cheerio.load(input);
    return {
        title:  $('meta[property="og:title"]').attr('content'),
        description:  $('meta[property="og:description"]').attr('content'),
        sitename:  $('meta[property="og:site_name"]').attr('content')
    };
}

Step 5: Run the extractor, and you should see the desired output as follows:

Conclusion:
The above sandbox is extremely useful when your site hosts various types of content such as article pages, news search pages, and different kinds of PDFs. It provides a practical environment to write and test JavaScript functions necessary for extracting the required data.

References

Cheerio Sandbox: Basic example (scrapeninja.net)

Sitecore Search
The industry standard for working with HTML in JavaScript | cheerio
Configuring request extractors | Sitecore Documentation

Written by Jaya Jha

No responses yet