Through this blog, I aim to spotlight several common errors you may encounter when you begin configuring PDF documents for a source to crawl.
If you’re unsure about how to index PDF documents using Sitecore Search, please refer to my previous blog post: ‘Indexing PDF Documents with Sitecore Search’.
Let’s get started by addressing the first error!
Path to check error — Analytics > Sources > Overview ->Select the Source ->Job Run List ->Select Job Name ->View Details to View Issues
Error 1 — Invalid payload size.
The above screenshot clearly indicates that the size of the PDF on our website exceeds the permissible upload limit set by Sitecore.
Resolution — At the time of writing this blog, Sitecore has not yet implemented a feature that allows customers to control or adjust the PDF size limit. This is a feature request they are considering for future implementation. By default, I assume they have set a limit of 10 MB.
So, you should raise a support case with the Sitecore team. Make sure to mention the source ID for which you wish to crawl the PDF, along with your domain ID, and attach a screenshot of the error.
Error 2 — Context deadline exceeded.
From viewing the screenshot, you could guess that this error relates to a timeout issue.
Let’s look at how we can fix this.
Resolution —
The first thing we can try in this case is to increase the timeout value of the web crawler in the web crawler settings.
If the issue persists, the solution is to raise a support case, because if the timeout is due to a server error, it can only be resolved by the Sitecore team.
Error 3 — Error parsing body. Body size: 838315 bytes. Cause: Put “some url": context deadline exceeded (Client. Timeout exceeded while awaiting headers)
It appears to be a timeout issue from the look of it; however, the underlying reason is related to parsing the body.
Resolution —
The first thing we can try in this case is to increase the timeout value of the web crawler in the web crawler settings.
If the issue still persists, the solution is to raise a support case, because if the timeout is due to a server error, it can only be resolved by the Sitecore team.
Important Note -
For each source, such as Dev, UAT, and Production, remember that you need to raise a separate support case for each error, following the best practices of raising support cases.
I hope this blog post has alerted you to potential issues you may encounter in the near future and has provided solutions to address them efficiently without the need for extensive investigation.
Happy Learning with Composable Solutions!