After moving my blog to Hugo and then back to WordPress, I knew there would be a few broken links on my website.
The question I asked myself was:
How can I create a spider that scans through my website and tests all the links?
Initially I came across the fast and popular Scrapy Python library, and tried to create a script that would perform the simple task of spidering my site, fetching all anchor tags and image tags, testing their corresponding
However, while I was able to create such a script cobbled from others who had shared their Scrapy script, I found I wanted better reporting. I didn’t trust myself that I was using the Scrapy library properly.
Then I thought, why not create my own simple spider script?
It couldn’t be that hard, right?
The basic process would be:
- Use the Requests & BeautifulSoup libraries
- Start by loading a web page – perhaps the home page of my web site.
- Scrape the web page and obtain all the necessary links from anchor tags (
hrefattribute) and image tags (
- Check each link for the HTTP code response. If it’s a 404 response then log the web page where the link is found, and the link itself – log this in a CSV file.
- If the link is listed in the permitted domains to follow, use that new page by starting back at step 2 with the new URL.
- If the link is not listed in the permitted domain, just grab the HTTP code response.
While coding this up in Python was relatively straightforward, I did find upon running the code it was rather slow compared to Scrapy. This may have had something to do with the lack of asynchronous support in my code.
To run this code all you need do is populate the values in the lists at the bottom of the script. For example, if you want to spider your website
https://example.com you would set in the
urls list the URL to start spidering from, then in the
follow_domains list insert all the domains you would like to follow.
Finally, if there are URLs which would be caught by the
follow_domains list which you wouldn’t want to follow, for example, a subdomain
notme.example.com or a subdirectory
example.com/do-not-follow/ then you would enter those into the
Creating my own broken link testing script proved to be an enjoyable and fun experience. Special thanks to Ari Bajo who helped to provide the inspiration that this was possible. You can check his version on the Scraping Bee here.
If you’d like to make any improvements feel free to let me know!
In a previous post I explored how to ignore blank cells when using the QUERY function in Google Sheets, which had its own distinct way of removing blank cells from the data capture. But how do...
A recent project hit a roadblock when I changed some code depending on the N/search module and refactored it to use N/query module. While I had some initial hassles with accepting the Beta nature of...