Check Broken Links On Website Without Using Scrapy


After moving my blog to Hugo and then back to WordPress, I knew there would be a few broken links on my website.

The question I asked myself was:

How can I create a spider that scans through my website and tests all the links?

Initially I came across the fast and popular Scrapy Python library, and tried to create a script that would perform the simple task of spidering my site, fetching all anchor tags and image tags, testing their corresponding href and src attributes.

However, while I was able to create such a script cobbled from others who had shared their Scrapy script, I found I wanted better reporting. I didn’t trust myself that I was using the Scrapy library properly.

Then I thought, why not create my own simple spider script?

It couldn’t be that hard, right?

The basic process would be:

  1. Use the Requests & BeautifulSoup libraries
  2. Start by loading a web page – perhaps the home page of my web site.
  3. Scrape the web page and obtain all the necessary links from anchor tags (href attribute) and image tags (src attribute) (etc).
  4. Check each link for the HTTP code response. If it’s a 404 response then log the web page where the link is found, and the link itself – log this in a CSV file.
  5. If the link is listed in the permitted domains to follow, use that new page by starting back at step 2 with the new URL.
  6. If the link is not listed in the permitted domain, just grab the HTTP code response.

While coding this up in Python was relatively straightforward, I did find upon running the code it was rather slow compared to Scrapy. This may have had something to do with the lack of asynchronous support in my code.

To run this code all you need do is populate the values in the lists at the bottom of the script. For example, if you want to spider your website https://example.com you would set in the urls list the URL to start spidering from, then in the follow_domains list insert all the domains you would like to follow.

Finally, if there are URLs which would be caught by the follow_domains list which you wouldn’t want to follow, for example, a subdomain notme.example.com or a subdirectory example.com/do-not-follow/ then you would enter those into the denied_domains list.

Conclusion

Creating my own broken link testing script proved to be an enjoyable and fun experience. Special thanks to Ari Bajo who helped to provide the inspiration that this was possible. You can check his version on the Scraping Bee here.

If you’d like to make any improvements feel free to let me know!

Ryan

Author of scripteverything.com, Ryan has been dabbling in code since the late '90s when he cut his teeth by exploring VBA in Excel when trying to do something more. Having his eyes opened with the potential of automating repetitive tasks, he expanded to Python and then moved over to scripting languages such as HTML, CSS, Javascript and PHP. When he is not behind a screen, Ryan enjoys a good bush walk with the family during the cooler months, and going with them to the beach during the warmer months.

Recent Posts