Download Multiple Files On Web Page Using Requests And BeautifulSoup

Ever found yourself scrolling through a web page filled with multiple files, feeling overwhelmed with the tedious process of clicking on each one individually to download them?

Wouldn’t it be great to automate this task, saving both time and effort?

I had a similar requirement where I needed to download a lot of PDFs on a single web page, and clicking on each, waiting for the PDF to download, and then waiting for the browser to become responsive again was all becoming a tedious task.

In this article, I will explore how I was able to accomplish this task using two popular Python libraries: requests and BeautifulSoup .

Prerequisites

Before diving into the tutorial, ensure you have Python 3 installed on your system. Visit the official Python website for download and installation instructions .

Install both the requests and beautifulsoup4 packages using the following commands in your environment:

pip install requests
pip install beautifulsoup4

If you’re new to web scraping, learning about the principles and best practices, such as respecting websites’ terms of service and robots.txt files, is recommended.

The remainder of this web page assumes you have adhered to any website’s terms of service if you download their files.

Requests Library

The requests library is a popular third-party Python library used for making HTTP requests. It simplifies the process of interacting with web services, such as APIs, by providing a user-friendly and efficient interface for handling various types of HTTP requests and responses.

The library provides various methods to send HTTP requests, including GET, POST, PUT, DELETE, and others. These methods correspond to the standard HTTP verbs that define the actions taken when interacting with web resources.

Some of the main features of the requests library include:

Handling redirects and following URLs in response headers.
Supporting various authentication methods.
Allowing customisation of request headers, query parameters, and request data.
Supporting various types of response content.
Providing easy access to response status codes, headers, and cookies.
Supporting timeouts and retries for more robust request handling.
Handling sessions and connection pooling for improved performance.

In essence, the requests library allows you to send HTTP requests and handle response data.

BeautifulSoup4 Library

Beautiful Soup is a third-party Python library used to extract data from HTML and XML documents. It provides a convenient and easy-to-use way to parse HTML and XML markup, allowing you to navigate, search, and modify the tree.

Beautiful Soup is particularly useful when extracting information from web pages or other structured text documents.

Some key features of Beautiful Soup include:

Parsing HTML and XML documents and creating a parseable tree from the page source.
Navigating and searching the tree using tag names, attributes, and CSS selectors.
Modifying the tree, such as adding, deleting, or altering tags and their attributes.
Extracting specific data from the parse tree and exporting it to a desired format (e.g., CSV, JSON).

Therefore, BeautifulSoup is helpful in this exercise so that you can parse the web page returned from the request to extract the links within the web page that you need to download.

Import Libraries

Once both of these libraries have been installed in your Python environment, you’re ready to proceed to writing your Python code.

Start by importing these two libraries at the top of your Python code:

import requests
from bs4 import BeautifulSoup

Once you have this at the top of your Python file you can move on to the next step.

Fetch Links From Web Page

The next process is to download the page where the links will be found.

In my use case, the web page contained a list of links, and I needed to extract the URL of these links to perform the individual downloading of each file. If you already have a list of links, then you can skip this section.

With the requests library use the get() method to retrieve the page’s content for a specific URL:

import requests
from bs4 import BeautifulSoup


response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, 'html.parser')

In the additional two lines added after the import statements, the request is sent to fetch the web page and this is then parsed through the BeautifulSoup parser.

Extract Links

Using the find_all() method in the new soup object extract all the links on the page. This will require some skill as you will need to inspect the web page to see how the links are coded in HTML.

For example, do the anchor tags containing the files you want to download all contain a specific attribute, or using the SoupSieve implementation to use CSS selectors in your selection of the HTML tags.

In my use case the tags I needed to download all had a class attribute of FileDown , therefore my find_all() function looked like this:

links = soup.find_all("a", {"class": "FileDown"})

However, you might find you may need to loop through all the anchor tags and do further filtering on the HTML tags extracted to find the list of links you want.

Your code for extracting the downloadable links might be to extract all anchor tags and then to identify which href attributes end in .pdf :

anchors = soup.find_all("a")
links = [a for a in anchors if a['href'].lower().endswith(".pdf")]

In the list comprehension links variable above, I loop through all the anchor tags and where the href attribute, when converted to lower case, ends with .pdf this anchor tag is appended to the links list.

In both examples above, the links variable is a list of tags containing URLs for what I want to download.

Download Files

The next part of your Python code is to download each file by sending it through the requests library to download.

for link in links:
    download_url = link["href"]
    file_name = link.text.strip()
    with open(file_name, "wb") as file:
        response = requests.get(download_url)
        file.write(response.content)

The above code is a simple for loop iterating through the links list containing the soup Tag and using the href attribute to download the corresponding file.

Upon capturing the URL send it through requests.get() and then store the file locally on your machine. In this example, I’m using the link’s text to define the name of the file.

Summary

In this article, we’ve explored how to download multiple files from a web page using the Python libraries, requests and BeautifulSoup . By combining these two powerful tools, we can extract file URLs and subsequently download them.

We began by installing the required libraries and then importing these libraries in your script. Next, I moved on to fetching the web page’s content and parsing it with BeautifulSoup to identify and extract relevant file URLs.

Once I had my list of file URLs, I used requests again to send download requests, saving the files to my local box.

The final code is here: