+5 votes
214 views
in Web development by (242k points)
reopened
Web scraping with Python: introduction and tutorial

1 Answer

+3 votes
by (1.6m points)
edited
 
Best answer

Why precisely use Python for web scraping?
Web scraping in general terms
Scraping tools for Python
Web scraping with Python and BeautifulSoup? Tutorial
Web scraping application cases
A simple example of web scraping
Legal risks of web scraping
Technical limitations of web scraping
APIs as an alternative to web scraping
Web scraping with Scrapy
Web scraping with Selenium
Web scraping with BeautifulSoup
Comparison of web scraping tools with Python
Set up the web scraping project with Python on your own computer
Extract citations and authors with Python and BeautifulSoup
Use Python packages to scrape

image

Web scraping with Python: introduction and tutorial

The World Wide Web is made up of many millions of linked documents, also known as web pages. The source text of the web pages is written in the Hypertext Markup Language (HTML). HTML source codes are a mix of readable information for humans and machine readable codes , called tags or labels . The browser, such as Chrome, Firefox, Safari or Edge, processes the source text, interprets the labels and presents the information they contain to the user..

To extract from the source text only the information that interests the user, special software is used . These are programs called web scrapers , crawlers , spiders or simply bots , which examine the source text of the pages in search of specific patterns and extract the information they contain. The data obtained through web scraping is subsequently summarized, combined, evaluated or stored to be used later.

In this article we will tell you why the Python language is especially useful for creating web scrapers and we present an introduction to this topic along with a practical tutorial..

Index
  1. Why precisely use Python for web scraping?
  2. Web scraping in general terms
    1. Web scraping application cases
    2. A simple example of web scraping
    3. Legal risks of web scraping
    4. Technical limitations of web scraping
    5. APIs as an alternative to web scraping
  3. Scraping tools for Python
    1. Web scraping with Scrapy
    2. Web scraping with Selenium
    3. Web scraping with BeautifulSoup
    4. Comparison of web scraping tools with Python
  4. Web scraping with Python and BeautifulSoup? Tutorial
    1. Set up the web scraping project with Python on your own computer
    2. Extract citations and authors with Python and BeautifulSoup
    3. Use Python packages to scrape

Why precisely use Python for web scraping?

Python, the popular programming language, lends itself especially well to creating web scraping programs . Since web pages have to be constantly modified and updated, their contents change over time. Their design may change, for example, or new elements may be added to them. The scrapers web are developed taking into account the specific structure of a web page, so that, if the structure changes, the scraper must also be modified . This process is especially easy with Python.

Likewise, Python's strengths are word processing and opening web resources, two of the technical bases of web scraping . Python is also an established standard for data analysis and processing. As if this weren't enough, Python offers a vast programming ecosystem , which includes libraries, open source projects, documentation and explanatory references for the language, as well as forum posts, bug reports, and blog articles..

More specifically, there are several well-established tools designed to do web scraping with Python . We present three of the best known: Scrapy, Selenium and BeautifulSoup. If you want to start practicing, you can take a look at our web scraping with Python tutorial , in which we use BeautifulSoup, which will help you understand the scraping process .

Web scraping in general terms

The basic scheme of web scraping is easy to explain. First, the scraper developer parses the HTML source text of the web page in question. In general, you will find clear patterns that allow you to extract the desired information. The scraper will then be programmed to identify these patterns and will do the rest of the job automatically :

  1. Open web page via URL
  2. Automatically extract structured data from patterns
  3. Summarize, store, evaluate or combine the extracted data, among other actions

Web scraping application cases

The web scraping can have many different applications. In addition to search engine indexing , web scraping can also be used for the following purposes, among many others:

  • Create contact databases
  • Check and compare offers online
  • Gather data from various online sources
  • Observe the evolution of online presence and reputation
  • Gather financial, weather, or other data
  • Observe changes in the content of web pages
  • Gather data for research purposes
  • Perform data exploration or data mining

A simple example of web scraping

Imagine a second-hand car sales website that shows, when opened in the browser, a list of available cars. The source text of this example could be the one of the entrance of one of the cars:

  raw_html = """ <h1>Coche de segunda mano en venta</h1> <ul> <li> <div> Volkswagen Escarabajo </div> <div> <span>Volkswagen</span> <span>Escarabajo</span> <span>1973</span> </div> <div> ? <span>14.998,?</span> </div> </li> </ul> """  

A web scraper could, for example, examine the online list of cars for sale and, depending on the intention with which it was created, search for a specific model. In our example, it is a Volkswagen Beetle. In the source text, the make and model of the car are indicated by the CSS classes car-make and car-model , respectively. Thanks to the class names, the searched data can be easily extracted or scraped . Here is the corresponding scraping example with BeautifulSoup:

  # Analizar sintácticamente el texto fuente HTML guardado en raw_html html = BeautifulSoup(raw_html, 'html.parser') # Extraer el contenido de la etiqueta con la clase 'car-title' car_title = html.find(class_ = 'car-title').text.strip() # Si el coche en cuestión resulta ser un Volkswagen Escarabajo if (car_title == 'Volkswagen Escarabajo'): # Subir del título del coche a la siguiente etiqueta de elemento de lista <li></li> html.find_parent('li') # Determinar el precio del coche car_price = html.find(class_ = 'sales-price').text.strip() # Mostrar el precio del coche print(car_price)  

Legal risks of web scraping

Web scraping techniques can be very useful, but they are not always free of legal risks. Since the operator of the web has designed it with human users in mind, its automatic opening by means of a web scraper may imply a breach of the conditions of use . These actions become especially relevant when large volumes of information are accessed from multiple pages at the same time or in rapid succession, in a way that a person would never be able to interact with the page.

If carried out automatically, the opening, storage and evaluation of the data published on a website could infringe intellectual property rights . In addition, if the data obtained is of a personal nature, storing and analyzing it without the authorization of the affected persons violates current data protection regulations. For this reason, it is not allowed, for example, to scrape Facebook profiles to obtain personal data.

Note

Infringements of data protection and intellectual property are punishable by significant fines . Make sure, therefore, to act within the law whenever you use web scraping techniques . If you run into technical security barriers, do not try to avoid them in any case.

Technical limitations of web scraping

For web page operators it is often advantageous to limit the automatic scraping possibilities of their online content . On the one hand, because the massive access to the web by scrapers can harm the performance of the site and, on the other, because there are usually internal sections of the web that should not be shown in the search results.

To limit access to scrapers , the use of the robots.txt standard has been extended. It is a text file that web operators place in the main directory of the web page. In it there are special entries that establish which scrapers or bots are authorized to access which areas of the web . The entries in the robots.txt file always apply to an entire domain.

Here is an example of the content of a robots.txt file that prohibits scraping by any type of bot on the entire site:

 

  # Cualquier bot User-agent: * # Excluir todo el directorio principal Disallow: /  

The robots.txt file only acts as a security measure as it invites voluntary limitation by bots , which should adhere to the file's bans. On a technical level, however, it is not an obstacle. Therefore, to effectively control access through web scrapers , web operators also implement more aggressive techniques. One of them is, for example, to limit performance; Another is to block the IP address of the scraper after several access attempts that violate the rules of the web.

APIs as an alternative to web scraping

Despite its effectiveness, web scraping is not the best method of obtaining data from web pages. In fact, there is often a better alternative: many web operators post their data in a structured and machine-readable format . Special programming interfaces called Application Programming Interfaces (APIs) are used to access this type of data .

Using an API offers significant benefits:

  • The website owner creates the API precisely to allow access to the data. This reduces the risk of infringements and the web operator can better regulate access to data. One way to do this is, for example, by requesting an API key to access them. This method also allows the operator to more precisely regulate performance limitations.
  • The API presents the data directly in a machine-readable format. With this, the laborious task of extracting the data from the source text is no longer necessary. In addition, the structure of the data is separated from its visual representation, so it is maintained no matter if the design of the web changes.

As long as there is an API available and that offers complete data, this will be the best method to access the information , without forgetting that through web scraping , all the texts that a person could read on a web page can be extracted, in principle.

Scraping tools for Python

The Python ecosystem includes several well- established tools to carry out scraping projects :

  • Scrapy
  • Selenium
  • BeautifulSoup

Next, we present the advantages and disadvantages of each of these technologies.

Web scraping with Scrapy

Scrapy , one of the tools to do web scraping with Python that we present, uses a parser or HTML parser to extract data from the source text (in HTML) of the web following this scheme:

URL? HTTP request? HTML? Scrapy

The key concept of the development of scrapers with Scrapy are the so-called web spiders, simple scraping programs based on Scrapy. Each spider ( spider ) is programmed to scrape a specific web and goes off the hook from page to page. The programming used is object-oriented: each spider is its own Python class.

In addition to the Python package itself, the Scrapy installation includes a command line tool, the Scrapy Shell, which allows you to control spiders . Furthermore, already created spiders can be stored in the Scrapy Cloud. From there, they run with set times. In this way, complex web sites can also be scraped without using the computer itself or the Internet connection. Another way is to create a server scraping web itself using the software open source Scrapyd.

Scrapy is a consolidated platform for applying web scraping techniques with Python. Its architecture is oriented to the needs of professional projects . Scrapy has, for example, an integrated pipeline to process the extracted data. The opening of the pages in Scrapy occurs asynchronously, that is, with the possibility of downloading several pages simultaneously. For this reason, Scrapy is a good option for scraping projects that have to process large volumes of pages.

Web scraping with Selenium

The software free Selenium is a framework for automated test software to web applications. In principle, it was developed to test pages and put my web , but the WebDriver Selenium also be used with Python for scraping . Although Selenium itself is not written in Python, with this programming language it is possible to access the functions of the software .

Unlike Scrapy and BeautifulSoup, Selenium does not work with the HTML source text of the web in question, but loads the page in a browser without a user interface . The browser then interprets the source code of the page and creates, from it, a Document Object Model (document object model or DOM). This standardized interface enables user interactions to be tested. In this way, for example, it is possible to simulate clicks and fill in forms automatically. Changes to the web that result from such actions are reflected in the DOM. The structure of the web scraping process with Selenium is as follows:

URL? HTTP request? HTML? Selenium? SUN

Since the DOM is generated dynamically, Selenium also allows you to scrape pages whose content has been generated using JavaScript . Access to dynamic content is the most important advantage of Selenium. In practical terms, Selenium can also be used in combination with Scrapy or BeautifulSoup: Selenium would provide the source text, while the second tool would take care of the parsing and evaluation of the data. In this case, the scheme to be followed would have this form:

URL? HTTP request? HTML? Selenium? SUN ? HTML? Scrapy / BeautifulSoup

Web scraping with BeautifulSoup

Of the three tools we present to perform web scraping with Python, BeautifulSoup is the oldest . As in the case of Scrapy, it is an HTML parser or parser. The web scraping with BeautifulSoup has the following structure:

URL? HTTP request? HTML? BeautifulSoup

However, unlike Scrapy, in BeautifulSoup the development of the scraper does not require object-oriented programming, but rather the scraper is written as a simple script or script . In doing so, BeautifulSoup offers the easiest way to fish for information from the tag soup that lives up to its name.

Comparison of web scraping tools with Python

Each of the three tools presented has its advantages and disadvantages, which we have summarized in the following table:

  Scrapy Selenium BeautifulSoup
Ease of learning and handling ++ + +++
Reading dynamic content ++ +++ +
Realization of complex applications +++ + ++
Robustness against HTML failures ++ + +++
Scraping performance optimization +++ + +
Ecosystem quality +++ + ++
In summary

What tool should you choose for your project? Bottom line: choose BeautifulSoup if you need quick development or if you want to get acquainted first with the concepts of Python and web scraping . Scrapy , for its part, allows you to carry out complex web scraping applications in Python if you have the necessary knowledge. Selenium will be your best option if your priority is to extract dynamic content with Python.

Web scraping with Python and BeautifulSoup? Tutorial

Here we show you how to extract data from a web page using BeautifulSoup. The first step will be to install Python and some tools that will help you. You will need to:

  • Python from version 3.4
  • The pip package manager for Python
  • The venv module

To install Python, follow the installation instructions on its website.

Once you have installed the free Homebrew package manager on your system, you can install Python using the following command:

 

  brew install python  
Note

The explanations and code examples below are for Python 3 on macOS . The code should, in theory, work on other operating systems as well, but it may require some modifications, especially for Windows.

Set up the web scraping project with Python on your own computer

For our Python tutorial we want to locate the Scraper web project folder on the desktop. To do this, we first open the command line interface (Terminal.App on Mac), copy the following lines of code into it and execute them:

  # Cambiar a la carpeta de escritorio cd ~/Desktop/ # Crear directorio de proyecto mkdir ./web Scraper/ && cd ./web Scraper/ # Crear entorno virtual # Se encarga, entre otras cosas, de que posteriormente se use pip3 python3 -m venv ./env # Activar entorno virtual source ./env/bin/activate # Instalar paquetes pip install requests pip install beautifulsoup4  

Extract citations and authors with Python and BeautifulSoup

The Quotes to Scrape website offers a whole collection of quotes from famous people specially designed to be the subject of scraping tests , so you don't have to worry about breaching the conditions of use.

Let's get to work. Open the command line interface (Terminal.App on Mac) and start the Python interpreter from the Scraper web project folder . To do this, copy the following lines of code in the interface and run them:

  # Ir al directorio de proyecto cd ~/Desktop/web Scraper/ # Activar entorno virtual source ./env/bin/activate # Iniciar el intérprete de Python # Puesto que nos encontramos en el entorno virtual, se usa python3 python  

Now, copy the following code and paste it in the Python interpreter, in the command line interface . Then press Enter (multiple times, if necessary) to run it. You can also save the code as a file named scrape_quotes.py in the Scraper web project folder . If you do, you can run the Python script with the python scrape_quotes.py command .

As a final result after running the code , in your Scraper web project folder there should be a file called zitate.csv . This file contains a table with the citations and the authors and you can open it with the spreadsheet program of your choice.

  # Importar módulos import requests import csv from bs4 import BeautifulSoup # Dirección de la página web url = "http://quotes.toscrape.com/" # Ejecutar GET-Request response = requests.get(url) # Analizar sintácticamente el archivo HTML de BeautifulSoup del texto fuente html = BeautifulSoup(response.text, 'html.parser') # Extraer todas las citas y autores del archivo HTML quotes_html = html.find_all('span', class_="text") authors_html = html.find_all('small', class_="author") # Crear una lista de las citas quotes = list() for quote in quotes_html: quotes.append(quote.text) # Crear una lista de los autores authors = list() for author in authors_html: authors.append(author.text) # Para hacer el test: combinar y mostrar las entradas de ambas listas for t in zip(quotes, authors): print(t) # Guardar las citas y los autores en un archivo CSV en el directorio actual # Abrir el archivo con Excel / LibreOffice, etc. with open('./zitate.csv', 'w') as csv_file: csv_writer = csv.writer(csv_file, dialect='excel') csv_writer.writerows(zip(quotes, authors))  

Use Python packages to scrape

No two web scraping projects are the same. Sometimes you just want to check if there have been changes to a page and sometimes you want to carry out complex evaluations, among other options. Python gives you a wide range of packages to choose from:

  1. Install packages in command line interface with pip3
  pip3 install <package></package>  
  1. Integrate modules within Python scripts with import

 

  from <package> import <module></module></package>  

These are some of the most used packages in web scraping projects :

Package Purpose
come Manage the virtual environment of the project
request Open a web page
lxml Use alternative parsers for HTML and XML
csv Read and write data to tables in CSV format
pandas Process and evaluate data
scrapy Use Scrapy
selenium Use Selenium WebDriver
advice

Use the Python Package Index (PyPI) to see all available packages.

Please, take into account the legal notice related to this article.


...