Web scraping with R and rvest

Learn how to use R and rvest to scrape data from any website in this comprehensive tutorial: Inspect HTML elements, write CSS selectors, and store your scraped data in a tidy format.

Web scraping with R and rvest
Lucas Walker
Lucas Walker 7 min read
Article content
  1. Choose between Python and R for web scraping
  2. Install necessary packages (rvest, dplyr)
  3. Get the page’s HTML
  4. Find relevant HTML elements
  5. Store scraped data in a dataframe
  6. Use rvest to collect attributes
  7. Frequently Asked Questions

Web scraping in R is a technique for extracting data from websites using the R programming language. It can be useful for collecting information that is not available through an API or a structured format. For example, you might want to scrape product reviews, news articles, social media posts, or any other type of web content that interests you. In this rvest tutorial, you'll learn how to use R and rvest, a popular package for web scraping in R. You'll learn how to use R to scrape HTML elements, write CSS selectors, handle pagination and dynamic content, and store your scraped data in a tidy format.

Choose between Python and R for web scraping

Python and R are both popular programming languages for web scraping. They both have a rich set of libraries and tools that make it easy to extract data from websites. However, they also have some differences that might make one more suitable than the other for certain tasks or preferences. In this chapter, we'll compare data scraping in R and Python and highlight their advantages.

Python's advantages

  • Python has a simple and expressive syntax that makes it easy to write and read code.
  • Python has a large and active community of developers and users that provide support and resources for web scraping.
  • Python has a wide range of libraries and frameworks for web scraping, such as BeautifulSoup, Scrapy, Selenium, Requests, and more. These libraries offer different levels of abstraction and functionality for web scraping, from parsing HTML to crawling and scraping entire websites.
  • Python is a general-purpose language that can be used for many other tasks besides web scraping, such as data analysis, machine learning, web development, automation, and more. This makes it a versatile and powerful language for data science.

🏸 Further reading: How to Scrape Facebook Pages: Step-by-Step Guide

📷 Further reading: How to scrape images from a website

R's advantages

  • R is a language designed for statistical computing and data analysis. It has a rich set of built-in functions and packages for data manipulation, visualization, modeling, and reporting.
  • R has a concise and consistent syntax that makes it easy to perform complex operations on data with minimal code.
  • R has a few but effective libraries for web scraping, such as rvest, httr, xml2, RSelenium, and more. These libraries offer similar functionality to Python's libraries but with a more consistent and tidy interface.
  • R has a strong integration with HTML widgets and Shiny apps that allow you to create interactive web applications and dashboards from your scraped data. This makes it easy to share and communicate your results with others.

Scraping data with R and rvest

Before you can start web scraping with R and rvest, you will need to install some software on your computer. The two main tools you will need are:

  1. R: This is the programming language that you will use to write your web scraping code. R is free and open source, and you can download here.
  2. RStudio: This is an integrated development environment (IDE) that makes working with R easier and more enjoyable. RStudio provides a user-friendly interface, code editor, debugger, and many other features. You can download RStudio here.

To install R and RStudio, follow the instructions on their respective websites. Once you have them installed, you can launch RStudio – and we’ll be able to move to the next step.

1. Install necessary packages (rvest, dplyr)

Let’s begin our project by opening RStudio. Select Create a project to make a new folder for your code. Next, select New file and create an R script. You can name it rvest_scraper. We need to install two main libraries for our project: rvest and dplyr. To do that, use install.packages(“packageName”) and run this code:

install.packages("rvest")
install.packages("dplyr")

Rvest helps us scrape data from web pages by allowing us to select and extract elements using CSS selectors or XPath expressions.

Dplyr helps us manipulate the data with the pipe operator (>) and a set of useful functions. The pipe operator lets us chain multiple operations together without creating intermediate variables or nesting functions. Dplyr also has an extensive grammar for data manipulation.

After installing the libraries, you can remove the install lines from your script.

2. Get the page’s HTML

We need to load our libraries into our project before we can use them. To do that, type library(rvest) and library(dplyr) in your script. The first step of web scraping is to get the HTML document from the server. We can store the URL of the page we want to scrape as a variable and use the read_html() function to download its source code.

link = "https://www.imdb.com/search/title/?title_type=feature&num_votes=25000&genres=adventure"
page = read_html(link)

Now our scraper has access to the HTML and can parse it with just two lines of code.

3. Find relevant HTML elements

To find the HTML element for movie titles on IMDB, we can use a tool called CSS selector or XPath expression to select and extract elements from the HTML document. For example, if you want to scrape the movie titles from the IMDB lists, you can inspect the HTML source code and find that each movie title is inside a <td> element with a class attribute of .lister-item-header a.

You can use the CSS selector td.titleColumn a to select all the <a> elements inside those <td> elements, which contain the movie titles. Alternatively, you can use the XPath expression //td[@class=‘titleColumn’]/a to achieve the same result.

Additionally, we can make rvest apply CSS selectors or XPath expressions to the HTML document and extract the movie titles as a vector of strings. For example, you can use the html_nodes() function to select the nodes that match your selector or expression, and then use the html_text() function to get the text content of those nodes:

# Load rvest library
library(rvest)

# Get HTML document from IMDB top 250 chart
url <- "http://www.imdb.com/chart/top"
html <- read_html(url)

# Use CSS selector to get movie titles
titles_css <- html_nodes(html, "td.titleColumn a") %>%
  html_text()

# Use XPath expression to get movie titles
titles_xpath <- html_nodes(html, xpath = "//td[@class='titleColumn']/a") %>%
  html_text()

# Print first 10 movie titles
head(titles_css, 10)
head(titles_xpath, 10)

Upon finding the correct HTML element, we can add this code line:

titles = page > html_nodes(".lister-item-header a") > html_text()

Finally, we can run our script by inputting titles in our terminal app – and we’ll see a list of IMDb movies.

4. Store scraped data in a dataframe

We can enrich the movie list data with additional parameters, including release year (.text-muted.unbold), ratings (.ratings-imdb-rating strong), and synopsis (.ratings-bar+ .text-muted).

movies = data.frame(titles, year, rating, synopsis, stringsAsFactors = FALSE)

Now we can use the data.frame() function to make a data frame with our variables as columns. A data frame is a table-like structure that stores data in rows and columns. To see the data frame we just made, run the code and type view(movies) in your console. This will open a new window that shows the data frame in a spreadsheet format.

5. Use rvest to collect attributes

Sometimes you may want to scrape the link inside the href attribute of an element. This can help you make your scraper follow links, keep track of the data source, and more. With rvest and dplyr, you can easily do that by using html_attr(“href”) instead of html_text(). This will select the href attribute from the element.

movie_url = page > html_nodes(".lister-item-header a") > html_attr("href")

But if we look at the page, we can see that the link inside the href is not complete. It is missing the https://www.imdb.com/ part. So if we run the code like this, we will get a broken string. To fix this, check the first link and see that it has the https://www.imdb.com/ part in the URL. So we just need to tell our scraper to add this part to the link before returning it. We can use the paste() function to join two strings together.

movie_url = page > html_nodes(".lister-item-header a") > html_attr("href") > paste("https://www.imdb.com", ., sep="")

The > operator passes the result of the previous function as the first argument. But if we add a comma and a dot after the string we want to add, we are telling the operator to pass the value as the second argument. Also, paste() will add a space between our joined strings, which will make our link invalid, so we add sep=“” to remove the space. Now if we run the code, we can see that it works as expected.

Conclusion

In this tutorial, you've learned how to use R and rvest to scrape data from any website. You've learned how to inspect HTML elements, write CSS selectors, handle pagination and dynamic content, and store your scraped data in a tidy format. You've also learned some best practices for web scraping, such as respecting the website's terms of service and rate limits, and avoiding legal or ethical issues. Web scraping can be a powerful tool for data analysis and visualization, as long as you use it responsibly and ethically. Happy scraping!

Frequently Asked Questions

Web scraping is a technique of gathering data from various websites. It can be useful when you need to collect information that is not available in a structured format or through an API. R is a popular programming language for data science that has many built-in tools and libraries for web scraping, such as rvest, httr, xml2, and jsonlite. Web scraping with R is easy and scalable, and allows you to manipulate and analyze the scraped data in the same environment.

The main steps of web scraping with R are:

  1. Identify the target website and the data you want to extract.
  2. Inspect the HTML structure of the website and locate the elements that contain the data.
  3. Use rvest functions to read the HTML code, select the elements, and extract the data.
  4. Format and clean the data as needed.
  5. Store the data in a data frame or export it to a file.

To use rvest to read HTML code from a website, you need to use the read_html() function. This function takes a URL or a file path as an argument and returns an object of class xml_document.

To use rvest to select and extract HTML elements, you need to use the html_nodes() function. This function takes an xml_document object and a CSS selector or an XPath expression as arguments and returns a list of xml_node objects that match the selector or expression.

To use rvest to format and clean the scraped data, you can use various functions from other R packages, such as stringr, tidyr, dplyr, lubridate, etc. For example, you can use stringr functions to remove whitespace, punctuation, or unwanted characters from the text; tidyr functions to reshape or separate the data into columns; dplyr functions to filter, group, or summarize the data; lubridate functions to parse or manipulate dates; etc.

You can also learn more about:

How to Avoid CAPTCHAs: Tips for Beating CAPTCHAs Every Time
Web scraping
How to Avoid CAPTCHAs: Tips for Beating CAPTCHAs Every Time

Learn how to avoid CAPTCHA and bypass CAPTCHA challenges in web scraping with effective strategies such as rotating proxies, mimicking human behavior, and rendering JavaScript.

Meet Us at the 2024 MAC Affiliate Conference on May 30–31!
Proxies and business
Meet Us at the 2024 MAC Affiliate Conference on May 30–31!

Join Infatica at the 2024 MAC Affiliate Conference to explore the latest innovations in affiliate marketing and meet our team. Exclusive promo code inside!

Protect Your Infatica Account with Two-Factor Authentication
How to
Protect Your Infatica Account with Two-Factor Authentication

Infatica introduces two-factor authentication for improved security. Find out how to enable 2FA and safeguard your account.

Get In Touch
Have a question about Infatica? Get in touch with our experts to learn how we can help.