Web scraping in R is a technique for extracting data from websites using the R programming language. It can be useful for collecting information that is not available through an API or a structured format. For example, you might want to scrape product reviews, news articles, social media posts, or any other type of web content that interests you. In this rvest tutorial, you'll learn how to use R and rvest, a popular package for web scraping in R. You'll learn how to use R to scrape HTML elements, write CSS selectors, handle pagination and dynamic content, and store your scraped data in a tidy format.
Choose between Python and R for web scraping
Python and R are both popular programming languages for web scraping. They both have a rich set of libraries and tools that make it easy to extract data from websites. However, they also have some differences that might make one more suitable than the other for certain tasks or preferences. In this chapter, we'll compare data scraping in R and Python and highlight their advantages.
Python's advantages
- Python has a simple and expressive syntax that makes it easy to write and read code.
- Python has a large and active community of developers and users that provide support and resources for web scraping.
- Python has a wide range of libraries and frameworks for web scraping, such as BeautifulSoup, Scrapy, Selenium, Requests, and more. These libraries offer different levels of abstraction and functionality for web scraping, from parsing HTML to crawling and scraping entire websites.
- Python is a general-purpose language that can be used for many other tasks besides web scraping, such as data analysis, machine learning, web development, automation, and more. This makes it a versatile and powerful language for data science.
🏸 Further reading: How to Scrape Facebook Pages: Step-by-Step Guide
📷 Further reading: How to scrape images from a website
R's advantages
- R is a language designed for statistical computing and data analysis. It has a rich set of built-in functions and packages for data manipulation, visualization, modeling, and reporting.
- R has a concise and consistent syntax that makes it easy to perform complex operations on data with minimal code.
- R has a few but effective libraries for web scraping, such as rvest, httr, xml2, RSelenium, and more. These libraries offer similar functionality to Python's libraries but with a more consistent and tidy interface.
- R has a strong integration with HTML widgets and Shiny apps that allow you to create interactive web applications and dashboards from your scraped data. This makes it easy to share and communicate your results with others.
Scraping data with R and rvest
Before you can start web scraping with R and rvest, you will need to install some software on your computer. The two main tools you will need are:
- R: This is the programming language that you will use to write your web scraping code. R is free and open source, and you can download here.
- RStudio: This is an integrated development environment (IDE) that makes working with R easier and more enjoyable. RStudio provides a user-friendly interface, code editor, debugger, and many other features. You can download RStudio here.
To install R and RStudio, follow the instructions on their respective websites. Once you have them installed, you can launch RStudio – and we’ll be able to move to the next step.
1. Install necessary packages (rvest, dplyr)
Let’s begin our project by opening RStudio. Select Create a project
to make a new folder for your code. Next, select New file
and create an R script. You can name it rvest_scraper
. We need to install two main libraries for our project: rvest
and dplyr
. To do that, use install.packages(“packageName”)
and run this code:
install.packages("rvest")
install.packages("dplyr")
Rvest helps us scrape data from web pages by allowing us to select and extract elements using CSS selectors or XPath expressions.
Dplyr helps us manipulate the data with the pipe operator (>
) and a set of useful functions. The pipe operator lets us chain multiple operations together without creating intermediate variables or nesting functions. Dplyr also has an extensive grammar for data manipulation.
After installing the libraries, you can remove the install lines from your script.
2. Get the page’s HTML
We need to load our libraries into our project before we can use them. To do that, type library(rvest)
and library(dplyr)
in your script. The first step of web scraping is to get the HTML document from the server. We can store the URL of the page we want to scrape as a variable and use the read_html()
function to download its source code.
link = "https://www.imdb.com/search/title/?title_type=feature&num_votes=25000&genres=adventure"
page = read_html(link)
Now our scraper has access to the HTML and can parse it with just two lines of code.
3. Find relevant HTML elements
To find the HTML element for movie titles on IMDB, we can use a tool called CSS selector or XPath expression to select and extract elements from the HTML document. For example, if you want to scrape the movie titles from the IMDB lists, you can inspect the HTML source code and find that each movie title is inside a <td>
element with a class attribute of .lister-item-header a
.
You can use the CSS selector td.titleColumn a
to select all the <a>
elements inside those <td>
elements, which contain the movie titles. Alternatively, you can use the XPath expression //td[@class=‘titleColumn’]/a
to achieve the same result.
Additionally, we can make rvest apply CSS selectors or XPath expressions to the HTML document and extract the movie titles as a vector of strings. For example, you can use the html_nodes()
function to select the nodes that match your selector or expression, and then use the html_text()
function to get the text content of those nodes:
# Load rvest library
library(rvest)
# Get HTML document from IMDB top 250 chart
url <- "http://www.imdb.com/chart/top"
html <- read_html(url)
# Use CSS selector to get movie titles
titles_css <- html_nodes(html, "td.titleColumn a") %>%
html_text()
# Use XPath expression to get movie titles
titles_xpath <- html_nodes(html, xpath = "//td[@class='titleColumn']/a") %>%
html_text()
# Print first 10 movie titles
head(titles_css, 10)
head(titles_xpath, 10)
Upon finding the correct HTML element, we can add this code line:
titles = page > html_nodes(".lister-item-header a") > html_text()
Finally, we can run our script by inputting titles
in our terminal app – and we’ll see a list of IMDb movies.
4. Store scraped data in a dataframe
We can enrich the movie list data with additional parameters, including release year (.text-muted.unbold
), ratings (.ratings-imdb-rating strong
), and synopsis (.ratings-bar+ .text-muted
).
movies = data.frame(titles, year, rating, synopsis, stringsAsFactors = FALSE)
Now we can use the data.frame()
function to make a data frame with our variables as columns. A data frame is a table-like structure that stores data in rows and columns. To see the data frame we just made, run the code and type view(movies)
in your console. This will open a new window that shows the data frame in a spreadsheet format.
5. Use rvest to collect attributes
Sometimes you may want to scrape the link inside the href attribute of an element. This can help you make your scraper follow links, keep track of the data source, and more. With rvest and dplyr, you can easily do that by using html_attr(“href”)
instead of html_text()
. This will select the href attribute from the element.
movie_url = page > html_nodes(".lister-item-header a") > html_attr("href")
But if we look at the page, we can see that the link inside the href is not complete. It is missing the https://www.imdb.com/
part. So if we run the code like this, we will get a broken string. To fix this, check the first link and see that it has the https://www.imdb.com/
part in the URL. So we just need to tell our scraper to add this part to the link before returning it. We can use the paste()
function to join two strings together.
movie_url = page > html_nodes(".lister-item-header a") > html_attr("href") > paste("https://www.imdb.com", ., sep="")
The >
operator passes the result of the previous function as the first argument. But if we add a comma and a dot after the string we want to add, we are telling the operator to pass the value as the second argument. Also, paste()
will add a space between our joined strings, which will make our link invalid, so we add sep=“”
to remove the space. Now if we run the code, we can see that it works as expected.
Conclusion
In this tutorial, you've learned how to use R and rvest to scrape data from any website. You've learned how to inspect HTML elements, write CSS selectors, handle pagination and dynamic content, and store your scraped data in a tidy format. You've also learned some best practices for web scraping, such as respecting the website's terms of service and rate limits, and avoiding legal or ethical issues. Web scraping can be a powerful tool for data analysis and visualization, as long as you use it responsibly and ethically. Happy scraping!
Frequently Asked Questions
rvest
, httr
, xml2
, and jsonlite
. Web scraping with R is easy and scalable, and allows you to manipulate and analyze the scraped data in the same environment.
The main steps of web scraping with R are:
- Identify the target website and the data you want to extract.
- Inspect the HTML structure of the website and locate the elements that contain the data.
- Use rvest functions to read the HTML code, select the elements, and extract the data.
- Format and clean the data as needed.
- Store the data in a data frame or export it to a file.
read_html()
function. This function takes a URL or a file path as an argument and returns an object of class xml_document
.
html_nodes()
function. This function takes an xml_document
object and a CSS selector or an XPath expression as arguments and returns a list of xml_node
objects that match the selector or expression.
stringr
, tidyr
, dplyr
, lubridate
, etc. For example, you can use stringr
functions to remove whitespace, punctuation, or unwanted characters from the text; tidyr
functions to reshape or separate the data into columns; dplyr
functions to filter, group, or summarize the data; lubridate
functions to parse or manipulate dates; etc.