Web Scraping in C#: A Beginner-Friendly Tutorial

Want to extract web data using C#? This in-depth tutorial covers everything from setting up scraping tools to bypassing anti-scraping measures with proxies and Selenium.

Web Scraping in C#: A Beginner-Friendly Tutorial
Jan Wiśniewski
Jan Wiśniewski 10 min read
Article content
  1. Why Choose C# for Web Scraping?
  2. Setting Up the C# Web Scraping Environment
  3. Making HTTP Requests in C#
  4. Parsing HTML Data
  5. Web Scraping with Selenium in C#
  6. Using Proxy Servers in C# Web Scraping
  7. Additional Methods Against Anti-Scraping Measures
  8. Storing Scraped Data
  9. Frequently Asked Questions

Web scraping in C# allows developers to extract data from websites efficiently, whether for research, automation, or business intelligence. With powerful tools like HttpClient, HtmlAgilityPack, Selenium, and Infatica proxies, you can collect and process web data while handling challenges like dynamic content and anti-scraping measures. In this guide, we’ll explore the best C# web scraping techniques, provide code examples, and show you how to store your scraped data effectively.

Why Choose C# for Web Scraping?

C# is a powerful, strongly typed language with a rich ecosystem that makes it well-suited for web scraping. Here’s why C# is a great choice:

  • Robust .NET ecosystem: The .NET framework and .NET Core provide extensive libraries for handling HTTP requests, parsing HTML, and managing concurrency.
  • Efficient HTTP handling: The HttpClient class in C# enables efficient and asynchronous HTTP requests for retrieving webpage content.
  • Powerful HTML parsing libraries: C# has well-established libraries like HtmlAgilityPack and AngleSharp for extracting and processing web data.
  • Support for automation: With Selenium WebDriver, C# can handle dynamic, JavaScript-heavy pages.
  • Performance and scalability: C# offers multi-threading and async programming capabilities, making it efficient for scraping large datasets.

Setting Up the C# Web Scraping Environment

Choosing an IDE for C# Development

First, we’ll need a suitable development environment. Here are some popular choices – for most users, Visual Studio Community Edition is the best choice as it provides a complete development environment for free:

  • Visual Studio (Recommended): A full-fledged IDE with powerful debugging tools, IntelliSense, and built-in support for .NET development.
  • Visual Studio Code (VS Code): A lightweight code editor with C# extensions available for debugging and syntax highlighting.
  • JetBrains Rider: A commercial IDE with advanced features for C# and .NET development.
  • Other editors: While other code editors like Notepad++ or Sublime Text can be used, they lack the robust debugging and project management features of dedicated C# IDEs.

Installing the .NET SDK

Before you start writing C# code, you need to install the .NET SDK. This includes the C# compiler and the runtime needed to execute applications.

Download the .NET SDK from the official Microsoft .NET website and install the SDK by following the on-screen instructions.

Verify the installation by running the following command in the terminal and seeing the installed .NET SDK version:

dotnet --version

Creating a New C# Project

Once the SDK is installed, create a new C# project: Open a terminal or command prompt and run the command to create a new console application:

dotnet new console -n WebScraper

Navigate into the project folder:

cd WebScraper

Open the project in your preferred IDE. If using VS Code, run:

dotnet run

Installing Required Libraries

For web scraping in C#, you need to install libraries that help with HTTP requests and HTML parsing. The two most commonly used libraries are:

  • HtmlAgilityPack (for parsing HTML)
  • AngleSharp (alternative HTML parser with DOM manipulation capabilities)

To install these libraries, run the following commands in your terminal inside the project directory:

dotnet add package HtmlAgilityPack
dotnet add package AngleSharp

If you plan to scrape JavaScript-heavy websites, you may also need Selenium:

dotnet add package Selenium.WebDriver
dotnet add package Selenium.Support

Making HTTP Requests in C#

Web scraping involves retrieving webpage content using HTTP requests. In C#, the HttpClient class provides a powerful way to send requests and handle responses efficiently.

Using HttpClient for GET Requests

The most common way to retrieve webpage content is by sending a GET request. Below is a simple example:

using System;
using System.Net.Http;
using System.Threading.Tasks;

class Program
{
    static async Task Main()
    {
        using HttpClient client = new HttpClient();
        
        try
        {
            string url = "https://example.com";
            HttpResponseMessage response = await client.GetAsync(url);
            
            if (response.IsSuccessStatusCode)
            {
                string content = await response.Content.ReadAsStringAsync();
                Console.WriteLine(content);
            }
            else
            {
                Console.WriteLine($"Error: {response.StatusCode}");
            }
        }
        catch (HttpRequestException e)
        {
            Console.WriteLine($"Request error: {e.Message}");
        }
    }
}

Handling POST Requests

Some websites require sending data using a POST request. Here’s an example of sending a POST request with form data:

using System;
using System.Net.Http;
using System.Collections.Generic;
using System.Threading.Tasks;

class Program
{
    static async Task Main()
    {
        using HttpClient client = new HttpClient();
        
        var postData = new Dictionary<string, string>
        {
            { "username", "testuser" },
            { "password", "mypassword" }
        };
        
        using FormUrlEncodedContent content = new FormUrlEncodedContent(postData);
        HttpResponseMessage response = await client.PostAsync("https://example.com/login", content);
        
        if (response.IsSuccessStatusCode)
        {
            string responseData = await response.Content.ReadAsStringAsync();
            Console.WriteLine(responseData);
        }
        else
        {
            Console.WriteLine($"Error: {response.StatusCode}");
        }
    }
}

Handling Headers and User-Agent Rotation

To avoid being blocked, it’s good practice to modify request headers, including the User-Agent:

client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)");

Parsing HTML Data

Extracting Data Using HtmlAgilityPack (XPath and CSS Selectors)

HtmlAgilityPack is a powerful library for parsing HTML and extracting specific elements using XPath and CSS selectors.

Using XPath to extract data:

var links = doc.DocumentNode.SelectNodes("//a[@href]");
foreach (var link in links)
{
    Console.WriteLine(link.Attributes["href"].Value);
}

Using CSS Selectors with HtmlAgilityPack: While HtmlAgilityPack does not natively support CSS selectors, you can use Fizzler:

dotnet add package Fizzler.Systems.HtmlAgilityPack

using System;
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;

class Program
{
    static void Main()
    {
        // Sample HTML content
        string html = "<div class='classname'><a href='#'>Click me</a></div>";

        // Load the HTML into an HtmlDocument
        var doc = new HtmlDocument();
        doc.LoadHtml(html);

        // Select all <a> elements inside <div class='classname'>
        var nodes = doc.DocumentNode.QuerySelectorAll("div.classname a");

        // Print the text inside each <a> tag
        foreach (var node in nodes)
        {
            Console.WriteLine(node.InnerText);
        }
    }
}

Using AngleSharp as an Alternative Parser

AngleSharp provides full DOM manipulation capabilities similar to a browser environment. Parsing HTML with AngleSharp:

using System;
using System.Threading.Tasks;
using AngleSharp;
using AngleSharp.Dom;

class Program
{
    static async Task Main()
    {
        // Sample HTML content
        string html = "<html><body><a href='https://example.com'>Example</a></body></html>";

        // Create a configuration and a browsing context
        var config = Configuration.Default;
        var context = BrowsingContext.New(config);

        // Load the HTML content into an AngleSharp document
        var document = await context.OpenAsync(req => req.Content(html));

        // Select all anchor (<a>) elements
        var links = document.QuerySelectorAll("a");

        // Print href attributes of all links
        foreach (var link in links)
        {
            Console.WriteLine(link.GetAttribute("href"));
        }
    }
}

Web Scraping with Selenium in C#

Setting Up Selenium WebDriver in C#

Selenium WebDriver allows automating web interactions in C#. To install Selenium WebDriver:

dotnet add package Selenium.WebDriver

Download and install a browser driver (e.g., ChromeDriver) and place the driver in your project directory or add it to your system PATH.

Automating Interactions (Clicks, Form Submissions)

With Selenium, you can interact with web elements such as buttons, text fields, and dropdowns. Here’s how to open a webpage and click a button:

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using System;

class Program
{
    static void Main()
    {
        // Initialize Chrome WebDriver
        using (IWebDriver driver = new ChromeDriver())
        {
            // Navigate to the webpage
            driver.Navigate().GoToUrl("https://example.com");

            // Find and click a button by its ID
            IWebElement button = driver.FindElement(By.Id("submit-button"));
            button.Click();

            // Close the browser
            driver.Quit();
        }
    }
}

Filling and submitting a form:

var inputField = driver.FindElement(By.Name("username"));
inputField.SendKeys("testuser");
var submitButton = driver.FindElement(By.Name("submit"));
submitButton.Click();

Handling JavaScript-Rendered Content

Selenium is useful for handling dynamic content loaded via JavaScript. Here’s how to wait for an element to load:

using System;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;

class Program
{
    static void Main()
    {
        // Initialize Chrome WebDriver
        using (IWebDriver driver = new ChromeDriver())
        {
            // Navigate to a webpage
            driver.Navigate().GoToUrl("https://example.com");

            // Initialize WebDriverWait (set max wait time to 10 seconds)
            WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10));

            // Wait for the element with ID 'dynamic-content' to appear
            var element = wait.Until(d => d.FindElement(By.Id("dynamic-content")));

            // Print the text content of the element
            Console.WriteLine(element.Text);

            // Close the browser
            driver.Quit();
        }
    }
}

Using Proxy Servers in C# Web Scraping

Proxies are essential for web scraping, especially when dealing with websites that implement anti-scraping measures. A proxy server acts as an intermediary between your scraper and the target website, masking your real IP address and distributing requests across multiple IPs.

Why Use Proxies?

  • Avoid IP bans: Scraping from a single IP address may trigger security measures. Using proxies allows IP rotation.
  • Access geo-restricted content: Some websites restrict content based on location. Proxies let you appear as if you're browsing from another country.
  • Bypass request limits: Some sites limit requests from a single IP. Rotating proxies help distribute traffic.

Adding Proxies to Your C# Scraper

C#’s HttpClientHandler allows configuring a proxy server for HTTP requests. Here’s how to set up a proxy with authentication:

using System;
using System.Net;
using System.Net.Http;
using System.Threading.Tasks;

class Program
{
    static async Task Main()
    {
        var httpClientHandler = new HttpClientHandler()
        {
            Proxy = new WebProxy("http://username:password@proxy.infatica.io:port"),
            UseProxy = true
        };

        HttpClient client = new HttpClient(httpClientHandler);

        try
        {
            HttpResponseMessage response = await client.GetAsync("https://example.com");
            string content = await response.Content.ReadAsStringAsync();
            Console.WriteLine(content);
        }
        catch (HttpRequestException e)
        {
            Console.WriteLine($"Request failed: {e.Message}");
        }
    }
}

In this example, replace username:password@proxy.infatica.io:port with your actual proxy credentials. The proxy then intercepts requests, forwarding them through a different IP.

Rotating Proxies for Each Request

To further enhance anonymity, you can use a list of proxy IPs and rotate them for each request. This ensures that each request uses a different proxy from the list, making it harder for the target website to detect scraping patterns.

string[] proxies = 
{
    "http://username:password@proxy1.infatica.io:port",
    "http://username:password@proxy2.infatica.io:port",
    "http://username:password@proxy3.infatica.io:port"
};

Random random = new Random();
string selectedProxy = proxies[random.Next(proxies.Length)];

var handler = new HttpClientHandler()
{
    Proxy = new WebProxy(selectedProxy),
    UseProxy = true
};

HttpClient client = new HttpClient(handler);
HttpResponseMessage response = await client.GetAsync("https://example.com");

Additional Methods Against Anti-Scraping Measures

To scrape even more effectively without getting blocked, you can also implement the following strategies:

Rotating User Agents and Headers

Many websites check the User-Agent header to identify bots. By rotating user agents, you can make your scraper appear as different browsers. Set a User-Agent in HttpClient:

HttpClient client = new HttpClient();
client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)");
client.DefaultRequestHeaders.Add("Referer", "https://example.com");

To rotate user agents, store multiple values in a list and select one randomly for each request.

string[] userAgents = 
{
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
};

Random random = new Random();
client.DefaultRequestHeaders.Add("User-Agent", userAgents[random.Next(userAgents.Length)]);

Implementing Delays and Request Throttling

If you send too many requests in a short time, websites may block your scraper. Adding delays between requests mimics human browsing behavior.

await Task.Delay(2000); // 2-second delay between requests

For more controlled request rates, you can use an exponential backoff strategy:

async Task<string> FetchData(string url)
{
    int[] delays = { 1000, 2000, 5000, 10000 }; // Delays in milliseconds
    foreach (int delay in delays)
    {
        try
        {
            HttpResponseMessage response = await client.GetAsync(url);
            if (response.IsSuccessStatusCode)
            {
                return await response.Content.ReadAsStringAsync();
            }
        }
        catch (HttpRequestException)
        {
            Console.WriteLine($"Request failed. Retrying in {delay / 1000} seconds...");
            await Task.Delay(delay);
        }
    }
    return null;
}

Storing Scraped Data

Once you've extracted data from a website, you'll need to store it for later analysis or processing. Depending on your use case, you can save the data in various formats, such as CSV, JSON, or a database. Let’s take a closer look at different storage methods and demonstrate how to use System.Text.Json for JSON serialization.

1. Saving Data to a CSV File

CSV (Comma-Separated Values) is a simple and widely used format for storing tabular data. To write data to a CSV file, you can use StreamWriter:

using System;
using System.Collections.Generic;
using System.IO;
using System.Threading.Tasks;

class Program
{
    static async Task Main()
    {
        List<string[]> data = new List<string[]>
        {
            new string[] { "Title", "Price", "URL" },
            new string[] { "Product 1", "$20", "https://example.com/product1" },
            new string[] { "Product 2", "$35", "https://example.com/product2" }
        };

        using (StreamWriter writer = new StreamWriter("scraped_data.csv"))
        {
            foreach (var row in data)
            {
                await writer.WriteLineAsync(string.Join(",", row));
            }
        }

        Console.WriteLine("Data saved to scraped_data.csv");
    }
}

Here’s how this code works:

  • Each row is represented as an array of strings.
  • StreamWriter writes the data to a file, with values separated by commas.
  • The first row acts as a header.

2. Saving Data as JSON (Using System.Text.Json)

JSON (JavaScript Object Notation) is a structured format widely used for APIs and data storage. C# provides System.Text.Json for JSON serialization.

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.Json;
using System.Threading.Tasks;

class Program
{
    public class Product
    {
        public string Title { get; set; }
        public string Price { get; set; }
        public string Url { get; set; }
    }

    static async Task Main()
    {
        List<Product> products = new List<Product>
        {
            new Product { Title = "Product 1", Price = "$20", Url = "https://example.com/product1" },
            new Product { Title = "Product 2", Price = "$35", Url = "https://example.com/product2" }
        };

        string jsonString = JsonSerializer.Serialize(products, new JsonSerializerOptions { WriteIndented = true });

        await File.WriteAllTextAsync("scraped_data.json", jsonString);

        Console.WriteLine("Data saved to scraped_data.json");
    }
}

Here’s how this code works:

  • The Product class represents the data model.
  • JsonSerializer.Serialize() converts the object list into a JSON-formatted string.
  • File.WriteAllTextAsync() writes the JSON string to a file.

3. Storing Data in a Database (SQLite Example)

Databases are useful for handling large-scale data efficiently; SQLite, a lightweight, file-based database, works well for scraping projects.

Install the Microsoft.Data.Sqlite package via NuGet:

dotnet add package Microsoft.Data.Sqlite

To write data to SQLite:

using System;
using System.Collections.Generic;
using Microsoft.Data.Sqlite;
using System.Threading.Tasks;

class Program
{
    static async Task Main()
    {
        string connectionString = "Data Source=scraped_data.db";

        using (var connection = new SqliteConnection(connectionString))
        {
            await connection.OpenAsync();

            string createTableQuery = @"CREATE TABLE IF NOT EXISTS Products (
                                        Id INTEGER PRIMARY KEY AUTOINCREMENT,
                                        Title TEXT,
                                        Price TEXT,
                                        Url TEXT)";
            using (var command = new SqliteCommand(createTableQuery, connection))
            {
                await command.ExecuteNonQueryAsync();
            }

            List<(string Title, string Price, string Url)> products = new()
            {
                ("Product 1", "$20", "https://example.com/product1"),
                ("Product 2", "$35", "https://example.com/product2")
            };

            foreach (var product in products)
            {
                string insertQuery = "INSERT INTO Products (Title, Price, Url) VALUES (@Title, @Price, @Url)";
                using (var command = new SqliteCommand(insertQuery, connection))
                {
                    command.Parameters.AddWithValue("@Title", product.Title);
                    command.Parameters.AddWithValue("@Price", product.Price);
                    command.Parameters.AddWithValue("@Url", product.Url);
                    await command.ExecuteNonQueryAsync();
                }
            }
        }

        Console.WriteLine("Data saved to SQLite database.");
    }
}

Here’s how this code works:

  • Creates an SQLite database file (`scraped_data.db`) if it doesn’t exist.
  • Defines a Products table with columns for title, price, and URL.
  • Inserts scraped data into the database using parameterized queries.

Frequently Asked Questions

Web scraping legality depends on the website’s terms of service and the type of data being collected. Publicly available data is generally safe to scrape, but scraping private or copyrighted content without permission may violate legal guidelines.

Popular libraries include HtmlAgilityPack (for parsing HTML), AngleSharp (for advanced HTML parsing), and Selenium (for scraping JavaScript-heavy sites). HttpClient is often used for sending requests and fetching page content efficiently.

To reduce the chances of getting blocked, use rotating proxies (e.g., Infatica’s residential proxies), vary your user-agents, implement request throttling, and avoid sending too many requests in a short period.

For JavaScript-rendered pages, use Selenium WebDriver to interact with the browser. Alternatively, try Puppeteer Sharp (a C# wrapper for Puppeteer) or scrape API endpoints directly if available.

Depending on your needs, you can save data in CSV (easy to read and process), JSON (structured and API-friendly), or SQLite/MySQL (for scalable and queryable storage).

Jan Wiśniewski

Jan is a content manager at Infatica. He is curious to see how technology can be used to help people and explores how proxies can help to address the problem of internet freedom and online safety.

You can also learn more about:

Web Scraping in C#: A Beginner-Friendly Tutorial
Web scraping
Web Scraping in C#: A Beginner-Friendly Tutorial

Want to extract web data using C#? This in-depth tutorial covers everything from setting up scraping tools to bypassing anti-scraping measures with proxies and Selenium.

XPath vs. CSS Selectors: Choosing the Best Locator for Web Scraping
Web scraping
XPath vs. CSS Selectors: Choosing the Best Locator for Web Scraping

Should you use XPath or CSS selectors for web scraping? This guide compares them, highlighting performance, tool compatibility, and practical examples.

WebSocket vs. HTTP: Key Differences for Proxies & Web Scraping
Proxy
WebSocket vs. HTTP: Key Differences for Proxies & Web Scraping

Let’s discover how WebSocket and HTTP differ, their roles in web scraping, and how proxies handle these protocols. A must-read for developers and data collectors!

Get In Touch
Have a question about Infatica? Get in touch with our experts to learn how we can help.