Golang Web Scraper with Colly: Complete Guide for Beginners

Pavlo Zinkovski 25 Oct 2024 19 min read

Article content

Is Golang Suitable for Web Scraping?
Libraries for Go Web Scraping
Step 1: Install Go
Step 2: Install Colly
Step 3: Create the Main Go File
Step 4: Initialize Colly Collector
Step 5: Visit the Target Website
Step 6: Send HTTP Requests with Colly
Step 7: Inspect the HTML page
Step 8: Define Data Extraction
Step 9: Save Scraped Data
Step 10: Refine Selectors
Handling Pagination
Go Web Scraping Challenges
Frequently Asked Questions

The Go programming language is rapidly gaining popularity as a powerful choice for web scraping due to its efficiency and concurrency capabilities. In this article, you’ll discover the fundamentals of Golang web scraping in Go, from setting up your development environment to managing HTTP requests and extracting data. We’ll explore key web scraping framework like Colly and Goquery, providing complete scraper code for parsing HTML, handling pagination, managing sessions, and exporting scraped data to formats like CSV and JSON.

Is Golang Suitable for Web Scraping?

Golang (commonly known as Go) is an excellent choice to automatically retrieve data – here's why:

1. Performance: Go is compiled and known for its high execution speed, making it ideal for tasks like web scraping that may involve processing large amounts of data quickly.

2. Concurrency: Go's built-in support for concurrency through goroutines allows multiple web pages to be scraped simultaneously, improving efficiency and reducing the total time needed for large scraping jobs.

3. Lightweight: Go’s small memory footprint makes it suitable for handling multiple web scraping tasks at once without consuming too many system resources.

4. Library support: With open-source libraries like Colly, Go can simplify data extraction. Colly offers a clean web scraping API for managing scraping tasks, handling parallel requests, and avoiding common issues like getting blocked by servers.

5. Error handling: Go’s error-handling mechanisms make it easier to manage edge cases, ensuring robust and reliable scraping workflows.

Libraries for Go Web Scraping

Feature	Colly	GoQuery	Goutte	Surf	Rod
Ease of Use	Simple and beginner-friendly	More flexible but lower-level	Easy, but limited features	Moderate, browser-like behavior	Advanced, browser automation
Concurrency	Built-in, supports parallel scraping	No built-in concurrency, handled manually	No built-in concurrency	Built-in concurrency	Built-in concurrency
HTML Parsing	Built-in, XPath, and CSS Selectors	Built-in, jQuery-like selectors	CSS selectors only	Built-in, browser-based parsing	Full browser DOM and CSS parsing
Request Handling	Customizable (headers, cookies)	Basic request handling	Basic request handling	Advanced request handling	Full control (including JavaScript execution)
JavaScript Support	No (static pages only)	No (static pages only)	No (static pages only)	No (static pages only)	Yes (headless browser)
Error Handling	Good, built-in retry mechanisms	Limited, requires custom logic	Basic error handling	Good, automatic retries	Full error control
Documentation	Excellent, well-documented	Moderate, requires external references	Limited	Moderate	Good, but advanced features
Best Use Case	Efficient, high-speed scraping	Complex HTML parsing	Simple scraping projects	Browser-like behavior for static sites	Dynamic page scraping, including JavaScript
Active Maintenance	Yes, regularly updated	Yes, but slower updates	Limited updates	Somewhat active	Yes, regularly updated

Step-by-step by guide on Web Scraping with Go

Before you can get started with scraping with Go and Colly, you'll need to ensure Go is properly installed on your system. Here's how to set up Go on macOS, Windows, and Linux:

Step 1: Install Go on macOS

1. Using Homebrew: If you don’t already have Homebrew installed, open a terminal and run the following command to install it:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Once Homebrew is installed, you can install Go by running:

brew install go

2. Manual installation: Download the latest Go package for macOS from Go’s official website. Run the installer and follow the on-screen instructions. After installation, verify it by opening the terminal and typing:

go version

Step 1.2: Install Go on Windows

1. Using the MSI installer:: Go to Go’s official website and download the .msi installer for Windows. Run the installer and follow the prompts. Once installed, open the Command Prompt and verify by typing:

go version

2. Setting the Path: Ensure the Go binary is included in your system’s PATH environment variable. The installer should handle this, but if not, you can manually add C:\Go\bin to the system environment variables.

Step 1.3: Install Go on Linux

1. Using Package Manager: For Ubuntu/Debian-based systems, run the following commands to install Go:

sudo apt update
sudo apt install golang-go

For Fedora-based systems:

sudo dnf install golang

2. Manual installation: Download the Go tarball for Linux from Go’s official website. Extract the tarball and move the files to /usr/local:

sudo tar -C /usr/local -xzf go1.X.X.linux-amd64.tar.gz

Add Go to your PATH by adding the following to your ~/.profile or ~/.bashrc:

export PATH=$PATH:/usr/local/go/bin

Finally, verify the installation:

go version

Step 2: Install Colly

Once Go is set up, the next step is to install the Colly library, which will be the main tool for handling web scraping in your Go project.

1. Initialize your project: First, create a new directory for your project. Open a terminal or command prompt and run:

mkdir go-scraper
cd go-scraper

Inside your project folder, initialize a new Go module by running:

go mod init go-scraper

This command creates a go.mod file, which manages all the required dependencies for your project.

2. Install Colly: To install Colly, use the go get command. This will fetch the Colly package and add it to your project’s dependencies:

go get -u github.com/gocolly/colly/v2

The -u flag ensures that the latest version of Colly is installed.

3. Verify installation: To confirm that Colly is installed correctly, open the go.mod file in your project directory. You should see github.com/gocolly/colly/v2 listed as one of the dependencies.

Step 3: Create the Main Go File

With Go and Colly set up, it's time to create the main Go file where you'll write the code for building web scrapers.

1. Create a new file: Inside your project folder, create a new file called main.go:

touch main.go

2. Add the code snippet: Before getting started, set up a basic Go program. In main.go, add the snippet that consists of only a few lines:

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
)

func main() {
    fmt.Println("Web scraping with Colly!")
}

This basic code imports the necessary Go packages (fmt for printing messages and Colly for web scraping) and defines the main function, which will be the entry point of your program.

3. Test the program: Run the program to make sure everything is set up correctly. In the terminal, inside your project folder, run:

go run main.go

You should see the output:

Web scraping with Colly!

Step 4: Initialize Colly Collector

Now that your Go project is set up and running, the next step is to initialize Colly’s Collector, which will handle the core functionality of your web scraper.

1. Initialize the Collector: In the main.go file, modify the main function to initialize a new Colly Collector. This object will be responsible for making requests and receiving responses from websites. Add the following code:

func main() {
    // Initialize a new Colly collector
    c := colly.NewCollector(
        // Set options, like limiting the domain scope
        colly.AllowedDomains("example.com"),
    )

    fmt.Println("Collector initialized!")
}

2. Configure Collector options: The NewCollector() function takes optional parameters to customize the behavior of web scrapers. For example, in the code above, we use colly.AllowedDomains() to restrict the web scraper to only scrape URLs from a specific domain (in this case, example.com). This is useful to prevent the scraper from wandering off into unrelated domains during a crawl.

3. Additional Collector settings (optional): You can further customize the Collector with additional settings such as:

User-Agent string: Customize the User-Agent header to mimic different browsers and avoid being blocked by websites.
Rate limiting: Control the rate at which requests are sent to avoid overloading the target server.
Cookies and headers: Manage cookies and HTTP headers to handle authentication or session-based scraping.

Example of setting a User-Agent and enabling logging:

c := colly.NewCollector(
    colly.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)"),
    colly.Debugger(&colly.DebugLog{}),
)

4. Test the initialization: To confirm the collector is working, you can add a simple fmt.Println statement as shown in the code, and re-run the program:

go run main.go

If the program runs successfully and prints "Collector initialized!", you've correctly initialized the Colly Collector and are ready to start scraping!

Step 5: Visit the Target Website

Now that the Colly Collector is initialized, the next step is to tell it which website to visit. Colly makes it simple to send requests to target URLs.

1. Visit a website: Use the Visit method to instruct your Collector to send an HTTP request to a specific URL. Add the following code inside the main function:

func main() {
    // Initialize a new Colly collector
    c := colly.NewCollector(
        colly.AllowedDomains("example.com"),
    )

    // Define what to do when visiting the target URL
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL.String())
    })

    // Visit the target website
    err := c.Visit("http://example.com")
    if err != nil {
        fmt.Println("Error visiting website:", err)
    }
}

2. Handle the request: The OnRequest method is a Colly event handler that gets triggered each time the web scraper makes a request. It can be used to log or modify requests before they are sent. In this example, it prints the URL being visited, which helps track the progress of the web scraper.

3. Error handling: The Visit method returns an error if the request fails (e.g., if the website is down or unreachable). It’s a good practice to check for errors and handle them accordingly.

4. Test the code: To see the scraper in action, run the program:

go run main.go

You should see the output:

Visiting http://example.com

Step 6: Send HTTP requests with Colly

Now that your Collector can visit a website, the next step is to handle the HTTP responses and interact with the data that Colly receives from the server. Let’s use Colly’s event handlers to manage responses and extract useful information.

1. Handling responses: Use the OnResponse method to define what happens when the Colly collector receives a response from the website. Add this code inside the main function:

func main() {
    // Initialize a new Colly collector
    c := colly.NewCollector(
        colly.AllowedDomains("example.com"),
    )

    // Print the URL of the request
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL.String())
    })

    // Handle the HTTP response
    c.OnResponse(func(r *colly.Response) {
        fmt.Println("Received response from", r.Request.URL)
        fmt.Println("Response size:", len(r.Body), "bytes")
    })

    // Visit the target website
    err := c.Visit("http://example.com")
    if err != nil {
        fmt.Println("Error visiting website:", err)
    }
}

2. Extracting information from the response: The OnResponse method provides access to the Response object, which contains the raw HTML and metadata from the web page. In the example above, the Golang web scraper prints the size of the response body in bytes. You can modify this to process or inspect the content.

3. Handling different response status codes: You can also manage different HTTP status codes by checking the status of the response. For example:

c.OnResponse(func(r *colly.Response) {
    if r.StatusCode == 200 {
        fmt.Println("Success! Page loaded.")
    } else {
        fmt.Println("Failed to load page:", r.StatusCode)
    }
})

4. Test the code: Run the program to test the HTTP request and response handling:

go run main.go

If successful, you should see output similar to:

Visiting http://example.com
Received response from http://example.com
Response size: 1256 bytes

Step 7: Inspect the HTML page

With your Colly scraper successfully sending requests and receiving responses, the next step is to inspect the HTML content of the web page you’re scraping. This will help you identify the elements you want to retrieve data from.

1. Viewing the HTML structure: Before diving into the code, open your web browser and navigate to the target website (e.g., http://example.com). Right-click on the page and select “Inspect” or “View Page Source” to examine the HTML structure.

Look for the specific HTML elements that contain the data you want to scrape. For example, if you're looking for article titles, you might find them wrapped in <h1>, <h2>, or <div> tags with specific class attributes.

2. Using Colly to parse the HTML: Colly makes it easy to extract elements from the HTML response using selectors. You’ll set up event handlers to process the HTML once the page is successfully loaded.

Modify your main.go file to include the OnHTML method, which allows you to specify the HTML elements to target. Here’s an example:

c.OnHTML("h1, h2, .article-title", func(e *colly.HTMLElement) {
    fmt.Println("Found title:", e.Text)
})

In this example, the scraper looks for all <h1> and <h2> tags, as well as any elements with the class article-title. Whenever it finds a match, it prints the text content.

3. Exploring more selectors: Colly supports various selectors, including CSS selectors and XPath, allowing for flexible data extraction. You can chain selectors for more complex queries. For instance, if you wanted to extract links within a specific section, you could use:

c.OnHTML(".links a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    fmt.Println("Found link:", link)
})

Test the Code: After adding the OnHTML event handler, run your program:

go run main.go

Depending on the structure of the target web page, you should see output similar to:

Visiting http://example.com
Found title: Welcome to Example
Found title: About Us

Step 8: Define Data Extraction

With the HTML element identified, the next step is to define how to extract the specific data you want from the web page. This involves setting up your Colly scraper to parse the relevant information and store it for later use.

1. Setting up data structures: Before extracting data, it’s a good idea to define a structure to hold the scraped information. This makes it easier to manage the data later on. For instance, if you are scraping articles, you might want to create a data structure to hold the title and link:

type Article struct {
    Title string
    Link  string
}

var articles []Article

2. Modifying the OnHTML handler: Update the OnHTML method to populate the defined structure with the extracted data. Here’s how you can modify your existing code:

c.OnHTML("h1, h2, .article-title", func(e *colly.HTMLElement) {
    title := e.Text
    link := e.Request.AbsoluteURL(e.Attr("href")) // Get the absolute URL if applicable
    articles = append(articles, Article{Title: title, Link: link})
    fmt.Println("Found article:", title, "Link:", link)
})

3. Extracting additional data: You can also extract other relevant data such as publication dates, summaries, or categories by adding more OnHTML handlers. For example, if there’s a date associated with each article in a specific <span> tag:

c.OnHTML(".article-date", func(e *colly.HTMLElement) {
    date := e.Text
    fmt.Println("Publication date:", date)
})

4. Handling nested structures: If your data structure is nested (for instance, if each article has comments), you can use a nested OnHTML to handle this:

c.OnHTML(".article", func(e *colly.HTMLElement) {
    title := e.ChildText("h2")
    link := e.ChildAttr("a", "href")
    article := Article{Title: title, Link: link}

    // Extract nested comments
    e.ForEach(".comment", func(_ int, c *colly.HTMLElement) {
        comment := c.Text
        fmt.Println("Comment:", comment)
    })

    articles = append(articles, article)
})

5. Test the code: After defining the web scraping process, run your program to see if it collects the intended information:

go run main.go

You should see output that reflects the articles and any additional data you are extracting.

Step 9: Save Scraped Data

After successfully extracting the data from the target website, the next step is to save it for later use. You can choose to store the data in various formats, such as JSON, CSV, or even a database. In this example, we'll focus on saving the scraped data in JSON format, as it’s widely used and easy to work with.

1. Import the required package: To handle JSON serialization, you need to import the encoding/json package. Add this import statement at the beginning of your main.go file:

import (
    "encoding/json"
    "fmt"
    "github.com/gocolly/colly/v2"
    "os"
)

2. Create a function to save data: Define a function that will take the scraped articles and save them to a JSON file. Add this function to your main.go file:

func saveToJSON(filename string, articles []Article) error {
    file, err := os.Create(filename)
    if err != nil {
        return err
    }
    defer file.Close()

    encoder := json.NewEncoder(file)
    encoder.SetIndent("", "  ") // Format the JSON with indentation
    return encoder.Encode(articles)
}

3. Call the save function: After the scraping process is complete, call the saveToJSON function to write the scraped data to a file. Update your main function as follows:

func main() {
    // ... [previous code for initialization]

    // Visit the target website and handle data extraction
    err := c.Visit("http://example.com")
    if err != nil {
        fmt.Println("Error visiting website:", err)
    }

    // Save scraped articles to a JSON file
    if err := saveToJSON("articles.json", articles); err != nil {
        fmt.Println("Error saving data:", err)
    } else {
        fmt.Println("Data saved to articles.json")
    }
}

4. Run the program: Execute your program to scrape data and save it to a JSON file:

go run main.go

If everything is set up correctly, you should see the output confirming that the data has been saved:

Data saved to articles.json

5. Verify the output: Check your project directory for the newly created articles.json file. Open it to ensure the scraped data is correctly formatted. You should see something like this:

[
  {
    "Title": "Article Title 1",
    "Link": "http://example.com/article1"
  },
  {
    "Title": "Article Title 2",
    "Link": "http://example.com/article2"
  }
]

Step 10: Refine Selectors

Now that you’ve successfully scraped and saved data from the target website, it’s important to refine your selectors. This ensures that your scraper targets the correct elements and extracts the most relevant information. Refining selectors also helps improve the scraper's resilience to changes in the website's structure.

1. Review the HTML structure: Before making changes, revisit the target website and inspect the HTML structure again. Look for any changes or inconsistencies in the elements you initially targeted. Take note of specific classes, IDs, or attributes that can help you refine your selectors.

2. Use specific selectors: Instead of using broad selectors that may match multiple elements, consider using more specific selectors to narrow down the results. For example, if you initially used a generic class selector like .article-title, refine it by including additional attributes:

c.OnHTML(".article-container .article-title", func(e *colly.HTMLElement) {
    title := e.Text
    link := e.ChildAttr("a", "href")
    articles = append(articles, Article{Title: title, Link: link})
})

3. Combine selectors: You can combine multiple selectors to target elements more accurately. For instance, if you want to capture titles from specific sections:

c.OnHTML("section#featured h2.article-title, div.latest-articles h2", func(e *colly.HTMLElement) {
    title := e.Text
    link := e.ChildAttr("a", "href")
    articles = append(articles, Article{Title: title, Link: link})
})

4. Utilize XPath: If a CSS selector isn’t sufficient, you can use XPath for more complex queries. Colly supports XPath through the OnXML method. For example:

c.OnXML("//div[@class='article']", func(e *colly.XMLElement) {
    title := e.ChildText("h2")
    link := e.ChildAttr("a", "href")
    articles = append(articles, Article{Title: title, Link: link})
})

5. Testing and iteration: After refining your selectors, run your scraper to test the changes. Ensure that it captures the intended data without missing relevant items or returning unwanted results:

go run main.go

If you notice any discrepancies, revisit the HTML structure and continue refining the selectors as needed.

6. Implementing error handling: To make your scraper more robust, implement error handling for scenarios where elements may not be found. For example:

c.OnHTML(".article-title", func(e *colly.HTMLElement) {
    title := e.Text
    if title == "" {
        fmt.Println("Warning: Title not found")
        return
    }
    link := e.ChildAttr("a", "href")
    articles = append(articles, Article{Title: title, Link: link})
})

Handling Pagination

When scraping websites that display data across multiple HTML pages, handling pagination is crucial to ensure you capture all the relevant content. Here’s a step-by-step guide on how to manage pagination using the Colly library in Go:

1. Identify pagination links: Start by inspecting the target website to find the pagination links or buttons. Look for elements like “Next,” “Previous,” or numbered links at the bottom of the page. They often have specific classes or IDs that you can use as selectors.

2. Set up Colly collector: Initialize your Colly collector as you would normally. You’ll add Golang web scraping logic to handle pagination as part of your scraping routine.

3. Create a function to visit pages: Use the OnHTML event handler to extract data links to the next page. Here’s how you can modify your existing code:

c.OnHTML(".pagination a.next", func(e *colly.HTMLElement) {
    nextPage := e.Attr("href")
    if nextPage != "" {
        fmt.Println("Found next page:", nextPage)
        // Visit the next page
        e.Request.Visit(nextPage)
    }
})

In this example, the selector .pagination a.next targets the “Next” link. When the web crawler finds this link, it extracts the URL and issues a new request to visit the next page.

4. Handling multiple pages: You might want to ensure that you don’t get stuck in an infinite loop. To do this, consider keeping track of visited URLs:

visited := make(map[string]bool)

c.OnHTML(".pagination a.next", func(e *colly.HTMLElement) {
    nextPage := e.Attr("href")
    if nextPage != "" && !visited[nextPage] {
        visited[nextPage] = true
        fmt.Println("Visiting next page:", nextPage)
        e.Request.Visit(nextPage)
    }
})

5. Start the scraping process: Initially, start the scraping by visiting the first page. For example:

err := c.Visit("http://example.com/start-page")
if err != nil {
    fmt.Println("Error visiting starting page:", err)
}

6. Complete example: Here’s a complete example that combines all the steps:

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector()

    visited := make(map[string]bool)

    c.OnHTML(".article-title", func(e *colly.HTMLElement) {
        title := e.Text
        link := e.ChildAttr("a", "href")
        fmt.Println("Found article:", title, "Link:", link)
    })

    c.OnHTML(".pagination a.next", func(e *colly.HTMLElement) {
        nextPage := e.Attr("href")
        if nextPage != "" && !visited[nextPage] {
            visited[nextPage] = true
            fmt.Println("Visiting next page:", nextPage)
            e.Request.Visit(nextPage)
        }
    })

    err := c.Visit("http://example.com/start-page")
    if err != nil {
        fmt.Println("Error visiting starting page:", err)
    }
}

7. Testing your scraper: Run your program to ensure it follows pagination correctly and scrapes web pages. If done correctly, you should see the output for articles across all the pages, indicating that your scraper has successfully navigated through pagination.

8. Handling edge cases: Some websites might have dynamic web pages (e.g., using JavaScript for infinite scrolling). In such cases, you might need to simulate scrolling or trigger JavaScript events, which can be more complex and may require additional web scraping libraries or tools like headless browsers (e.g., Puppeteer or Selenium).

Go Web Scraping Challenges

Despite its data scraping prowess, our pipeline isn’t without its challenges. Some of them are:

Rate Limiting and IP Blocking

Many websites implement anti-bots measures to prevent excessive scraping, which can result in blocked requests or even banned IP addresses. Here’s how to manage these challenges effectively:

Respect robots.txt: Before scraping, check the website’s robots.txt file (e.g., http://example.com/robots.txt). This file specifies which pages can be scraped and the crawl rate the site expects. Adhere to these guidelines.

Implement delays: Introduce a delay between requests to avoid hitting the server too frequently. Use time.Sleep() in Go to add pauses between requests:

import "time"

time.Sleep(2 * time.Second) // Sleep for 2 seconds between requests

Randomize delays: Instead of using a fixed delay, randomize the wait time to mimic human behavior. This makes your scraper less predictable:

import (
    "math/rand"
    "time"
)

delay := time.Duration(rand.Intn(5)+1) * time.Second // Random delay between 1 and 5 seconds
time.Sleep(delay)

Rotating User Agents: Websites may also block requests based on the User-Agent header. Rotating your User-Agent string with each request can help avoid detection. You can set custom User-Agent headers in Colly:

c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36")
})

Implementing backoff strategies: If you encounter an error (like a 429 Too Many Requests), implement a backoff strategy. Gradually increase the delay between requests after receiving an error, allowing the server to recover before trying again:

backoff := 1 * time.Second
for {
    err := c.Visit(url)
    if err != nil {
        fmt.Println("Error visiting:", err)
        time.Sleep(backoff)
        backoff *= 2 // Exponential backoff
        continue
    }
    break // Exit loop if successful
}

Handling CAPTCHA

Adding Infatica rotating proxies to your Golang web scraper using the Colly library can help you avoid CAPTCHAs and collect data more efficiently.

1. Configure proxies in your scraper: You can set a proxy for your requests by modifying the Transport of your Colly collector. You need to configure the HTTP transport to use a proxy.

package main

import (
    "fmt"
    "net/http"
    "net/url"
    "github.com/gocolly/colly/v2"
)

func main() {
    // Create a new collector
    c := colly.NewCollector()

    // Define your proxy URL
    proxyURL, err := url.Parse("http://your-proxy:port") // Replace with your proxy URL
    if err != nil {
        fmt.Println("Error parsing proxy URL:", err)
        return
    }

    // Set the proxy in the collector's Transport
    c.SetTransport(&http.Transport{
        Proxy: http.ProxyURL(proxyURL),
    })

    // Define your scraping logic
    c.OnHTML("h1", func(e *colly.HTMLElement) {
        fmt.Println("Title:", e.Text)
    })

    // Start scraping
    err = c.Visit("http://example.com") // Replace with your target URL
    if err != nil {
        fmt.Println("Error visiting:", err)
    }
}

2. Rotating proxies: If you're using a rotating proxy service, you may need to randomly select a proxy from a list. Here's an example of how you can manage multiple proxies:

import (
    "math/rand"
    "time"
)

// List of proxies
var proxies = []string{
    "http://proxy1:port",
    "http://proxy2:port",
    "http://proxy3:port",
}

// Function to get a random proxy
func getRandomProxy() string {
    rand.Seed(time.Now().UnixNano()) // Seed the random number generator
    return proxies[rand.Intn(len(proxies))]
}

// Set up the collector with a random proxy
proxyURL, err := url.Parse(getRandomProxy())
if err != nil {
    fmt.Println("Error parsing proxy URL:", err)
    return
}

c.SetTransport(&http.Transport{
    Proxy: http.ProxyURL(proxyURL),
})

3. Testing and adjusting: After setting up your proxy, run your scraper and monitor the output. If you encounter issues, such as request failures or CAPTCHAs, you may need to adjust:

The frequency of requests.
The list of proxies being used.

Session Management and Authentication

When scraping websites that require login or session management, it’s essential to maintain authentication states across requests. This typically involves managing cookies, headers, and possibly session tokens.

1. Identify login requirements: Before you begin coding, analyze the target website to understand its login mechanism:

Form-based login: Check if the site uses a form to capture login credentials.
API-based authentication: Some sites may use APIs that require authentication tokens.

2. Create a login function: You’ll need to send a POST request to the login endpoint with the required credentials. Here’s an example of how to do this:

package main

import (
    "fmt"
    "net/http"
    "net/url"
    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector()

    // Set up a cookie jar to manage session cookies
    jar, _ := cookiejar.New(nil)
    c.SetCookieJar(jar)

    // Login function
    loginURL := "http://example.com/login" // Replace with the actual login URL
    loginData := url.Values{
        "username": {"your-username"}, // Replace with actual form field names and values
        "password": {"your-password"},
    }

    // Perform the login
    c.OnResponse(func(r *colly.Response) {
        if r.Request.URL.String() == loginURL {
            fmt.Println("Logged in successfully!")
        }
    })

    // POST login request
    err := c.Post(loginURL, loginData)
    if err != nil {
        fmt.Println("Login error:", err)
        return
    }

    // Now, you can visit pages that require authentication
    c.OnHTML("h1", func(e *colly.HTMLElement) {
        fmt.Println("Title:", e.Text)
    })

    err = c.Visit("http://example.com/dashboard") // Replace with a protected page
    if err != nil {
        fmt.Println("Error visiting dashboard:", err)
    }
}

3. Maintaining session across requests: The SetCookieJar() method in the example above allows the collector to maintain cookies received from the server during the login process. This way, any subsequent requests will automatically include the session cookie.

4. Handling token-based authentication: If the website uses token-based authentication (like JWT), you may need to retrieve a token upon login and include it in the headers for each request. Here’s how to do that:

// After logging in, extract the token from the response
var token string
c.OnResponse(func(r *colly.Response) {
    if r.Request.URL.String() == loginURL {
        // Extract token (assuming it's in JSON response)
        var jsonResponse map[string]string
        json.Unmarshal(r.Body, &jsonResponse)
        token = jsonResponse["token"] // Adjust based on actual response structure
    }
})

// Add the token to headers for subsequent requests
c.OnRequest(func(r *colly.Request) {
    if token != "" {
        r.Headers.Set("Authorization", "Bearer "+token)
    }
})

5. Logout (optional): If the target website has a logout functionality, consider implementing it to cleanly end the session:

logoutURL := "http://example.com/logout" // Replace with actual logout URL

err = c.Visit(logoutURL)
if err != nil {
    fmt.Println("Error logging out:", err)
}

Frequently Asked Questions

Yes, Golang is excellent for web scraping due to its performance, concurrency support, and efficient libraries. Its goroutines allow handling multiple requests simultaneously, making it ideal for scraping large datasets quickly and effectively.

Colly is widely regarded as the best scraping library for Golang. It offers a simple API, efficient scraping capabilities, built-in support for concurrent requests, and powerful features like automatic cookie handling and easy data extraction through HTML selectors.

Both Golang and Python have strengths in web scraping. Golang excels in performance and concurrency, while Python offers a vast ecosystem of libraries and community support. The choice depends on your specific needs, such as speed versus ease of use.

Best practices include using structured error handling with if statements, logging errors for debugging, implementing retry mechanisms for failed requests, and handling specific HTTP status codes (like 404 and 429) appropriately to manage request limits and connectivity issues.

You can efficiently parse and clean scraped data in Go by utilizing the encoding/json package for JSON, the encoding/xml package for XML, and string manipulation functions for text data. Regular expressions can also help in cleaning and extracting specific patterns from the data.

To parse specific data formats, use Go’s standard libraries like encoding/json for JSON and encoding/xml for XML. For JSON, unmarshal data into structs for easy manipulation. For XML, define struct tags to map elements, facilitating straightforward data extraction and processing.

Contact Sales

Web scraping

Pavlo Zinkovski

As Infatica's CTO & CEO, Pavlo shares the knowledge on the technical fundamentals of proxies.