Webscraping With Selenium



In the last tutorial we learned how to leverage the Scrapy framework to solve common web scraping problems.Today we are going to take a look at Selenium (with Python ❤️ ) in a step-by-step tutorial.

Some common use-cases of using selenium for web scraping are automating a login, submitting form elements, handling alert prompts, adding/deleting cookies, and much more. It can handle exceptions as well. For more details on Selenium, we can refer to the official documentation Let’s deep dive into the world of selenium right away! Navigating to a URL with Selenium. Now we’re ready to write some code. Let’s start off by creating an instance of a Chrome WebDriver (the driver is an IDisposable object, so it should be instantiated in a using statement) and navigating to a URL (I’ll be using this scraping test site).

  1. Selenium is a great tool for web scraping, especially when learning the basics. But, depending on your goals, it is sometimes easier to choose an already-built tool that does web scraping for you. Building your own scraper is a long and resource-costly procedure that might not be worth the time and effort.
  2. What is Web Scraping? Web Scraping is a common technique primarily used for extracting information (or data) from websites. The HTML of the page from where relevant data has to be scraped is processed using the appropriate tools and stored in the database, excel sheet, etc. So that the data can be used for further analysis.
  3. On Example of Scraping with Selenium WebDriver in C#. In this article I will show you how it is easy to scrape a web site using Selenium WebDriver. I will guide you through a sample project which is written in C# and uses WebDriver in conjunction with the Chrome browser to login on the testing page and scrape the text from the private area of the website.

Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.

The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.

At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser, end-to-end testing (acceptance tests).

Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!

Selenium is useful when you have to perform an action on a website such as:

  • Clicking on buttons
  • Filling forms
  • Scrolling
  • Taking a screenshot

It is also useful for executing Javascript code. Let's say that you want to scrape a Single Page Application. Plus you haven't found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.

Installation

We will use Chrome in our example, so make sure you have it installed on your local machine:

  • selenium package

To install the Selenium package, as always, I recommend that you create a virtual environment (for example using virtualenv) and then:

Quickstart

Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser:

This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code).You should see a message stating that the browser is controlled by automated software.

To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:

The driver.page_source will return the full page HTML code.

Here are two other interesting WebDriver properties:

  • driver.title gets the page's title
  • driver.current_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)

Locating Elements

Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping).

There are many methods available in the Selenium API to select elements on the page. You can use:

  • Tag name
  • Class name
  • IDs
  • XPath
  • CSS selectors

We recently published an article explaining XPath. Don't hesitate to take a look if you aren't familiar with XPath.

As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time:


find_element

There are many ways to locate an element in selenium.Let's say that we want to locate the h1 tag in this HTML:

All these methods also have find_elements (note the plural) to return a list of elements.

For example, to get all anchors on a page, use the following:

Some elements aren't easily accessible with an ID or a simple class, and that's when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).

XPath is my favorite way of locating elements on a web page. It's a powerful way to extract any element on a page, based on it's absolute position on the DOM, or relative to another element.

WebElement

A WebElement is a Selenium object representing an HTML element.

Portraiture 2 for mac. There are many actions that you can perform on those HTML elements, here are the most useful:

  • Accessing the text of the element with the property element.text
  • Clicking on the element with element.click()
  • Accessing an attribute with element.get_attribute('class')
  • Sending text to an input with: element.send_keys('mypassword')

There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.

It can be interesting to avoid honeypots (like filling hidden inputs).

Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden like this:

This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.

That's a classic honeypot.

Full example

Here is a full example using Selenium API methods we just covered.

We are going to log into Hacker News:


In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.

In order to authenticate we need to:

  • Go to the login page using driver.get()
  • Select the username input using driver.find_element_by_* and then element.send_keys() to send text to the input
  • Follow the same process with the password input
  • Click on the login button using element.click()

Should be easy right? Let's see the code:

Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?

Selenium

We could try a couple of things:

  • Check for an error message (like “Wrong password”)
  • Check for one element on the page that is only displayed once logged in.

So, we're going to check for the logout button. The logout button has the ID “logout” (easy)!

We can't just check if the element is None because all of the find_element_by_* raise an exception if the element is not found in the DOM.So we have to use a try/except block and catch the NoSuchElementException exception:

Taking a screenshot

We could easily take a screenshot using:


Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.

In our Hacker News case it's simple and we don't have to worry about these issues.

Waiting for an element to be present

Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and Vue.js for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.

If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:

  • Use a time.sleep(ARBITRARY_TIME) before taking the screenshot.
  • Use a WebDriverWait object.

If you use a time.sleep() you will probably use an arbitrary value. The problem is, you're either waiting for too long or not enough.Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.With the WebDriverWait method you will wait the exact amount of time necessary for your element/data to be loaded.

This will wait five seconds for an element located by the ID “mySuperId” to be loaded.There are many other interesting expected conditions like:

  • element_to_be_clickable
  • text_to_be_present_in_element
  • element_to_be_clickable

You can find more information about this in the Selenium documentation

Executing Javascript

Sometimes, you may need to execute some Javascript on the page. For example, let's say you want to take a screenshot of some information, but you first need to scroll a bit to see it.You can easily do this with Selenium:

Conclusion

I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don't hesitate to take a look at our general Python web scraping guide.

Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API

Web Scraping With Selenium

Selenium is also an excellent tool to automate almost anything on the web.

If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn't have an API, it's maybe* a good idea to automate it with Selenium,just don't forget this xkcd:


Web scraping or data extraction from internet sites can be done with various tools and methods. The more complex sites to scrape are always the ones that look for suspicious behaviors and non-human patterns. Therefore, the tools we use for scraping must simulate human behavior as much as possible.

The tools that are developed and can simulate human behavior are testing tools, and one of the most commonly used and well-known tools as such is Selenium.

In our previous blog post in this series we talked about puppeteer and how it is used for web scraping. In this article we will focus on another library called Selenium.

What is Selenium?

Selenium is a framework for web testing that allows simulating various browsers and was initially made for testing front-end components and websites. As you can probably guess, whatever one would like to test, another would like to scrape. And in the case of Selenium, this is a perfect library for scraping. Or is it?

A Brief History

Selenium was first developed by Jason Huggins in 2004 as an internal tool for a company he worked for. Since then it has been evolved a lot but the concept has remained the same: a framework that simulates (or in truth really operates as) a web browser.

How does selenium work?

Basically, Selenium is a library that can control an automated version of Google Chrome, Firefox, Safari, Vivaldi, etc. You can use it in an automated process and imitate a normal user’s behavior. If for example, you would like to check your competitor’s top 10 ranking products daily, you would write a piece of code that will open a Chrome window (automatically, using Selenium), surf to your competitor’s storefront or search results on Amazon and scrape the data of their leading products.

Selenium is composed of 5 components that make the testing (or scraping) process possible:

Selenium IDE:

This is a real IDE for testing. This is actually a Chrome Extension or a Firefox Add on, and allows recording, editing and debugging tests. This component is less functional for scraping since usually scraping would be done using an API.

Selenium Client API:

This is the API that allows us to communicate with selenium using an API. There are various libraries for different programming languages so scraping can be done in JavaScript, Java, C#, R, Python, and Ruby. It is continuously supported and improved by a strong community.

Selenium WebDriver:

This is the component of Selenium that actually plays the browser’s part in the scraping. The driver is just like a remote control that connects to a specific browser (like in anything else, each remote control is designed to control a specific browser), and through it, we can control the browser and tell it what to do.

Selenium Grid:

Web Scraping With Selenium C#

This is not a part of selenium itself, but more of a tool to allow us to use multiple selenium instances on remote machines.

Since our previous article talked about Puppeteer and praised it for being the best tool for web scraping, let’s examine the differences and similarities between Selenium and Puppeteer.

Webscraping

What is the difference between Selenium and Puppeteer?

This is a very common question and the distinction is very important. Both of these libraries are very similar in concept, but there are a few key points to consider: Puppeteer’s main disadvantage is that it is limited to be used in Javascript, since the API Google publish supports Javascript only. Since it is a library written by Google, it supports only the Chrome browser. If you prefer to write all of your code in a coding language different than Javascript, or it is of importance to your company to use a web browser other than Google Chrome, I would consider using Selenium.

For scraping purposes, the fact that Puppeteer supports only Chrome really does not matter in my opinion. It matters more for testing usages when you would want to test your app or website on different browsers.

The library limitations and the fact that you need to know JavaScript in order to use it might be a bit more limiting, though I believe it’s always good to learn new skills and programming languages and controlling Puppeteer with JavaScript code is a great place to start with.

Should I choose Puppeteer or Selenium for web scraping?

The advantages of Puppeteer over Selenium are immense. And on top of stands the fact that Puppeteer is faster than Selenium. If you are planning on a high scale scraping operation, that may be a point to consider.

In addition, Puppeteer has more options that are vital for scraping, such as network interception. This is another great advantage over Selenium that allows you to handle each network request and response that is generated while loading a page and log them, stop them, or generate more of them.

This allows you to intercept and stop requesting certain resources such as images, CSS, or javascript files and reduce your response time and traffic used, both are very important when scraping on a large scale.

If you are planning on scraping for business purposes, either if to build a comparison website, a business intelligence tool, or any other business-oriented purpose, we would suggest you to use an API solution.

To sum it up, here are the main pros and cons of Selenium and puppeteer for web scraping.

Pros of selenium for web scraping:

Works with many programming languages.

It can be used with many different browsers and platforms.

Can manually record tests/small scrape operations.

It is powered by a great community.

Cons of selenium for web scraping:

Selenium

Slower than Puppeteer.

It has less control over the way the scraping is done and has less advanced features.

Pros of puppeteer for web scraping:

Faster than other libraries.

It is easier to use – no need of installing any external driver.

It has more control, allows more options like Network Interception.

Cons of puppeteer for web scraping:

The main drawback of puppeteers is that it currently works only with JavaScript.

To conclude, since most of the advantages of Selenium over Puppeteer are mainly centered around testing, for scraping I would definitely recommend giving Puppeteer a try. Even at the price of learning JavaScript.

I know Puppeteer isn’t for everyone and you might be already into Selenium, want to write a python scraper or just don’t like Puppeteer for any reason. That’s ok and in this case let’s see how to use Selenium for web scraping.

How is Selenium used for web scraping?

Web scraping using Selenium is rather straightforward. I don’t want to go into a specific language (We will add more language-specific tutorials in the future), but the steps are very similar in every language. Feel free to use the official Selenium Documentation for in-depth details. The first and most important step is to install the Selenium web driver component.

Install Selenium in Javascript:

npm install selenium-webdriver

Install selenium in Python:

Web scraping with selenium python

pip install selenium

Then you can start using the library according to the documentation.

3 Best practices for web scraping with Selenium

Scraping with Selenium is rather straight forwards. First, you need to get the HTML of the div, component or page you are scraping. This is done by navigating to that page using the web driver and then using a selector to extract the data you need. There are 3 key points you should notice though:

1. Use a good proxy server with IP rotation

This is the pitfall you can most easily fall to in case you are a programmer who is starting his scraping journey – If you do not use a good web proxy service you will get blocked. All modern sites including Google, Amazon, Airbnb, eBay, and many others use advanced anti-scraping mechanisms. Some put more effort into it than others, but all of them will start blocking you very quickly if you don’t use a proxy service and change your address every X amount of requests. What is X? depends on the website, but it varies between 5 and 20 usually.

Once you have your proxy in place and have a rotation mechanism for the IPs you use, the number of times you will be blocked by websites will be reduced to around zero.

2. Find the sweet spot in your crawling rate

Crawling too fast is a big problem and will cause you to hit a lot of anti scrape defenses on the websites you are scraping. Of course, you want this to be as fast as possible, but scraping too fast is also sometimes painfully slow.

Try to play with it. Start with a request per second and increase this. make sure that for this test you are using a different IP address every time and that you are trying to fetch different objects from the website you are testing.

3. Use user-agent rotation

When sending a request for a webpage or an object, the browser sends a header called User-Agent. It is necessary to change this every few requests, and always send a “legitimate” User-Agent (one that conceals the fact that this is a headless browser). Read more about User-Agents.

Web Scraping With Selenium C#

In the next posts we will continue to discuss the challenges you may face when scraping online information and alternatives. You are also more than welcome to check our recent article about the differences between in house scraping and web scraping API.