Chris Yoon Developing

<- Back

Using Selenium to crawl data from JavaScript pages

ChrisYoon March 10, 2022

In order to extract data from Javascript (JS) webpages in Python we have to use selenium rather than standard libraries such as urllib or beautiful soup. Standard libraries usually send a request to a website and download the html code of the page, from where we can extract the data.

The problem with JS pages though is that their html code might not contain the data that we want to crawl. Instead, the html of JS pages includes pieces of JS code that dynamically load data from the database in the backend into the page, but not the actual data. To deal with this problem we can use the selenium library which comes with a variety of useful functions, one of which allows us to extract data loaded by JS from the backend.

In addition to its ability to scrape data loaded from the backend, another advantage of selenium is that it gives us the capability to control the browser such as clicking buttons or entering strings into search bars. For instance, in some cases you need to log into your user account in order to view the data or you might want to conduct specific search requests.

In this post I will show how to crawl accomodation data from the website www.booking.com. In doing so, we will instruct our selenium crawler to enter a location into the search bar, select a check-in date and specify the number of guests. Next, we will submit the search request by clicking on the search button and crawl all accommodation data on all result pages.

In order to build our selenium crawling bot, we need to prepare two things: (1) we have to install the selenium library in Python by executing pip3 install selenium in our command shell and (2) we have to download a webdriver such as the chromedriver and place the downloaded .exe file in our working directory or at another place in our file structure. Note that the actual browser needs to be installed, too in addition to the webdriver. If you use chrome as browser, install the chromedriver from this source, but make sure that the version of the webdriver matches with the version of your installed browser.

In the following example we will crawl accommodation data for all counties in Ireland from booking.com. For this, we create a .csv file that includes the list of all Irish counties. This .csv file will help us to crawl all counties in one loop. The list of counties can be copied from here. The .csv file will be named as Counties.csv with the first line being the header:

Now, we can start coding our crawling bot.

First of all, we import the libraries necessary for the following crawling bot.

import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support import expected_conditions as EC

In the next step, we assign our parameters to a couple of variables that will be used by the crawling bot and that are necessary to enter the search queries. Moreover, we define the path where we find the chromedriver and load the counties into the program by using pandas.

url = "https://www.booking.com/"
checkin_month = "August 2022"
checkin_day = "26, Friday"
chrome_driver_path = Service("C:/Users/Documents/Python/Booking_Crawling/chromedriver")
df_counties = pd.read_csv('Counties.csv', usecols=['Counties'])

Now, as we finally prepared our variables and loaded the counties into the program we loop over the counties. Note that the next few code blocks are within this outer loop, but before we create a list, where the data will be stored during the loop. After the loop starts, we write a try/except block that closes the open webdriver from the loop before. During the first loop the try/except block will be skipped and print an error message because we haven’t initiate the webdriver yet. After that we initate the webdriver and navigate our webdriver to the booking.com website.

df_properties = []

for row in df_counties.iterrows():
        county_search = row[1]['Counties']

        try:
                driver.close()
        except Exception as e:
                print(e)

        try:
                driver = webdriver.Chrome(service=chrome_driver_path)
                driver.get(url)

In the next step, we tell selenium to wait until the search bar of booking.com appears and enter the county into the search bar.

                search_entry = wait(driver,20).until(EC.presence_of_element_located((By.ID, "ss")))
                search_entry.send_keys(county_search)

Next, the webdriver is instructed to find the date picker and to send the month and day to the date picker element. In this case we use the CSS selector to find the web element. For this, just go to booking.com, right-click on the page and select “Inspect”, which allows us to view the html code of the page. Next, we search the web element in the html code and as soon as we found it, we right-click on the specific html code and copy the selector – a process that will be repeated for all the other web elements.

After we entered the county into the search bar we specifiy the check-in date. In this case, we do not need a check-out date because we want to crawl data for only one night.

                 Datebox_checkin_month = driver.find_element(By.CSS_SELECTOR, 
                         .div:nth-child(1) select:nth-child(2)) 
                 Datebox_checkin_month.send_keys(checkin_month) 
                 Datebox_checkin_day = driver.find_element(By.CSS_SELECTOR, 
                         "select:nth-child(2)") 
                 Datebox_checkin_day.send_keys(checkin_day) 
                 age_child1_input = wait(driver,20).until(EC.presence_of_element_located((By.CSS_SELECTOR, 
                         ".sb-group__children__field > select:nth-child(1)"))) 
                 age_child1_input.send_keys(10)

In the next step, we define the number of persons. In this case, we want accommodation data for two adults and to two kids. Since two adults are predefined by booking.com, we only need to specify the number of kids. Booking.com also asks us to specify the age of the kids. Finally, we tell the crawling bot to click the submit button and to wait for 3 seconds until the search results are loaded.

                Number_of_guests = driver.find_element(By.CSS_SELECTOR, 
".xp__guests")

                Number_of_guests.click()
                add_child_button = wait(driver,20).until( 
EC.presence_of_element_located((By.CSS_SELECTOR, "div.sb-group__field:nth-child(2) > 
div:nth-child(1) > div:nth-child(2) > button:nth-child(4)"))
)

                add_child_button.click()
                add_child_button.click()

                age_child2_input = wait(driver,20).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".sb-group__children__field > 
select:nth-child(2)")))
                age_child2_input.send_keys("10")

                Submit_button = wait(driver,20).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".sb-searchbox__button"))
)
                Submit_button.click()

                time.sleep(3)

As soon as the search request was submitted and the data were loaded we can start crawling the data and assign the data to the variable results.

                results = wait(driver,20).until(
EC.presence_of_element_located((By.CSS_SELECTOR, 
'[data-testid="property-card"]'))
)

In the next step, we start an inner loop, in which we parse the data and append them to the df_properties list we created above.

                for property in results:
                        propertyArr = property.text.split("\n")
                        print(propertyArr)
                        df_properties.append(propertyArr)

Next, we start an inner while loop in order to crawl data of the next pages. Here, the while loop allows us to escape the loop in cases when there is no next page. In order to do so, we check if the “Next Page” button is disabled or not. If it is disabled, we break out of the loop and start with the next county in the outer loop. If the button is enabled, we crawl the data and append the data to the df_properties list.

                while True:
                        time.sleep(3)
                        if not (driver.find_element(By.CSS_SELECTOR, 
"[aria-label='Next page']").is_enabled()):                                
                                break                        
                        elm = driver.find_element(By.CSS_SELECTOR, "[aria-label='Next page']") 
                        elm.click()
                        time.sleep(3)
                        results = driver.find_elements(By.CSS_SELECTOR, 
'[data-testid="property-card"]')
                        print(results)
                        for property in results:
                                propertyArr = property.text.split("\n")
                                print(propertyArr)
                                df_properties.append(propertyArr)  
                                       
        except Exception as e:
                print(e)

Finally, we use pandas to transform the list into a dataframe and save the dataframe in our working directory. Note that here we did not only break from the inner loop, but also from the outer loop.

df_mydata = pd.DataFrame(df_properties)
df_mydata.to_csv(r'2_Adults_2_Kids_Booking_Data.csv', header=False, index=False)

Selenium is a very practical library to extract data from JS webpages and to control a browser. The ability to control a browser allows us to build different web tools, one of which could be a chatbot for YouTube live chats. You can find the full code in my GitHub repository.

Using Selenium to crawl data from JavaScript pages

More Posts

Developing and simulating robot stability

Resilience engineering and what it means for the future of manufacturing systems

The fundamental flaws of AI

Apps

Connect

More