Web scraping dynamic content created by Javascript with Python

Scraping websites which contain dynamic content created by Javascript sounds easier than it is. This is partly because browser technology is constantly evolving which forces web scraping libraries to change with them. Therefore many articles written about the topic reference deprecated libraries like PhantomJS and dryscrape which makes it difficult to find information that is up-to-date.

In this article we will show you how to scrape dynamic content with Python and Selenium in headless mode. Selenium is a web scraping library similar to BeautifulSoup with the difference that it can handle website content that was loaded from a Javascript script.

To make things more exciting we will do so by providing an example that has a real life use case. Namely sending a notification to your Android or iOS device when certain TeamSpeak users enter or leave a given TeamSpeak server.

First make sure to install Selenium and the Simplepush library.

sudo pip3 install selenium
sudo pip3 install simplepush

Then we need to make sure to have the ChromeDriver installed.

On Ubuntu or Raspbian:

sudo apt install chromium-chromedriver

On Debian:

sudo apt install chromium-driver

On MacOS:

brew cask install chromedriver

Now we can start coding.

#!/usr/bin/python3
from multiprocessing import Process, Manager
from requests.exceptions import ConnectionError
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from simplepush import send
import time

# Get this URL from the tsviewer.com search
TSVIEWER_URL = "https://www.tsviewer.com/index.php?page=ts_viewer&ID=1111111"
# If you squint, you can derive the TSVIEWER_ID from TSVIEWER_URL
TSVIEWER_ID = "ts3viewer_1111111"
# You will immediately get your personal Simplepush key after installing the Simplepush app
SIMPLEPUSH_KEY = "YourSimplepushKey"
# The usernames of your friends you want to be notified about
FRIENDS = ["TeamSpeakUser"]

def update(friends_online):
    driver = webdriver.Chrome(options=options)

    try:
        driver.get(TSVIEWER_URL)
        # Wait until Javascript loaded a div where the id is TSVIEWER_ID
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, TSVIEWER_ID)))
        html = driver.page_source

        # This check unfortunately seems to be necessary since sometimes WebDriverWait doesn't do its job
        if TSVIEWER_ID in html:
            for friend in FRIENDS:
                send_notification_on_change(friend, f"{friend} entered the server", f"{friend} left the server", html, friends_online)
    except:
        print("Error")
    finally:
        driver.close()
        driver.quit()

def send_notification_on_change(name, message_join, message_leave, html, friends_online):
    name = name.lower()
    html = html.lower()

    if name in html and name not in friends_online:
        try:
            friends_online.append(name)
            send(SIMPLEPUSH_KEY, "A friend joined", message_join)
        except ConnectionError:
            friends_online.remove(name)

    if name in friends_online and name not in html:
        try:
            friends_online.remove(name)
            send(SIMPLEPUSH_KEY, "A friend left",  message_leave)
        except ConnectionError:
            friends_online.append(name)

if __name__ == "__main__":
    options = webdriver.ChromeOptions()
    options.set_headless = True
    options.add_argument('headless')

    manager = Manager()
    friends_online = manager.list()

    try:
        while(1):
            p = Process(target=update, args=[friends_online])
            p.start()
            # make the main process wait for `update` to end
            p.join()
            # all memory used by the subprocess will be freed to the OS
            time.sleep(5)
    except (KeyboardInterrupt, SystemExit):
        print("Stopped")

Did you notice how we use the multiprocessing library to start Selenium in its own process? This is because otherwise our program could run out of memory since Python has difficulties collecting unused WebDriver instances. By running them inside their own processes we make sure that all memory is released back to the OS once a process finishes.

Now if you run our little program, it will check tsviewer.com every five seconds to see if one of our friends joined or left the server (as defined by TSVIEWER_URL and TSVIEWER_ID). If that was the case, it will send out a notification to the Simplepush key defined by SIMPLEPUSH_KEY.

Share on:

Imprint

Simplepush Blog

Web scraping dynamic content created by Javascript with Python