Web scraping dynamic content created by Javascript with Python
Scraping websites which contain dynamic content created by Javascript sounds easier than it is.
This is partly because browser technology is constantly evolving which forces web scraping libraries to change with them.
Therefore many articles written about the topic reference deprecated libraries like PhantomJS and dryscrape which makes it difficult to find information that is up-to-date.
In this article we will show you how to scrape dynamic content with Python and Selenium in headless mode.
Selenium is a web scraping library similar to BeautifulSoup with the difference that it can handle website content that was loaded from a Javascript script.
To make things more exciting we will do so by providing an example that has a real life use case.
Namely sending a notification to your Android or iOS device when certain TeamSpeak users enter or leave a given TeamSpeak server.
First make sure to install Selenium and the Simplepush library.
Then we need to make sure to have the ChromeDriver installed.
On Ubuntu or Raspbian:
On Debian:
On MacOS:
Now we can start coding.
Did you notice how we use the multiprocessing library to start Selenium in its own process?
This is because otherwise our program could run out of memory since Python has difficulties collecting unused WebDriver instances.
By running them inside their own processes we make sure that all memory is released back to the OS once a process finishes.
Now if you run our little program, it will check tsviewer.com every five seconds to see if one of our friends joined or left the server (as defined by TSVIEWER_URL and TSVIEWER_ID).
If that was the case, it will send out a notification to the Simplepush key defined by SIMPLEPUSH_KEY.