Well, if like me, you are trying to find a vaccine for your parents, then here is the place to look. Yes, you can click relentlessly on a website but that sure does seem like a waste of time, so while I’m doing that, the other alternative is to learn a little bit about computers at the same time. I’ve always taken every opportunity to expand technical horizons while toiling in the world.
(And yes, in the course of doing this, we did finally get an appointment by the good old fashion spamming the mouse and typing faster than someone to get a same day appointment, YMMV, spamming is great, but code lives forever).
So here are some notes on how to do this and see this Deepnote project for experimentation with it:
- Safari Developer Show Web Inspector. Turn on developer mode and you can use this (or the Chrome equivalent to see what is being generated, the more sophisticated the more tools you need.
- Deepnote. The first question is where to do this kind of experimentation. In the old days, like last year, I would have just run python on my local machine, then I would have used Google Colab, but right now I’m experimenting with Deepnote, it also runs Jupyter notebooks online, but the big advantage is keystroke to keystroke sharing that works just like Google Workplace.
So those are the two tools, now what library should you use, well it depends on the site and what it has in it but tl:dr use Beautiful Soup for static snd Selenium with headless Chrome for dynamic pages that generate with JavaScript.
- Beautiful Soup for static HTML. My first attempt was with Python because well, it’s a simpler syntax at least for me and the roads lead to Beautiful Soup which is a Python library that makes it easy to scrape static HTML. The classic use is for a page that has sports scores. You can navigate and search for web links. The main problem is that this is just for a single page. So you normally combine this with the requests package to get a different site.
- Selenium to enter data and walk through pages. The main issue with the Beautiful Soup/request combination is that it is really only for reading data, it doesn’t help when you want to automate entering data. This is what Selenium is for, what it does is start a browser in the background, and then you can do things like send_keys and hit enter to walk through a website. While used mainly for testing websites, it works for scraping static HTML especially when you use XPath syntax to dive deep into the hierarchical structure of a website. There are some tricks but you can get Selenium to run in Colab or Deepnote.
- Selenium Execute_javascript to run a single script and check results. The main problem is that most modern website do not just lay things out statically. Yes, if you are using a static site generator this works, but most sites generate their HTML dynamically from javascript, so one way to do this is to use Selenium to execute specific Javascript with the execute command. This means that you are really deep in a page trying to figure out what to do. This is nice for debugging and automated testing.
- Selenium plus Beautiful Soup to browse and then find static HTML. Here the final page is generated so you can use Selenium to get the page and then Beautiful Soup to find the tables. You can just pass the HTML that Selenium finds to Beautiful Soup to search.
- Selenium only. Or better yet just use Selenium snd itâs various find elements.
- Selenium also works on dynamic html where Javascript modifies the DOM and fills in data with Ajax or web sockets. OK the most complex is that some javascript can actually generate HTML and it is very hard to pick this up. You m used to load a headless PhantomJS apparently which emulates all this and gives you the real static HTML but today just load chrome headless as it runs in the background then to dynamic searches with XPath to find things.
Real world Example: WA DOH COVID Vaccine locations
Every site is different, so here are some real examples. The the Washington State COVID Vaccine is really complicated with a Google Map and then all kinds of scripts. This is where you have to be patient because you need to fire figure out the structure of the page.
So the trick is to open up the Show Page Source
mode of your Browser and work your way down. You need to lock on each element and wait for the data you want to be shaded in the top window. In this case, it is really unclear how it works because there is a Google Map, there is a dynamic section where you can close and open by counties.
But by carefully going through it, eventually, you can see that the hierarchy is by working your way through each part in this case, the real data is buried literally 8 layers down so here are the levels. The main surprise is that the real data is in a form of all places:
- html
- body id=Body
- form id=Form class=mm-page.mm-slideout
- div class=body_bg.full
- div class=dnn_wrapper
- div class=wrapper
- div id=Breadcrumb_style_4 class=Breaddcrumb_bg
- div id=dnn_content
- main id=MainContent
So this at least gets us down into something that looks like the actual data, not done yet though, so let’s keep moving down. It shows why patience is needed with all this autogenerate HTML. Each div allows another kind of formatting option as an aside but we are getting close, you can see the layout happening as we get to the content itself:
- div class=”dnn_layout”
- div class=pane_layout
- div class=”col-sm-12″
- div id=”dnn_ContentPane” class=”ContentPane”
- div class=”DnnModule DnnModule-DNNUserDefinedTable DnnModule-34979″
- div class=”White”
- div id=”dnn_ctr34979_ContentPane” class=”content pane”
- div id=”dnn_ctr34979_ModuleContent”
- div id=”dnn_ctr34979_DefaultPlaceHolderControl”
- div class=”dnnForm dnnClear”
- div id=”accordion34979″ role=”tablist” class=”panel-group accordian_6
So this is the pay dirt, the thing called tablets is where the data lives. So it takes real patience to do scraping well, but here you see a series of div class=panel panel-default and so the scraper can search for these.
I think you can see why scraping is so difficult and how much it depends on the structure of the HTML. But the name of the County for instance is in:
- div class=”panel panel-default”
- div class=”panel-heading”
- h2 class=”panel-title”
And in a hotline you can see “King County” but the pay direct is finding where it points, you see a reference to pnlKing and indeed just below this is:
- div id=”pnlKing” “panel-collapse collapse in”
- div class=”panel-body”
- div data-search-content
And so a search system just needs the names of all the counties with a prefix of pnl, so “pnlKing, pnlSnohomish, pnlJefferson,…” and search for those tags to get the data and the go and parse all the data-search-content
So I think you can see that building a parser for website is waaaay more complicated than a simple API call. In this case, it sure would be nice just to have a single call to get the vaccine locations data rather than doing this.
Note that Selenium makes this a little easier with XPath expression which let you do hierarchical search through a system so that for example, you can find the lowest element and then finally extract the actual text string.
The most aggravating thing is that I can’t seem to figure out a way to get the actual hypertext link string. I can see the href with a get_attribute(‘href’) but the text reveals nothing, but it turns out the trick is that with an attribute, you need to a a get_attribute(‘innerHTML’) and not a .text and this doesn’t seem to be documented. So the correct way to get the hyperlink and the text is:
# As a random example use this page
driver.get("https://www.tutorialpoint.com/index.html")
# pick up a random link
link = driver.find_element_by_tag_name('a')
href = link.get_attribute('html')
text = link.get_attribute('innerHTML')
Net, net, you can do it and I’m glad I spent a day learning it, but I sure have a lot of respect for people who do this work for a living!