A WordCloud from a JobCloud, or a very short exercise on Web Scraping, Regular Expressions and Data Visualization.
When looking for open job positions, the online source to go in the german part of Switzerland is the website jobs.ch. To play a bit with web scraping, text parsing and data visualization I wanted to create a wordcloud based on the text in open job positions that were found giving a keyword on the search field of the website.
After looking at the HTML and CSS elements using the amazing Chrome Inspector I found really quick where were the links needed to parse and retrieve the relevant text. That usually reflects a well structured website, easy to inspect and debug. After writing down a simple strategy it was just a matter of setting up the loops and transforming the strings.
Everything is done with the well-known Python modules requests, beautifulsoup, regex, wordcloud and matplotlib, inside a Jupyter Lab notebook.
import bs4 as bs
import urllib.request
import re
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from tqdm import tqdmThe tdqm is used to monitor the iterables, specially useful at the beginnig, when trying to understand the eficiency of the approach. wordcloud is a bit of a black-box at this point for me, but it seems reliable and for the sake of speed will be used to quickly visualize the data.
The site to parse (or scrap) is:
The url for a given search looks like:
Inside the results page, the link for an individual position is something like:
So for any given time, giving a keyword or term to search, and a location, a list of links to retrieve the text can be automatically generated. At this point I have tried the code with one term only. The location can also be blank and the search will just give everything for the term given, without filtering a location. I have tested the code for 10 pages, with 20 results per page, gives 200 open job positions parsed. This works so far very well, giving that there are more than 200 positions for the term given.
After building the beautifulsoup object based on the requested page we search for the link at the class:
sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
for link in soup.find_all('a'):
pos = link.get('href')Interesting enough a very simple regular expression was need to retrieve the text that was relevant to the position, and ignore all the back-end stuff:
re.findall(r'kopieren.*Jobs —'Meaning that everything between the word "kopieren." (which most likely will not be present in a job description) and the "Jobs -" gives us the actual text of the job description. This assumption seems to be correct for all of the runs performed, however, it should be used carefully as it is infered from looking superficially at the CSS style of the page.
As usual, stopwords need to be eliminated before creating the wordcloud. The file stopwords_de_plain.txt has a small set of german and english words that seems to catch most "unvaluable" words. Inside the wordcloud API an additional set of stopwords were defined, adjusted on the fly, as I was plotting the different terms.
See the notebook
After playing around with several different keywords or search terms, here are a few interesting results, leaving the location blank and parsing the first 200 open positions found for each term:
- If the HTML or CSS structure changes then our parsing will probably won't work anymore. If that is the case we can catch the error very quickly and tweak the code accorrdingly.
- The code works well with 200 end-links. One improvement will be to identify how many positions were found in total and incorporate that variable into the loop to retrieve the text for all of the positions. Loose search terms could lead easily to +1000 results, that could potentially be a challenge to the network and to the computing performance... time for the yellow toy elephant and his friends to join the party?





