This project contains a web scraper that can be used to retrieve data from some job sites, but here it's being used to get data from Linkedin. It's based on JobSpy python package, but was adapted to this project usage. Also a rotating proxy was configured to prevent blocking requests.
The scraper job can be triggered by running the main file, the output data CSVs are stored in the data folder and is manually uploaded to the to Databricks volume "/Volumes/tabular/dataexpert/fegvilela_jobs".
There's a workflow on Databricks that's triggered by new file arrival in this volume.
The workflow loads the data incrementally to a bronze table with schema enforcement and data quality checks, regarding mainly nullability.
After loading the data, the staging silver table is created, deduplicating the data. This data is then used to build the silver table, that leverages NLP techniques to evaluate each job description field, gathering the most important tokens.
- jobspy folder: scraper code
- data folder: scraped data directory
- databricks: databricks assets (workflow, notebooks, etc)
For future development, the scraper should be executed at a lambda and the data could be saved into a S3 bucket, or fed into a kafka topic. This was the original approach, but unfortunately AWS account wasn't available.
More than that, the gold layer has to be developed, using the dimensional modeling present in the pre-capstone project report. It's also important to have the visualization layer, so the insights can be better consumed and displayed.
