Skip to content

fegvilela/fegvilela-capstone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fernanda Vilela Capstone Project

This project contains a web scraper that can be used to retrieve data from some job sites, but here it's being used to get data from Linkedin. It's based on JobSpy python package, but was adapted to this project usage. Also a rotating proxy was configured to prevent blocking requests.

The scraper job can be triggered by running the main file, the output data CSVs are stored in the data folder and is manually uploaded to the to Databricks volume "/Volumes/tabular/dataexpert/fegvilela_jobs". There's a workflow on Databricks that's triggered by new file arrival in this volume. The workflow loads the data incrementally to a bronze table with schema enforcement and data quality checks, regarding mainly nullability. After loading the data, the staging silver table is created, deduplicating the data. This data is then used to build the silver table, that leverages NLP techniques to evaluate each job description field, gathering the most important tokens.

Organization

  • jobspy folder: scraper code
  • data folder: scraped data directory
  • databricks: databricks assets (workflow, notebooks, etc)

Considerations

For future development, the scraper should be executed at a lambda and the data could be saved into a S3 bucket, or fed into a kafka topic. This was the original approach, but unfortunately AWS account wasn't available.

More than that, the gold layer has to be developed, using the dimensional modeling present in the pre-capstone project report. It's also important to have the visualization layer, so the insights can be better consumed and displayed.

workflow

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages