GitHub - fegvilela/fegvilela-capstone

Fernanda Vilela Capstone Project

This project contains a web scraper that can be used to retrieve data from some job sites, but here it's being used to get data from Linkedin. It's based on JobSpy python package, but was adapted to this project usage. Also a rotating proxy was configured to prevent blocking requests.

The scraper job can be triggered by running the main file, the output data CSVs are stored in the data folder and is manually uploaded to the to Databricks volume "/Volumes/tabular/dataexpert/fegvilela_jobs". There's a workflow on Databricks that's triggered by new file arrival in this volume. The workflow loads the data incrementally to a bronze table with schema enforcement and data quality checks, regarding mainly nullability. After loading the data, the staging silver table is created, deduplicating the data. This data is then used to build the silver table, that leverages NLP techniques to evaluate each job description field, gathering the most important tokens.

Organization

jobspy folder: scraper code
data folder: scraped data directory
databricks: databricks assets (workflow, notebooks, etc)

Considerations

For future development, the scraper should be executed at a lambda and the data could be saved into a S3 bucket, or fed into a kafka topic. This was the original approach, but unfortunately AWS account wasn't available.

More than that, the gold layer has to be developed, using the dimensional modeling present in the pre-capstone project report. It's also important to have the visualization layer, so the insights can be better consumed and displayed.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
databricks		databricks
jobspy		jobspy
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
workflow.png		workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fernanda Vilela Capstone Project

Organization

Considerations

About

Uh oh!

Releases

Packages

Languages

License

fegvilela/fegvilela-capstone

Folders and files

Latest commit

History

Repository files navigation

Fernanda Vilela Capstone Project

Organization

Considerations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages