Final project for data-engineer zoomcamp 2024
For my project I decided to analyse meteorites which landings to Earth with coordinates, years and other additionals params. I get data from The Meteoritical Society which contains information about all of the known meteorite landings and want to figure out in which areas meteorite fragments concentrated.
In this project, I am going to implement some data engineering best practices (partition table, pre commits hooks and others) and gain interesting metrics, such as:
- number of meteorites by year
- distribution of the number of meteorites by latitude
- distribution meteorites by types
- interactive map of meteorite landings
Go to https://lookerstudio.google.com/reporting/6c8488a2-2e39-4b79-a966-de9cba50b83c/page/lo3tD to view report
In fact, a partitioned table is not needed for such a small amount of data, but I decided to add it to show that I can do it. Also processing with spark.
In this project I get Meteorite Landings from NASA data open portal.
- Mage for orchestrating workflow
- dbt for data transformation
- Spark for data transformation
- Google BigQuery for data warehousing and analysis
- Google Looker Studio for dashboard
- Terraform for provisioning BigQuery dataset
- Docker for running services on local machine
- Docker Compose for running services on local machine
- A Google Cloud Platform account
- Docker (https://www.docker.com/get-started/)
- Terraform (https://developer.hashicorp.com/terraform/install)
Go to Manage Resource page in the Google Cloud console. Click Create Project and fill in the fields, after that click Finish. Then add billing account to the project
cd ~
git clone [email protected]:grozwalker/de-zoomcamp-meterorite-landings.git
git submodule update --init --recursive- In the Google Cloud console, go to the Create service account page
- Select a Google Cloud project
- Fill necessary fields
- Add this roles: BigQuery Admin, Cloud Datastore Owner, Cloud SQL Admin, Storage Admin, Storage Object Admin, Viewer
- Click Done to finish creating the service account.
- In the Service account dashboard find just now created account and click on Actions -> Manage keys
- Click on Add key -> Create new key and choose key type JSON
- Save file as
gcp-service-account.jsonand store it in your project folder, in{project_folder}/key.
cd ~/de-zoomcamp-meterorite-landings/terraform
cp terraform.tfvars.example terraform.tfvars
nano terraform.tfvars # fill the variable **project_id** with the value of the project ID that you created above
terraform init
terraform applycd ~/de-zoomcamp-meterorite-landings
cp dev.env .env
nano .env # fill GOOGLE_PROJECT_ID
make build
make ingest_data # It take several minutesIf you want access to mage ui run: make ui and open http://localhost:6789/pipelines/meteorite_landings
After all you can destroy all infrastructure:
cd ~/de-zoomcamp-meterorite-landings/terraform
terraform destroy