Key takeaways

This repository is a collection of Datacamp's projects

A Network Analysis of Game of Thrones

Tags

Codes

Original creator:

A New Era of Data Analysis in Baseball

Tags

Codes

Original creator:

A Visual History of Nobel Prize Winners

Tags

Codes

Original creator:

Analyze Your Runkeeper Fitness Data

Tags

Codes

Original creator:

Analyzing TV Data

Tags

Codes

Original creator:

ASL Recognition with Deep Learning

American Sign Language (ASL) is the primary language used by many deaf individuals in North America, and it is also used by hard-of-hearing and hearing individuals. The language is as rich as spoken languages and employs signs made with the hand, along with facial gestures and bodily postures.

In this project, you will train a convolutional neural network to classify images of ASL letters. After loading, examining, and preprocessing the data, you will train the network and test its performance.

Tags

Data Manipulation Data Visualization Machine Learning Importing & Cleaning Data

Codes

Determine the global random seed with Tensorflow
One-Hot encoder

import tensorflow as tf
from keras.utils import np_utils

# Sets the global random seed 
tf.set_random_seed(2)

# One-hot encode the training labels
# Start from zero(0)
y_train_OH = np_utils.to_categorical(y_train, num_classes=3)

Original creator: Alexis Cook

Bad Passwords and the NIST Guidelines

Tags

Codes

Original creator:

Book Recommendations from Charles Darwin

Tags

Codes

Original creator:

Classify Song Genres from Audio Data

Using a dataset comprised of songs of two music genres (Hip-Hop and Rock), you will train a classifier to distinguish between the two genres based only on track information derived from Echonest (now part of Spotify). You will first make use of pandas and seaborn packages in Python for subsetting the data, aggregating information, and creating plots when exploring the data for obvious trends or factors you should be aware of when doing machine learning.

Next, you will use the scikit-learn package to predict whether you can correctly classify a song's genre based on features such as danceability, energy, acousticness, tempo, etc. You will go over implementations of common algorithms such as PCA, logistic regression, decision trees, and so forth.

Tags

Data Manipulation Data Visualization Machine Learning Importing & Cleaning Data

Codes

If there is no obvious elbow on a scree plot of variance amount in data, we use cumulative explained variance instead and determine the threshold such as 0.85 (rule of thumb)

# Merge the relevant columns of tracks and echonest_metrics
echo_tracks = echonest_metrics.merge(tracks[['genre_top', 'track_id']], on='track_id')

from sklearn.decomposition import PCA

# Get our explained variance ratios from PCA using all features
pca = PCA()
pca.fit(scaled_train_features)
exp_variance = pca.explained_variance_ratio_

# plot the explained variance using a barplot
fig, ax = plt.subplots()
ax.bar(range(pca.n_components_), exp_variance)
ax.set_xlabel('Principal Component #')

Original creator: Ahmed Hasan

Comparing Cosmetics by Ingredients

Tags

Codes

Original creator:

Comparing Search Interest with Google Trends

Time series data is everywhere; from temperature records, to unemployment rates, to the S&P 500 Index. Another rich source of time series data is Google Trends, where you can freely download the search interest of terms and topics from as far back as 2004. This project dives into manipulating and visualizing Google Trends data to find unique insights.

In this project’s guided variant, you will explore the search data underneath the Kardashian family's fame and make custom plots to find how the most famous Kardashian/Jenner sister has changed over time. In the unguided variant, you will analyze the worldwide search interest of five major internet browsers to calculate metrics such as rolling average and percentage change.

Tags

Data Manipulation Data Visualization Importing & Cleaning Data

Codes

Cast the columns to int type
Cast the columns to type datetime64[ns]
Smooth out the fluctuations with rolling means

# Cast the columns to integer type
trends[col] = pd.to_numeric(trends[col])
# Cast the columns to type `datetime64[ns]`
trends['month'] = pd.to_datetime(trends['month'])
# Smooth out the fluctuations with rolling means
trends.rolling(window=12).mean()

Original creator: David Venturi

Disney Movies and Box Office Success

Tags

Codes

Original creator:

Do Left-handed People Really Die Young?

Tags

Codes

Original creator:

Dr. Semmelweis and the Discovery of Handwashing

In 1847, the Hungarian physician Ignaz Semmelweis made a breakthough discovery: he discovers handwashing. Contaminated hands was a major cause of childbed fever and by enforcing handwashing at his hospital he saved hundreds of lives.

Tags

Data Manipulation Data Visualization Probability & Statistics Importing & Cleaning Data Data Manipulation Data Visualization Case Studies

Codes

Bootstrap analysis

# A bootstrap analysis of the reduction of deaths due to handwashing
boot_mean_diff = []
for i in range(3000):
    boot_before = before_washing['proportion_deaths'].sample(frac=1, replace=True)
    boot_after = after_washing['proportion_deaths'].sample(frac=1, replace=True)
    boot_mean_diff.append(boot_before.mean() - boot_after.mean())

# Calculating a 95% confidence interval from boot_mean_diff 
confidence_interval = pd.Series(boot_mean_diff).quantile([0.025, 0.975])
confidence_interval

Original creator: Rasmus Bååth

Exploring the Bitcoin Cryptocurrency Market

Tags

Codes

Original creator:

Exploring the Evolution of Linux

Tags

Codes

Original creator:

Exploring the History of Lego

The Rebrickable database includes data on every LEGO set that has ever been sold; the names of the sets, what bricks they contain, what color the bricks are, etc. It might be small bricks, but this is big data! In this project, you will get to explore the Rebrickable database and answer a series of questions related to the history of Lego!

Tags

Data Manipulation Data Visualization Importing & Cleaning Data

Codes

Extract information from Pandas's groupby
.size includes NaN values while count does not

import pandas 

# colors_summary: Distribution of colors based on transparency
colors_summary = colors.groupby('is_trans').count()
nan_colors_summary = colors.groupby('is_trans').size()

# Create a summary of average number of parts by year: `parts_by_year`
parts_by_year = sets.groupby('year')['num_parts'].mean()
# themes_by_year: Number of themes shipped by year
themes_by_year = sets.groupby('year')[['theme_id']].nunique()

Original creator: Ramnath Vaidyanathan

Exploring the NYC Airbnb Market

Tags

Codes

Original creator:

Extract Stock Sentiment from News Headlines

Tags

Codes

Original creator:

Find Movie Similarity from Plot Summaries

Tags

Codes

Original creator:

Generating Keywords for Google Ads

Tags

Codes

Original creator:

Give Life: Predict Blood Donations

Tags

Codes

Original creator:

Investigating Netflix Movies and Guest Stars in The Office

In this project, you’ll apply the skills you learned in Introduction to Python and Intermediate Python to solve a real-world data science problem. You’ll press “watch next episode” to discover if Netflix’s movies are getting shorter over time and which guest stars appear in the most popular episode of "The Office", using everything from lists and loops to pandas and matplotlib.

You’ll also gain experience in an essential data science skill — exploratory data analysis. This will allow you to perform critical tasks such as manipulating raw data and drawing conclusions from plots you create of the data.

Tags

Data Manipulation Data Visualization Programming

Codes

Give colors according to genres
Reverse boolean mask ~

# Define an empty list
colors = []

# Iterate over rows of netflix_movies_col_subset
for ind_movie, movie in netflix_movies_col_subset.iterrows():
    if movie['genre'] == 'Children':
        colors.append('red')
    elif movie['genre'] == 'Documentaries':
        colors.append('blue')
    elif movie['genre'] == 'Stand-Up':
        colors.append('green')
    else:
        colors.append('black')

# Set the figure style and initalize a new figure
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(12,8))

# Create a scatter plot of duration versus release_year
plt.scatter(netflix_movies_col_subset['release_year'], netflix_movies_col_subset['duration'], c=colors)

# Create a title and axis labels
plt.title('Movie duration by year of release')
plt.xlabel('Release year')
plt.ylabel('Duration (min)')

# Show the plot
plt.show()

# Reverse boolean mask
typical_netflix_movies = netflix_movies_col_subset[~netflix_movies_col_subset['genre'].isin(['Children', 
                                                                                             'Documentaries', 
                                                                                             'Stand-Up'])]
typical_netflix_movies.head()

Original creator: Justin Saddlemyer

Mobile Games A/B Testing with Cookie Cats

Tags

Codes

Original creator:

Naïve Bees: Deep Learning with Images

Tags

Codes

Original creator:

Naïve Bees: Image Loading and Processing

Can a machine distinguish between a honey bee and a bumble bee? Being able to identify bee species from images, while challenging, would allow researchers to more quickly and effectively collect field data. In this Project, you will use the Python image library Pillow to load and manipulate image data. You'll learn common transformations of images and how to build them into a pipeline.

This project is the first part of a series of projects that walk through working with image data, building classifiers using traditional techniques, and leveraging the power of deep learning for computer vision. The second project in the series is Naïve Bees: Predict Species from Images.

Tags

Data Manipulation Data Visualization Machine Learning Importing & Cleaning Data

Codes

PIL library for images

from PIL import Image

plt.imshow(img_data)
plt.imshow(img_data[:,:, 0], cmap=plt.cm.Reds_r)

honey = Image.open('datasets/bee_12.jpg')

# convert honey to grayscale
honey_bw = honey.convert('L')
display(honey_bw)

# convert the image to a NumPy array
honey_bw_arr = np.array(honey_bw)

# get the shape of the resulting array
honey_bw_arr_shape = honey_bw_arr.shape
print("Our NumPy array has the shape: {}".format(honey_bw_arr_shape))

# save a image
honey_bw.save("saved_images/bw_flipped.jpg")

# plot the array using matplotlib
plt.imshow(honey_bw_arr, cmap=plt.cm.gray)
plt.show()

# plot the kde of the new black and white array
plot_kde(honey_bw_arr, 'k')

Original creator: Peter Bull, Emily Miller

Naïve Bees: Predict Species from Images

Tags

Codes

Original creator:

Name Game: Gender Prediction using Sound

Tags

Codes

Original creator:

Predicting Credit Card Approvals

Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this project, you will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.

The dataset used in this project is the Credit Card Approval dataset from the UCI Machine Learning Repository.

Tags

Data Manipulation Machine Learning Importing & Cleaning Data Applied Finance

Codes

Read .data file by Pandas's read_csv()
Impute missing values by mean imputation
Impute with the most frequent value
Convert the non-numeric data into numeric
GridSearchCV

# Read .data file
cc_apps = pd.read_csv('datasets/cc_approvals.data', header=None)

# Impute the missing values with mean imputation
cc_apps_train.fillna(cc_apps_train.mean(axis=0), inplace=True)
cc_apps_test.fillna(cc_apps_train.mean(axis=0), inplace=True)

# Count the number of NaNs in the datasets and print the counts to verify
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())

# Iterate over each column of cc_apps_train
for col in cc_apps_train:
    # Check if the column is of object type
    if cc_apps_train[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_apps_train = cc_apps_train.fillna(cc_apps_train[col].value_counts().index[0])
        cc_apps_test = cc_apps_test.fillna(cc_apps_test[col].value_counts().index[0])

# Convert the categorical features in the train and test sets to numeric data independently
cc_apps_train = pd.get_dummies(cc_apps_train)
cc_apps_test = pd.get_dummies(cc_apps_test)

from sklearn.model_selection import GridSearchCV

# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

Original creator: Sayak Paul

Real-time Insights from Social Media Data

Tags

Codes

Original creator:

Recreating John Snow's Ghost Map

Tags

Codes

Original creator:

Reducing Traffic Mortality in the USA

Tags

Codes

Original creator:

Risk and Returns: The Sharpe Ratio

When you assess whether to invest in an asset, you want to look not only at how much money you could make but also at how much risk you are taking. The Sharpe Ratio, developed by Nobel Prize winner William Sharpe some 50 years ago, does precisely this: it compares the return of an investment to that of an alternative and relates the relative return to the risk of the investment, measured by the standard deviation of returns.

In this project, you will apply the Sharpe ratio to real financial data using pandas.

Tags

Applied Finance Case Studies

Codes

Extract information in finance data

# calculate daily stock_data returns
# Pandas method
stock_returns = stock_data.pct_change()

# calculate the daily sharpe ratio
# divide avg_excess_return by sd_excess_return
daily_sharpe_ratio = avg_excess_return.div(sd_excess_return)

# annualize the sharpe ratio
# multiply daily_sharpe_ratio by 
annual_factor = np.sqrt(252)
annual_sharpe_ratio = daily_sharpe_ratio.mul(annual_factor)

Original creator: Stefan Jansen

Streamlining Employee Data

Tags

Codes

Original creator:

The Android App Market on Google Play

Tags

Codes

Original creator:

The GitHub History of the Scala Language

Tags

Codes

Original creator:

The Hottest Topics in Machine Learning

Neural Information Processing Systems (NIPS) is one of the top machine learning conferences in the world where groundbreaking work is published. In this Project, you will analyze a large collection of NIPS research papers from the past decade to discover the latest trends in machine learning. The techniques used here to handle large amounts of data can be applied to other text datasets as well.

Familiarity with Python and pandas is required to complete this Project, as well as experience with Natural Language Processing in Python (sklearn specifically).

Tags

Data Manipulation Data Visualization Machine Learning Probability & Statistics Importing & Cleaning Data

Codes

map and lambda
wordcloud library for NLP
NLP in scikit-learn

# Import the wordcloud library
import wordcloud 

# Join the different processed titles together.
long_string = ' '.join(papers['title_processed'])

# Create a WordCloud object
wordcloud = wordcloud.WordCloud()

# Generate a word cloud
# -- YOUR CODE HERE --
wordcloud.generate(long_string)

# Visualize the word cloud
wordcloud.to_image()

# Load the library with the CountVectorizer method
from sklearn.feature_extraction.text import CountVectorizer

# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(papers['title_processed'])

# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)

# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Classify Song Genres from Audio Data.ipynb		Classify Song Genres from Audio Data.ipynb
Comparing Search Interest with Google Trends.ipynb		Comparing Search Interest with Google Trends.ipynb
Dr. Semmelweis and the Discovery of Handwashing.ipynb		Dr. Semmelweis and the Discovery of Handwashing.ipynb
Exploring the History of Lego.ipynb		Exploring the History of Lego.ipynb
Investigating Netflix Movies and Guest Stars in The Office.ipynb		Investigating Netflix Movies and Guest Stars in The Office.ipynb
Naïve Bees_ Image Loading and Processing.ipynb		Naïve Bees_ Image Loading and Processing.ipynb
Predicting Credit Card Approvals.ipynb		Predicting Credit Card Approvals.ipynb
README.md		README.md
The Hottest Topics in Machine Learning.ipynb		The Hottest Topics in Machine Learning.ipynb

BossNP/Datacamp-projects

Folders and files

Latest commit

History

Repository files navigation

Key takeaways

A Network Analysis of Game of Thrones

Tags

Codes

A New Era of Data Analysis in Baseball

Tags

Codes

A Visual History of Nobel Prize Winners

Tags

Codes

Analyze Your Runkeeper Fitness Data

Tags

Codes

Analyzing TV Data

Tags

Codes

ASL Recognition with Deep Learning

Tags

Codes

Bad Passwords and the NIST Guidelines

Tags

Codes

Book Recommendations from Charles Darwin

Tags

Codes

Classify Song Genres from Audio Data

Tags

Codes

Comparing Cosmetics by Ingredients

Tags

Codes

Comparing Search Interest with Google Trends

Tags

Codes

Disney Movies and Box Office Success

Tags

Codes

Do Left-handed People Really Die Young?

Tags

Codes

Dr. Semmelweis and the Discovery of Handwashing

Tags

Codes

Exploring the Bitcoin Cryptocurrency Market

Tags

Codes

Exploring the Evolution of Linux

Tags

Codes

Exploring the History of Lego

Tags

Codes

Exploring the NYC Airbnb Market

Tags

Codes

Extract Stock Sentiment from News Headlines

Tags

Codes

Find Movie Similarity from Plot Summaries

Tags

Codes

Generating Keywords for Google Ads

Tags

Codes

Give Life: Predict Blood Donations

Tags

Codes

Investigating Netflix Movies and Guest Stars in The Office

Tags

Codes

Mobile Games A/B Testing with Cookie Cats

Tags

Codes

Naïve Bees: Deep Learning with Images

Tags

Packages