0% found this document useful (0 votes)
159 views

Data Analysis Using Python (Python For Beginners) - CloudxLab

- NumPy is a Python library used for working with arrays and matrices for numerical computing. - NumPy provides multidimensional arrays and matrices, along with tools to work with these numeric data structures. - Common NumPy functions include np.array() for creating arrays, np.zeros() and np.ones() for creating arrays of zeros or ones, and np.random.rand() for generating random numbers.

Uploaded by

Gizliusta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views

Data Analysis Using Python (Python For Beginners) - CloudxLab

- NumPy is a Python library used for working with arrays and matrices for numerical computing. - NumPy provides multidimensional arrays and matrices, along with tools to work with these numeric data structures. - Common NumPy functions include np.array() for creating arrays, np.zeros() and np.ones() for creating arrays of zeros or ones, and np.random.rand() for generating random numbers.

Uploaded by

Gizliusta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 152

Data Analysis with Python (For Beginners)

[email protected]
About CloudxLab

Making learning fun and for life

Videos Quizzes Hands-On Projects Case Studies

Real Life Use Cases


CloudxLab - Playground with
Feedback
Playground for hands-on. System evaluates the code automatically
and nudges the user by giving appropriate feedback

Content Playground

Feedback
CloudxLab - Online Cloud Based Lab

Cloud-based Lab with pre-installed tools and software for


practicing AI, Machine Learning, Deep Learning, Data Science, Big
Data and related technologies
CloudxLab - Online Cloud Based Lab

Real-world Experience Seamless Experience


Lab setup is exactly same as of setup in No endless downloading/ installations. No
Enterprises. Become job ready from hardware, permissions or configuration
Day 1 issues

Central Dataset Any Device Anywhere


Upload your own dataset Connect from ANY browser,
Or use open source datasets available on lab SSH, device or operating system
CloudxLab - Social
We learn better with peers. Social proof and leaderboard
increases engagement and motivation
CloudxLab - Hiring Partners
Dedicated Job Portal → Upgrade career, enhance salary & move
jobs by applying to jobs posted by our hiring partners
CloudxLab - University Partners
Instructors / Authors

Praveen
Sandeep Giri Abhinav Singh
Pavithran
Founder at CloudxLab.com | AI CTO/Co-Founder at Yatis | IOT, Co-Founder, CloudxLab.com | AI,
Advisor at Algoworks | Speaker - ML, Computer Vision, Edge ML & Big Data | Visiting Faculty at
AI, Machine Learning, Deep SCMHRD
Learning,Big Data Cypress Semiconductors, Philips,
Multiple patents Byjus, HashCube
Amazon, InMobi, D.E.Shaw conference papers, 9+ Years of Exp. in EdTech, Game
18+ Years of Exp. in Enterprise IIT Bombay Dual Degree Development & Building Product
Softwares, Machine Learning &
Churning Humongous Data
What is Python

[email protected]
What is Python

- Python is a interpreted,
high-level language

[email protected]
What is Python

- Python is a interpreted,
high-level language
- Invented in 1991 by Guido van
Rossum

[email protected]
What is Python

- Python is a interpreted,
high-level language
- Invented in 1991 by Guido van
Rossum
- It is easy to use and improves
engineer productivity

[email protected]
What is Python

- Python is a interpreted,
high-level language
- Invented in 1991 by Guido van
Rossum
- It is easy to use and improves
engineer productivity
- Libraries for multiple
applications

[email protected]
What is Python

- Python is a interpreted,
high-level language
- Invented in 1991 by Guido van
Rossum
- It is easy to use and improves
engineer productivity
- Libraries for multiple
applications
- Django framework for web
applications
- We will focus on libraries for
Data Analysis
[email protected]
What is Python

- Python is a interpreted,
high-level language
- Invented in 1991 by Guido van
Rossum
- It is easy to use and improves
engineer productivity
- Libraries for multiple
applications
- Django framework for web
applications
- We will focus on libraries for
Data Analysis
[email protected]
Numpy

[email protected]
What is NumPy

Stands for "Numeric Python" or "Numerical Python".

● Open Source
● Module of Python
● Provides fast mathematical functions

[email protected]
What is NumPy

scikitlearn tensorflow

numpy
Python
matplotlib
pandas

The complete Machine Learning eco-system.


[email protected]
Why use NumPy ?

● Array-oriented computing
● Efficiently implemented multi-dimensional arrays
● Designed for scientific computation
● Library of high-level mathematical functions

[email protected]
Numpy - Introduction

● NumPy’s main object is the homogeneous multidimensional


array
● It is a table of elements
○ usually numbers
○ all of the same type
○ indexed by a tuple of positive integers
● In NumPy dimensions are called axes
● The number of axes is rank

[email protected]
Numpy - Introduction

First Dimension / Axis, Len = 4

Second Dimension / Axis, Len = 3


[[ 0., 0., 0., 0.],

[ 0., 0., 0., 0.],

[ 0., 0., 0., 0.]])

The above array has a rank of 2 since it is 2


dimensional.

[email protected]
Creating Numpy arrays
np.array - Creating NumPy array from Python Lists/Tuple

Numpy arrays can be created from Python lists or tuple in the


following way.

>>> import numpy as np


>>> a = np.array([1, 2, 3])
>>> type(a)
<type 'numpy.ndarray'>
>>> b = np.array((3, 4, 5))
>>> type(b)
<type 'numpy.ndarray'>

[email protected]
Creating Numpy arrays
np.zeroes - An array with all Zeroes

To create an array with all zeroes the function np.zeroes is


used

>>> x = np.zeros( (3,4) )


>>> x
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])

[email protected]
Creating Numpy arrays
np.ones - An array with all Ones

To create an array with all ones the function np.ones is used.

>>> np.ones( (3,4), dtype=np.int16 )


array([[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[ 1, 1, 1, 1]])

[email protected]
Creating Numpy arrays
np.full - An array with a given value

To create an array with a given shape and a given value np.full


is used.

>>> np.full( (3,4), 0.11 )


array([[ 0.11, 0.11, 0.11, 0.11],
[ 0.11, 0.11, 0.11, 0.11],
[ 0.11, 0.11, 0.11, 0.11]])

[email protected]
Creating Numpy arrays
np.arange - Creating sequence of Numbers

>>> np.arange( 10, 30, 5 )


array([10, 15, 20, 25])
>>> np.arange( 0, 2, 0.3 )
# it accepts float arguments
array([ 0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8])

[email protected]
Creating Numpy arrays
np.linspace - Creating an array with evenly distributed numbers

● Returns an array having a specific number of points


● Evenly distributed between two values
● The maximum value is included, contrary to arange
Ending Number Total Number of points
Starting Number

>>> np.linspace(0, 5/3, 6)

array([0. , 0.33333333 , 0.66666667 , 1. , 1.33333333 1.66666667])

[email protected]
Creating Numpy arrays
np.random.rand - Creating an array with random numbers

Make a 2x3 matrix having random floats between 0 and 1:

>>> np.random.rand(2,3)
array([[ 0.55365951, 0.60150511, 0.36113117],
[ 0.5388662 , 0.06929014, 0.07908068]])

[email protected]
Creating Numpy arrays
np.empty - Creating an empty array

To create an uninitialised array with a given shape. Its content


is not predictable.

>>> np.empty((2,3))
array([[ 0.21288689, 0.20662218, 0.78018623],
[ 0.35294004, 0.07347101, 0.54552084]])

[email protected]
Important attributes of a NumPy object

The NumPy’s array class is called ndarray. The important


attributes of a ndarray object are -

ndarray.ndim
the number of axes (dimensions) of the array.
[[ 1., 0., 0.],
[ 0., 1., 2.]]

For the above array the value of ndarray.ndim is 2.

[email protected]
Important attributes of a NumPy object

ndarray.shape
the dimensions of the array. This is a tuple of integers
indicating the size of the array in each dimension.
[[ 1., 0., 0.],
[ 0., 1., 2.]]
For the above array the value of ndarray.shape is (2,3)

[email protected]
Important attributes of a NumPy object

ndarray.size
the total number of elements of the array. This is equal to
the product of the elements of shape.
[[ 1., 0., 0.],
[ 0., 1., 2.]]

For the above array the value of ndarray.size is 6.

[email protected]
Important attributes of a NumPy object

ndarray.dtype
Tells the datatype of the elements in the numpy array. All
the elements in a numpy array have the same type.
>>> c = np.arange(1, 5)
>>> c.dtype
dtype('int64')

[email protected]
Important attributes of a NumPy object

ndarray.itemsize
The itemsize attribute returns the size (in bytes) of each
item:
>>> c = np.arange(1, 5)
>>> c.itemsize
8

[email protected]
Reshaping Arrays

The function reshape is used to reshape the numpy array.


The following example illustrates this.

>>> a = np.arange(6)
>>> print(a)
[0 1 2 3 4 5]
>>> b = a.reshape(2, 3)
>>> print(b)
[[0 1 2],
[3 4 5]]

[email protected]
Indexing and Accessing NumPy arrays

[email protected]
Indexing one dimensional NumPy Arrays

0 1 2 3 4 5 6 Index

>>> a = np.array([1, 5, 3, 19, 13, 7, 3])


>>> a[3]
19
>>> a[2:5] #range
array([ 3, 19, 13])
>>> a[2::2] # How many to jump
array([ 3, 13, 3])
>>> a[::-1] #Go reverse
array([ 3, 7, 13, 19, 3, 5, 1])

[email protected]
Difference with regular Python arrays

1. If you assign a single value to an ndarray slice, it is copied


across the whole slice :
>>> a = np.array([1, 2, 5, 7, 8])
>>> a[1:3] = -1
>>> a
array([ 1, -1, -1, 7, 8])
----
>>> b = [1, 2, 5, 7, 8]
>>> b[1:3] = -1
TypeError: can only assign an iterable

[email protected]
Difference with regular Python arrays

2. ndarray slices are actually views on the same data buffer. If


you modify it, it is going to modify the original ndarray as well.

>>> a = np.array([1, 2, 5, 7, 8])


>>> a_slice = a[1:5]
>>> a_slice[1] = 1000
>>> a
array([ 1, 2, 1000, 7, 8])
# Original array was modified

[email protected]
Important attributes of a NumPy object

3. If you want a copy of the data, you need to use the copy
method as another_slice = a[2:6].copy() ,
if we modify another_slice, a remains same.

[email protected]
Indexing multi dimensional NumPy arrays
Multi-dimensional arrays can be accessed as
>>> b[1, 2] # row 1, col 2
>>> b[1, :] # row 1, all columns
>>> b[:, 1] # all rows, column 1

The following format is used while indexing multi-dimensional


arrays
Array[row_start_index:row_end_index, column_start_index:
column_end_index]

[email protected]
Boolean Indexing

We can also index arrays using an ndarray of boolean values on


one axis to specify the indices that we want to access.

>>> a = np.arange(12).reshape(3, 4)
>>> rows_on = np.array([ True, False, True])
>>> a[rows_on , : ] # Rows 0 and 3, all columns
array([[ 0, 1, 2, 3],
[ 8, 9, 10, 11]])

[email protected]
Linear Algebra with NumPy

[email protected]
Vectors

● A vector is a quantity defined by a magnitude and a direction.


● A vector can be represented by an array of numbers called
scalars.

[email protected]
Vectors

For example, say the rocket is going up at a slight angle: it has a


vertical speed of 5,000 m/s, and also a slight speed towards the
East at 10 m/s, and a slight speed towards the North at 50 m/s.
The rocket's velocity may be represented by the following
vector:

velocity 50 m/s

10 m/s

5,000 m/s
[email protected]
Use of Vectors in Machine Learning
● Vectors have many purposes in Machine Learning, most
notably to represent observations and predictions.
● For example, say we built a Machine Learning system to
classify videos into 3 categories (good, spam, clickbait) based
on what we know about them.
Good

Spam

Clickbait

[email protected]
Use of Vectors in Machine Learning
● For each video, we would have a vector representing what
we know about it, such as:

Video

● This vector could represent a video that lasts 10.5 minutes,


but only 5.2% viewers watch for more than a minute, it gets
3.25 views per day on average, and it was flagged 7 times as
spam. As you can see, each axis may have a different
meaning.

[email protected]
Use of Vectors in Machine Learning

● Based on this vector our Machine Learning system may


predict that there is an 80% probability that it is a spam
video, 18% that it is clickbait, and 2% that it is a good video.
This could be represented as the following vector:
Spam

class_probabilities Clickbait
Good

[email protected]
Representing Vectors in Python

● In python, a vector can be represented in many ways, the


simplest being a regular python list of numbers.
○ [1,1,1,1]
● Since Machine Learning requires lots of scientific calculations,
it is much better to use NumPy's ndarray, which provides a
lot of convenient and optimized implementations of essential
mathematical operations on vectors.
● numpy.array([1,1,1,1])

[email protected]
Vectorized Operations

● Vectorized operations are far more efficient


● Than loops written in Python to do the same thing
● Let’s test it

[email protected]
Vectorized Operations

Matrix multiplication
1. Using for loop
>>> def multiply_loops(A, B):
C = np.zeros((A.shape[0], B.shape[1]))
for i in range(A.shape[1]):
for j in range(B.shape[0]):
C[i, j] = A[i, j] * B[j, i]
return C

2. Using NumPy's matrix-matrix multiplication operator


>>> def multiply_vector(A, B):
return A @ B

[email protected]
Vectorized Operations

Matrix multiplication - Sample data

# Two randomly-generated, 100x100 matrices

>>> X = np.random.random((100, 100))


>>> Y = np.random.random((100, 100))

[email protected]
Vectorized Operations
Matrix multiplication - Loops - timeit Matrix multiplication - Vector - timeit

# First, using the explicit # Second, the NumPy


loops: multiplication:
>>> %timeit >>> %timeit
multiply_loops(X, Y) multiply_vector(X, Y)

4.23 ms ± 107 µs per loop 46.6 µs ± 346 ns per loop


(mean ± std. dev. of 7 runs, (mean ± std. dev. of 7 runs,
100 loops each) 10000 loops each)

Result - It took about 4.23 Result - 46.6 microseconds (46.4


milliseconds (4.23∗10−3 seconds) to ∗10−6 seconds) per multiplication
perform one matrix-matrix
multiplication Conclusion - Two orders of
magnitude faster

[email protected]
Basic Operations on NumPy arrays

[email protected]
Addition in NumPy arrays

Addition can be performed on NumPy arrays as shown below.


They apply element wise.

>>> a = np.array( [20, 30, 40, 50] )


>>> b = np.arange( 4 )
>>> b
array([0, 1, 2, 3])
>>> c = a + b
>>> c
array([20, 31, 42, 53])

[email protected]
Subtraction in NumPy arrays

Subtraction can be performed on NumPy arrays as shown


below. They apply element wise.
>>> a = np.array( [20, 30, 40, 50] )
>>> b = np.arange( 4 )
>>> b
array([0, 1, 2, 3])
>>> c = a - b
>>> c
array([20, 29, 38, 47])

[email protected]
Element wise product in NumPy arrays

Element wise product can be performed on NumPy arrays as


shown below.
>>> A = np.array( [[1,1],
... [0,1]] )
>>> B = np.array( [[2,0],
... [3,4]] )
>>> A*B # element wise product
array([[2, 0],
[0, 4]])

[email protected]
Matrix Product in NumPy arrays

Matrix product can be performed on NumPy arrays as shown


below.
>>> A = np.array( [[1,1],
... [0,1]] )
>>> B = np.array( [[2,0],
... [3,4]] )
>>> np.dot(A, B) # matrix product
array([[5, 4],
[3, 4]])

[email protected]
Division in NumPy arrays

Division can be performed on NumPy arrays as shown below.


They apply element wise.

a = np.array( [20, 30, 40, 50] )


b = np.arange(1, 5)
c = a / b
c
array([ 20. , 15. , 13.33333333, 12.5
])

[email protected]
Integer Division in NumPy arrays

Division can be performed on NumPy arrays as shown below.


They apply element wise.

a = np.array( [20, 30, 40, 50] )


b = np.arange(1, 5)
c = a // b
c
array([20, 15, 13, 12])

[email protected]
Modulus in NumPy arrays

Modulus operator can be applied on NumPy arrays as shown


below. They apply element wise.
a = np.array( [20, 30, 40, 50] )
b = np.arange(1, 5)
c = a % b
c
array([0, 0, 1, 2])

[email protected]
Exponents in NumPy arrays

We can find the exponent of each element in a NumPy array


in the following way. It is applied element wise.

a = np.array( [20, 30, 40, 50] )


b = np.arange(1, 5)
c = a ** b
c
array([ 20, 900, 64000, 6250000])

[email protected]
Conditional Operators on NumPy arrays

Conditional operators are also applied element-wise


m = np.array([20, -5, 30, 40])
m < [15, 16, 35, 36]
array([False, True, True, False], dtype=bool)

m < 25
array([ True, True, False, False], dtype=bool)

To get the elements below 25


m[m < 25]
array([20, -5])

[email protected]
Broadcasting in NumPy arrays

[email protected]
What is Broadcasting ?

1 2 0 2 1 4

4 5 3 4 7 9

1 2 5
???
4 5 7

[email protected]
What is Broadcasting ?

In general, when NumPy expects arrays of the same shape but


finds that this is not the case, it applies the so-called
broadcasting rules.

Basically there are 2 rules of Broadcasting to remember.

[email protected]
First rule of Broadcasting

[[[1, 3 ]]] + [5] [[[6, 8]]]

Shape (1, 1, 2) (1, ) (1, 1, 2)

If the arrays do not have the same rank, then a 1 will be


prepended to the smaller ranking arrays until their ranks match.

[email protected]
First rule of Broadcasting

>>> h = np.arange(5).reshape(1, 1, 5)
h
>>> array([[[0, 1, 2, 3, 4]]])
Let's try to add a 1D array of shape (5,) to this 3D array of
shape (1,1,5), applying the first rule of broadcasting.
h + [10, 20, 30, 40, 50] # same as: h + [[[10, 20, 30, 40, 50]]]
array([[[10, 21, 32, 43, 54]]])

[email protected]
Second rule of Broadcasting

[email protected]
Second rule of Broadcasting

On adding a 2D array of shape (2,1) to a 2D ndarray of shape


(2, 3). NumPy will apply the second rule of broadcasting

>>> k = np.arange(6).reshape(2, 3)
>>> k
array([[0, 1, 2],
[3, 4, 5]])

>>> k + [100, 200, 300]


array([[100, 201, 302],
[103, 204, 305]])

[email protected]
Mathematical and statistical
functions on NumPy arrays

[email protected]
Finding Mean of NumPy array elements

The ndarray object has a method mean() which finds the mean
of all the elements in the array regardless of the shape of the
numpy array.

>>> a = np.array([[-2.5, 3.1, 7], [10, 11, 12]])


>>> print("mean =", a.mean())
mean = 6.76666666667

[email protected]
Other useful ndarray methods

Similar to mean there are other ndarray methods which can be


used for various computations.

min - returns the minimum element in the ndarray


max - returns the maximum element in the ndarray
sum - returns the sum of the elements in the ndarray
prod - returns the product of the elements in the ndarray
std - returns the standard deviation of the elements in the
ndarray.
var - returns the variance of the elements in the ndarray.

[email protected]
Other useful ndarray methods
>>> a = np.array([[-2.5, 3.1, 7], [10, 11, 12]])

>>> for func in (a.min, a.max, a.sum, a.prod, a.std,


a.var):
print(func.__name__, "=", func())

min = -2.5
max = 12.0
sum = 40.6
prod = -71610.0
std = 5.08483584352
var = 25.8555555556
[email protected]
Summing across different axes
We can sum across different axes of a numpy array by
specifying the axis parameter of the sum function.

>>> c=np.arange(24).reshape(2,3,4)
>>> c
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],

[[12, 13, 14, 15],


[16, 17, 18, 19],
[20, 21, 22, 23]]])

[email protected]
Summing across different axes

>>> c.sum(axis=0) # sum across matrices


array([[12, 14, 16, 18],
[20, 22, 24, 26],
[28, 30, 32, 34]])

[email protected]
Transposing Matrices
The T attribute is equivalent to calling transpose() when the
rank is ≥2

>>> m1 = np.arange(6).reshape(2,3)
>>> m1
array([[0, 1, 2],
[3, 4, 5]])
>>> m1.T
array([[0, 3],
[1, 4],
[2, 5]])

[email protected]
Solving a system of linear scalar equations
The solve function solves a system of linear scalar equations,
such as:

2x + 6y = 6
5x + 3y = -9

[email protected]
Solving a system of linear scalar equations
>>> coeffs = np.array([[2, 6], [5, 3]])
>>> depvars = np.array([6, -9])
>>> solution = linalg.solve(coeffs, depvars)
>>> solution
array([-3., 2.])

[email protected]
Solving a system of linear scalar equations
Let’s check the solution.

>>> coeffs.dot(solution), depvars

(array([ 6., -9.]), array([ 6, -9]))

[email protected]
References

● NumPy
○ https://docs.scipy.org/doc/

[email protected]
Questions?
https://discuss.cloudxlab.com
[email protected]
Pandas

[email protected]
What is Pandas?

● One of the most widely used Python libraries in Data Science after
NumPy and Matplotlib
● The Pandas library Provides
○ High-performance
○ Easy-to-use data structures and
○ Data analysis tools

[email protected]
Pandas - DataFrame

● The main data structure is the DataFrame

● In memory 2D table

○ Like Spreadsheet with column names and row label

[email protected]
Pandas - Data Analysis

● Many features available in Excel are available programmatically like

○ Creating pivot tables

○ Computing columns based on other columns

○ Plotting graphs

[email protected]
Pandas - Data Structures

● Series objects

○ 1D array, similar to a column in a spreadsheet

● DataFrame objects

○ 2D table, similar to a spreadsheet

● Panel objects

○ Dictionary of DataFrames

[email protected]
Pandas - Series Objects

Creating a Series
>>> import pandas as pd
>>> s = pd.Series([2,-1,3,5])

Output -
0 2
1 -1
2 3
3 5
dtype: int64

[email protected]
Pandas - Series Objects

Pass as parameters to NumPy functions


>>> import numpy as np
>>> np.square(s)

Output -
0 4
1 1
2 9
3 25
dtype: int64

[email protected]
Pandas - Series Objects

Arithmetic operation on the series


>>> s + [1000,2000,3000,4000]

Output -
0 1002
1 1999
2 3003
3 4005
dtype: int64

[email protected]
Pandas - Series Objects

Broadcasting
>>> s + 1000

Output -
0 1002
1 999
2 1003
3 1005
dtype: int64

[email protected]
Pandas - Series Objects

Binary and conditional operations


>>> s < 0

Output -
0 False
1 True
2 False
3 False
dtype: bool

[email protected]
Pandas - Series Objects

Index labels - Integer location


>>> s2 = pd.Series([68, 83, 112, 68])
>>> print(s2)

Output -
0 68
1 83
2 112
3 68
dtype: int64

[email protected]
Pandas - Series Objects

Index labels - Set Manually


>>> s2 = pd.Series([68, 83, 112, 68],
index=["alice", "bob", "charles", "darwin"])
>>> print(s2)

Output -
alice 68
bob 83
charles 112
darwin 68
dtype: int64

[email protected]
Pandas - Series Objects

Access the items in series

● By specifying integer location

>>> s2[1]

● By specifying label

>>> s2["bob"]

[email protected]
Pandas - Series Objects

Access the items in series - Recommendations

● Use the loc attribute when accessing by label

>>> s2.loc["bob"]

● Use iloc attribute when accessing by integer location

>>> s2.iloc[1]

[email protected]
Pandas - Series Objects

Init from Python dict

>>> weights = {"alice": 68, "bob": 83, "colin": 86,


"darwin": 68}
>>> s3 = pd.Series(weights)
>>> print(s3)

Output -
alice 68
bob 83
colin 86
darwin 68
dtype: int64
[email protected]
Pandas - Series Objects

Control the elements to include and specify their order

>>> s4 = pd.Series(weights, index = ["colin", "alice"])


>>> print(s4)

Output -
colin 86
alice 68
dtype: int64

[email protected]
Pandas - Series Objects

Automatic alignment

● When an operation involves multiple Series objects

● Pandas automatically aligns items by matching index labels

[email protected]
Pandas - Series Objects

Automatic alignment - example

>>> print(s2+s3)
Output -
alice 136.0
bob 166.0
charles NaN
colin NaN
darwin 136.0
dtype: float64

* Note NaN

[email protected]
Pandas - Series Objects

Automatic alignment

Do not forget to set the right index labels, else you may get surprising
results
>>> s5 = pd.Series([1000,1000,1000,1000])
>>> print(s2 + s5)
Output-
alice NaN
bob NaN
charles NaN
darwin NaN
0 NaN
1 NaN
[email protected]
Pandas - Series Objects

Init with a scalar

>>> meaning = pd.Series(42, ["life", "universe",


"everything"])
>>> print(meaning)

Output-

life 42
universe 42
everything 42
dtype: int64

[email protected]
Pandas - Series Objects

Series name - A Series can have a name

>>> s6 = pd.Series([83, 68], index=["bob", "alice"],


name="weights")
>>> print(s6)

* Here series name is weights

Output-
bob 83
alice 68
Name: weights, dtype: int64

[email protected]
Pandas - Series Objects

Plotting a series

>>> %matplotlib inline


>>> import matplotlib.pyplot as plt
>>> temperatures =
[4.4,5.1,6.1,6.2,6.1,6.1,5.7,5.2,4.7,4.1,3.9,3.5]
>>> s7 = pd.Series(temperatures, name="Temperature")
>>> s7.plot()
>>> plt.show()

[email protected]
Pandas - DataFrame Objects

● A DataFrame object represents


○ A spreadsheet,
○ With cell values,
○ Column names
○ And row index labels

● Visualize DataFrame as dictionaries of Series

[email protected]
Pandas - DataFrame Objects

Creating a DataFrame - Pass a dictionary of Series objects

>>> people_dict = {
"weight": pd.Series([68, 83, 112],index=["alice",
"bob", "charles"]),

"birthyear": pd.Series([1984, 1985, 1992],


index=["bob", "alice", "charles"], name="year"),

"children": pd.Series([0, 3], index=["charles",


"bob"]),

"hobby": pd.Series(["Biking", "Dancing"],


index=["alice", "bob"]),
}
[email protected]
Pandas - DataFrame Objects

Creating a DataFrame

>>> people = pd.DataFrame(people_dict)


>>> people

[email protected]
Pandas - DataFrame Objects

Creating a DataFrame - Important Notes

● The Series were automatically aligned based on their index


● Missing values are represented as NaN
● Series names are ignored (the name "year" was dropped)

[email protected]
Pandas - DataFrame Objects

DataFrame - Access a column

>>> people["birthyear"]

Output -

alice 1985
bob 1984
charles 1992
Name: birthyear, dtype: int64

[email protected]
Pandas - DataFrame Objects

DataFrame - Access the multiple columns

>>> people[["birthyear", "hobby"]]

Output -

[email protected]
Pandas - DataFrame Objects

Creating DataFrame - Include columns and/or rows and


guarantee order

>>> d2 = pd.DataFrame(
people_dict,
columns=["birthyear", "weight", "height"],
index=["bob", "alice", "eugene"]
)
>>> print(d2)

[email protected]
Pandas - DataFrame Objects

DataFrame - Accessing rows

● Using loc
○ people.loc["charles"]
● Using iloc
○ People.iloc[2]
Output -
birthyear 1992
children 0
hobby NaN
weight 112
Name: charles, dtype: object
[email protected]
Pandas - DataFrame Objects

DataFrame - Get a slice of rows

>>> people.iloc[1:3]

Output -

[email protected]
Pandas - DataFrame Objects

DataFrame - Pass a boolean array

>>> people[np.array([True, False, True])]

Output -

[email protected]
Pandas - DataFrame Objects

DataFrame - Pass boolean expression

>>> people[people["birthyear"] < 1990]

Output -

[email protected]
Pandas - DataFrame Objects

DataFrame - Adding and removing columns

>>> # Adds a new column "age"


>>> people["age"] = 2016 - people["birthyear"]

>>> # Adds another column "over 30"


>>> people["over 30"] = people["age"] > 30

>>> # Removes "birthyear" and "children" columns


>>> birthyears = people.pop("birthyear")
>>> del people["children"]

>>> people

[email protected]
Pandas - DataFrame Objects

DataFrame - A new column must have the same number of rows

>>> # alice is missing, eugene is ignored

>>> people["pets"] = pd.Series({


"bob": 0,
"charles": 5,
"eugene":1
})

>>> people

[email protected]
Pandas - DataFrame Objects

DataFrame - Add a new column using insert method after an


existing column

>>> people.insert(1, "height", [172, 181, 185])


>>> people

[email protected]
Pandas - DataFrame Objects

DataFrame - Add new columns using assign method

>>> (people
.assign(body_mass_index = lambda df:df["weight"]
/ (df["height"] / 100) ** 2)
.assign(overweight = lambda df:
df["body_mass_index"] > 25)
)

[email protected]
Pandas - DataFrame Objects

DataFrame - Sorting a DataFrame

● Use sort_index method


○ It sorts the rows by their index label
○ In ascending order
○ Reverse the order by passing ascending=False
○ Returns a sorted copy of DataFrame

[email protected]
Pandas - DataFrame Objects

DataFrame - Sorting a DataFrame

>>> people.sort_index(ascending=False)

[email protected]
Pandas - DataFrame Objects

DataFrame - Sorting a DataFrame - inplace argument

>>> people.sort_index(inplace=True)
>>> people

[email protected]
Pandas - DataFrame Objects

DataFrame - Sorting a DataFrame - Sort By Value

>>> people.sort_values(by="age", inplace=True)


>>> people

[email protected]
Pandas - DataFrame Objects

Plotting a DataFrame

>>> people.plot(
kind = "line",
x = "body_mass_index",
y = ["height", "weight"]
)
>>> plt.show()

[email protected]
Pandas - DataFrame Objects

DataFrames - Saving and Loading

● Pandas can save DataFrames to various backends such as


○ CSV
○ Excel (requires openpyxl library)
○ JSON
○ HTML
○ SQL database

[email protected]
Pandas - DataFrame Objects

DataFrames - Saving

Let’s create a new DataFrame my_df and save it in various formats

>>> my_df = pd.DataFrame(


[
["Biking", 68.5, 1985, np.nan],
["Dancing", 83.1, 1984, 3]
],
columns=["hobby","weight","birthyear","children"],
index=["alice", "bob"]
)
>>> my_df

[email protected]
Pandas - DataFrame Objects

DataFrames - Saving

● Save to CSV
○ >>> my_df.to_csv("my_df.csv")
● Save to HTML
○ >>> my_df.to_html("my_df.html")
● Save to JSON
○ >>> my_df.to_json("my_df.json")

[email protected]
Pandas - DataFrame Objects

DataFrames - What was saved?

>>> for filename in ("my_df.csv", "my_df.html",


"my_df.json"):
print("#", filename)
with open(filename, "rt") as f:
print(f.read())
print()

[email protected]
Pandas - DataFrame Objects

DataFrames - What was saved?

Note that the index is saved as the first column (with no name) in a CSV file

[email protected]
Pandas - DataFrame Objects
DataFrames - What was saved?

Note that the index is saved as <th> tags in HTML

[email protected]
Pandas - DataFrame Objects

DataFrames - What was saved?

Note that the index is saved as keys in JSON

[email protected]
Pandas - DataFrame Objects

DataFrames - Loading

● read_csv # For loading CSV files

● read_html # For loading HTML files

● read_excel # For loading Excel files

[email protected]
Pandas - DataFrame Objects

DataFrames - Load CSV file

>>> my_df_loaded = pd.read_csv("my_df.csv", index_col=0)

>>> my_df_loaded

[email protected]
Pandas - DataFrame Objects

DataFrames - Overview

● When dealing with large DataFrames, it is useful to get a quick overview


of its content
● Load housing.csv inside dataset directory to create a DataFrame and
get a quick overview

[email protected]
Pandas - DataFrame Objects

DataFrames - Overview

● Let’s understand below methods


○ head()
○ tail()
○ info()
○ describe()

[email protected]
Pandas - DataFrame Objects

DataFrames - Overview - head()

● The head method returns the top 5 rows

>>> housing = pd.read_csv("dataset/housing.csv")


>>> housing.head()

[email protected]
Pandas - DataFrame Objects

DataFrames - Overview - tail()

● The tail method returns the bottom 5 rows


● We can also pass the number of rows we want

>>> housing.tail(n=2)

[email protected]
Pandas - DataFrame Objects

DataFrames - Overview - info()

● The info method prints out the summary of each column's contents

>>> housing.info()

[email protected]
Pandas - DataFrame Objects

DataFrames - Overview - describe()

● The describe method gives a nice overview of the main aggregated


values over each column
○ count: number of non-null (not NaN) values
○ mean: mean of non-null values
○ std: standard deviation of non-null values
○ min: minimum of non-null values
○ 25%, 50%, 75%: 25th, 50th and 75th percentile of non-null values
○ max: maximum of non-null values
[email protected]
References

● Pandas
○ http://pandas.pydata.org/pandas-docs/stable/

[email protected]
Questions?
https://discuss.cloudxlab.com
[email protected]
Matplotlib

[email protected]
Matplotlib - Overview

● Matplotlib is a Python 2D plotting library


● Produces publication quality figures in a variety of
○ Hardcopy formats and
○ Interactive environments

[email protected]
Matplotlib - Overview

● Matplotlib can be used in


○ Python scripts
○ Python and IPython shell
○ Jupyter notebook
○ Web application servers
○ GUI toolkits

[email protected]
Matplotlib - pyplot Module

● matplotlib.pyplot
○ Collection of functions that make matplotlib work like MATLAB
○ Majority of plotting commands in pyplot have MATLAB analogs with
similar arguments

[email protected]
Matplotlib - pyplot Module

● matplotlib.pyplot
○ Collection of functions that make matplotlib work like MATLAB
○ Majority of plotting commands in pyplot have MATLAB analogs with
similar arguments

[email protected]
Matplotlib - pyplot Module - plot()

>>> import matplotlib.pyplot as plt


>>> plt.plot([1,2,3,4])
>>> plt.ylabel('some numbers')
>>> plt.show()

[email protected]
Matplotlib - pyplot Module - plot()

plot x versus y
>>> import matplotlib.pyplot as plt
>>> plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
>>> plt.ylabel('some numbers')
>>> plt.show()

[email protected]
Matplotlib - pyplot Module - Histogram

>>> import matplotlib.pyplot as plt


>>> x =
[21,22,23,4,5,6,77,8,9,10,31,32,33,34,35,36,37,18,49,50,
100]
>> num_bins = 5
>> plt.hist(x, num_bins, facecolor='blue')
>> plt.show()

[email protected]
References

● Matplotlib
○ https://matplotlib.org/tutorials/index.html

[email protected]
Questions?
https://discuss.cloudxlab.com
[email protected]

You might also like