0% found this document useful (0 votes)
7 views107 pages

Lab 05 Presentation

Uploaded by

amisskpop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views107 pages

Lab 05 Presentation

Uploaded by

amisskpop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

What are the

chances?
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Measuring chance
What's the probability of an event?

# ways event can happen


P (event) =
total # of possible outcomes
Example: a coin flip

1 way to get heads 1


P (heads) = = = 50%
2 possible outcomes 2

INTRODUCTION TO STATISTICS IN PYTHON


Assigning salespeople

INTRODUCTION TO STATISTICS IN PYTHON


Assigning salespeople

1
P (Brian) = = 25%
4

INTRODUCTION TO STATISTICS IN PYTHON


Sampling from a DataFrame
print(sales_counts) sales_counts.sample()

name n_sales name n_sales


0 Amir 178 1 Brian 128
1 Brian 128
2 Claire 75 sales_counts.sample()
3 Damian 69

name n_sales
2 Claire 75

INTRODUCTION TO STATISTICS IN PYTHON


Setting a random seed
np.random.seed(10) np.random.seed(10)
sales_counts.sample() sales_counts.sample()

name n_sales name n_sales


1 Brian 128 1 Brian 128

np.random.seed(10)
sales_counts.sample()

name n_sales
1 Brian 128

INTRODUCTION TO STATISTICS IN PYTHON


A second meeting
Sampling without replacement

INTRODUCTION TO STATISTICS IN PYTHON


A second meeting

1
P (Claire) = = 33%
3

INTRODUCTION TO STATISTICS IN PYTHON


Sampling twice in Python
sales_counts.sample(2)

name n_sales
1 Brian 128
2 Claire 75

INTRODUCTION TO STATISTICS IN PYTHON


Sampling with replacement

INTRODUCTION TO STATISTICS IN PYTHON


Sampling with replacement

1
P (Claire) = = 25%
4

INTRODUCTION TO STATISTICS IN PYTHON


Sampling with/without replacement in Python
sales_counts.sample(5, replace = True)

name n_sales
1 Brian 128
2 Claire 75
1 Brian 128
3 Damian 69
0 Amir 178

INTRODUCTION TO STATISTICS IN PYTHON


Independent events
Two events are independent if the probability
of the second event isn't affected by the
outcome of the first event.

INTRODUCTION TO STATISTICS IN PYTHON


Independent events
Two events are independent if the probability
of the second event isn't affected by the
outcome of the first event.

Sampling with replacement = each pick is


independent

INTRODUCTION TO STATISTICS IN PYTHON


Dependent events
Two events are dependent if the probability
of the second event is affected by the
outcome of the first event.

INTRODUCTION TO STATISTICS IN PYTHON


Dependent events
Two events are dependent if the probability
of the second event is affected by the
outcome of the first event.

INTRODUCTION TO STATISTICS IN PYTHON


Dependent events
Two events are dependent if the probability
of the second event is affected by the
outcome of the first event.

Sampling without replacement = each pick is


dependent

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Discrete
distributions
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Rolling the dice

INTRODUCTION TO STATISTICS IN PYTHON


Rolling the dice

INTRODUCTION TO STATISTICS IN PYTHON


Choosing salespeople

INTRODUCTION TO STATISTICS IN PYTHON


Probability distribution
Describes the probability of each possible outcome in a scenario

Expected value: mean of a probability distribution

Expected value of a fair die roll =


(1 × 16 ) + (2 × 16 ) + (3 × 16 ) + (4 × 16 ) + (5 × 16 ) + (6 × 16 ) = 3.5

INTRODUCTION TO STATISTICS IN PYTHON


Visualizing a probability distribution

INTRODUCTION TO STATISTICS IN PYTHON


Probability = area
P (die roll) ≤ 2 = ?

INTRODUCTION TO STATISTICS IN PYTHON


Probability = area
P (die roll) ≤ 2 = 1/3

INTRODUCTION TO STATISTICS IN PYTHON


Uneven die

Expected value of uneven die roll =


(1 × 16 ) + (2 × 0) + (3 × 13 ) + (4 × 16 ) + (5 × 16 ) + (6 × 16 ) = 3.67

INTRODUCTION TO STATISTICS IN PYTHON


Visualizing uneven probabilities

INTRODUCTION TO STATISTICS IN PYTHON


Adding areas
P (uneven die roll) ≤ 2 = ?

INTRODUCTION TO STATISTICS IN PYTHON


Adding areas
P (uneven die roll) ≤ 2 = 1/6

INTRODUCTION TO STATISTICS IN PYTHON


Discrete probability distributions
Describe probabilities for discrete outcomes

Fair die Uneven die

Discrete uniform distribution

INTRODUCTION TO STATISTICS IN PYTHON


Sampling from discrete distributions
print(die) rolls_10 = die.sample(10, replace = True)
rolls_10

number prob
0 1 0.166667 number prob
1 2 0.166667 0 1 0.166667
2 3 0.166667 0 1 0.166667
3 4 0.166667 4 5 0.166667
4 5 0.166667 1 2 0.166667
5 6 0.166667 0 1 0.166667
0 1 0.166667
5 6 0.166667
np.mean(die['number'])
5 6 0.166667
...
3.5

INTRODUCTION TO STATISTICS IN PYTHON


Visualizing a sample
rolls_10['number'].hist(bins=np.linspace(1,7,7))
plt.show()

INTRODUCTION TO STATISTICS IN PYTHON


Sample distribution vs. theoretical distribution
Sample of 10 rolls Theoretical probability distribution

np.mean(rolls_10['number']) = 3.0
mean(die['number']) = 3.5

INTRODUCTION TO STATISTICS IN PYTHON


A bigger sample
Sample of 100 rolls Theoretical probability distribution

np.mean(rolls_100['number']) = 3.4
mean(die['number']) = 3.5

INTRODUCTION TO STATISTICS IN PYTHON


An even bigger sample
Sample of 1000 rolls Theoretical probability distribution

np.mean(rolls_1000['number']) = 3.48
mean(die['number']) = 3.5

INTRODUCTION TO STATISTICS IN PYTHON


Law of large numbers
As the size of your sample increases, the sample mean will approach the expected value.

Sample size Mean


10 3.00
100 3.40
1000 3.48

INTRODUCTION TO STATISTICS IN PYTHON


Continuous
distributions
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Waiting for the bus

INTRODUCTION TO STATISTICS IN PYTHON


Continuous uniform distribution

INTRODUCTION TO STATISTICS IN PYTHON


Continuous uniform distribution

INTRODUCTION TO STATISTICS IN PYTHON


Probability still = area
P (4 ≤ wait time ≤ 7) = ?

INTRODUCTION TO STATISTICS IN PYTHON


Probability still = area
P (4 ≤ wait time ≤ 7) = ?

INTRODUCTION TO STATISTICS IN PYTHON


Probability still = area
P (4 ≤ wait time ≤ 7) = 3 × 1/12 = 3/12

INTRODUCTION TO STATISTICS IN PYTHON


Uniform distribution in Python
P (wait time ≤ 7)

from scipy.stats import uniform


uniform.cdf(7, 0, 12)

0.5833333

INTRODUCTION TO STATISTICS IN PYTHON


"Greater than" probabilities
P (wait time ≥ 7) = 1 − P (wait time ≤ 7)

from scipy.stats import uniform


1 - uniform.cdf(7, 0, 12)

0.4166667

INTRODUCTION TO STATISTICS IN PYTHON


P (4 ≤ wait time ≤ 7)

INTRODUCTION TO STATISTICS IN PYTHON


P (4 ≤ wait time ≤ 7)

INTRODUCTION TO STATISTICS IN PYTHON


P (4 ≤ wait time ≤ 7)

from scipy.stats import uniform


uniform.cdf(7, 0, 12) - uniform.cdf(4, 0, 12)

0.25

INTRODUCTION TO STATISTICS IN PYTHON


Total area = 1
P (0 ≤ wait time ≤ 12) = ?

INTRODUCTION TO STATISTICS IN PYTHON


Total area = 1
P (0 ≤ outcome ≤ 12) = 12 × 1/12 = 1

INTRODUCTION TO STATISTICS IN PYTHON


Generating random numbers according to uniform
distribution
from scipy.stats import uniform
uniform.rvs(0, 5, size=10)

array([1.89740094, 4.70673196, 0.33224683, 1.0137103 , 2.31641255,


3.49969897, 0.29688598, 0.92057234, 4.71086658, 1.56815855])

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The binomial
distribution
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Coin flipping

INTRODUCTION TO STATISTICS IN PYTHON


Binary outcomes

INTRODUCTION TO STATISTICS IN PYTHON


A single flip
binom.rvs(# of coins, probability of heads/success, size=# of trials)

1 = head, 0 = tails

from scipy.stats import binom


binom.rvs(1, 0.5, size=1)

array([1])

INTRODUCTION TO STATISTICS IN PYTHON


One flip many times
binom.rvs(1, 0.5, size=8)

array([0, 1, 1, 0, 1, 0, 1, 1])

INTRODUCTION TO STATISTICS IN PYTHON


Many flips one time
binom.rvs(8, 0.5, size=1)

array([5])

INTRODUCTION TO STATISTICS IN PYTHON


Many flips many times
binom.rvs(3, 0.5, size=10)

array([0, 3, 2, 1, 3, 0, 2, 2, 0, 0])

INTRODUCTION TO STATISTICS IN PYTHON


Other probabilities
binom.rvs(3, 0.25, size=10)

array([1, 1, 1, 1, 0, 0, 2, 0, 1, 0])

INTRODUCTION TO STATISTICS IN PYTHON


Binomial distribution
Probability distribution of the number of
successes in a sequence of independent
trials

E.g. Number of heads in a sequence of coin


flips

Described by n and p

n: total number of trials


p: probability of success

INTRODUCTION TO STATISTICS IN PYTHON


What's the probability of 7 heads?
P (heads = 7)

# binom.pmf(num heads, num trials, prob of heads)


binom.pmf(7, 10, 0.5)

0.1171875

INTRODUCTION TO STATISTICS IN PYTHON


What's the probability of 7 or fewer heads?
P (heads ≤ 7)

binom.cdf(7, 10, 0.5)

0.9453125

INTRODUCTION TO STATISTICS IN PYTHON


What's the probability of more than 7 heads?
P (heads > 7)

1 - binom.cdf(7, 10, 0.5)

0.0546875

INTRODUCTION TO STATISTICS IN PYTHON


Expected value
Expected value = n × p

Expected number of heads out of 10 flips = 10 × 0.5 = 5

INTRODUCTION TO STATISTICS IN PYTHON


Independence
The binomial distribution is a probability
distribution of the number of successes in a
sequence of independent trials

INTRODUCTION TO STATISTICS IN PYTHON


Independence
The binomial distribution is a probability
distribution of the number of successes in a
sequence of independent trials

Probabilities of second trial are altered due to


outcome of the first

If trials are not independent, the binomial


distribution does not apply!

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The normal
distribution
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
What is the normal distribution?

INTRODUCTION TO STATISTICS IN PYTHON


Symmetrical

INTRODUCTION TO STATISTICS IN PYTHON


Area = 1

INTRODUCTION TO STATISTICS IN PYTHON


Curve never hits 0

INTRODUCTION TO STATISTICS IN PYTHON


Described by mean and standard deviation

Mean: 20

Standard deviation: 3

Standard normal distribution

Mean: 0

Standard deviation: 1

INTRODUCTION TO STATISTICS IN PYTHON


Areas under the normal distribution
68% falls within 1 standard deviation

INTRODUCTION TO STATISTICS IN PYTHON


Areas under the normal distribution
95% falls within 2 standard deviations

INTRODUCTION TO STATISTICS IN PYTHON


Areas under the normal distribution
99.7% falls within 3 standard deviations

INTRODUCTION TO STATISTICS IN PYTHON


Lots of histograms look normal
Normal distribution Women's heights from NHANES

Mean: 161 cm Standard deviation: 7 cm

INTRODUCTION TO STATISTICS IN PYTHON


Approximating data with the normal distribution

INTRODUCTION TO STATISTICS IN PYTHON


What percent of women are shorter than 154 cm?
from scipy.stats import norm
norm.cdf(154, 161, 7)

0.158655

16% of women in the survey are shorter than


154 cm

INTRODUCTION TO STATISTICS IN PYTHON


What percent of women are taller than 154 cm?
from scipy.stats import norm
1 - norm.cdf(154, 161, 7)

0.841345

INTRODUCTION TO STATISTICS IN PYTHON


What percent of women are 154-157 cm?

norm.cdf(157, 161, 7) - norm.cdf(154, 161, 7)

INTRODUCTION TO STATISTICS IN PYTHON


What percent of women are 154-157 cm?

norm.cdf(157, 161, 7) - norm.cdf(154, 161, 7)

0.1252

INTRODUCTION TO STATISTICS IN PYTHON


What height are 90% of women shorter than?
norm.ppf(0.9, 161, 7)

169.97086

INTRODUCTION TO STATISTICS IN PYTHON


What height are 90% of women taller than?
norm.ppf((1-0.9), 161, 7)

152.029

INTRODUCTION TO STATISTICS IN PYTHON


Generating random numbers
# Generate 10 random heights
norm.rvs(161, 7, size=10)

array([155.5758223 , 155.13133235, 160.06377097, 168.33345778,


165.92273375, 163.32677057, 165.13280753, 146.36133538,
149.07845021, 160.5790856 ])

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The central limit
theorem
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Rolling the dice 5 times
die = pd.Series([1, 2, 3, 4, 5, 6])
# Roll 5 times
samp_5 = die.sample(5, replace=True)
print(samp_5)

array([3, 1, 4, 1, 1])

np.mean(samp_5)

2.0

INTRODUCTION TO STATISTICS IN PYTHON


Rolling the dice 5 times
# Roll 5 times and take mean
samp_5 = die.sample(5, replace=True)
np.mean(samp_5)

4.4

samp_5 = die.sample(5, replace=True)


np.mean(samp_5)

3.8

INTRODUCTION TO STATISTICS IN PYTHON


Rolling the dice 5 times 10 times
Repeat 10 times: sample_means = []
for i in range(10):
Roll 5 times
samp_5 = die.sample(5, replace=True)
Take the mean sample_means.append(np.mean(samp_5))
print(sample_means)

[3.8, 4.0, 3.8, 3.6, 3.2, 4.8, 2.6,


3.0, 2.6, 2.0]

INTRODUCTION TO STATISTICS IN PYTHON


Sampling distributions
Sampling distribution of the sample mean

INTRODUCTION TO STATISTICS IN PYTHON


100 sample means
sample_means = []
for i in range(100):
sample_means.append(np.mean(die.sample(5, replace=True)))

INTRODUCTION TO STATISTICS IN PYTHON


1000 sample means
sample_means = []
for i in range(1000):
sample_means.append(np.mean(die.sample(5, replace=True)))

INTRODUCTION TO STATISTICS IN PYTHON


Central limit theorem
The sampling distribution of a statistic becomes closer to the normal distribution as the
number of trials increases.

* Samples should be random and independent

INTRODUCTION TO STATISTICS IN PYTHON


Sampling distribution of proportion

INTRODUCTION TO STATISTICS IN PYTHON


Mean of sampling distribution
# Estimate expected value of die
np.mean(sample_means)

3.48

# Estimate proportion of "Claire"s


np.mean(sample_props)

Estimate characteristics of unknown


0.26
underlying distribution

More easily estimate characteristics of


large populations

INTRODUCTION TO STATISTICS IN PYTHON


The Poisson
distribution
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Poisson processes
Events appear to happen at a certain rate,
but completely at random

Examples
Number of animals adopted from an
animal shelter per week

Number of people arriving at a


restaurant per hour

Number of earthquakes in California per


year

Time unit is irrelevant, as long as you use


the same unit when talking about the same
situation

INTRODUCTION TO STATISTICS IN PYTHON


Poisson distribution
Probability of some # of events occurring over a fixed period of time

Examples
Probability of ≥ 5 animals adopted from an animal shelter per week

Probability of 12 people arriving at a restaurant per hour

Probability of < 20 earthquakes in California per year

INTRODUCTION TO STATISTICS IN PYTHON


Lambda (λ)
λ = average number of events per time interval
Average number of adoptions per week = 8

INTRODUCTION TO STATISTICS IN PYTHON


Probability of a single value
If the average number of adoptions per week is 8, what is P (# adoptions in a week = 5)?

from scipy.stats import poisson


poisson.pmf(5, 8)

0.09160366

INTRODUCTION TO STATISTICS IN PYTHON


Probability of less than or equal to
If the average number of adoptions per week is 8, what is P (# adoptions in a week ≤ 5)?

from scipy.stats import poisson


poisson.cdf(5, 8)

0.1912361

INTRODUCTION TO STATISTICS IN PYTHON


Probability of greater than
If the average number of adoptions per week is 8, what is P (# adoptions in a week > 5)?

1 - poisson.cdf(5, 8)

0.8087639

If the average number of adoptions per week is 10, what is P (# adoptions in a week > 5)?

1 - poisson.cdf(5, 10)

0.932914

INTRODUCTION TO STATISTICS IN PYTHON


The CLT still applies!

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

You might also like