Pandas Summarized Visually in 8
Pandas Summarized Visually in 8
pandas
Methods to read data are all named read_* to_*
pd.read_* where * is the file type. Series
and DataFrames can be saved to disk
using their to_* method.
DataFrame
Usage Patterns h5 X Y Z h5
a
• Use pd.read_clipboard() for one off data b
extractions. c
+ +
Reading Text Files into a DataFrame
olors highlight ho different arguments ma from the data file to a ata rame.
# Historical_data.csv
Date Cs Rd
Date, Cs, Rd >>> read_table(
2005-01-03, 64.78, - 'historical_data.csv',
sep=',',
2005-01-04, 63.79, 201.4
header=1,
2005-01-05, 64.46, 193.45
skiprows=1,
... skipfooter=2,
Data from Lab Z. index_col=0,
Recorded by Agent E parse_dates=True,
na_values=['-'])
, ,
X Y X Y X Y
a a a
>>> df_list = read_html(url) b b b
c c c
1
Tak e your Pandas ski l ls to the n ex t l ev e l ! Re g i s te r a t ww w .e nt h oug ht. c om /p and a s-m as te r y -w or ks ho p
© 2 01 9 E n t hou g ht , In c., l i cen se d u nder t h e C rea t ive C om mo ns At t ribution-N onCom m er cial-N oDeri va tives 4.0 Inter nat ional L icense.
To vi ew a copy o f th i s l ice nse, visit ht tp :/ / creat ivecom m ons.or g/ licenses /by-nc-nd/ 4.0 /
Pandas Data Structures: Series and DataFrames
pandas
A Series, s, maps an index to values. It is:
• Like an ordered dictionary
• A Numpy array with row labels and a name
A DataFrame, df, maps index and column labels to values. It is:
Indexing and Slicing
• Like a dictionary of Series (columns) sharing the same index
• A 2D Numpy array with row and column labels Use these attributes on Series and DataFrames for indexing,
s_df applies to both Series and DataFrames. slicing, and assignments:
Assume that manipulations of Pandas object return copies.
s_df.loc[] Refers only to the index labels
s_df.iloc[] Refers only to the integer location,
similar to lists or Numpy arrays
Creating Series and DataFrames
s_df.xs(key, level) Select rows with label key in level
Series Series level of an object with MultiIndex.
2
Tak e your Pandas ski l ls to the n ex t l ev e l ! Re g i s te r a t ww w .e nt h oug ht. c om /p and a s-m as te r y -w or ks ho p
© 2 01 9 E n t hou g ht , In c., l i cen se d u nder t h e C rea t ive C om mo ns At t ribution-N onCom m er cial-N oDeri va tives 4.0 Inter nat ional L icense.
To vi ew a copy o f th i s l ice nse, visit ht tp :/ / creat ivecom m ons.or g/ licenses /by-nc-nd/ 4.0 /
Computation with Series and DataFrames
pandas
Pandas objects do not behave exactly like Numpy arrays. They follow three
main rules (see on the right). Aligning objects on the index (or columns)
before calculations might be the most important difference. There are The 3 Rules of Binary Operations
built-in methods for most common statistical operations, such as mean
or sum, and they apply across one-dimension at a time. To apply
custom functions, use one of three methods to do tablewise (pipe), Rule 1:
row or column-wise (apply) or elementwise (applymap) Operations between multiple Pandas objects implement
operations. auto alignment ased on inde first.
Rule 2:
Mathematical operators (+ - * / exp, log, ...) apply element by
Rule 1: Alignment First element, on the values.
Rule 3:
> s1 + s2 > s1.add(s2, fill_value=0) Reduction operations (mean, std, skew, kurt, sum, prod, ...)
s1 s2 s1 s2 are applied column by column by default.
a 1 NaN a NaN a 1 0 a 1
b 6
b 2
NaN
b 4
c 5 c NaN
b 2
0
b 4
c 5
b 6
c 5
Rule 2: Element-By-Element
Mathematical Operations
Use add, sub, mul, div, to set fill alue.
df + 1 df.abs() np.log(df)
Rule 3: Reduction Operations
X Y X Y X Y X Y
>>> df.sum() Series a -2 -2 a -1 -1 a 1 1 a 0 0
b -2 -2 b -1 -1 b 1 1 b 0 0
X Y c -2 -2 c -1 -1 c 1 1 c 0 0
df.sum()
a X
b Y
c Apply a Function to Each Value
Operates across rows by default (axis=0, or axis='rows'). Apply a function to each value in a Series or DataFrame
Operate across columns with axis=1 or axis='columns'. s.apply(value_to_value) Series
df.applymap(value_to_value) DataFrame
Value
Y
b b Z
c c
Time
With a Series, Pandas plots values against the With a DataFrame, Pandas creates one line per Use Matplotlib to override or add annotations:
index: column: > ax.set_xlabel('Time')
> ax = s.plot() > ax = df.plot() > ax.set_ylabel('Value')
> ax.set_title('Experiment A')
When plotting the results of complex manipulations with groupby, it's often useful to
Pass labels if you want to override
stack/unstack the resulting ata rame to fit the one line er column assum tion see
Data Structures cheatsheet). the column names and set the legend
location:
Useful Arguments to plot > ax.legend(labels, loc='best')
X Y
a
b
c
Red Panda
• subplots=True: one subplot per column, instead of one line Ailurus fulgens
• figsize set figure si e, in inches
• x and y: plot one column against another
Kinds of Plots
+
df.plot.scatter(x, y) df.plot.bar() df.plot.hist() df.plot.box()
4
Ta k e your Pa ndas sk i ll s to the nex t le ve l! Re g i st e r at w ww . e n t ho ug ht . c om/ p an d as -mas t e r y-w ork s h op
© 20 19 Ent ho ug h t, In c. , li cen sed u n der th e Cr eat ive C om m ons Att ributi on-NonCo m m ercial-NoDe rivati ve s 4.0 Int erna tional L icen se.
T o vi ew a copy o f t hi s l i cense , visit htt p:// creat iveco m mo ns .org/lic ense s/b y-nc -nd/4 .0/
Manipulating Dates and Times
pandas
Use a Datetime index for easy time-based indexing and slicing, as
well as for powerful resampling and data alignment.
Timestamps vs Periods
Pandas makes a distinction between timestamps, called
Timestamps
Datetime objects, and time spans, called Period objects.
Frequency Offsets
Used by date_range, period_range and resample:
Creating Ranges or Periods
• B: Business day • A: Year end > pd.period_range(start=None, end=None,
• D: Calendar day • AS: Year start periods=None, freq=offset)
• W: Weekly • H: Hourly
• M: Month end T, min inutel
• MS: Month start • S: Secondly
Resampling
• BM: Business month end , ms illiseconds
> s_df.resample(freq_offset).mean()
• Q: Quarter end , us icroseconds
For more: • N: Nanoseconds resample returns a groupby-like object that must be
oo u andas ffset liases or chec out pandas.tseries.offsets, aggregated with mean, sum, std, apply, etc. ee also the
and pandas.tseries.holiday modules. lit l om ine cheat sheet.
5
Tak e your Pandas ski l ls to the n ex t l ev e l ! Re g i s te r a t ww w .e nt h oug ht. c om /p and a s-m as te r y -w or ks ho p
© 2 01 9 E n t hou g ht , In c., l i cen se d u nder t h e C rea t ive C om mo ns At t ribution-N onCom m er cial-N oDeri va tives 4.0 Inter nat ional L icense.
To vi ew a copy o f th i s l ice nse, visit ht tp :/ / creat ivecom m ons.or g/ licenses /by-nc-nd/ 4.0 /
Combining DataFrames
pandas
Tools for combining Series and DataFrames together, with
SQL-type joins and concatenation. Use join if merging Concatenating DataFrames
on indices, otherwise use merge. > pd.concat(df_list)
“Stacks” DataFrames on top of each other.
Set ignore_index=True, to replace index with RangeIndex.
Note: Faster than repeated df.append(other_df).
Merge on Column Values
> pd.merge(left, right, how='inner', on='id')
Ignores index, unless on=None. See value of how below.
Join on Index
Use on if merging on same column in both DataFrames, otherwise
> df.join(other)
use left_on, right_on.
Merge DataFrames on indexes. Set on=columns to join on index
of other and on columns of df. join uses pd.merge under
Merge Types: The how Keyword the covers.
6
Tak e your Pandas ski l ls to the n ex t l ev e l ! Re g i s te r a t ww w .e nt h oug ht. c om /p and a s-m as te r y -w or ks ho p
© 2 01 9 E n t hou g ht , In c., l i cen se d u nder t h e C rea t ive C om mo ns At t ribution-N onCom m er cial-N oDeri va tives 4.0 Inter nat ional L icense.
To vi ew a copy o f th i s l ice nse, visit ht tp :/ / creat ivecom m ons.or g/ licenses /by-nc-nd/ 4.0 /
Split / Apply / Combine with DataFrames
pandas
1. Split the data based on some criteria.
2. Apply a function to each group to aggregate, transform, or
filter. Split/Apply/Combine
3. Combine the results.
The apply and combine steps are typically done together in X Y
Pandas. a 1 1.5
X Y a 2
a 1 X Y
Split: Group By b
c
3
2
X Y
b 3 2
a
b
1.5
2
Group by a single column: b 1 b 1 c 2
> g = df.groupby(col_name) c 2
a 2 X Y
Grouping with list of column names creates DataFrame with MultiIndex. c 2 2
(see “Reshaping DataFrames and Pivot Tables” cheatsheet): c 2
> g = df.groupby(list_col_names)
Pass a function to group based on the index:
Split Apply Combine
> g = df.groupby(function)
• Groupby • Apply
• Window Functions • Group-specific transformations
X Y Z
0 a • Aggregation
X Y Z 2 a • Group-specific Filtering
0 a
df.groupby('X')
1 b X Y Z
2
3
a
b
1 b
3 b
Split: What’s a GroupBy Object?
4 c
X Y Z
It keeps track of which rows are part of which group.
4 c
> g.groups Dictionary, where keys are group
names, and values are indices of rows in a given group.
Apply/Combine: General Tool: apply It is iterable:
> for group, sub_df in g:
More general than agg, transform, and filter. Can
...
aggregate, transform or filter. The resulting dimensions
can change, for example:
> g.apply(lambda x: x.describe())
Apply/Combine: Aggregation
Perform computations on each group. The shape changes;
Apply/Combine: Transformation the categories in the grouping columns become the index.
Can use built-in aggregation methods: mean, sum, size,
The shape and the index do not change.
count, std, var, sem, describe, first, last, nth,
> g.transform(df_to_df)
min, max, for example:
Example, normalization:
> g.mean()
> def normalize(grp):
… or aggregate using custom function:
. return (grp - grp.mean()) / grp.var()
> g.agg(series_to_value)
> g.transform(normalize)
… or aggregate with multiple functions at once:
X Y Z
Other Groupby-Like Operations: Window Functions
0 a 1 1
X Y Z • resample, rolling, and ewm (exponential weighted
2 a 1 1
0 a 1 1 0
X Y Z
g.filter(…) function) methods behave like GroupBy objects. They keep
1 b 1 1 1
1 b 1 1 track of which row is in which “group”. Results must be
2 a 1 1 2
3 b 1 1 aggregated with sum, mean, count, etc. (see Aggregation).
3 b 1 1 • resample is often used before rolling, expanding, and 3
X Y Z
4 c 0 0 ewm when using a DateTime index. 4
7
Tak e your Pa ndas sk i ll s to the nex t le v e l ! Re g i s t e r at ww w . e n th oug ht .c o m/p a nd as -m as te ry -wor k sho p
© 2 01 9 E nt h ou g ht , In c., li cen sed u nder th e Cre at ive C om m ons Att ributio n-NonCo mm e rcial-N oDe rivative s 4.0 Inte rnat ional L icense.
To vi ew a copy o f t hi s l i cense, visit ht t p:// creat iveco mm o ns.o rg/lice nse s/by-nc-nd /4.0 /
Reshaping DataFrames and Pivot Tables
pandas
Tools for reshaping DataFrames from the wide to the long format and back.
The long format can be tidy, which means that "each variable is a column,
each observation is a row"1. Tidy data is easier to filter, aggregate,
transform, sort, and pivot. Reshaping operations often produce multi-level
Long to Wide Format and Back
indices or columns, which can be sliced and indexed. with stack() and unstack()
1 Hadley Wickham (2014) "Tidy Data", http://dx.doi.org/10.18637/jss.v059.i10
df.pivot() vs pd.pivot_table
8
Tak e your Pandas ski l ls to the n ex t l ev e l ! Re g is t e r at w ww . e nthou g h t . c om/p a nd as -mas t e ry -wo rk s ho p
© 2 01 9 E n th ou g ht , In c., l i cen se d un der t he C rea ti ve C omm o ns At tr ibut ion-N onC om m erc ial-No Derivat ives 4.0 I nternat ional Lic ense.
To vi ew a copy of th i s l ice nse, visit ht tp:/ /c reat ivecom m ons.org/ licenses/ by-nc-nd/ 4.0 /