Vous êtes sur la page 1sur 8

Reading and Writing Data with Pandas

pandas
Methods to read data are all named read_* to_*
pd.read_* where * is the le type. Series
and DataFrames can be saved to disk using
their to_* method.

DataFrame
Usage Patterns h5 X Y Z h5

a
Use pd.read_clipboard() for one-o data b
extractions. c

Use the other pd.read_* methods in scripts


for repeatable analyses.

+ +
Reading Text Files into a DataFrame
Colors highlight how dierent arguments map from the data le to a DataFrame.

# Historical_data.csv
Date Cs Rd
Date, Cs, Rd >>> read_table(
2005-01-03, 64.78, - 'historical_data.csv',
sep=',',
2005-01-04, 63.79, 201.4
header=1,
2005-01-05, 64.46, 193.45
skiprows=1,
... skipfooter=2,
Data from Lab Z. index_col=0,
Recorded by Agent E parse_dates=True,
na_values=['-'])

Other arguments: Possible values of parse_dates:


names: set or override column names [0, 2]: Parse columns 0 and 2 as separate dates
parse_dates: accepts multiple argument types, see on the right [[0, 2]]: Group columns 0 and 2 and parse as single date
converters: manually process each element in a column {'Date': [0, 2]}: Group columns 0 and 2, parse as
comment: character indicating commented line single date in a column named Date.
chunksize: read only a certain number of rows each time Dates are parsed after the converters have been applied.

Parsing Tables from the Web

, ,
X Y X Y X Y
a a a
>>> df_list = read_html(url) b b b
c c c

Writing Data Structures to Disk From and To a Database


Writing data structures to disk: Read, using SQLAlchemy. Supports multiple databases:
> s_df.to_csv(filename) > from sqlalchemy import create_engine
> s_df.to_excel(filename) > engine = create_engine(database_url)
> conn = engine.connect()
Write multiple DataFrames to single Excel le: > df = pd.read_sql(query_str_or_table_name, conn)
> writer = pd.ExcelWriter(filename)
> df1.to_excel(writer, sheet_name='First') Write:
> df2.to_excel(writer, sheet_name='Second') > df.to_sql(table_name, conn)
> writer.save()

Take y our Pand a s skills to the next le ve l! Reg ister at w w w .enth ou g h t.com/pan d as-master-cl ass
2 0 1 6 En t h o u gh t , I n c. , lice n s e d u n de r t h e Cre a t iv e Co mmo n s At t ribu t io n -No n Co mme rcial-NoDeriv at iv e s 4.0 I n t e rn at io n al Lice n s e.
T o v ie w a co py o f t h is lice ns e , v is it h t t p:/ / cre at iv e co mmo n s .o rg/ lice n s e s / by -n c-n d/ 4.0/
Split / Apply / Combine with DataFrames
pandas
1. Split the data based on some criteria.
2. Apply a function to each group to aggregate, transform, or
lter. Split/Apply/Combine
3. Combine the results.
The apply and combine steps are typically done together in X Y
Pandas. a 1 1.5
X Y a 2
a 1 X Y
Split: Group By b
c
3
2
X Y
b 3 2
a
b
1.5
2
Group by a single column: b 1 b 1 c 2
> g = df.groupby(col_name) c 2
a 2 X Y
Grouping with list of column names creates DataFrame with MultiIndex. c 2 2
(see Reshaping DataFrames and Pivot Tables cheatsheet): c 2
> g = df.groupby(list_col_names)
Pass a function to group based on the index:
Split Apply Combine
> g = df.groupby(function)
Groupby Apply
Window Functions Group-specic transformations
X Y Z
0 a Aggregation
X Y Z 2 a Group-specic Filtering
0 a
df.groupby('X')
1 b X Y Z
2
3
a
b
1 b
3 b
Split: Whats a GroupBy Object?
4 c
X Y Z
It keeps track of which rows are part of which group.
4 c
> g.groups Dictionary, where keys are group
names, and values are indices of rows in a given group.
Apply/Combine: General Tool: apply It is iterable:
> for group, sub_df in g:
More general than agg, transform, and filter. Can
...
aggregate, transform or lter. The resulting dimensions
can change, for example:
> g.apply(lambda x: x.describe())
Apply/Combine: Aggregation
Perform computations on each group. The shape changes;
Apply/Combine: Transformation the categories in the grouping columns become the index.
Can use built-in aggregation methods: mean, sum, size,
The shape and the index do not change.
count, std, var, sem, describe, first, last, nth,
> g.transform(df_to_df)
min, max, for example:
Example, normalization:
> g.mean()
> def normalize(grp):
or aggregate using custom function:
. return (grp - grp.mean()) / grp.var()
> g.agg(series_to_value)
> g.transform(normalize)
or aggregate with multiple functions at once:

X Y Z > g.agg([s_to_v1, s_to_v2])


0 a 1 1 X Y Z or use dierent functions on dierent columns.
2 a 1 1 0 a 0 0 > g.agg({'Y': s_to_v1, 'Z': s_to_v2})
g.transform() 1 b 0 0
X Y Z
1 b 2 2 2 a 0 0 X Y Z
3 b 2 2 3 b 0 0 0 a
4 c 0 0 2 a
X Y Z
4 c 3 3 X Y Z Y Z
1 b g.agg() a
3 b
Apply/Combine: Filtering
b
c
X Y Z
4 c
Returns a group only if condition is true.
> g.filter(lambda x: len(x)>1)

X Y Z
Other Groupby-Like Operations: Window Functions
0 a 1 1
X Y Z resample, rolling, and ewm (exponential weighted
2 a 1 1
0 a 1 1 0
X Y Z
g.filter() function) methods behave like GroupBy objects. They keep
1 b 1 1 1
1 b 1 1 track of which row is in which group. Results must be
2 a 1 1 2
3 b 1 1 aggregated with sum, mean, count, etc. (see Aggregation).
3 b 1 1 resample is often used before rolling, expanding, and 3
X Y Z
4 c 0 0 ewm when using a DateTime index. 4

Take y our Pand a s skills to the next le ve l! Reg ister at w w w .enth ou g h t.com/pan d as-master-cl ass
2 0 1 6 E n t h o u gh t , I n c. , lice n s e d u n de r t h e Cre a t iv e Co mmo n s At t ribu t io n -No n Co mme rcial-No De riv at iv e s 4.0 I n t e rn at io n al Lice n s e.
T o v ie w a copy o f t h is lice n s e , v is it h t t p:/ / cre at iv e co mmo n s .o rg/ lice n s e s / by -n c-n d/ 4.0/
Manipulating Dates and Times
pandas
Use a Datetime index for easy time-based indexing and slicing,
as well as for powerful resampling and data alignment.
Timestamps vs Periods
Pandas makes a distinction between timestamps, called
Timestamps
Datetime objects, and time spans, called Period objects.

2016-01-01 2016-01-02 2016-01-03 2016-01-04

Converting Objects to Time Objects


Periods
Convert dierent types, for example strings, lists, or arrays to
... ...
Datetime with:
> pd.to_datetime(value) 2016-01-01 2016-01-02 2016-01-03
Convert timestamps to time spans: set period duration with
frequency oset (see below).
Save Yourself Some Pain:
> date_obj.to_period(freq=freq_offset)
Use ISO 8601 Format
Creating Ranges of Timestamps When entering dates, to be consistent and to lower the risk of error
or confusion, use ISO format YYYY-MM-DD:


> pd.date_range(start=None, end=None,
>>> pd.to_datetime('12/01/2000') # 1st December
periods=None, freq=offset,
Timestamp('2000-12-01 00:00:00')


tz='Europe/London')
>>> pd.to_datetime('13/01/2000') # 13th January!
Specify either a start or end date, or both. Set number of
Timestamp('2000-01-13 00:00:00')
"steps" with periods. Set "step size" with freq; see "Frequen-


>>> pd.to_datetime('2000-01-13') # 13th January
cy osets" for acceptable values. Specify time zones with tz.
Timestamp('2000-01-13 00:00:00')

Frequency Offsets
Used by date_range, period_range and resample:
Creating Ranges or Periods
B: Business day A: Year end > pd.period_range(start=None, end=None,
D: Calendar day AS: Year start periods=None, freq=offset)
W: Weekly H: Hourly
M: Month end T, min: Minutely
MS: Month start S: Secondly
Resampling
BM: Business month end L, ms: Milliseconds
> s_df.resample(freq_offset).mean()
Q: Quarter end U, us: Microseconds
For more: N: Nanoseconds resample returns a groupby-like object that must be
Lookup "Pandas Oset Aliases" or check out pandas.tseries.offsets, aggregated with mean, sum, std, apply, etc. (See also the
and pandas.tseries.holiday modules. Split-Apply-Combine cheat sheet.)

Vectorized String Operations


Pandas implements vectorized string operations named
after Python's string methods. Access them through the
Splitting and Replacing
str attribute of string Series
split returns a Series of lists:
> s.str.split()
Some String Methods Access an element of each list with get:
> s.str.split(char).str.get(1)
> s.str.lower() > s.str.strip()
> s.str.isupper() > s.str.normalize()
Return a DataFrame instead of a list:
> s.str.len() and more > s.str.split(expand=True)
Index by character position:
> s.str[0] Find and replace with string or regular expressions:
> s.str.replace(str_or_regex, new)
True if regular expression pattern or string in Series: > s.str.extract(regex)
> s.str.contains(str_or_pattern) > s.str.findall(regex)

Take y our Pand a s skills to the next le ve l! Reg ister at w w w .enth ou g h t.com/pan d as-master-cl ass
2 0 1 6 En t h o u gh t , I n c. , lice n s e d u n de r t h e Cre a t iv e Co mmo n s At t ribu t io n -No n Co mme rcial-NoDeriv at iv e s 4.0 I n t e rn at io n al Lice n s e.
T o v ie w a co py o f t h is lice ns e , v is it h t t p:/ / cre at iv e co mmo n s .o rg/ lice n s e s / by -n c-n d/ 4.0/
Pandas Data Structures: Series and DataFrames
pandas
A Series, s, maps an index to values. It is:
Like an ordered dictionary
A Numpy array with row labels and a name
A DataFrame, df, maps index and column labels to values. It is:
Indexing and Slicing
Like a dictionary of Series (columns) sharing the same index
A 2D Numpy array with row and column labels Use these attributes on Series and DataFrames for indexing,
s_df applies to both Series and DataFrames. slicing, and assignments:
Assume that manipulations of Pandas object return copies.
s_df.loc[] Refers only to the index labels
s_df.iloc[] Refers only to the integer location,
similar to lists or Numpy arrays
Creating Series and DataFrames
s_df.xs(key, level) Select rows with label key in level
Series Series level of an object with MultiIndex.

> pd.Series(values, index=index,


Values
name=name) Masking and Boolean Indexing
> pd.Series({'idx1': val1, 'idx2': val2} n1 Cary 0
Where values, index, and name are sequences or
n2 Lynn 1 Create masks with, for example, comparisons
arrays.
n3 mask = df['X'] < 0
DataFrame Sam 2
Or isin, for membership mask
Index Integer mask = df['X'].isin(list_valid_values)
location
Columns Use masks for indexing (must use loc)
DataFrame
Age Gender
df.loc[mask] = 0
Cary 32 M > pd.DataFrame(values, index=index, Combine multiple masks with bitwise operators (and (&), or (|), xor
columns=col_names) (^), not (~)) and group them with parentheses:
Lynn 18 F
> pd.DataFrame({'col1': series1_or_seq, mask = (df['X'] < 0) & (df['Y'] == 0)
Sam 26 M 'col2': series2_or_seq})
Where values is a sequence of sequences or a
Index Values
2D array Common Indexing and Slicing Patterns
rows and cols can be values, lists, Series or masks.
Manipulating Series and DataFrames
s_df.loc[rows] Some rows (all columns in a DataFrame)
df.loc[:, cols_list] All rows, some columns
Manipulating Columns
df.loc[rows, cols] Subset of rows and columns
df.rename(columns={old_name: new_name}) Renames column s_df.loc[mask] Boolean mask of rows (all columns)
df.drop(name_or_names, axis='columns') Drops column name df.loc[mask, cols] Boolean mask of rows, some columns
Manipulating Index
s_df.reindex(new_index) Conform to new index Using [ ] on Series and DataFrames
s_df.drop(labels_to_drop) Drops index labels
s_df.rename(index={old_label: new_label})Renames index labels On Series, [ ] refers to the index labels, or to a slice
s_df.reset_index() Drops index, replaces with Range index Value
s['a']
s_df.sort_index() Sorts index labels Series, rst 2 rows
s[:2]
df.set_index(column_name_or_names)
On DataFrames, [ ] refers to columns labels:
Manipulating Values Series
All row values and the index will follow: df['X']
DataFrame
df.sort_values(col_name, ascending=True) df[['X', 'Y']]
df.sort_values(['X','Y'], ascending=[False, True]) df['new_or_old_col'] = series_or_array

Important Attributes and Methods EXCEPT! with a slice or mask.


DataFrame, rst 2 rows
df[:2]
s_df.index Array-like row labels DataFrame, rows where mask is
df[mask]
df.columns Array-like column labels True
s_df.values Numpy array, data
s_df.shape (n_rows, m_cols) NEVER CHAIN BRACKETS!


s.dtype, df.dtypes Type of Series, of each column
len(s_df) Number of rows > df[mask]['X'] = 1
SettingWithCopyWarning


s_df.head() and s_df.tail() First/last rows
s.unique() Series of unique values > df.loc[mask , 'X'] = 1
s_df.describe() Summary stats
df.info() Memory usage

Take y our Pand a s skills to the next le ve l! Reg ister at w w w .enth ou g h t.com/pan d as-master-cl ass
2 0 1 6 En t h o u gh t , I n c. , lice n s e d u n de r t h e Cre a t iv e Co mmo n s At t ribu t io n -No n Co mme rcial-NoDeriv at iv e s 4.0 I n t e rn at io n al Lice n s e.
T o v ie w a co py o f t h is lice ns e , v is it h t t p:/ / cre at iv e co mmo n s .o rg/ lice n s e s / by -n c-n d/ 4.0/
Combining DataFrames
pandas
Tools for combining Series and DataFrames
together, with SQL-type joins and concatenation. Concatenating DataFrames
Use join if merging on indices, otherwise use
> pd.concat(df_list)
merge. Stacks DataFrames on top of each other.
Set ignore_index=True, to replace index with RangeIndex.
Note: Faster than repeated df.append(other_df).
Merge on Column Values
> pd.merge(left, right, how='inner', on='id')
Ignores index, unless on=None. See value of how below.
Join on Index
Use on if merging on same column in both DataFrames, otherwise
> df.join(other)
use left_on, right_on.
Merge DataFrames on index. Set on=keys to join on index of df and
on keys of other. Join uses pd.merge under the covers.
Merge Types: The how Keyword

left left_on='X' right_on='Y' right

long X long X Y short Y short


left right how="outer" 0 aaaa a 0 aaaa a 0 b bb
1 bbbb b 1 bbbb b b bb 1 c cc
2 c cc

long X long X Y short Y short


left right how="inner" 0 aaaa a 0 bbbb b b bb 0 b bb
1 bbbb b 1 c cc

long X long X Y short Y short


left right how="left" 0 aaaa a 0 aaaa a 0 b bb
1 bbbb b 1 bbbb b b bb 1 c cc

long X long X Y short Y short


left right how="right" 0 aaaa a 0 bbbb b b bb 0 b bb
1 bbbb b 1 c cc 1 c ctc

Cleaning Data with Missing Values


Pandas represents missing values as NaN (Not a Number). It
comes from Numpy and is of type float64. Pandas has
Replacing Missing Values
many methods to nd and replace missing values.
s_df.loc[s_df.isnull()] = 0 Use mask to replace NaN

Find Missing Values s_df.interpolate(method='linear') Interpolate using dierent methods


s_df.fillna(method='ffill') Fill forward (last valid value)
> s_df.isnull() or > pd.isnull(obj)
s_df.fillna(method='bfill') Or backward (next valid value)
> s_df.notnull() or > pd.notnull(obj)
s_df.dropna(how='any') Drop rows if any value is NaN
s_df.dropna(how='all') Drop rows if all values are NaN
s_df.dropna(how='all', axis=1) Drop across columns instead of rows

Take y our Pand a s skills to the next le ve l! Reg ister at w w w .enth ou g h t.com/pan d as-master-cl ass
2 0 1 6 En t h o u gh t , I n c. , lice n s e d u n de r t h e Cre a t iv e Co mmo n s At t ribu t io n -No n Co mme rcial-NoDeriv at iv e s 4.0 I n t e rn at io n al Lice n s e.
T o v ie w a co py o f t h is lice ns e , v is it h t t p:/ / cre at iv e co mmo n s .o rg/ lice n s e s / by -n c-n d/ 4.0/
Reshaping Dataframes and Pivot Tables
pandas
Tools for reshaping DataFrames from the wide to the long format and back.
The long format can be tidy, which means that "each variable is a column,
each observation is a row"1. Tidy data is easier to filter, aggregate,
transform, sort, and pivot. Reshaping operations often produce multi-level Long to Wide Format and Back
indices or columns, which can be sliced and indexed. with stack() and unstack()
1 Hadley Wickham (2014) "Tidy Data", http://dx.doi.org/10.18637/jss.v059.i10

Pivot column level to index, Pivot index level to columns,


i.e. "stacking the columns" "unstack the columns" (long to
MultiIndex: A Multi-Level (wide to long):
> df.stack()
wide):
> df.unstack()
Hierarchical Index If multiple indices or column levels, use level number or name to
stack/unstack:
Often created as a result of: > df.unstack(0) or > df.unstack('Year')
> df.groupby(list_of_columns)
> df.set_index(list_of_columns) A common use case for unstacking, plotting group data vs index
after groupby:
Contiguous labels are displayed together but apply to each row. The concept is > (df.groupby(['A', 'B])['relevant'].mean()
similar to multi-level columns. .unstack().plot())
Long
A MultiIndex allows indexing and slicing one or multiple levels at once. Using
the Long example from the right: Wide Year Month Value
Stack 1
Jan.
Year Jan. Feb. Mar.
long.loc[1900] All 1900 rows 1900 Feb 7
1900 1 7 2
long.loc[(1900, 'March')] value 2 Mar. 2
2000 4 3 9
long.xs('March', level='Month') All March rows Jan. 4
Unstack
Simpler than using boolean indexing, for example: 2000 Feb 3
> long[long.Month == 'March'] Mar. 9

Pivot Tables From Wide to Long with melt


Specify which columns are identiers (id_vars, values will be
> pd.pivot_table(df, repeated for each row) and which are "measured variables"
index=cols, (keys to group by for index) (value_vars, will become values in variable column.
columns=cols2, (keys to group by for columns) All remaining columns by default).
values=cols3, (columns to aggregate)
aggfunc='mean') (what to do with repeated values) pd.melt(df, id_vars=id_cols, value_vars=value_columns)

Omitting index, columns, or values will use all remaining columns of df.
You can "pivot" a table manually using groupby, stack and unstack. pd.melt(team, id_vars=['Color'],
value_vars=['A', 'B', 'C'],
Index var_name='Team', value_name='Score')
Columns
Number of Continent Continent
0 Recently updated stations code AN EU
code Color Team Score
1 FALSE 1 EU Recently
updated
Team 0 Red A 1
2 FALSE 1 EU Color A B C
FALSE 1 3 Melt 1 Blue A 2
3 FALSE 1 EU
0 Red 1 3 4 2 Red B 3
TRUE 2 1
1 Blue 2 - 6 3 Blue B -
4 TRUE 1 EU
pd.pivot_table(df, 4 Red C 4
5 FALSE 1 AN index="Recently updated",
5 Blue C 5
columns="continent code",
6 TRUE 1 AN
values="Number of Stations",
7 TRUE 1 AN
aggfunc=np.sum)

df.pivot() vs pd.pivot_table

Red Panda
df.pivot() Does not deal with repeated values in Ailurus fulgens
index. It's a declarative form of stack
and unstack.
pd.pivot_table() Use if you have repeated values in index
(specify aggfunc argument).

Take y our P and a s skills to the ne xt le ve l! Reg ister at w w w .enthou g h t.com/p an d as-master-class
2 0 1 6 E n t h o u gh t , I n c. , lice n s e d u n de r t h e Cre a t iv e Co mmo n s At t ribu t io n -No n Co mme rcial-No Deriv at iv e s 4.0 I n t e rn at io n al Lice n s e.
T o v ie w a co py o f t h is lice n s e , v is it h t t p:/ / cre at iv e co mmo n s .o rg/ lice n s e s / by -n c-n d/ 4.0/

Vous aimerez peut-être aussi