Vous êtes sur la page 1sur 48

Practical Medium Data

Analytics with Python

PyData NYC 2013
Practical Medium Data
Analytics with Python
10 Things I Hate
About pandas
PyData NYC 2013
Wes McKinney

Former quant and MIT math dude
Creator of Pandas project for Python
Author of
Python for Data Analysis OReilly

Founder and CEO of DataPad

3 www.datapad.io
> 20k copies since Oct 2012
Bringing many new people
to Python and data analysis
with code

4 www.datapad.io
Founded in 2013, located in SF

In private beta, join us!

Hiring for engineering

Why hate on pandas?
7 www.datapad.io
pandas rocks!
So, pandas

Easy-to-use, fast in-memory data wrangling

and analytics library

Enabled loads of complex data work to be

done by mere mortals in Python

Might have kept R from taking over the

world (hehe)

10 www.datapad.io
11 www.datapad.io
pandas, the project

170 distinct contributors

Over 5400 issues and pull requests
on GitHub

Upcoming 0.13 release

12 www.datapad.io

pandass broad applicability also a


Only game in town for lot of things

pandas being used in some

unplanned ways

13 www.datapad.io
Some things to love

No more structured dtype drudgery!

Easy IO!
Data alignment!
Hierarchical indexing!
Time series analytics!
14 www.datapad.io
More things to love

Table reshaping
Missing data handling
pandas.merge, pandas.concat

Expressive groupby machinery

15 www.datapad.io
Some pandas use cases
General data wrangling
ETL jobs
Business analytics (incl. BI uses)

Time series analysis, statistical


16 www.datapad.io
pandas does many things
that are tedious, slow, or
dicult to do correctly
without it
Unfortunately, pandas is
not a database
#1 Slightly too far from
the metal
DataFrames internal structure
intended to make row-oriented ops
fast on numerical data

Python objects can be used as data,

indices (a feature, not a bug)

19 www.datapad.io
#2 No support (yet) for
memory maps
Many analytics ops require a small portion
of the data

Many ways to materialize the full data set

in memory by accident

Axis indexes wouldnt necessarily make

sense on out of core data sets

20 www.datapad.io
#2 No support (yet) for
memory maps
N.B. HDF5/PyTables support is a
partial solution

21 www.datapad.io
#3 No tight database
Makes it dicult to be a serious tool
in an ETL toolchain on top of some
SQL-ish system

Inadequacy of pandas/NumPy data

type systems

22 www.datapad.io
#3 No tight database
Jobs with heavy SQL-reading are
slow and use tons of memory

TODO: integrate pandas with ODBC

C API and write out SQL data directly
into NumPy arrays

23 www.datapad.io
#4 Best-efforts NA
Inconsistent representation of
missing data

No Boolean or Integer NA values

NA needs to be a rst class citizen in

analytics operations

24 www.datapad.io
#5 RAM management

Dicult to understand footprint of pandas


Ample data copying throughout library

Would benet from being able to compress
data in-memory or shuttle data temporarily
to disk

25 www.datapad.io
#6 Weak support for
categorical data
Makes pandas not quite a fully-
edged R replacement

GroupBy and Joins slower than they

could be

26 www.datapad.io
#7 Complex GroupBy
operations get messy
Must write custom functions to pass
to .apply(..)

Easy to run up against DRY

problems and general Python
syntax limitations

27 www.datapad.io
#8 Appending data slow
and tedious
DataFrame not intended as a
database table

Makes streaming data use a


B+ tree tables interesting?

28 www.datapad.io
#9 Limited type system,
column metadata
Currencies, units
Time zones
Geographic data

Composite data types

29 www.datapad.io
#10 No true query
processing layer
Aggregate SUM, MEAN, ...
Limit/TopK LIMIT
Sorting ORDER BY
30 www.datapad.io
#11 Slow: no multicore /
distributed algos
Hampered by use of Python data
structures / GIL interactions

Object internals not designed for

concurrent use

31 www.datapad.io
Oh no what do we do
Stop believing in the one
tool to rule them all
Real Artists Ship
- Steve Jobs
Focus on results

I am heavily biased by focus on

business analytics/BI use cases

Need production-ready software to

ship in relatively short time frame

36 www.datapad.io
A new project

In internal development at DataPad

Code named badger
pandas-ish syntax: designed for
data processing and analytical

37 www.datapad.io
Badger in a nutshell
Consistent data type system

Compressed columnar binary storage

High perf analytical query processor

Data preparation/cleaning tools

38 www.datapad.io
Badger in a nutshell
Time series analytics

Immutable array data, little copying

Analytics kernels: written C with no


Caching of useful intermediates

39 www.datapad.io
Some benchmarks
Data set: 2012 Election data (FEC)
5.3 mm records 7 columns
R: data.table
SQL: PostgreSQL, SQLite
40 www.datapad.io
Query 1
Total contributions by candidate
SELECT cand_nm,
sum(contb_receipt_amt) AS total
FROM fec
GROUP BY cand_nm

41 www.datapad.io
Query 1
Total contributions by candidate
badger (in-memory) : 19ms (1x)
badger (from-disk) : 131ms (6.9x)
pandas (in-memory) : 273ms (14.3x)
R data.table 1.8.10: 382ms (20x)
PostgreSQL : 4.7s (247x)
SQLite : 72s (3800x)

42 www.datapad.io
Query 2
Total contributions by candidate
and state
SELECT cand_nm, contbr_st,
sum(contb_receipt_amt) AS total
FROM fec
GROUP BY cand_nm, contbr_st

43 www.datapad.io
Query 2
Total contributions by candidate and
badger (in-memory) : 269ms (1x)
badger (from-disk) : 391ms (1.5x)
R data.table 1.8.10: 500ms (1.8x)
pandas (in-memory) : 770ms (2.9x)
PostgreSQL : 5.96s (23x)

44 www.datapad.io
Query 3
Total contributions by candidate
and state with 2 lter predicates
SELECT cand_nm,
sum(contb_receipt_amt) as total
FROM fec
WHERE contb_receipt_dt BETWEEN
'2012-05-01' and '2012-11-05'
AND contb_receipt_amt BETWEEN
0 and 2500
GROUP BY cand_nm
45 www.datapad.io
Query 3
Total contributions by candidate
and state with 2 lter predicates

badger (in-memory) : 96ms (1x)

badger (from-disk) : 275ms (2.9x)
pandas (in-memory) : 946ms (9.8x)
PostgreSQL : 6.2s (65x)

46 www.datapad.io
Badger, the future
Distributed in-memory analytics
Multicore algorithms
ETL job-building tools
Open source in some form someday
Looking for algorithms hackers to help

47 www.datapad.io
Thank you!

48 www.datapad.io