Vous êtes sur la page 1sur 3

Reading json directly into pandas

http://hayd.github.io/2013/pandas-json/

Andy Hayden
Reading json directly into pandas
12 Jun 2013
New to pandas 0.12 release, is a read_json function (which uses the speedy ujson under the hood).
Andy Hayden

It's as easy as whacking in the path/url/string of a valid json:

Python Data
Hacker

In [1]: df = pd.read_json('https://api.github.com/repos/pydata/pandas/issues?per_page=5'

resume

Let's inspect a few columns to see how we've done:


In [2]: df[['created_at', 'title', 'body', 'comments']]
Out[2]:
created_at
0 2013-06-12 02:54:37

title
DOC add to_datetime to api.rst

1 2013-06-12 01:16:19

ci/after_script.sh missing?

2 2013-06-11 23:07:52

ENH Prefer requests over urllib2

3 2013-06-11 21:12:45
4 2013-06-11 19:50:17

Nothing in docs about io.data


DOC: Clarify quote behavior parameters

Either I'm being thick or `to_da


https://travisAt the moment we
There's nothing on the docs abou
I've been bit many times recentl

The parse_dates argument has a good crack at parsing any columns which look like they're dates, and
it's worked in this example (converting created_at to Timestamps). It looks carefully at the datatype and at
column names (you can pass also pass a column name explicitly to ensure it gets converted) to choose
which to parse.
After you've done some analysis in your favourite data analysis library, the corresponding to_json allows
you can export results to valid json.
In [4]: res = df[['created_at', 'title', 'body', 'comments']].head()
In [5]: res.to_json()
Out[5]: '{"created_at":{"0":1370695148000000000,"1":1370665875000000000,"2":1370656273000000000
Here, orient decides how we should layout the data:
orient : {'split', 'records', 'index', 'columns', 'values'},
default is 'index' for Series, 'columns' for DataFrame
The format of the JSON string
split : dict like
{index -> [index], columns -> [columns], data -> [values]}
records : list like [{column -> value}, ... , {column -> value}]
index : dict like {index -> {column -> value}}
columns : dict like {column -> {index -> value}}
values : just the values array
For example (note times have been exported as epoch, but we could have used iso via):
In [6]: res.to_json(orient='records')

1 de 3

23/12/14 03:17

Reading json directly into pandas

http://hayd.github.io/2013/pandas-json/

Out[6]: '[{"created_at":1370695148000000000,"title":"CLN: refactored url accessing and filepath


Note, our times have been converted to unix timestamps (which also means we'd need to use the same
pd.to_datetime trick when read_json it back in). Also NaNs, NaTs and Nones will be converted to JSON's
null.
And save it to a le:
In [7]: res.to_json(file_name)
Useful.
Warning: read_json requires valid JSON, so doing something like will cause some Exception:
In [8]: pd.read_json("{'0':{'0':1,'1':3},'1':{'0':2,'1':4}}")
# ValueError, since this isn't valid JSON
In [9]: pd.read_json('{"0":{"0":1,"1":3},"1":{"0":2,"1":4}}')
Out[9]:
0

Just as an further example, here I can get all the issues from github (there's a limit of 100 per request), this
is how easy it is to extract data in pandas:
In [10]: page = 1
df = pd.read_json('https://api.github.com/repos/pydata/pandas/issues?page=
while df:
dfs[page] = df
page += 1
df = pd.read_json('https://api.github.com/repos/pydata/pandas/issues?page=
In [11]: dfs.keys() # 7 requests come back with issues
Out[11]: [1, 2, 3, 4, 5, 6, 7]
In [12]: df = pd.concat(dfs, ignore_index=True).set_index('number)
In [13]: df
Out[13]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 613 entries, 3813 to 39
Data columns (total 18 columns):

2 de 3

assignee

27

body

613

non-null values

closed_at

comments

613

non-null values

comments_url

613

non-null values

created_at

613

non-null values

events_url
html_url

613
613

non-null values
non-null values

id

613

non-null values

labels
labels_url

613
613

non-null values
non-null values

non-null values
non-null values

23/12/14 03:17

Reading json directly into pandas

milestone
pull_request

586
613

non-null values
non-null values

state

613

non-null values

title

613

non-null values

updated_at

613

non-null values

url

613

non-null values

user

613

non-null values

http://hayd.github.io/2013/pandas-json/

dtypes: datetime64[ns](1), int64(2), object(15)


In [14]: df.comments.describe()
Out[14]:
count

613.000000

mean

3.590538

std
min

9.641128
0.000000

25%
50%

0.000000
1.000000

75%

4.000000

max

185.000000

dtype: float64
It deals with fairly moderately sized les fairly eciently, here's a 200Mb le (this is on my 2009 macbook
air, I'd expect times to be may be faster on better hardware e.g. an SSD):
In [15]: %time pd.read_json('citylots.json')
CPU times: user 4.78 s, sys: 684 ms, total: 5.46 s
Wall time: 5.89 s
Thanks to wesm, jreback and Komnomnomnom for putting it together.
Ghostery ha bloqueado los comentarios creados por Disqus.

blog comments powered by Disqus

3 de 3

23/12/14 03:17

Vous aimerez peut-être aussi