Vous êtes sur la page 1sur 72

Introduction to Data Science

Lecture 3
Manipulating Tabular Data

Intro. to Data Science Fall 2015


John Canny
including notes from Michael Franklin and
others
Outline for this Evening
• Two views of tables:
– SQL/Pandas
– OLAP/Numpy/Matlab
• SQL, NoSQL
– Non-Tabular Structures
Data Science – One Definition
The Big Picture

Extract
Transform
Load

4
Two views of tables
First view
Key Concept: Structured Data
A data model is a collection of concepts for
describing data.

A schema is a description of a particular


collection of data, using a given data model.
The Relational Model*
• The Relational Model is Ubiquitous:
• MySQL, PostgreSQL, Oracle, DB2, SQLServer, …
• Foundational work done at
• IBM - System R
• UC Berkeley - Ingres
E. F., “Ted” Codd
Turing Award 1981
• Object-oriented concepts have been merged in
• Early work: POSTGRES research project at Berkeley
• Informix, IBM DB2, Oracle 8i

• Also has support for XML (semi-structured data)

*Codd, E. F. (1970). "A relational model of data for large shared data banks".
Communications of the ACM 13 (6): 37
Relational Database: Definitions
• Relational database: a set of relations
• Relation: made up of 2 parts:
Schema : specifies name of relation, plus name and
type of each column
Students(sid: string, name: string, login: string,
age: integer, gpa: real)
Instance : the actual data at a given time
• #rows = cardinality
• #fields = degree / arity
• A relation is a mathematical object (from set theory)
which is true for certain arguments.
• An instance defines the set of arguments for which the
relation is true (it’s a table not a row).
Ex: Instance of Students Relation
sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@eecs 18 3.2
53650 Smith smith @math 19 3.8

• Cardinality = 3, arity = 5 , all rows distinct

• The relation is true for these tuples and false for others
SQL - A language for Relational DBs*

• SQL = Structured Query Language


• Data Definition Language (DDL)
– create, modify, delete relations
– specify constraints
– administer users, security, etc.
• Data Manipulation Language (DML)
– Specify queries to find tuples that satisfy criteria
– add, modify, remove tuples
• The DBMS is responsible for efficient evaluation.

* Developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the 1970s.


Used to be SEQUEL (Structured English QUEry Language)
Creating Relations in SQL
• Create the Students relation.
– Note: the type (domain) of each field is specified,
and enforced by the DBMS whenever tuples are
added or modified.

CREATE TABLE Students


(sid CHAR(20),
name CHAR(20),
login CHAR(10),
age INTEGER,
gpa FLOAT)
Table Creation (continued)

• Another example: the Enrolled table holds


information about courses students take.

CREATE TABLE Enrolled


(sid CHAR(20),
cid CHAR(20),
grade CHAR(2))
Adding and Deleting Tuples
• Can insert a single tuple using:
INSERT INTO Students (sid, name, login, age, gpa)
VALUES ('53688', 'Smith', 'smith@ee', 18, 3.2)

• Can delete all tuples satisfying some condition (e.g.,


name = Smith):
DELETE
FROM Students S
WHERE S.name = 'Smith'
Queries in SQL
• Single-table queries are straightforward.

• To find all 18 year old students, we can write:

SELECT *
FROM Students S
WHERE S.age=18

• To find just names and logins, replace the first line:


SELECT S.name, S.login
Joins and Inference
• Chaining relations together is the basic inference
method in relational DBs. It produces new
relations (effectively new facts) from the data:
SELECT S.name, M.mortality
FROM Students S, Mortality M
WHERE S.Race=M.Race
S M
Name Race Race Mortality
Socrates Man Man Mortal
Thor God God Immortal
Barney Dinosaur Dinosaur Mortal
Blarney stone Stone Stone Non-living
Joins and Inference
• Chaining relations together is the basic inference
method in relational DBs. It produces new
relations (effectively new facts) from the data:
SELECT S.name, M.mortality
FROM Students S, Mortality M
WHERE S.Race=M.Race

Name Mortality
Socrates Mortal
Thor Immortal
Barney Mortal
Blarney stone Non-living
SELECT [DISTINCT] target-list
Basic SQL Query FROM
WHERE
relation-list
qualification

• relation-list : A list of relation names


• possibly with a range-variable after each name
• target-list : A list of attributes of tables in relation-list
• qualification : Comparisons combined using AND, OR and
NOT.
• Comparisons are Attr op const or Attr1 op Attr2, where
op is one of =≠<>≤≥
• DISTINCT: optional keyword indicating that the
answer should not contain duplicates.
• In SQL SELECT, the default is that duplicates are not
eliminated! (Result is called a “multiset”)
SQL Inner Joins
SELECT S.name, E.classid
FROM Students S (INNER) JOIN Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150

Note the previous version of this query (with no join keyword) is an “Implicit join”
SQL Inner Joins
SELECT S.name, E.classid
FROM Students S (INNER) JOIN Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Unmatched keys
Jones DataScience194
Smith French150
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
Brown NULL
SQL Joins
SELECT S.name, E.classid
FROM Students S LEFT OUTER JOIN Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
Brown NULL
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
SQL Joins
SELECT S.name, E.classid
FROM Students S RIGHT OUTER JOIN Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
SQL Joins
SELECT S.name, E.classid
FROM Students S ? JOIN Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
Brown NULL
SQL Joins
SELECT S.name, E.classid
FROM Students S FULL OUTER JOIN Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
Brown NULL
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
SQL Joins
SELECT S.name, E.classid
FROM Students S LEFT SEMI JOIN Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
What kind of Join is this?
SELECT *
FROM Students S ?? Enrolled E
S S.name S.sid E E.sid E.classid
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
22222 French150

S.name S.sid E.sid E.classid


Jones 11111 11111 History105
Jones 11111 11111 DataScience194
Jones 11111 22222 French150
Smith 22222 11111 History105
Smith 22222 11111 DataScience194
Smith 22222 22222 French150
SQL Joins
SELECT *
FROM Students S CROSS JOIN Enrolled E
S S.name S.sid E E.sid E.classid
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
22222 French150

S.name S.sid E.sid E.classid


Jones 11111 11111 History105
Jones 11111 11111 DataScience194
Jones 11111 22222 French150
Smith 22222 11111 History105
Smith 22222 11111 DataScience194
Smith 22222 22222 French150
What kind of Join is this?
SELECT *
FROM Students S, Enrolled E
WHERE S.sid <= E.sid

S S.name S.sid E E.sid E.classid


Jones 11111 11111 History105
Smith 22222 11111 DataScience194
22222 French150

S.name S.sid E.sid E.classid


Jones 11111 11111 History105
Jones 11111 11111 DataScience194
Jones 11111 22222 French150
Smith 22222 22222 French150
Theta Joins
SELECT *
FROM Students S, Enrolled E
WHERE S.sid <= E.sid

S S.name S.sid E E.sid E.classid


Jones 11111 11111 History105
Smith 22222 11111 DataScience194
22222 French150

S.name S.sid E.sid E.classid


Jones 11111 11111 History105
Jones 11111 11111 DataScience194
Jones 11111 22222 French150
Smith 22222 22222 French150
Recall: Tweet JSON Format
The tweet's unique ID. These Text of the tweet.
IDs are roughly sorted & Consecutive duplicate tweets
developers should treat them are rejected. 140 character
as opaque (http://bit.ly/dCkppc). max (http://bit.ly/4ud3he).
DEPRECATED

{"id"=>12296272736,
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453", Tweet's
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010", creation
"in_reply_to_user_id"=>nil, The ID of an existing tweet that date.
"in_reply_to_screen_name"=>nil, this tweet is in reply to. Won't
"in_reply_to_status_id"=>nil
The author's

The screen name & be set unless the author of the


"favorited"=>false,
user ID.

user ID of replied to referenced tweet is mentioned.


"truncated"=>false, Truncated to 140
characters. Only tweet author.
"user"=>
possible from SMS. The author's
{"id"=>6253282,
user name. The author's
"screen_name"=>"twitterapi",
The author's biography.
"name"=>"Twitter API",
screen name.
embedded object can get out of sync.

"description"=>
"The Real Twitter API. I tweet about API changes, service issues and
The author of the tweet. This

happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",
"url"=>"http://apiwiki.twitter.com", The author's
"location"=>"San Francisco, CA",
"profile_background_color"=>"c1dfee",
The tweet's unique ID. These
The author's "location". This is a free-form text fi eld, and
URL.
Text of the tweet.
there are no guarantees on whether it can be geocoded.
"profile_background_image_url"=>
IDs are roughly sorted &
"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",
Consecutive duplicate tweets
Rendering information
"profile_background_tile"=>false,
developers should treat them
"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",
for the author. Colors
are encoded in hex
are rejected. 140 character
"profile_link_color"=>"0000ff",
"profile_sidebar_border_color"=>"87bc44", as opaque (http://bit.ly/dCkppc).
The creation date
values (RGB).
max (http://bit.ly/4ud3he).
DEPRECATED

"profile_sidebar_fill_color"=>"e0ff92", for this account.


{"id"=>12296272736, "profile_text_color"=>"000000",
"created_at"=>"Wed May 23 06:01:13 +0000 2007",
Whether this account
contributors enabled
has

"contributors_enabled"=>true, (http://bit.ly/50npuu).
"text"=> Number of
Number of tweets

"favourites_count"=>1, favorites this


this user has.

"statuses_count"=>1628, Number of user has.


"An early look at Annotations:
"friends_count"=>13,
"time_zone"=>"Pacific Time (US & Canada)", The timezone and offset
users this user
is following.

http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"utc_offset"=>-28800,
"lang"=>"en",
(in seconds) for this user.
The user's selected
"protected"=>false,
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",
"followers_count"=>100581,
language.

"geo_enabled"=>true, Whether this user is protected


"in_reply_to_user_id"=>nil,
enabled (http://bit.ly/4pFY77).

"notifications"=>false, DEPRECATED
Whether this user has geo

"following"=>true, in this context Number of The ID of an existing tweet that


or not. If the user is protected,
then this tweet is not visible
"in_reply_to_screen_name"=>nil,
"verified"=>true},
"contributors"=>[3191321],
Whether this user
has a verified badge.
followers for
this user.
except to "friends".
this tweet is in reply to. Won't
"geo"=>nil,
"in_reply_to_status_id"=>nil
The author's

"coordinates"=>nil,
"place"=>
DEPRECATED
The place ID The screen name & be set unless the author of the
The contributors' (if any) user

"favorited"=>false,
user ID.

{"id"=>"2b6ff8c22edd9576", IDs (http://bit.ly/50npuu).


"url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",
user ID of replied to referenced tweet is mentioned.
"truncated"=>false, "name"=>"SoMa", Truncated to 140
The printable names of this place
The URL to fetch a detailed
polygon for this place
"full_name"=>"SoMa, San Francisco",
"place_type"=>"neighborhood", characters. Only tweet author.
"user"=> "country_code"=>"US", The type of this
GeoJSON (http://bit.ly/b8L1Cp).

The author's
The geo tag on this tweet in

"country"=>"The United States of America", place - can be a

{"id"=>6253282, "bounding_box"=> possible from SMS.


"neighborhood"
The place associated with this
Tweet (http://bit.ly/b8L1Cp).
{"coordinates"=> or "city"
[[[-122.42284884, 37.76893497], user name.
"screen_name"=>"twitterapi",
[-122.3964, 37.76893497],
The country this place is in
[-122.3964, 37.78752897],
"name"=>"Twitter
[-122.42284884, 37.78752897]]], API", The author's
"type"=>"Polygon"}},
screen name.
ut of sync.

"description"=>
"source"=>"web"}
The application The bounding
that sent this box for this Map of a Twitter Status Object
"The Real tweet TwitterplaceAPI. I tweet about API changes, service issues and
. This

Raffi Krikorian <raffi@twitter.com >


happily answer questions about Twitter 18 and our API. Don't get an answer? It's on my website.
April 2010
Normalization
Raw twitter data storage is very inefficient because, e.g. user
records are repeated with every tweet by that user.
Normalization is the process of minimizing data redundancy.

Tweet id User id Location id Body


11 111 1111 I need a Jamba juice
22 111 1111 Cal Soccer rules
33 111 2222 Why do we procrastinate?
44 222 3333 Close your eyes and push “go”

User.id Name Attribs… Loc.id Name Attribs…


111 Jones 1111 Berkeley
222 Smith 2222 Oakland
3333 Hayward
Normalization
Normalized tables include only a foreign key to the information in
another table for repeated data.
The original table is the result of inner joins between tables.

Tweet id User id Location id Body


11 111 1111 I need a Jamba juice
22 111 1111 Cal Soccer rules
33 111 2222 Why do we procrastinate?
44 222 3333 Close your eyes and push “go”

User.id Name Attribs… Loc.id Name Attribs…


111 Jones 1111 Berkeley
222 Smith 2222 Oakland
3333 Hayward
Aggregate Queries
Including reference counts in the lookup tables allows you to
perform aggregate queries on those tables alone:
Average age of users, most popular location,…

Tweet id User id Location id Body


11 111 1111 I need a Jamba Juice
22 111 1111 Cal Soccer rules
33 111 2222 Why do we procrastinate?
44 222 3333 Close your eyes and push “go”

U.id Name Count Attr.. L.id Name Count Attr…


111 Jones 3 1111 Berkeley 2
222 Smith 1 2222 Oakland 1
3333 Hayward 1
SQL Query Semantics
Semantics of an SQL query are defined in terms of the
following conceptual evaluation strategy:
1. do FROM clause: compute cross-product of tables (e.g.,
Students and Enrolled).
2. do WHERE clause: Check conditions, discard tuples
that fail. (i.e., “selection”).
3. do SELECT clause: Delete unwanted fields. (i.e.,
“projection”).
4. If DISTINCT specified, eliminate duplicate rows.
Probably the least efficient way to compute a query!
– An optimizer will find more efficient strategies to get
the same answer.
Data Model (Tabular)
• SQLite
– Table: fixed number of named columns of specified
type
– 5 storage classes for columns
• NULL
• INTEGER
• REAL
• TEXT
• BLOB
– Data stored on disk in a single file in row-major order
– Operations performed via sqlite3 shell

38
Reductions and GroupBy
• One of the most common operations on Data Tables is
aggregation or reduction (count, sum, average, min, max,…).
• They provide a means to see high-level patterns in the data,
to make summaries of it etc.
• You need ways of specifying which columns are being
aggregated over, which is the role of a GroupBy operator.
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7

39
Reductions and GroupBy
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7

SELECT SID, Name, AVG(GPA)


FROM Students
GROUP BY SID

40
Reductions and GroupBy
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7

SELECT SID, Name, AVG(GPA)


FROM Students
SID Name GPA
GROUP BY SID
111 Jones 3.35
222 Smith 3.1

41
Pandas/Python
• Series: a named, ordered dictionary
– The keys of the dictionary are the indexes
– Built on NumPy’s ndarray
– Values can be any Numpy data type object

• DataFrame: a table with named columns


– Represented as a Dict (col_name -> series)
– Each Series object represents a column

42
Operations
• map() functions
• filter (apply predicate to rows)
• sort/group by
• aggregate: sum, count, average, max, min
• Pivot or reshape
• Relational:
– union, intersection, difference, cartesian product
(CROSS JOIN)
– select/filter, project
– join: natural join (INNER JOIN), theta join, semi-join,
etc.
– rename
43
Pandas vs SQL
+ Pandas is lightweight and fast.
+ Full SQL expressiveness plus the expressiveness of
Python, especially for function evaluation.
+ Integration with plotting functions like Matplotlib.

- Tables must fit into memory.


- No post-load indexing functionality: indices are built
when a table is created.
- No transactions, journaling, etc.
- Large, complex joins probably slower.
44
Jacobs Update
• Room 310 not ready, but other rooms are. For
the next few (?) weeks, we will meet:
• Mondays in 155 Donner
• Wednesdays in 110/120 Jacobs Hall

Starting this Weds.


5 min break
The other view of tables: OLAP
• OnLine Analytical Processing
• Conceptually like an n-dimensional spreadsheet (Cube)
• (Discrete) columns become dimensions
• The goal is live interaction with numerical data for business
intelligence
The other view of tables: OLAP
From a table to a cube:

name classid Semester Grade Units


Jones History105 F13 3.3 4.0
Jones DataScience194 S12 4.0 3.0
Jones French150 F14 3.7 4.0
Smith History105 S15 2.3 3.0
Smith DataScience194 F14 2.7 3.0
Smith French150 F13 3.0 4.0
From tables to OLAP cubes
From a table to a cube:

name classid Semester Grade Units


Jones History105 F13 3.3 4.0
Jones DataScience194 S12 4.0 3.0
Jones French150 F14 3.7 4.0
Smith History105 S15 2.3 3.0
Smith DataScience194 F14 2.7 3.0
Smith French150 F13 3.0 4.0

Variables used as qualifiers Variables we want to measure


(In where, GroupBy clauses) Normally numeric
Normally discrete
Constructing OLAP cubes
name classid Semester Grade Units
Jones History105 F13 3.3 4.0
Jones DataScience194 S12 4.0 3.0
… … … … …
Cube
Cube
dimensions Semester values

Cell contents are Grade, Unit values


Classid

Name
Queries on OLAP cubes
• Once the cube is defined, its easy to do aggregate queries by
projecting along one or more axes.
• E.g. to get student GPAs, we project the Grade field onto the
student (Name) axis.
• In fact, such aggregates are precomputed and maintained
automatically in an OLAP cube, so queries are instantaneous.

Semester

Cell contents are Grade, Unit values


Classid

Name
OLAP
• Slicing:
fixing one or
more variables

• Dicing:
selecting a range of
values for one or
more variables
OLAP
• Drilling Up/Down
(change levels of a
hierarchically-indexed
variable)

• Pivoting:
produce a two-axis
view for viewing
as a spreadsheet.
Outline
• To support real-time querying, OLAP DBs store aggregates
of data values along many dimensions.
• This works best if axes can be tree-structured. E.g time can
be expressed as a hierarchy
hour  day  week  month  year
OLAP tradeoffs
• Aggregates increase space and the cost of updates.
• On the other hand, since they are projections of data, or
tree structures, the storage overhead can be small.
• Aggregates are limited, but cover a lot of common cases:
avg, stdev, min, max.
• Operations (slice, dice, pivot, etc.) are conceptually simpler
than SQL, but cover a lot of common cases.
• Good integration with clients, e.g. spreadsheets, for visual
interaction, although there is an underlying query
language (MDX).
Numpy/Matlab and OLAP
• Numpy and Matlab have an efficient implementation of nd-
arrays for dense data.
• Indices must be integer, but you can implement general
indices using dictionaries from indexval->int.
• Slicing and dicing are available using index ranges:
a[5,1:3,:] etc.
• Roll-down/up involve aggregates along dimensions such as
sum(a[3,4:6,:],2)
• Pivoting involves index permutations (.transpose()) and
aggregation over the other indices.
• Limitation: MATLAB and Numpy currently only support dense
nd-arrays (or sparse 2d arrays).
What’s Wrong with Tables?

• Too limited in structure?


• Too rigid?
• Too old fashioned?
What’s Wrong with (RDBMS) Tables?
• Indices: Typical RDBMS table storage is mostly indices
– Cant afford this overhead for large datastores
• Transactions:
– Safe state changes require journals etc., and are slow
• Relations:
– Checking relations adds further overhead to updates
• Sparse Data Support:
– RDBMS Tables are very wasteful when data is very sparse
– Very sparse data is common in modern data stores
– RDBMS tables might have dozens of columns, modern data
stores might have many thousands.
RDBMS tables – row based
Table:

sid name login age gpa


53831 Jones jones@cs 18 3.4
53831 Smith smith@ee 18 3.2

Represented as:
53831 Jones jones@cs 18 3.4

53831 Smith smith@ee 18 3.2


Tweet JSON Format
The tweet's unique ID. These Text of the tweet.
IDs are roughly sorted & Consecutive duplicate tweets
developers should treat them are rejected. 140 character
as opaque (http://bit.ly/dCkppc). max (http://bit.ly/4ud3he).
DEPRECATED

{"id"=>12296272736,
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453", Tweet's
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010", creation
"in_reply_to_user_id"=>nil, The ID of an existing tweet that date.
"in_reply_to_screen_name"=>nil, this tweet is in reply to. Won't
"in_reply_to_status_id"=>nil
The author's

The screen name & be set unless the author of the


"favorited"=>false,
user ID.

user ID of replied to referenced tweet is mentioned.


"truncated"=>false, Truncated to 140
characters. Only tweet author.
"user"=>
possible from SMS. The author's
{"id"=>6253282,
user name. The author's
"screen_name"=>"twitterapi",
The author's biography.
"name"=>"Twitter API",
screen name.
embedded object can get out of sync.

"description"=>
"The Real Twitter API. I tweet about API changes, service issues and
The author of the tweet. This

happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",
"url"=>"http://apiwiki.twitter.com", The author's
"location"=>"San Francisco, CA",
"profile_background_color"=>"c1dfee",
The tweet's unique ID. These
The author's "location". This is a free-form text fi eld, and
URL.
Text of the tweet.
there are no guarantees on whether it can be geocoded.
"profile_background_image_url"=>
IDs are roughly sorted &
"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",
Consecutive duplicate tweets
Rendering information
"profile_background_tile"=>false,
developers should treat them
"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",
for the author. Colors
are encoded in hex
are rejected. 140 character
"profile_link_color"=>"0000ff",
"profile_sidebar_border_color"=>"87bc44", as opaque (http://bit.ly/dCkppc).
The creation date
values (RGB).
max (http://bit.ly/4ud3he).
DEPRECATED

"profile_sidebar_fill_color"=>"e0ff92", for this account.


{"id"=>12296272736, "profile_text_color"=>"000000",
"created_at"=>"Wed May 23 06:01:13 +0000 2007",
Whether this account
contributors enabled
has

"contributors_enabled"=>true, (http://bit.ly/50npuu).
"text"=> Number of
Number of tweets

"favourites_count"=>1, favorites this


this user has.

"statuses_count"=>1628, Number of user has.


"An early look at Annotations:
"friends_count"=>13,
"time_zone"=>"Pacific Time (US & Canada)", The timezone and offset
users this user
is following.

http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"utc_offset"=>-28800,
"lang"=>"en",
(in seconds) for this user.
The user's selected
"protected"=>false,
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",
"followers_count"=>100581,
language.

"geo_enabled"=>true, Whether this user is protected


"in_reply_to_user_id"=>nil,
enabled (http://bit.ly/4pFY77).

"notifications"=>false, DEPRECATED
Whether this user has geo

"following"=>true, in this context Number of The ID of an existing tweet that


or not. If the user is protected,
then this tweet is not visible
"in_reply_to_screen_name"=>nil,
"verified"=>true},
"contributors"=>[3191321],
Whether this user
has a verified badge.
followers for
this user.
except to "friends".
this tweet is in reply to. Won't
"geo"=>nil,
"in_reply_to_status_id"=>nil
The author's

"coordinates"=>nil,
"place"=>
DEPRECATED
The place ID The screen name & be set unless the author of the
The contributors' (if any) user

"favorited"=>false,
user ID.

{"id"=>"2b6ff8c22edd9576", IDs (http://bit.ly/50npuu).


"url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",
user ID of replied to referenced tweet is mentioned.
"truncated"=>false, "name"=>"SoMa", Truncated to 140
The printable names of this place
The URL to fetch a detailed
polygon for this place
"full_name"=>"SoMa, San Francisco",
"place_type"=>"neighborhood", characters. Only tweet author.
"user"=> "country_code"=>"US", The type of this
GeoJSON (http://bit.ly/b8L1Cp).

The author's
The geo tag on this tweet in

"country"=>"The United States of America", place - can be a

{"id"=>6253282, "bounding_box"=> possible from SMS.


"neighborhood"
The place associated with this
Tweet (http://bit.ly/b8L1Cp).
{"coordinates"=> or "city"
[[[-122.42284884, 37.76893497], user name.
"screen_name"=>"twitterapi",
[-122.3964, 37.76893497],
The country this place is in
[-122.3964, 37.78752897],
"name"=>"Twitter
[-122.42284884, 37.78752897]]], API", The author's
"type"=>"Polygon"}},
screen name.
ut of sync.

"description"=>
"source"=>"web"}
The application The bounding
that sent this box for this Map of a Twitter Status Object
"The Real tweet TwitterplaceAPI. I tweet about API changes, service issues and
. This

Raffi Krikorian <raffi@twitter.com >


happily answer questions about Twitter 18 and our API. Don't get an answer? It's on my website.
April 2010
RDBMS tables – row based
Table:
ID name login loc locid LAT LONG ALT State
52841 Jones jones@cs NULL NULL NULL NULL NULL NULL
53831 Smith smith@ee NULL NULL NULL NULL NULL NULL
55541 Brown brown@ee NULL NULL NULL NULL NULL NULL

Represented as:
52841 Jones jones@cs NULL NULL NULL NULL NULL NULL

53831 Smith smith@ee NULL NULL NULL NULL NULL NULL

55541 Brown brown@ee NULL NULL NULL NULL NULL NULL


Column-based store
Table:
ID name login loc locid LAT LONG ALT State
52841 Jones jones@cs Albany 2341 38.4 122.7 100 CA
53831 Smith smith@ee NULL NULL NULL NULL NULL NULL
55541 Brown brown@ee NULL NULL NULL NULL NULL NULL

Represented as column (key-value) stores:


ID name ID login ID loc ID locid
52841 Jones 52841 jones@cs 52841 Albany 52841 2341
53831 Smith 53831 smith@ee
55541 Brown 55541 brown@ee ID LAT ID LONG
52841 38.4 52841 122.7

NoSQL Storage Systems

64
Column-Family Stores (Cassandra)
A column-family groups data columns together, and is
analogous to a table (and similar to Pandas DataFrame)
Static column family from Apache Cassandra:
Columns fixed

Dynamic Column family (Cassandra):


Can add or
remove columns
from a dynamic
column family
CouchDB Data Model (JSON)
• “With CouchDB, no schema is enforced, so new document
types with new meaning can be safely added alongside
the old.”
• A CouchDB document is an object that consists of named
fields. Field values may be:
– strings, numbers, dates,
– ordered lists, associative maps
"Subject": "I like Plankton"
"Author": "Rusty"
"PostedDate": "5/23/2006"
"Tags": ["plankton", "baseball", "decisions"]
"Body": "I decided today that I don't like baseball. I like plankton."

66
Key-value stores
• A key-value store is an even simpler approach.
• It implements storage and retrieval of (key,value) pairs.
• i.e. Basic functionality is that of a dictionary
age[“john”] = 25.
• But some KV-stores also implement sorting and
indexing with the keys (e.g. leveldb).
• You can build either column-based or row-based DBs
on top of such KV-stores to optimize performance (e.g.
omitting indices or ACID qualities).

67
Pig
• Started at Yahoo! Research
• Features:
– Expresses sequences of MapReduce jobs
– Data model: nested “bags” of items
• Schema is optional
– Provides relational (SQL) operators
(JOIN, GROUP BY, etc)
– Easy to plug in Java functions
An Example Problem
Suppose you have user Load Users Load Pages
data in one file, website
data in another, and you Filter by age
need to find the top 5
most visited pages by Join on name

users aged 18-25. Group on url

Count clicks

Order by clicks

Take top 5

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt


In MapReduce

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt


In Pig Latin

Users = load ‘users’ as (name, age);


Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;

store Top5 into ‘top5sites’;

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt


Hive
• Developed at Facebook
• Relational database built on Hadoop
– Maintains table schemas
– SQL-like query language (which can also call
Hadoop Streaming scripts)
– Supports table partitioning,
complex data types, sampling,
some query optimization
• Used for most Facebook jobs
– Less than 1% of daily jobs at Facebook use
MapReduce directly!!! (SQL – or PIG – wins!)
– Note: Google also has several SQL-like systems in use.
Summary
• Two views of tables:
– SQL/Pandas
– OLAP/Numpy/Matlab
• SQL, NoSQL
– Non-Tabular Structures

Wednesday come to 110/120 Jacobs Hall for


Pandas Lab

Vous aimerez peut-être aussi