Académique Documents
Professionnel Documents
Culture Documents
Lecture 3
Manipulating Tabular Data
Extract
Transform
Load
4
Two views of tables
First view
Key Concept: Structured Data
A data model is a collection of concepts for
describing data.
*Codd, E. F. (1970). "A relational model of data for large shared data banks".
Communications of the ACM 13 (6): 37
Relational Database: Definitions
• Relational database: a set of relations
• Relation: made up of 2 parts:
Schema : specifies name of relation, plus name and
type of each column
Students(sid: string, name: string, login: string,
age: integer, gpa: real)
Instance : the actual data at a given time
• #rows = cardinality
• #fields = degree / arity
• A relation is a mathematical object (from set theory)
which is true for certain arguments.
• An instance defines the set of arguments for which the
relation is true (it’s a table not a row).
Ex: Instance of Students Relation
sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@eecs 18 3.2
53650 Smith smith @math 19 3.8
• The relation is true for these tuples and false for others
SQL - A language for Relational DBs*
SELECT *
FROM Students S
WHERE S.age=18
Name Mortality
Socrates Mortal
Thor Immortal
Barney Mortal
Blarney stone Non-living
SELECT [DISTINCT] target-list
Basic SQL Query FROM
WHERE
relation-list
qualification
Note the previous version of this query (with no join keyword) is an “Implicit join”
SQL Inner Joins
SELECT S.name, E.classid
FROM Students S (INNER) JOIN Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Unmatched keys
Jones DataScience194
Smith French150
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
Brown NULL
SQL Joins
SELECT S.name, E.classid
FROM Students S LEFT OUTER JOIN Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
Brown NULL
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
SQL Joins
SELECT S.name, E.classid
FROM Students S RIGHT OUTER JOIN Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
SQL Joins
SELECT S.name, E.classid
FROM Students S ? JOIN Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
Brown NULL
SQL Joins
SELECT S.name, E.classid
FROM Students S FULL OUTER JOIN Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
Brown NULL
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
SQL Joins
SELECT S.name, E.classid
FROM Students S LEFT SEMI JOIN Enrolled E
ON S.sid=E.sid
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
What kind of Join is this?
SELECT *
FROM Students S ?? Enrolled E
S S.name S.sid E E.sid E.classid
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
22222 French150
{"id"=>12296272736,
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453", Tweet's
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010", creation
"in_reply_to_user_id"=>nil, The ID of an existing tweet that date.
"in_reply_to_screen_name"=>nil, this tweet is in reply to. Won't
"in_reply_to_status_id"=>nil
The author's
"description"=>
"The Real Twitter API. I tweet about API changes, service issues and
The author of the tweet. This
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",
"url"=>"http://apiwiki.twitter.com", The author's
"location"=>"San Francisco, CA",
"profile_background_color"=>"c1dfee",
The tweet's unique ID. These
The author's "location". This is a free-form text fi eld, and
URL.
Text of the tweet.
there are no guarantees on whether it can be geocoded.
"profile_background_image_url"=>
IDs are roughly sorted &
"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",
Consecutive duplicate tweets
Rendering information
"profile_background_tile"=>false,
developers should treat them
"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",
for the author. Colors
are encoded in hex
are rejected. 140 character
"profile_link_color"=>"0000ff",
"profile_sidebar_border_color"=>"87bc44", as opaque (http://bit.ly/dCkppc).
The creation date
values (RGB).
max (http://bit.ly/4ud3he).
DEPRECATED
"contributors_enabled"=>true, (http://bit.ly/50npuu).
"text"=> Number of
Number of tweets
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"utc_offset"=>-28800,
"lang"=>"en",
(in seconds) for this user.
The user's selected
"protected"=>false,
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",
"followers_count"=>100581,
language.
"notifications"=>false, DEPRECATED
Whether this user has geo
"coordinates"=>nil,
"place"=>
DEPRECATED
The place ID The screen name & be set unless the author of the
The contributors' (if any) user
"favorited"=>false,
user ID.
The author's
The geo tag on this tweet in
"description"=>
"source"=>"web"}
The application The bounding
that sent this box for this Map of a Twitter Status Object
"The Real tweet TwitterplaceAPI. I tweet about API changes, service issues and
. This
38
Reductions and GroupBy
• One of the most common operations on Data Tables is
aggregation or reduction (count, sum, average, min, max,…).
• They provide a means to see high-level patterns in the data,
to make summaries of it etc.
• You need ways of specifying which columns are being
aggregated over, which is the role of a GroupBy operator.
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7
39
Reductions and GroupBy
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7
40
Reductions and GroupBy
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7
41
Pandas/Python
• Series: a named, ordered dictionary
– The keys of the dictionary are the indexes
– Built on NumPy’s ndarray
– Values can be any Numpy data type object
42
Operations
• map() functions
• filter (apply predicate to rows)
• sort/group by
• aggregate: sum, count, average, max, min
• Pivot or reshape
• Relational:
– union, intersection, difference, cartesian product
(CROSS JOIN)
– select/filter, project
– join: natural join (INNER JOIN), theta join, semi-join,
etc.
– rename
43
Pandas vs SQL
+ Pandas is lightweight and fast.
+ Full SQL expressiveness plus the expressiveness of
Python, especially for function evaluation.
+ Integration with plotting functions like Matplotlib.
Name
Queries on OLAP cubes
• Once the cube is defined, its easy to do aggregate queries by
projecting along one or more axes.
• E.g. to get student GPAs, we project the Grade field onto the
student (Name) axis.
• In fact, such aggregates are precomputed and maintained
automatically in an OLAP cube, so queries are instantaneous.
Semester
Name
OLAP
• Slicing:
fixing one or
more variables
• Dicing:
selecting a range of
values for one or
more variables
OLAP
• Drilling Up/Down
(change levels of a
hierarchically-indexed
variable)
• Pivoting:
produce a two-axis
view for viewing
as a spreadsheet.
Outline
• To support real-time querying, OLAP DBs store aggregates
of data values along many dimensions.
• This works best if axes can be tree-structured. E.g time can
be expressed as a hierarchy
hour day week month year
OLAP tradeoffs
• Aggregates increase space and the cost of updates.
• On the other hand, since they are projections of data, or
tree structures, the storage overhead can be small.
• Aggregates are limited, but cover a lot of common cases:
avg, stdev, min, max.
• Operations (slice, dice, pivot, etc.) are conceptually simpler
than SQL, but cover a lot of common cases.
• Good integration with clients, e.g. spreadsheets, for visual
interaction, although there is an underlying query
language (MDX).
Numpy/Matlab and OLAP
• Numpy and Matlab have an efficient implementation of nd-
arrays for dense data.
• Indices must be integer, but you can implement general
indices using dictionaries from indexval->int.
• Slicing and dicing are available using index ranges:
a[5,1:3,:] etc.
• Roll-down/up involve aggregates along dimensions such as
sum(a[3,4:6,:],2)
• Pivoting involves index permutations (.transpose()) and
aggregation over the other indices.
• Limitation: MATLAB and Numpy currently only support dense
nd-arrays (or sparse 2d arrays).
What’s Wrong with Tables?
Represented as:
53831 Jones jones@cs 18 3.4
{"id"=>12296272736,
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453", Tweet's
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010", creation
"in_reply_to_user_id"=>nil, The ID of an existing tweet that date.
"in_reply_to_screen_name"=>nil, this tweet is in reply to. Won't
"in_reply_to_status_id"=>nil
The author's
"description"=>
"The Real Twitter API. I tweet about API changes, service issues and
The author of the tweet. This
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",
"url"=>"http://apiwiki.twitter.com", The author's
"location"=>"San Francisco, CA",
"profile_background_color"=>"c1dfee",
The tweet's unique ID. These
The author's "location". This is a free-form text fi eld, and
URL.
Text of the tweet.
there are no guarantees on whether it can be geocoded.
"profile_background_image_url"=>
IDs are roughly sorted &
"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",
Consecutive duplicate tweets
Rendering information
"profile_background_tile"=>false,
developers should treat them
"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",
for the author. Colors
are encoded in hex
are rejected. 140 character
"profile_link_color"=>"0000ff",
"profile_sidebar_border_color"=>"87bc44", as opaque (http://bit.ly/dCkppc).
The creation date
values (RGB).
max (http://bit.ly/4ud3he).
DEPRECATED
"contributors_enabled"=>true, (http://bit.ly/50npuu).
"text"=> Number of
Number of tweets
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"utc_offset"=>-28800,
"lang"=>"en",
(in seconds) for this user.
The user's selected
"protected"=>false,
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",
"followers_count"=>100581,
language.
"notifications"=>false, DEPRECATED
Whether this user has geo
"coordinates"=>nil,
"place"=>
DEPRECATED
The place ID The screen name & be set unless the author of the
The contributors' (if any) user
"favorited"=>false,
user ID.
The author's
The geo tag on this tweet in
"description"=>
"source"=>"web"}
The application The bounding
that sent this box for this Map of a Twitter Status Object
"The Real tweet TwitterplaceAPI. I tweet about API changes, service issues and
. This
Represented as:
52841 Jones jones@cs NULL NULL NULL NULL NULL NULL
64
Column-Family Stores (Cassandra)
A column-family groups data columns together, and is
analogous to a table (and similar to Pandas DataFrame)
Static column family from Apache Cassandra:
Columns fixed
66
Key-value stores
• A key-value store is an even simpler approach.
• It implements storage and retrieval of (key,value) pairs.
• i.e. Basic functionality is that of a dictionary
age[“john”] = 25.
• But some KV-stores also implement sorting and
indexing with the keys (e.g. leveldb).
• You can build either column-based or row-based DBs
on top of such KV-stores to optimize performance (e.g.
omitting indices or ACID qualities).
67
Pig
• Started at Yahoo! Research
• Features:
– Expresses sequences of MapReduce jobs
– Data model: nested “bags” of items
• Schema is optional
– Provides relational (SQL) operators
(JOIN, GROUP BY, etc)
– Easy to plug in Java functions
An Example Problem
Suppose you have user Load Users Load Pages
data in one file, website
data in another, and you Filter by age
need to find the top 5
most visited pages by Join on name
Count clicks
Order by clicks
Take top 5