Vous êtes sur la page 1sur 4

SENG 501.

50 Engineering Large Scale Analytics Systems


Winter 2015 Lab# 3 (30 Marks)
WordCount: A Handy Tool
Emad Mohammed
Department of Electrical & Computer Engineering
University of Calgary
Winter 2015
Acknowledgment
This document is designed by Emad Mohammed and revised by Dr. Behrouz H. Far.
1. Administrative details
1.1 There is no pre-lab for this lab
1.2 Due dates
The due dates as follows
Lab 3 Component
Due
In-lab 3 submission
Wednesday 04-Feburary-2015, 11:50 AM
Post-lab 3 submission Tuesday 10-Feburary-2015, 11:00 AM
20% marks will be deducted from assignments handed in late, but no later than 24 hours
after a due date. Please note that ALL lab submissions are made through the D2L dropbox
2- In-Lab: Arrival and Departure delay statistics (Max, Min, STD, and Mean) (15 Marks)
WordCount MapReduce job reads text (and other file format, e.g., csv) files and counts how
often words/tokens occur. The input to the map function is a data chunk consists of a number of
lines (the size of the data chunk is dependent on the file size and the split size), each line of
which contains a word/token separated by a text separator (e.g. tab, comma, space, etc.,).
Each mapper takes ONE data chunk as input and breaks it into words. It then emits a key/values
pair of the word/token and 1. Each reducer sums the counts (find Max, Min, STD, and Mean) for
each word/token. The word/token here represents a distinct key, and there is ONE reduce task
per key and emits a single key/value with the word and sum or other statistics.

2-1 What to do
In this in-lab we will try to compute some simple statistics and answer some questions about the
airline data using a modified version of the wordcount MapReduce job. Download the sample
code and data files from the D2L tab Week 3 lecture 06. Save your data to user/biadmin/lab3
on the sandbox. Use the supplied sample code (modify and add to) to achieve the following:
2-1-1 Write MapReduce job (in R) to find out how many unique carriers, origins,
destination in the given data (air.csv). View the carrier list using the R function View
2-1-2 Write MapReduce job (in R) to find the max departure delay per month and per
carrier (ALL Carriers) and draw the results. The results should look like the following
two figures.

Maximum Departure Delay per Month

Maximum Departure Delay per Carrier

2-1-3 Write ONE MapReduce job (in R) to find the max, min, and average Arrival Delay
per Carrier and draw a line plot for these statistics.
2-1-4 Write your own interpretation and comments on the results and graphs for (2-1-2 and
2-1-3).
2-2 What to hand
Write down your code in a single name it (lastname_wordcount_inlab3.R) file with your
comments (some marks will be awarded for commenting on the code) on the code and upload it
on the D2L before the due date.
3- Post-Lab: Understand the travel pattern
In this post-lab we will try to count the frequency of every origin-to-destination airport in the
air.csv file (you can download this file from the D2L Tab Week 3 lecture 06)

3-1 What to do
Write MapReduce job (in R) to find the following:
3-1-1 Top 10 Airports by total volume of flights for all destinations
It is required to rank every airport by the volume of flights that have this airport as
a destination (e.g. destination=JFK). You are asked to find the number of flights
that satisfy this condition per airport and then rank all airports according to the
calculated flight volumes. The results should be represented in a table with the top
10 airports and the calculate number of flights.
3-1-2 Busy Routes per year per month
Some cities are more attractive than others and thus many people visit it more
frequently. It is required to plot the monthly flight volume per airport and
highlight the busiest month per airport. The results should be tabulated and a
demonstration graph bar plot should be provide. Calculate this only for the top 10
destination airports, (one table and chart per input).
3-1-3 Create directed graph
A directed graph is a plot of a set of items (called vertices or nodes) that are
connected together by edges, where all the edges are directed from one
vertex/node to another. When drawing a directed graph, the edges are typically
drawn as arrows indicating the direction. We can use the directed graph to
visualize the flow of flights and the possible paths between cities/airports. In this
exercise, it is required to draw a directed graph for JFK airport. The
vertices/nodes represent the airports and the link direction represents the outgoing
or the ingoing flight direction. The node should contain the name of the airport,
the number of flight to this airport, and the distance between the JFK and other
airports. The results should look like the following graph.

LA, 100,2500

NY, 50, 2000

JFK

PH, 20, 3000

Sample directed outgoing graph for JFK airport and 3 different destinations with the number of
flights and distance between JFK and the other 3 airports

The nodes should be sorted such that the top node represents the highest
frequency of flights between JFK and the airport in this node. Follow these
requirements; it is required to develop the following directed graph in R and
MapReduce jobs as described in the following:
3-1-3-1 Create an outgoing directed graph for JFK as origin and other airports as
destination
3-1-3-2 Create an ingoing directed graph for JFK as destination and other airports as
origins
3-1-3-3 Augment both graphs into one big graph with the conjoint point JFK airport
3-1-3-4 Write your own interpretation and comments on the results and graphs
3-2 What to hand
Write down your code in a single name it (lastname_wordcount_postlab3.R) file with
your comments (some marks will be awarded for commenting on the code) on the code
and upload it on the D2L before the due date.

Vous aimerez peut-être aussi