Académique Documents
Professionnel Documents
Culture Documents
Big Data
Introduction
Big data is a broad term for data sets so large or complex that traditional data processing
applications are inadequate. Challenges include analysis, capture, data curation, search, sharing,
storage, transfer, visualisation, and information privacy. The term often refers simply to the use of
predictive analytics or other certain advanced methods to extract value from data, and seldom to a
particular size of data set. Accuracy in big data may lead to more confident decision making. And
better decisions can mean greater operational efficiency, cost reductions and reduced risk.
Analysis of data sets can find new correlations, to "spot business trends, prevent diseases, combat
crime and so on." Scientists, practitioners of media and advertising and governments alike regularly
meet difficulties with large data sets in areas including Internet search, finance and business
informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics,
connectomics, complex physics simulations, and biological and environmental research.
Data sets grow in size in part because they are increasingly being gathered by cheap and numerous
information-sensing mobile devices, aerial (remote sensing), software logs, cameras, microphones,
radio-frequency identification (RFID) readers, and wireless sensor networks. The world's
technological per-capita capacity to store information has roughly doubled every 40 months since
the 1980s; as of 2012, every day 2.5 exabytes of data were created; The challenge for large
enterprises is determining who should own big data initiatives that straddle the entire organisation.
Work with big data is necessarily uncommon; most analysis is of "PC size" data, on a desktop PC or
notebook that can handle the available data set.
Relational database management systems and desktop statistics and visualisation packages often
have difficulty handling big data. The work instead requires "massively parallel software running on
tens, hundreds, or even thousands of servers". What is considered "big data" varies depending on
the capabilities of the users and their tools, and expanding capabilities make Big Data a moving
target. Thus, what is considered to be "Big" in one year will become ordinary in later years. "For
some organisations, facing hundreds of gigabytes of data for the first time may trigger a need to
reconsider data management options. For others, it may take tens or hundreds of terabytes before
data size becomes a significant consideration."
Definition
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to
capture, curate, manage, and process data within a tolerable elapsed time. Big data "size" is a
constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.
Big data is a set of techniques and technologies that require new forms of integration to uncover
large hidden values from large datasets that are diverse, complex, and of a massive scale.
In a 2001 research report and related lectures, META Group (now Gartner) analyst Doug Laney
defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume
(amount of data), velocity (speed of data in and out), and variety (range of data types and sources).
Gartner, and now much of the industry, continue to use this "3Vs" model for describing big data. In
2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high
variety information assets that require new forms of processing to enable enhanced decision
making, insight discovery and process optimisation." Additionally, a new V "Veracity" is added by
some organisations to describe it.
If Gartners definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a
more sound difference between big data and Business Intelligence, regarding data and their use:
A. Business Intelligence uses descriptive statistics with data with high information density to
measure things, detect trends etc.;
B. Big data uses inductive statistics and concepts from nonlinear system identification to infer
laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low
information density to reveal relationships, dependencies and perform predictions of outcomes
and behaviours.
A more recent, consensual definition states that "Big Data represents the Information assets
characterised by such a High Volume, Velocity and Variety to require specific Technology and
Analytical Methods for its transformation into Value".
Characteristics
Big data can be described by the following characteristics:
Volume
The quantity of data that is generated is very important in this context. It is the size of the data
which determines the value and potential of the data under consideration and whether it can actually
be considered Big Data or not. The name Big Data itself contains a term which is related to size
and hence the characteristic.
Variety
The next aspect of Big Data is its variety. This means that the category to which Big Data belongs
to is also a very essential fact that needs to be known by the data analysts. This helps the people,
who are closely analyzing the data and are associated with it, to effectively use the data to their
advantage and thus upholding the importance of the Big Data.
Velocity
The term velocity in the context refers to the speed of generation of data or how fast the data is
generated and processed to meet the demands and the challenges which lie ahead in the path of
growth and development.
Variability
This is a factor which can be a problem for those who analyse the data. This refers to the
inconsistency which can be shown by the data at times, thus hampering the process of being able to
handle and manage the data effectively.
Veracity
The quality of the data being captured can vary greatly. Accuracy of analysis depends on the
veracity of the source data.
Complexity
Data management can become a very complex process, especially when large volumes of data come
from multiple sources. These data need to be linked, connected and correlated in order to be able to
grasp the information that is supposed to be conveyed by these data. This situation, is therefore,
termed as the complexity of Big Data.
Factory work and Cyber-physical systems may have a 6C system:
In this scenario and in order to provide useful insight to the factory management and gain correct
content, data has to be processed with advanced tools (analytics and algorithms) to generate
meaningful information. Considering the presence of visible and invisible issues in an industrial
factory, the information generation algorithm has to be capable of detecting and addressing invisible
issues such as machine degradation, component wear, etc. in the factory floor.
Day 3 - 5
Hadoop
Introduction
Apache Hadoop was born out of a need to process an avalanche of big data. The web was
generating more and more information on a daily basis, and it was becoming very difficult to index
over one billion pages of content. In order to cope, Google invented a new style of data processing
known as MapReduce. A year after Google published a white paper describing the MapReduce
framework, Doug Cutting and Mike Cafarella, inspired by the white paper, created Hadoop to apply
these concepts to an open-source software framework to support distribution for the Nutch search
engine project. Given the original case, Hadoop was designed with a simple write-once storage
infrastructure.
Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries
for a huge variety of tasks that all share the common theme of lots of variety, volume and velocity
of data both structured and unstructured. It is now widely used across industries, including
finance, media and entertainment, government, healthcare, information services, retail, and other
industries with big data requirements but the limitations of the original storage infrastructure
remain.
Hadoop is increasingly becoming the go-to framework for large-scale, data-intensive deployments.
Hadoop is built to process large amounts of data from terabytes to petabytes and beyond. With this
much data, its unlikely that it would fit on a single computer's hard drive, much less in memory.
The beauty of Hadoop is that it is designed to efficiently process huge amounts of data by
connecting many commodity computers together to work in parallel. Using the MapReduce model,
Hadoop can take a query over a dataset, divide it, and run it in parallel over multiple nodes.
Distributing the computation solves the problem of having data thats too large to fit onto a single
machine.
Hadoop Software
The Hadoop software stack introduces entirely new economics for storing and processing data at
scale. It allows organizations unparalleled flexibility in how theyre able to leverage data of all
shapes and sizes to uncover insights about their business. Users can now deploy the complete
hardware and software stack including the OS and Hadoop software across the entire cluster and
manage the full cluster through a single management interface.
Apache Hadoop includes a Distributed File System (HDFS), which breaks up input data and stores
data on the compute nodes. This makes it possible for data to be processed in parallel using all of
the machines in the cluster. The Apache Hadoop Distributed File System is written in Java and runs
on different operating systems.
Hadoop was designed from the beginning to accommodate multiple file system implementations
and there are a number available. HDFS and the S3 file system are probably the most widely used,
but many others are available, including the MapR File System.
How is Hadoop Different from Past Techniques?
A. Hadoop can handle data in a very fluid way. Hadoop is more than just a faster, cheaper
database and analytics tool. Unlike databases, Hadoop doesnt insist that you structure your
data. Data may be unstructured and schemaless. Users can dump their data into the framework
without needing to reformat it. By contrast, relational databases require that data be structured
and schemas be defined before storing the data.
B. Hadoop has a simplified programming model. Hadoops simplified programming model
allows users to quickly write and test software in distributed systems. Performing computation
on large volumes of data has been done before, usually in a distributed setting but writing
software for distributed systems is notoriously hard. By trading away some programming
flexibility, Hadoop makes it much easier to write distributed programs.
C. Because Hadoop accepts practically any kind of data, it stores information in far more diverse
formats than what is typically found in the tidy rows and columns of a traditional database.
Some good examples are machine-generated data and log data, written out in storage formats
including JSON, Avro and ORC.
D. The majority of data preparation work in Hadoop is currently being done by writing code in
scripting languages like Hive, Pig or Python.
E. Hadoop is easy to administer. Alternative high performance computing (HPC) systems allow
programs to run on large collections of computers, but they typically require rigid program
configuration and generally require that data be stored on a separate storage area network
(SAN) system. Schedulers on HPC clusters require careful administration and since program
execution is sensitive to node failure, administration of a Hadoop cluster is much easier.
F. Hadoop invisibly handles job control issues such as node failure. If a node fails, Hadoop makes
sure the computations are run on other nodes and that data stored on that node are recovered
from other nodes.
G. Hadoop is agile. Relational databases are good at storing and processing data sets with
predefined and rigid data models. For unstructured data, relational databases lack the agility and
scalability that is needed. Apache Hadoop makes it possible to cheaply process and analyze
huge amounts of both structured and unstructured data together, and to process data without
defining all structure ahead of time.
Hadoop Architecture
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other Hadoop modules.
These libraries provides filesystem and OS level abstractions and contains the necessary
Java files and scripts required to start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster resource management.
Hadoop Distributed File System (HDFS): A distributed file system that provides high-
throughput access to application data.
Hadoop MapReduce: This is YARN-based system for parallel processing of large data
sets.
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware. It has many similarities with existing distributed file systems. However, the
differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is
designed to be deployed on low-cost hardware. HDFS provides high throughput access to
application data and is suitable for applications that have large data sets. HDFS relaxes a few
POSIX requirements to enable streaming access to file system data. HDFS was originally built as
infrastructure for the Apache Nutch web search engine project.
Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each
file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of
a file are replicated for fault tolerance. The block size and replication factor are configurable per
file. An application can specify the number of replicas of a file. The replication factor can be
specified at file creation time and can be changed later. Files in HDFS are write-once and have
strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a
Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a
DataNode.
MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process big
amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a
reliable, fault-tolerant manner. The term MapReduce actually refers to the following two different
tasks that Hadoop programs perform:
The Map Task: This is the first task, which takes input data and converts it into a set of data,
where individual elements are broken down into tuples (key/value pairs).
The Reduce Task: This task takes the output from a map task as input and combines those data
tuples into a smaller set of tuples. The reduce task is always performed after the map task.
Typically both the input and the output are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per
cluster-node. The master is responsible for resource management, tracking resource consumption/
availability and scheduling the jobs component tasks on the slaves, monitoring them and re-
executing the failed tasks. The slaves TaskTracker execute the tasks as directed by the master and
provide task-status information to the master periodically.
The JobTracker is a single point of failure for the Hadoop MapReduce service which means if
JobTracker goes down, all running jobs are halted.
MapReduce Algorithm
1. Map function: Splitting and Mapping
2. Shuffle function: Merging and Sorting
3. Reduce function: Reduction
Stage 1
A user/application can submit a job to the Hadoop (a hadoop job client) for required process by
specifying the following items:
1. The location of the input and output files in the distributed file system.
2. The java classes in the form of jar file containing the implementation of map and reduce
functions.
Stage 2
The Hadoop job client then submits the job (jar/executable etc) and configuration to the JobTracker
which then assumes the responsibility of distributing the software/configuration to the slaves,
scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.
Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce implementation and output
of the reduce function is stored into the output files on the file system.
Introduction
Pig was initially developed at Yahoo Research around 2006 but moved into the Apache Software
Foundation in 2007. Apache Pig is a platform for analysing large data sets that consists of a high-
level language for expressing data analysis programs, coupled with infrastructure for evaluating
these programs. The salient property of Pig programs is that their structure is amenable to
substantial parallelisation, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of
Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the
Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin,
which has the following key properties:
Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly
parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations
are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimisation opportunities. The way in which tasks are encoded permits the system to
optimise their execution automatically, allowing the user to focus on semantics rather than
efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.
Components of Pig
There are two major components of Pig:
Execution Environment
There are two choices of execution environment; a local environment and distributed environment.
A local environment is good for testing when you do not have a full distributed Hadoop
environment deployed. You tell Pig to run in the local environment when you start Pigs command
line interpreter by passing it the -x local option. You tell Pig to run in a distributed environment by
passing -x mapreduce instead. Alternatively, you an start Pig command line interpreter without any
arguments and it will start it in the distributed environment. There are three different ways to run
Pig. You can run you PigLatin code as a script, just by passing the name of your script file to the pig
command. You can run it interactively through the grunt command line launched using Pig with no
script argument. Finally, you can call into Pig from within Java using Pigs embedded form.
Execution Modes
Local Mode
You invoke pig from your terminal as follows:
pig -x local
It will bring you to the grunt shell as seen below:
grunt>
By local mode it means that Pig will work on files available on your local file system and store your
results in your local file system once the analysis is done.
MapReduce Mode
You invoke pig from your terminal as follows:
pig OR pig -x MapReduce
It will bring you to the grunt shell as seen below:
grunt>
By MapReduce mode it means that Pig will work on files available on your HDFS and store your
results in your HDFS once the analysis is done.
Day 9-11
Hive
Introduction
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarisation, query, and analysis. Hive gives an SQL-like interface to query data stored in various
databases and file systems that integrate with Hadoop. The traditional SQL queries must be
implemented in the MapReduce Java API to execute SQL applications and queries over a
distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like Queries
(HiveQL) into the underlying Java API without the need to implement queries in the low-level Java
API. Since most of the data warehousing application work with SQL based querying language, Hive
supports easy portability of SQL-based application to Hadoop. While initially developed by
Facebook, Apache Hive is now used and developed by other companies such as Netflix and the
Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache
Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.
Features
Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file
systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL with
schema on read and transparently converts queries to MapReduce, Apache Tez and Spark jobs. All
three execution engines can run in Hadoop YARN. To accelerate queries, it provides indexes,
including bitmap indexes. Other features of Hive include:
Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10,
more index types are planned.
Different storage types such as plain text, RCFile, HBase, ORC, and others.
Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks
during query execution.
Operating on compressed data stored into the Hadoop ecosystem using algorithms including
DEFLATE, BWT, snappy, etc.
Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools.
Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark
jobs.
By default, Hive stores metadata in an embedded Apache Derby database, and other client/server
databases like MySQL can optionally be used.
Architecture
Major components of the Hive architecture are:
Megastore
Stores metadata for each of the tables such as their schema and location. It also includes the
partition metadata which helps the driver to track the progress of various data sets distributed over
the cluster. The data is stored in a traditional RDBMS format. The metadata helps the driver to keep
a track of the data and it is highly crucial. Hence, a backup server regularly replicates the data
which can be retrieved in case of data loss.
Driver
Acts like a controller which receives the HiveQL statements. It starts the execution of statement by
creating sessions and monitors the life cycle and progress of the execution. It stores the necessary
metadata generated during the execution of an HiveQL statement. The driver also acts as a
collection point of data or query result obtained after the Reduce operation.
Compiler
Performs compilation of the HiveQL query, which converts the query to an execution plan. This
plan contains the tasks and steps needed to be performed by the Hadoop MapReduce to get the
output as translated by the query. The compiler converts the query to an Abstract syntax tree (AST).
After checking for compatibility and compile time errors, it converts the AST to a directed acyclic
graph (DAG). DAG divides operators to MapReduce stages and tasks based on the input query and
data.
Optimiser
Performs various transformations on the execution plan to get an optimised DAG. Various
transformations can be aggregated together, such as converting a pipeline of joins by a single join,
for better performance. It can also split the tasks, such as applying a transformation on data before a
reduce operation, to provide better performance and scalability. However, the logic of
transformation used for optimisation used can be modified or pipelined using another optimiser.
Executor
After compilation and Optimisation, the Executor executes the tasks according to the DAG. It
interacts with the job tracker of Hadoop to schedule tasks to be run. It takes care of pipelining the
tasks by making sure that a task with dependency gets executed only if all other prerequisites are
run.
Flume
Introduction
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of log data. It has a simple and flexible architecture based on streaming
data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and
recovery mechanisms. It uses a simple extensible data model that allows for online analytic
application.
Features
Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage. Specifically,
Flume allows users to:
Stream data
Ingest streaming data from multiple sources into Hadoop for storage and analysis.
Insulate systems
Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at
which data can be written to the destination.
Scale horizontally
To ingest new data streams and additional volume as needed.
Sources
Put simply, Flume sources listen for and consume events. Events can range from newline-
terminated strings in stdout to HTTP POSTs and RPC calls it all depends on what sources the
agent is configured to use. Flume agents may have more than one source, but must have at least
one. Sources require a name and a type; the type then dictates additional configuration parameters.
On consuming an event, Flume sources write the event to a channel. Importantly, sources write to
their channels as transactions. By dealing in events and transactions, Flume agents maintain end-to-
end flow reliability. Events are not dropped inside a Flume agent unless the channel is explicitly
allowed to discard them due to a full queue.
Channels
Channels are the mechanism by which Flume agents transfer events from their sources to their
sinks. Events written to the channel by a source are not removed from the channel until a sink
removes that event in a transaction. This allows Flume sinks to retry writes in the event of a failure
in the external repository (such as HDFS or an outgoing network connection). For example, if the
network between a Flume agent and a Hadoop cluster goes down, the channel will keep all events
queued until the sink can correctly write to the cluster and close its transactions with the channel.
Channels are typically of two types: in-memory queues and durable disk-backed queues. In-
memory channels provide high throughput but no recovery if an agent fails. File or database-backed
channels, on the other hand, are durable. They support full recovery and event
replay in the case of agent failure.
Flume Agent
Sink
Sinks provide Flume agents pluggable output capability if you need to write to a new type
storage, just write a Java class that implements the necessary classes. Like sources, sinks
correspond to a type of output: writes to HDFS or HBase, remote procedure calls to other agents, or
any number of other external repositories. Sinks remove events from the channel in transactions and
write them to output. Transactions close when the event is successfully written, ensuring that all
events are committed to their final destination.
Sqoop
Introduction
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores
such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the
EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data
from Hadoop and export it into external structured datastores. Sqoop works with relational
databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.
Features
Data imports
Moves certain data from external stores and EDWs into Hadoop to optimise cost-effectiveness of
combined data storage and processing.
Load balancing
Mitigates excessive storage and processing loads to other systems.
Sqoop Agent
Day 19
Hadoop Configuration
Hadoop is installed and configured in pseudo distributed mode. It is a distributed simulation on
single machine. Each Hadoop daemon such as hdfs, yarn, MapReduce etc., will run as a separate
java process. This mode is useful for development.
You can find all the Hadoop configuration files in the location $HADOOP_HOME/etc/hadoop. It
is required to make changes in those configuration files according to your Hadoop infrastructure.
In order to develop Hadoop programs in java, you have to reset the java environment variables in
hadoop-env.sh file by replacing JAVA_HOME value with the location of java in your system.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance,
memory allocated for the file system, memory limit for storing the data, and size of Read/Write
buffers.
Open the core-site.xml and add the following properties in between <configuration>,
</configuration> tags.
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, namenode path,
and datanode paths of your local file systems. It means the place where you want to store the
Hadoop infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following
properties in between the <configuration>, </configuration> tags in this file.
mapped-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop
contains a template of yarn-site.xml. First of all, it is required to copy the file from mapred-
site,xml.template to mapred-site.xml file using the following command.
Day 20-28
Project
Title
Web VDA [Web Visitor Data Analytics Using Big Data Ecosystem]
Project Description
Web VDA (Visitor Data Analytics) project is initiated to demonstrate capability on Big Data
Processing & Analytics to prospective clients who a looking for Big Data Analytics solutions to
understand their customers behaviour which will help them in acquiring new & retaining their
existing customers. This project involves analysing log data of the web visitors.
The analytics team is interested in understanding web activities of the site, which are the Referrer
URLs used to access the website. They will get lacs of record with columns mentioning IP, date,
timestamp, URL, page, browser used & other details related to users who have accessed web pages.
This project involves analysing semi-structured/unstructured data of the web visitors which is
mainly in form of log files. Big log data is first ingested in Hadoop Distributed File System using
scripting & Apache Flume utility. After ingesting, data will be cleaned & transformed using Apache
Pig/Hive utilities.
Cleansed structured data is then transferred to relational database system like MySQL.
Technologies Used
Operating System
Ubuntu 12/14.04 Server
Data Storage
HDFS (Hadoop Distributed File System), MySQL
4. Generate a report with top 3 viewed products of year 2012 & 2011
5. Generate a report with top 3 IP addresses accessing portal in year 2012 & 2011
7. Generate a report containing all products & their view counts in descending order
8. Generate a report containing all User IPs & their hit counts in descending order