Académique Documents
Professionnel Documents
Culture Documents
INDUSTRIAL APPLICATIONS
GI.IADS5
PR. HAJJI TARIK
2021/2022
Introduction to Big Data and Hadoop
Learning Objectives
6 1
Banking
Manufacturing
5 2
Consumer Healthcare
4 3
Technology Energy
According to US Bureau of Labour Statistics, Big Data alone will fetch 11.5 million jobs
by 2026.
Traditional Decision-Making
La prise de décision repose sur ce que vous savez, ce qui, à son tour,
repose sur l'analyse des données.
Solution
Cela contribue à accélérer la prise de décision, améliorant ainsi l'avantage
concurrentiel et permettant de gagner du temps et
de l'énergie.
Case Study: Google’s Self-DrivingCar
Technical
Data
Community
Data
Personal
Data
Big Data Analytics Pipeline
Introduction to Big Data and Hadoop
What Is BigData?
Big data refers to extremely large data sets that may be analyzed computationally to
reveal patterns, trends, and associations, especially relating to human behavior
and interactions.
Volume Variety
Veracity Velocity
Web Multimedia
Logs
Social
Media
Case Study: Royal Bank of Scotland
Case Study: Royal Bank of Scotland
The case study of Royal Bank of Scotland gave the following three
things:
Improved customer
Sentiment Reduced processing
satisfaction
analysis time
Introduction to Big Data and Hadoop
Challenges of Traditional Systems (RDBMS and DWH)
DATA SIZE
Data ranges from
terabytes (10 ^12 bytes) to
exabytes (10 ^18 bytes).
GROWTH RATE
RDBMS systems are
designed for steady data
retention rather than rapid
growth.
UNSTRUCTURED DATA
Relational databases can’t
categorize unstructured
data.
Advantages of BigData
Multiple systems
● It utilizes and adds hardware or software to increase the output and storage of
data.
● Fault tolerance in Big data or Hadoop HDFS refers to the working strength of
a system in unfavorable conditions and how that system can handle such a
situation.
Data =1 Data =1
Terabyte Terabyte
System Limited
High
1 2 3
failure programming
bandwidth
complexity
Any Hadoo
solution? p
Introduction to Big Data and Hadoop
What Is Hadoop?
Scalable
Can follow both horizontal
and vertical scaling
Reliable
Stores copies of the data
on different machines Flexible
and is resistant to
hardware failure Can store huge data and
decide to use it later
Economical
Can use ordinary
computers
for data processing
Traditional Database Systemsvs. Hadoop
VS.
Data
Processing
YARN Resource
Management
Storage
Hadoop Core
Introduction to Big Data and Hadoop
Components of HadoopEcosystem
Sqoop
Cluster Resource
YARN Management
Flume
Components of HadoopEcosystem
HDFS (HADOOP DISTRIBUTED FILESYSTEM)
A storage layer
for Hadoop
Hadoop provides a
command line
interface to interact Suitable for the
with HDFS distributed
storage and
processing
A NoSQL database
or non-relational
database
Apache Spark
Components of HadoopEcosystem
HADOOP MAP-REDUCE
An alternative to
writing Map-Reduce
code
An open
source
dataflow
system
Executes queries
using
Map-Reduce
Similar to
Impala
Oozie Oozie
Coordinator Workflow
Start Engine Engine
B
Action A
Action1 C
Action2
Action3
End
Components of HadoopEcosystem
HUE (HADOOP USEREXPERIENCE)
HDInsight
Big Data Processing
a Volume
.
b Variety
.
c Velocity
.
d Veracity
.
Knowledge
Check A bank wants to process 1000 transactions per second.
Which one of the following Vs reflects this real-world use case?
2
a Volume
.
b Variety
.
c Velocity
.
d Veracity
.
c Abundance of unstructured
. data
d None of the
. above
Knowledge
Check
Why has popularity of big data increased tremendously in the recent years?
3
c Abundance of unstructured
. data
d None of the
. above
Unstructured data is growing at astronomical rates, contributing to the big data deluge that's sweeping across
enterprise data storage environments.
Knowledge
Check
What is Hadoop?
4
Hadoop is a framework that allows distributed processing of large datasets across clusters of commodity
computers using a simple programming model.
Knowledge
Check Which of the following is a column-oriented NoSQL database that runs on
top of HDFS?
5
a MongoDB
.
b Flume
.
c Ambari
.
d HBase
.
Knowledge
Check Which of the following is a column-oriented NoSQL database that runs on
top of HDFS?
5
a MongoDB
.
b Flume
.
c Ambari
.
d HBase
.
Apache HBase is a NoSQL database that runs on top of Hadoop as a distributed and scalable big
data store.
Knowledge
Check
Scoop is used to .
6
a Import data from relational databases to Hadoop HDFS and export from
Hadoop file system to relational databases
.
b Execute queries using Map-Reduce
.
c Enable nontechnical users to search and explore data stored in or
ingested into Hadoop and HBase
.
d Stream event data from multiple
. systems
Knowledge
Check
Scoop is used to .
6
a Import data from relational databases to Hadoop HDFS and export from
Hadoop file system to relational databases
.
b Execute queries using Map-Reduce
.
c Enable nontechnical users to search and explore data stored in or
ingested into Hadoop and HBase
.
d Stream event data from multiple
. systems
Scoop is used to import data from relational databases to Hadoop HDFS and export from Hadoop file system to
relational databases.