Final Report On Big Data 1

Report
For
Academic Progress
(1140372001)
on
“Big Data: Technical Issues & Security Challenges”
Submitted by:
Kebebe Abebe
Student ID.No: B20176111W
School of Software Engineering
Beijing University of Technology
Submitted to:
Prof. Jingsha He
School of Software Engineering
Beijing University of Technology
Date: 2017/11/13
ABSTRACT
This report aims to provide a review of academic progress course, with particular regard to Big
Data. Big Data is the large amount of data that cannot be processed by making use of traditional
methods of data processing. Due to widespread usage of many computing devices such as smart
phones, laptops, wearable computing devices; billions of people are connected to internet
worldwide, generating large amount of data at the rapid rate. The data processing over the
internet has exceeded more than the modern computers can handle. Due to this high growth rate,
the term Big Data is envisaged. This growth of data has led to an explosion, coining the term Big
Data. In addition to the growth in volume, Big Data also exhibits other unique characteristics,
such as velocity, variety, value and veracity. This large volume, rapidly increasing and verities
of data is becoming the key basis of completion, underpinning new waves of productivity growth,
innovation and customer surplus. However, the fast growth rate of such large data generates
numerous challenges, such as data analysis, storage, querying, inconsistency and
incompleteness, scalability, timeliness, and security. Key industry segments are heavily
represented; financial services, where data is plentiful and data investments are substantial, and
life sciences, where data usage is rapidly emerging. This report provides a brief introduction to
the Big Data technology and its importance in the contemporary world. This report also
addresses the various concepts, characteristics, architecture, management, technologies,
challenges and applications of Big Data.
i
Contents
1. INTRODUCTION ................................................................................................................. 1
1.1 Definition of Big Data ...................................................................................................... 2
1.2 Sources of Big Data.......................................................................................................... 2
1.3 Types of data .................................................................................................................... 3
1.4 Benefits of Big Data ......................................................................................................... 4
2. CHARACTERISTICS OF BIG DATA ............................................................................... 5

2.1 Data Volume .................................................................................................................... 5
2.2 Data Velocity.................................................................................................................... 6
2.3 Data Variety ..................................................................................................................... 6
2.4 Data Value ........................................................................................................................ 6
2.5 Data Veracity.................................................................................................................... 7
3. ARCHITECTURE OF BIG DATA ..................................................................................... 7

3.1 Data Sources Layer .......................................................................................................... 8
3.2 Ingestion Layer ................................................................................................................. 9
3.3 Hadoop Storage Layer .................................................................................................... 10
3.4 Hadoop Infrastructure Layer .......................................................................................... 11
3.5 Hadoop Platform Management Layer ............................................................................ 11
3.6 Security Layer ................................................................................................................ 13
3.7 Monitoring Layer ........................................................................................................... 13
3.8 Analytics Engines Layer ................................................................................................ 13
3.9 Visualization Layer ........................................................................................................ 14
3.10 Big Data Applications Layer .......................................................................................... 15
4. BIG DATA MANAGEMENT ............................................................................................ 16

4.1 Data Collection ............................................................................................................... 16
4.2 Data Processing .............................................................................................................. 17
4.3 Data Analysis ................................................................................................................. 18
4.4 Data Interpretation.......................................................................................................... 18
ii
5. BIG DATA TECHNOLOGIES .......................................................................................... 18
5.1 Hadoop ........................................................................................................................... 19
5.2 Hadoop Components ...................................................................................................... 19
5.3 Hadoop technology works.............................................................................................. 22
5.4 Advantages of Hadoop ................................................................................................... 22
6. BIG DATA CHALLENGES ............................................................................................... 22
6.1 Privacy and security ....................................................................................................... 23
6.2 Data access and sharing of information ......................................................................... 23
6.3 Storage and processing issues ........................................................................................ 23
6.4 Analytical challenges ..................................................................................................... 23
6.5 Technical challenges ...................................................................................................... 23
6.6 Human resources and manpower ................................................................................... 24
6.7 Future challenges............................................................................................................ 24
7. APPLICATIONS OF BIG DATA ...................................................................................... 24
8. CONCLUSION .................................................................................................................... 25
REFERENCES: .......................................................................................................................... 26
iii
List of Figures
Figure 1. Sources of Big Data ........................................................................................................ 3
Figure 2. Types of data being used in big data .............................................................................. 4
Figure 3. Five Vs Big Data Characteristics [2] .............................................................................. 5
Figure 4. Velocity of Big Data [3] ................................................................................................. 6
Figure 5. Variety of Big Data [3] ................................................................................................... 6
Figure 6. The big data architecture ................................................................................................ 7
Figure 7. The variety of data sources ............................................................................................. 8
Figure 8. Components of data ingestion layer ............................................................................. 10
Figure 9. NoSQL databases ........................................................................................................ 11
Figure 10. Big data platform architecture .................................................................................... 12
Figure 11. Search engine conceptual architecture ....................................................................... 14
Figure 12. Visualization conceptual architecture ......................................................................... 15
Figure 13. Typical ETL process framework [4] .......................................................................... 17
Figure 14. The architecture of Hadoop ........................................................................................ 20
Figure 15. MapReduce parallel programming ............................................................................. 20
Figure 16. NoSQL database typical business scenarios ............................................................... 21
List of Tables
Table 1. Legacy data sources ......................................................................................................... 8
Table 2. New age data sources - telecom industry ......................................................................... 9
Table 3. Big data typical software stack ...................................................................................... 16
iv
1. INTRODUCTION
Recent advancement in technology has led to generation of a great quantity of data from
distinctive domains over the past 20 years. Big Data is a broad term for datasets so great in
volume or complicated that traditional data processing applications are inadequate [1]. Although
the Big Data have large amount of data or volume, it also processes the number of unique
characteristics unlike traditional data. The term Big Data often refers to large amount of data
which requires new technologies and architectures to make possible to extract value from it by
capturing and analysis process and seldom to a particular size of dataset. Big Data is usually
unstructured and requires more time for analysis and processing. This development calls for new
system architectures for data acquisition, transmission, storage, and large-scale data processing
mechanisms.
Big Data has emerged because we are living in society which makes increasing use of data
intensive technologies. Due to such large size of data, it becomes very difficult to perform
effective analysis using the existing traditional techniques. Since Big Data is a recent upcoming
technology in the market, it becomes necessary that various challenges and issues associated in
bringing and adapting to this technology are needed to be understood. Big Data concept means a
datasets which continues to grow so much that it becomes difficult to manage it using existing
database management concepts and tools.
Big Data due to its various characteristics like volume, velocity, variety, value and veracity put
forward many challenges. The various challenges faced in large data management include,
analysis, capture, data curation, search, sharing, storage, transfer, visualization, information
privacy and many more. In addition to variations in the amount of data stored in different sectors,
the types of data generated and stored; i.e. encoded video, images, audio, or text/numeric
information also differ markedly from industry to industry. The data is so enormous and are
generated so fast that it doesn’t fit the structures of normal or regular database architecture. To
analyze the data new alternative way must be used to process it.
In this report, the next sections address the basic concepts, characteristics, architecture,
management, technologies, challenges and applications of Big Data.
1|Page
1.1 Definition of Big Data
The term “Big Data” is used in a variety of contexts with a variety of characteristics. Therefore,
the followings are few definitions for Big Data.
Gartner’s definition:
“Big Data is a high-volume, high-velocity and/or high-variety information assets that
demand cost-effective, innovative forms of information processing for enhanced insight,
decision making, and process optimization."
Working definition:
Big Data is a collection of large datasets that cannot be processed using traditional
computing technologies and techniques in order to extract value. It is not a single technique
or a tool; rather it involves many areas of business and technology.
The ultimate goal of processing big data includes:
 The data analysis being undertaken uses a high volume of data from a variety of sources
including structured, semi-structured, unstructured or even incomplete data.
 The size (volume) of the data sets within the data analysis and velocity with which they
need to be analyzed has outpaced the current abilities of standard business intelligence
tools and methods of analysis.
1.2 Sources of Big Data
Rapid growth in the acquisition, production and use of data has been attributed to a range of
technological, societal and economic factors. Therefore, Big data involves the data produced by
different devices and applications. Given below are some of the sources of big data:
 Black box data: It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the performance
information of the aircraft.
 Social media data: Social media such as Facebook and Twitter hold information and the
views posted by millions of people across the globe.
 Stock exchange data: holds information about the buy and sell decisions made on a
share of different companies made by the customers.
 Power grid data: holds information consumed by a particular node with respect to a base
station.
2|Page
 Transport data: includes model, capacity, distance and availability of a vehicle.
 Search engine data: search engines retrieve lots of data from different databases
Figure 1. Sources of Big Data
1.3 Types of data

Since Big Data includes huge volume, high velocity, and extensible variety of data, the data in it
can be classified into the following types:
1. Structured data: refers to data that is identifiable and organized in a structured way. The
most common form of structured data is a database where specific information is stored
based on a methodology of columns and rows. It is machine readable and also efficiently
organized for human readers. For example, an employee information in a database.
2. Semi-structured data: refers to data that does not conform to a formal structure based on
standardized data models. However semi-structured data may contain tags or other meta-
data to organize it. For example, personal data stored in a XML file.
3. Unstructured data: it refers to any data that has no identifiable structure. For examples,
images, videos, email, documents, text, etc.
3|Page
Figure 2. Types of data being used in big data
1.4 Benefits of Big Data

The Big Data has numerous benefits on society, science and technology. It is unto the way that
how it is used for the human beings. Some of its benefits are:-
 Understanding and targeting customers
 Understanding and optimizing business processes
 Personal qualification and performance optimization
 Improving healthcare and public health
 Improving sports performance
 Improving science and research
 Optimizing machine and device performance
 Improving security and law enforcement
 Improving and optimizing cities and countries
 Financial trading
4|Page
2. CHARACTERISTICS OF BIG DATA
Characteristics of Big Data by what is usually referred to as a multi V model. The three Vs main
characteristics (volume, velocity and variety) of big data are well defined in the definition by
Gartner. In report, the 5V characteristics (Volume, Velocity, Variety, Value and Veracity) of big
data are described below.
Figure 3. Five Vs Big Data Characteristics [2]
2.1 Data Volume

Data volume defines the measures of amount of data available to an organization, which does not
necessarily have to own all of it as long as it can access it. As amount of data volume increases,
the value of different data records will decrease in proportion to age, type, richness, and quantity
among all other factors.
5|Page
2.2 Data Velocity
Data velocity is a mean to measure the speed of data generation, streaming, and arithmetic
operations. E-Commerce and other startups have rapidly increased the speed and richness of data
used for different business transactions (for instance, web-site clicks). Managing the data
velocity is much more and bigger than a band width issue; it is also an ingest issue (extract
transform-load).
Considering data velocity [3], it is considered that, to complicate matters further, arrival of data
and processing or analyzing data are performed at different speeds, as illustrated in Figure 4.
Figure 4. Velocity of Big Data [3]
2.3 Data Variety

Data variety is a measure of the richness of the data representation of the different types of data
stored in the database – text, images video, audio, etc. From an analytic perspective, it is
probably the biggest obstacle to effectively use large volumes of data.
Figure 5. Variety of Big Data [3]

2.4 Data Value
Data value measures the usefulness of data in making decisions. It has been noted that “the
purpose of computing is insight, not numbers”. Data science is exploratory and useful in getting
to know the data, but “analytic science” encompasses the predictive power of big data.
6|Page
2.5 Data Veracity
Data veracity refers to the degree in which a leader trusts information in order to make a
decision. Therefore, finding the right correlations in Big Data is very important for the business
future. However, as one in three business leaders do not trust the information used to reach
decisions, generating trust in Big Data presents a huge challenge as the number and type of
sources grows.
3. ARCHITECTURE OF BIG DATA
Big data management architecture should be able to consume myriad data sources in a fast and
inexpensive manner. Figure 6 outlines the architecture of big data with its components in big
data tech stack.
Figure 6. The big data architecture
7|Page
3.1 Data Sources Layer
Big data begins in the data sources layer, where data sources of different volumes, velocity, and
variety vie with each other to be included in the final big data set to be analyzed. These big data
sets, also called data lakes, are pools of data that are tagged for inquiry or searched for patterns
after they are stored in the Hadoop framework. Figure 7 illustrates the various types of data
sources.
Figure 7. The variety of data sources

Traditionally, different industries designed their data-management architecture around the legacy
data sources listed in Table 1. The technologies, adapters, databases, and analytics tools were
selected to serve these legacy protocols and standards.
Legacy Data Sources

HTTP/HTTPS web services
RDBMS
FTP
JMS/MQ based services
Text / flat file /csv logs
XML data sources
IM Protocol requests
Table 1. Legacy data sources
8|Page
Some of the “new age” data sources that have seen an increase in volume, velocity, or variety are
illustrated in Table 2.
New Age Data Sources

High Volume Sources
1. Switching devices data
2. Access point data messages
3. Call data record due to exponential growth in user base
4. Feeds from social networking sites
Variety of Sources
1. Image and video feeds from social Networking sites
2. Transaction data
3. GPS data
4. Call center voice feeds
5. E-mail
6. SMS
High Velocity Sources
1. Call data records
2. Social networking site conversations
3. GPS data
4. Call center - voice-to-text feeds
Table 2. New age data sources - telecom industry
3.2 Ingestion Layer

The ingestion layer loads the final relevant information, sans the noise, to the distributed Hadoop
storage layer based on multiple commodity servers. It should have the capability to validate,
cleanse, transform, reduce, and integrate the data into the big data tech stack for further
processing.
The building blocks of the ingestion layer should include components for the following:
 Identification: - involves detection of the various known data formats or assignment of
default formats to unstructured data.
 Filtration: - involves selection of inbound information relevant to the enterprise, based
on the Enterprise MDM repository.
 Validation: - involves analysis of data continuously against new MDM metadata.
 Noise Reduction: - involves cleansing data by removing the noise and minimizing
disturbances.
 Transformation: - involves splitting, converging, de-normalizing or summarizing data.
9|Page
 Compression: - involves reducing the size of the data but not losing the relevance of the
data in the process. It should not affect the analysis results after compression.
 Integration: - involves integrating the final massaged data set into the Hadoop storage
layer, that is, Hadoop distributed file system (HDFS) and NoSQL databases.
Figure 8. Components of data ingestion layer
There is multiple ingestion patterns (data source-to-ingestion layer communication) that can be
implemented based on the performance, scalability, and availability requirements.
3.3 Hadoop Storage Layer

The storage layer is usually loaded with data using a batch process. The integration component
of the ingestion layer invokes various mechanisms like Sqoop, MapReduce jobs, ETL jobs, and
others to upload data to the distributed Hadoop storage layer (DHSL). The storage layer provides
storage patterns (communication from ingestion layer to storage layer) that can be implemented
based on the performance, scalability, and availability requirements. Hoop storage layer consists
of NoSQL Database and HDFS which are the cornerstones of the big data storage layer.
NoSQL Database is used to store prevalent in the big data world, including key-value pair,
document, graph, columnar, and geospatial databases.
10 | P a g e
Figure 9. NoSQL databases
3.4 Hadoop Infrastructure Layer

Hadoop Infrastructure Layer is responsible for the operation and scalability of big data
architecture. It is based on a distributed computing model. It is a “share-nothing” architecture,
where the data and the functions required to manipulate it reside together on a single node. It
contains the main components: Bare Metal Clustered Workstations and Virtualized Cloud
Services. Hadoop and HDFS can manage the infrastructure layer in a virtualized cloud
environment (on-premises as well as in a public cloud) or a distributed grid of commodity
servers over a fast gigabit network.
3.5 Hadoop Platform Management Layer

It is the layer that provides the tools and query languages to access the NoSQL databases using
the HDFS storage file system sitting on top of the Hadoop physical infrastructure layer. Figure
10 shows how the platform layer of the big data tech stack communicates with the layers below
it.
11 | P a g e
Figure 10. Big data platform architecture
The Hadoop platform management layer accesses data, runs queries, and manages the lower
layers using scripting languages like Pig and Hive.
The key building blocks of the Hadoop platform management layer are Zookeeper, Pig, Hive,
Sqoop and MapReduce.
 MapReduce simplifies the creation of processes that analyze large amounts of
unstructured and structured data in parallel.
 Hive is a data-warehouse system for Hadoop that provides the capability to aggregate
large volumes of data. This SQL-like interface increases the compression of stored data
for improved storage-resource utilization without affecting access speed.
 Pig is a scripting language that allows us to manipulate the data in the HDFS in parallel.
 Sqoop is a command-line tool that enables importing individual tables, specific columns,
or entire database files straight to the distributed file system or data warehouse.
 ZooKeeper is a coordinator for keeping the various Hadoop instances and nodes in sync
and protected from the failure of any of the nodes.
12 | P a g e
3.6 Security Layer
It is the layer in which security has to be implemented in a way that does not harm performance,
scalability, or functionality, and it should be relatively simple to manage and maintain. To
implement a security baseline foundation, we should design a big data tech stack so that, at a
minimum, it does the following:
 Authenticates nodes using protocols like Kerberos
 Enables file-layer encryption
 Subscribes to a key management service for trusted keys and certificates
 Uses tools for validation during deployment of datasets
 Logs the communication between nodes, and uses distributed logging mechanisms to
trace any anomalies across layers
 Ensures all communication between nodes is secure, for example, by using Secure
Sockets Layer (SSL), TLS, and so forth.
3.7 Monitoring Layer

It is the layer provides tools for data storage and visualization. Performance is a key parameter to
monitor so that there is very low overhead and high parallelism. Open source tools like Ganglia
and Nagios are widely used for monitoring big data tech stacks.
3.8 Analytics Engines Layer

It is the layer where the data loaded from various enterprise applications into the big data tech
stack has to be indexed and searched for big data analytics processing. Figure 11 shows the
conceptual architecture of the search engines layer and how it interacts with the various layers of
a big data tech stack.
13 | P a g e
Figure 11. Search engine conceptual architecture
3.9 Visualization Layer

It is the later where the process of visualization is incorporated as an integral part of the big data
tech stack in order to help data analysts and scientists to gain insights faster and increase their
ability to look at different aspects of the data in various visual modes. Figure 12 shows the
interactions between different layers of the big data stack that allow us to harnesses the power of
visualization tools.
14 | P a g e
Figure 12. Visualization conceptual architecture
3.10 Big Data Applications Layer

It is the layer where the companies are seeing the development of applications that are designed
specifically to take advantage of the unique characteristics of big data. The applications rely on
huge volumes, velocities, and varieties of data to transform the behavior of a market.
There are a wide choice of tools and products that we can use to build our application
architecture end to end. Products usually selected by many enterprises to begin their big data
journey are shown in Table 3.
15 | P a g e
Purpose Products/tools
Ingestion Layer Apache Flume, Storm
Hadoop Storage HDFS
NoSQL Databases Hbase, Cassandra
Rules Engines MapReduce jobs
NoSQL Data Warehouse Hive
Platform Management Query Tools MapReduce, Pig, Hive
Search Engine Solr
Platform Management Co-ordination Tools ZooKeeper, Oozie
Analytics Engines R, Pentaho
Visualization Tools Tableau, Clickview, Spotfire
EMC Greenplum, IBM Netezza, IBM Pure
Big Data Analytics Appliances
Systems, Oracle Exalytics
Monitoring Ganglia, Nagios
Data Analyst IDE Talend, Pentaho
Cloudera, DataStax, Hortonworks, IBM Big
Hadoop Administration
Insights
Public Cloud-Based Virtual Infrastructure Amazon AWS & S3, Rackspace
Table 3. Big data typical software stack
4. BIG DATA MANAGEMENT
Increasing quantities of data are being collected and analyzed, producing new insights into how
people think and act, and how systems behave. This often requires innovative processing and
analysis known as big data analytics. Making use of any kind of data requires data collection,
processing, analysis and interpretation of results.
4.1 Data Collection

Big data can be acquired in myriad formats from a vast, and increasing, number of sources.
These include images, sound recordings, user click streams that measure internet activity, and
data generated by computer simulations (such as those used in weather forecasting). Key to
managing data collection is metadata, which are data about data. For example, an e-mail
automatically generates metadata containing the addresses of the sender and recipient, and the
date and time it was sent, to aid the manipulation and storage of e-mail archives. Producing
metadata for big data sets can be challenging, and may not capture all the nuances of the data.
16 | P a g e
4.2 Data Processing
Data may undergo numerous processes to improve quality and usability before analysis. After
recording, big data must be filtered and compressed. Only the relevant data should be recorded
by means of filters that discard useless information using specialized tools such as ETL (Extract-
Transform-Load).
Phases in ETL Process:

1. Extraction: In this phase relevant information is extracted. To make this phase efficient,
only the data source that has been changed since recent last ETL process is considered.
2. Transformation: Data is transformed through the following various sub phases:
 Data analysis
 Definition of transformation workflow and mapping rules
 Verification
 Transformation
 Backflow of cleaned data
3. Loading: At the last, after the data is in the required format, it is then loaded into the data
warehouse/Destination.
The ETL process framework is shown in the Figure 14 below.
Figure 13. Typical ETL process framework [4]
These processes can be more difficult when applied to big data. For example: it may contain
multiple data formats that are difficult to extract; require rapid real-time processing to enable the
user to react to a changing situation; or involve the linkage of different databases, which requires
data formats that are compatible with each other.
17 | P a g e
4.3 Data Analysis
Analytics are used to gain insight from data. They typically involve applying an algorithm (a
sequence of calculations) to data to find patterns, which can then be used to make predictions or
forecasts. Big data analytics encompass various inter-related techniques, including the following
examples.
 Data mining – identifies patterns by sifting through data.

 Machine learning - describes systems that learn from data.
 Simulation - can be used to model the behaviour of complex systems.
4.4 Data Interpretation

For the results of analysis to be useful, they need to be interpreted and communicated.
Interpreting big data needs to take context into account, such as how the data were collected,
their quality and any assumptions made. Interpretation requires care for several reasons:
 Despite being large, a data set may still contain biases and anomalies, or exclude
behaviour not captured by the data.
 There may be limitations to the usefulness of big data analytics, which can identify
correlations (consistent patterns between variables) but not necessarily cause.
Correlations can be extremely useful for making predictions or measuring previously
unseen behaviour, if they occur reliably.
 Techniques can be reductionist and not appropriate for all contexts.
5. BIG DATA TECHNOLOGIES
Big data technologies are important in providing more accurate analysis, which may lead to more
concrete decision-making resulting in greater operational efficiencies, cost reductions, and
reduced risks for the business.
Some key characteristics of these technologies include:

 Accessing data stored in a variety of standard configurations.
 Relying on standard relational data access methods.
 Enabling canonical means for virtualizing data accesses to consumer applications
18 | P a g e
 Employ push-down capabilities of a wide variety of data management systems (ranging
from conventional RDBMS data stores to newer NoSQL approaches) to optimize data
access „
 Rapid application of data transformations as data sets is migrated from sources to the big
data target platforms
While looking into the technologies that handle big data, they can be examined as the following
two complementary classes of technology and they are frequently deployed together.
1. Operational big data technology: - it includes systems that provide operational
capabilities for real-time, interactive workloads where data is primarily captured and
stored ( e.g. MongoDB, NoSQL, etc)
2. Analytical big data technology: - includes systems that provide analytical capabilities
for retrospective and complex analysis that may touch most or all of the data (e.g. MPP,
MapReduce, etc).
Even though there are many technologies available for data management, one of the most widely
used technologies is Hadoop.
5.1 Hadoop
Hadoop is an open source, java-based programming framework that supports the processing and
storage of extremely large datasets in a distributed computing environment. Hadoop runs
applications using the MapReduce algorithm, where the data is processed in parallel on different
CPU nodes. Its distributed file system facilitates rapid data transfer rates among nodes and
allows the system to continue operating in case of a node failure. Hadoop can perform complete
statistical analysis for a huge amount of data.
5.2 Hadoop Components

 File System (The Hadoop File System)
 Programming Paradigm (Map Reduce)
19 | P a g e
Figure 14. The architecture of Hadoop
A. MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at

Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The
MapReduce program runs on Hadoop which is an Apache open-source framework. Figure 15
shows how data is processed using MapReduce parallel programming.
Figure 15. MapReduce parallel programming
20 | P a g e
B. Hadoop Distributed File System (HDFS)
HDFS is based on the Google File System (GFS) and provides a distributed file system that is
designed to run on commodity hardware. It has many similarities with existing distributed file
systems. It provides high throughput access to application data and is suitable for applications
having large datasets. It is not accessible as a logical data structure for easy data manipulation.
HDFS stores data prevalent in the big data world, including key-value pair, document, graph,
columnar, and geospatial databases which are collectively referred to as NoSQL databases.
Figure 16. NoSQL database typical business scenarios
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules:
 Hadoop Common: Java libraries and utilities required by other Hadoop modules.
 Hadoop YARN: framework for job scheduling and cluster resource management.
21 | P a g e
5.3 Hadoop technology works
Hadoop runs code across a cluster of computers. This process includes the following core tasks
that Hadoop performs:
 Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
5.4 Advantages of Hadoop

 Hadoop framework allows the user to quickly write and test distributed systems.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures at
the application layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is compatible
on all the platforms since it is Java based.
6. BIG DATA CHALLENGES
Big data challenges are usually the real implementation interrupt which require immediate
attention, if any implementation without handling these challenges may lead to failure of
technology and some unfavorable results [5].
Big data challenges can be classified as privacy & security, data access &sharing of information,
storage and processing issues, analytical, human resources & manpower, technical, and future
challenges.
22 | P a g e
6.1 Privacy and security
It is the most important issue with big data which is sensitive and includes conceptual, technical
as well as legal significance. The personal information of a person when combined with external
large data sets leads to the inference of new facts about that person and it’s possible that these
kinds of facts about the person are secretive and the person might not want the data owner to
know or any person to know about them.
6.2 Data access and sharing of information

If data is to be used to make accurate decisions in time it becomes necessary that it should be
available in accurate, complete and timely manner. This makes the data management and
governance process bit complex adding the necessity to make data open and make it available.
6.3 Storage and processing issues
The storage available is not enough for storing the large amount of data which is being produced
by almost everything: Social Media sites are themselves a great contributor along with the sensor
devices etc.
6.4 Analytical challenges
Big data brings along with it some huge analytical challenges. The types of analysis to be done
on this huge amount of data which is unstructured, semi structured and structured they need large
technical skill.
6.5 Technical challenges

 Fault tolerance –Fault tolerant computing is extremely hard, involving difficult
algorithms.
 Quality of data- Storage and collection of huge amount of data are more costly. Big data
basically focuses on quality of data storage rather than having very large irreverent data
so that better result and conclusion can be drawn.
 Heterogeneous data- In big data unstructured data represent almost every kind of data
being produced by social media interaction, recorded meeting, handle PDF document, fax
transfer, to email and more. Converting unstructured data in to structured data one is also
not feasible.
23 | P a g e
6.6 Human resources and manpower
Since Big data is at its youth and an emerging technology so it needs to attract organizations and
youth with diverse new skill sets need to be developed in individuals hence requires training
programs to be held by the organizations.
6.7 Future challenges

 Distributed mining- Many data mining techniques are not trivial to paralyze. To have
distributed versions of some methods, a lot of research is needed with practical and
theoretical analysis to provide new methods.
 Analytics Architecture- It is not clear yet how an optimal architecture of analytics
systems should be to deal with historic data and with real-time data at the same time.
 Compression- Dealing with Big Data, the quantity of space needed to store it is very
relevant. Using compression, we may take more time and less space, so we can consider
it as a transformation from time to space.
 Visualization- A main task of Big Data analysis is how to visualize the results of any
data. Because of data is so big, it is very difficult to find user-friendly visualizations.
 Hidden Big Data- Large quantities of useful data are getting lost since new data is
largely untagged file based and unstructured data.
7. APPLICATIONS OF BIG DATA

Big Data is applying in many application areas. Here are some examples of Big Data
applications:
 Smart Grid case: it is crucial to manage in real time the national electronic power
consumption and to monitor Smart grids operations.
 E-health: connected health platforms are already used to personalize health services
(e.g., CISCO solution).
 Internet of Things (IoT): IoT represents one of the main markets of big data
applications. Because of the high variety of objects, the applications of IoT are
continuously evolving. Nowadays, there are various Big Data applications supporting for
logistic enterprises.
24 | P a g e
 Transportation and logistics: Many public road transport companies are using RFID
(Radiofrequency Identification) and GPS to track buses and explore interesting data to
improve their services.
 Political services and government monitoring: Many governments such as India and
United States are mining data to monitor political trends and analyze population
sentiments.
 Big Data Analytics Applications (BDAs) are a new type of software applications, which
analyze big data using massive parallel processing frameworks (e.g., Hadoop).
 Data Mining: Decision trees automatically help users understand what combination of
data attributes result in a desired outcome. The structure of the decision tree reflects the
structure that is possibly hidden in your data.
 Banking: The use of customer data invariably raises privacy issues. By uncovering
hidden connections between seemingly unrelated pieces of data, big data analytics could
potentially reveal sensitive personal information.
 Marketing: Marketers have begun to use facial recognition software to learn how well
their advertising succeeds or fails at stimulating interest in their products.
 Telecom: Now a day’s big data is used in different fields. In telecom also it plays a very
good role.
8. CONCLUSION
In this report some of the important concepts are covered that are needed to be analyzed by the
organizations while estimating the significance of implementing the Big Data technology and
some direct challenges to the infrastructure of the technology. The availability of Big Data, low-
cost commodity hardware, new information management and analytic software has produced a
unique moment in the history of data analysis. The convergence of these trends means that we
have the capabilities required to analyze astonishing data sets quickly and cost-effectively for the
first time in history. These capabilities are neither theoretical nor trivial. They represent a
genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency,
productivity, revenue, and profitability. The age of Big Data is here, and these are truly
revolutionary times if both business and technology professionals continue to work together and
deliver on the promise.
25 | P a g e
REFERENCES:
[1] Wei Fan, Albert Bifet. Mining big data: current status, and forecast to the future, ACM
SIGKDD Explorations Newsletter, Volume 14 Issue 2, December 2012
[2] BIG DATA AND FIVE V’S CHARACTERISTICS, Ministry of Education, Islamic University
College, Third Author Affiliation
[3] Big Data computing and clouds: Trends and future direction by Rajkumar Buyya
[4] Big Data: Emerging Challenges of Big Data and Techniques for Handling,
Dr.M.Padmavalli, Nov.-Dec. 2016
[5] Armour, F., Kaisler, S. and Espinosa, J. A., Money W. 2013. Illustrated the issues and
challenges in big data.
[6] Lee, K. H., Choi, T. W., Ganguly, A., Wolinsky, D. I., Boykin, P. O. and Figueired, R.
2011. Presents the parallel data processing with map Reduce.
[7] Marz, N. and Warren, J. 2013. Big Data: Principles and best practices of scalable realtime
data systems. Manning Publications.
[8] Feldman, D., Schmidt, M. and Sohler, C. 2013. Turning big data into tiny data: Constant-
size coresets for k-means, pca and projective clustering. In SODA.
[9] Fan, W. and Bifet, A., Discribe the big data mining current status and forecast to the future.
[10] Big Data Application Architecture Q & A, A Problem – Solution Approach, Nitin Sawant
and Himanshu Shah, Apress.
[11] Mark A. Beyer and Douglas Laney, "The Importance of 'Big Data': A Definition,"
Gartner,
26 | P a g e

Final Report On Big Data 1

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Final Report On Big Data 1

Transféré par

Droits d'auteur :

Formats disponibles

Report

2. CHARACTERISTICS OF BIG DATA ............................................................................... 5

3. ARCHITECTURE OF BIG DATA ..................................................................................... 7

4. BIG DATA MANAGEMENT ............................................................................................ 16

4.4 Data Interpretation.......................................................................................................... 18

Figure 1. Sources of Big Data

1.3 Types of data

1.4 Benefits of Big Data

Figure 3. Five Vs Big Data Characteristics [2]

2.1 Data Volume

Figure 4. Velocity of Big Data [3]

2.3 Data Variety

Figure 5. Variety of Big Data [3]

Figure 6. The big data architecture

Figure 7. The variety of data sources

Legacy Data Sources

Table 1. Legacy data sources

New Age Data Sources

Table 2. New age data sources - telecom industry

3.2 Ingestion Layer

Figure 8. Components of data ingestion layer

3.3 Hadoop Storage Layer

3.4 Hadoop Infrastructure Layer

3.5 Hadoop Platform Management Layer

3.7 Monitoring Layer

3.8 Analytics Engines Layer

3.9 Visualization Layer

3.10 Big Data Applications Layer

Table 3. Big data typical software stack

4. BIG DATA MANAGEMENT

4.1 Data Collection

Phases in ETL Process:

The ETL process framework is shown in the Figure 14 below.

Figure 13. Typical ETL process framework [4]

 Data mining – identifies patterns by sifting through data.

4.4 Data Interpretation

5. BIG DATA TECHNOLOGIES

Some key characteristics of these technologies include:

5.2 Hadoop Components

MapReduce is a parallel programming model for writing distributed applications devised at

Figure 15. MapReduce parallel programming

Figure 16. NoSQL database typical business scenarios

5.4 Advantages of Hadoop

6. BIG DATA CHALLENGES

6.2 Data access and sharing of information

6.5 Technical challenges

6.7 Future challenges

7. APPLICATIONS OF BIG DATA

Vous aimerez peut-être aussi