Vous êtes sur la page 1sur 12

INTRODUCTION TO BIG DATA

Big data is a term used to refer to data sets that are too large or complex for traditional data-processing
application software to adequately deal with. Data with many cases (rows) offer greater statistical
power, while data with higher complexity (more attributes or columns) may lead to a higher false
discovery rate. Big data challenges include capturing data, data storage, data analysis, search, sharing,
transfer, visualization, querying, updating, information privacy and data source. Big data was originally
associated with three key concepts: volume, variety, and velocity. Other concepts later attributed with
big data are veracity (i.e., how much noise is in the data) and value.

Relational database management systems, desktop statistics and software packages used to visualize
data often have difficulty handling big data. The work may require "massively parallel software running
on tens, hundreds, or even thousands of servers". What qualifies as being "big data" varies depending
on the capabilities of the users and their tools, and expanding capabilities make big data a moving target.
For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to
reconsider data management options. For others, it may take tens or hundreds of terabytes before data
size becomes a significant consideration.

HISTORY OF BIG DATA

The term has been in use since the 1990s, with some giving credit to John Mashey for popularizing the
term. Big data usually includes data sets with sizes beyond the ability of commonly used software tools
to capture, curate, manage, and process data within a tolerable elapsed time. Big data philosophy
encompasses unstructured, semi-structured and structured data, however the main focus is on
unstructured data. Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen
terabytes to many exabytes of data. Big data requires a set of techniques and technologies with new
forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale.

A 2018 definition states "Big data is where parallel computing tools are needed to handle data", and
notes, "This represents a distinct and clearly defined change in the computer science used, via parallel
programming theories, and losses of some of the guarantees and capabilities made by Codd’s relational
model."
CHARACTERISTICS OF BIG DATA
Big data can be described by the following characteristics:

Volume – The quantity of generated and stored data. The size of the data determines the value and
potential insight, and whether it can be considered big data or not.

Variety – The type and nature of the data. This helps people who analyse it to effectively use the
resulting insight. Big data draws from text, images, audio, video; plus it completes missing pieces
through data fusion.

Velocity – In this context, the speed at which the data is generated and processed to meet the
demands and challenges that lie in the path of growth and development. Big data is often available
in real-time.

Veracity – The data quality of captured data can vary greatly, affecting the accurate analysis.
(Martin, 2015)

Data must be processed with advanced tools (analytics and algorithms) to reveal meaningful
information. For example, to manage a factory one must consider both visible and invisible issues with
various components. Information generation algorithms must detect and address invisible issues such
as machine degradation, component wear, etc. on the factory floor. (Jay, Edzel, Behrad, & Hung-an,
2013)
ARCHITECTURE OF BIG DATA
A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too
large or complex for traditional database systems. The threshold at which organizations enter into the
big data realm differs, depending on the capabilities of the users and their tools. For some, it can mean
hundreds of gigabytes of data, while for others it means hundreds of terabytes.

Over the years, the data landscape has changed. What you can do, or are expected to do, with data has
changed. The cost of storage has fallen dramatically, while the means by which data is collected keeps
growing. Some data arrives at a rapid pace, constantly demanding to be collected and observed. Other
data arrives more slowly, but in very large chunks, often in the form of decades of historical data. You
might be facing an advanced analytics problem, or one that requires machine learning. These are
challenges that big data architectures seek to solve.

Big data solutions typically involve one or more of the following types of workload:

 Batch processing of big data sources at rest.


 Real-time processing of big data in motion.
 Interactive exploration of big data.
 Predictive analytics and machine learning.

Consider big data architectures when you need to:

 Store and process data in volumes too large for a traditional database.
 Transform unstructured data for analysis and reporting.
 Capture, process, and analyse unbounded streams of data in real time, or with low latency.

Most big data architectures include some or all of the following components:
Data sources. All big data solutions start with one or more data sources. Examples include:

 Application data stores, such as relational databases.


 Static files produced by applications, such as web server log files.
 Real-time data sources, such as IoT devices.

Data storage. Data for batch processing operations is typically stored in a distributed file store that can
hold high volumes of large files in various formats. This kind of store is often called a data lake.

Batch processing. Because the data sets are so large, often a big data solution must process data files
using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually
these jobs involve reading source files, processing them, and writing the output to new files.

Real-time message ingestion. If the solution includes real-time sources, the architecture must include
a way to capture and store real-time messages for stream processing. This might be a simple data store,
where incoming messages are dropped into a folder for processing. However, many solutions need a
message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable
delivery, and other message queuing semantics. This portion of a streaming architecture is often referred
to as stream buffering.

Stream processing. After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to
an output sink.

Analytical data store. Many big data solutions prepare data for analysis and then serve the processed
data in a structured format that can be queried using analytical tools. The analytical data store used to
serve these queries can be a Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions.

Analysis and reporting. The goal of most big data solutions is to provide insights into the data through
analysis and reporting. To empower users to analyse the data, the architecture may include a data
modelling layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis
Services. It might also support self-service BI, using the modelling and visualization technologies in
Microsoft Power BI or Microsoft Excel. Analysis and reporting can also take the form of interactive
data exploration by data scientists or data analysts.

Orchestration. Most big data solutions consist of repeated data processing operations, encapsulated in
workflows, which transform source data, move data between multiple sources and sinks, load the
processed data into an analytical data store, or push the results straight to a report or dashboard. To
automate these workflows, you can use an orchestration technology such Azure Data Factory or Apache
Oozie and Sqoop. (Microsoft, 2016)
Big data analytics for manufacturing applications is marketed as a "5C architecture" (connection,
conversion, cyber, cognition, and configuration). Factory work and Cyber-physical systems may have
an extended "6C system": Connection (sensor and networks); cloud (computing and data on demand);
cyber (model and memory) Content/context (meaning and correlation); community (sharing and
collaboration); customization (personalization and value)

ARCHITECTURE USED IN BIG DATA:


1. Lambda Architecture

2. Kappa Architecture

3. Internet of Things (IoT)


TECHNOLOGIES AND TOOLS IN BIG DATA
It's of no doubt that it’s a data-driven world we live in, and that data is growing exponentially. So much
so that it's rapidly changing our lives and organizations around the world have to adjust and adapt to
this vast amount of information.

From innovative storage technologies to IoT deployment and the EU's new GDPR legislation, big data
is driving change in the industry. Big data is a challenge for even the largest of organizations, who can
no longer afford to ignore the huge potential it has to improve business decisions, reach customers with
greater accuracy, and streamline business processes. (Ceravolo, 2013)

The four main elements of any big data project includes data storage, data mining, data analysis and
data visualization, of which each technological concept has a number of innovative, effective and high
technological tools on offer for businesses. (Che, Safran, & Peng, 2013)

Below are trending technologies as well as the technological tools involved in the big data environment

DATA STORAGE
For big data projects, cloud-based storage tools are vital to maximizing the amount of information you
can store. Cloud storage options let you store data in a much more secured and accessible fashion, for
ease of use:

 Hadoop
Hadoop is an open-source platform, specifically designed to store very large datasets using clusters. It
supports both structured and unstructured data and scales effortlessly, so is great for organizations that
are likely to need extra capacity without much notice. It can also handle a huge number of tasks without
any latency. This is a great option for organizations that have the developer resource to implement Java,
but it does require some effort to get up and running.

 MongoDB
MongoDB is very useful for organizations that use a combination of semi-structured and unstructured
data. This could be, for example, organizations that develop mobile apps, those that need to store data
relating to product catalogues, or data used for real-time personalization.

 RainStor
Rather than simply storing big data, Rainstor compresses and de-duplicates data, providing storage
savings of up to a ration of 40:1. It doesn't lose any of the datasets in the process, making it a great
option if an organization wants to take advantage of storage savings. Rainstor is available natively for
Hadoop and uses SQL to manage data. (Sagiroglu & Sinanc, 2013)
DATA MINING
Once you have your data stored, one will definitely need to add some tools to help filter the information
needed to analyze and/or visualize. The below tools helps one extract data needed without the hassle of
manually trawling through it all (a task that's impossible for humans to do anyway if you hold thousands
or more records). Effective tools for this particular technology includes:

 IBM SPSS Modeler


IBM's SPSS Modeler can be used to build predictive models using its visual interface rather than via
programming and running line of codes. It covers text analytics, entity analytics, decision management
and optimization and allows for the mining of both structured and unstructured data across an entire
dataset.

 KNIME
KNIME is a scalable open source solution with more than 1,000 modules to help data scientists mine
for new insights, make predictions and uncover key points from data. Text files, databases, documents,
images, networks and even Hadoop-based data can all be read, making it a perfect solution if the data-
types are mixed. It features a huge range of algorithms and community contributions to offer a full suite
of data mining and analysis tools.

 RapidMiner
RapidMiner as the name implies, is an open source data mining tool that allows customers and users
use templates and themes rather than having to write code or run line of codes. This makes it an
attractive option for organizations without a specific resource or if they're just looking for a tool to start
mining data.

DATA ANALYSIS
Now after obtaining data needed, the next step here is to find powerful tools to help analyze it in order
to glean key insights into businesses, clients, customers or even the wider world at large. Top notch
data analysis tools include:

 Apache Spark
Apache Spark is perhaps one of the most well-known big data analysis tools, built with big data at the
forefront of everything it does. It's open source, fast, effective and works with all major big data
languages including Java, Scala, Python, R, and SQL.

Apache Spark takes analysis a step further, allowing developers to use large-scale SQL, batch
processing, stream processing, and machine learning in one place, alongside graph procession too.
 Presto
Like Apache Spark, Presto is an open source tool, using distributed SQL queries, designed to run queries
against data as a powerful interactive analytics engine. It supports both non-relational sources, such as
the Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB and HBase, plus
relational data sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and
Teradata, making it a useful tool for businesses operating both types of database.

 SAP HANA
Data analytics is just one aspect of SAP's HANA platform, but it's a feature it does exceptionally well.
Supporting text, spatial, graph and series data from one place, SAP HANA integrates with Hadoop, R
and SAS to help businesses make fast decisions based on invaluable data insights.

 Tableau
Tableau combines data analysis and visualization tools and can be used on a desktop, via a server or
online. The online version has a big focus on collaboration, meaning you can easily share your
discoveries with anyone else in your organisation. Interactive visualisations make it easy for everyone
to make sense of the information and with Tableau Cloud's fully hosted option, you won't need any
resource to configure servers, manage software upgrades, or scale hardware capacity.

 Splunk Hunk:
Designed to run on top of Apaches Hadoop framework, Splunk's Hunk is a fully-equipped data analytics
tool which can generate graphs and visual representations of the data it is fed, all manageable through
a dashboard. Queries can be made against raw data through Hunk's interface, while graphs, charts and
dashboards can be quickly created and shared through Hunk's interface. It also works on other databases
and stores as well, including Amazon EMR, Cloudera CDH, and Hotronworks Data Platform among
others. (Che, Safran, & Peng, 2013)

DATA VISUALIZATION
Not everyone is adept at taking key insights from a list of data points or understanding what they mean.
The best way to present your data is by turning it into data visualizations so everyone can understand
what it means. Here are our top data visualization tools.

 Plotly
Plotly supports the creation of charts, presentations and dashboards from data analyzed using
JavaScript, Python, R, Matlab, Jupyter or Excel. A huge visualization library and online chart creation
tool makes it super-simple to create great looking graphics using a highly effective import and analysis
GUI.
 DataHero
DataHero is a simple to use visualization tool, which can suck data from a variety of cloud services and
inject them into charts and dashboards that make it easier for the entire business to understand insights.
Because no coding is required, it's suitable for use by organizations without data scientists in residence.

 QlikView
With a suite of capabilities on offer, QlikView allows its users to create data visualizations from all
manner of data sources with self-service tools that remove the need for complex data models to be in
place. Straightforward visualization is served up by QlikView running on top of the company's own
analytics platform, which can be shared with others so decision made upon trends the data revealed can
be collaborative. (Sebepou & Magoutis, 2010)

BIG DATA APPLICATIONS


The primary goal of Big Data applications is to help companies make more informative business
decisions by analysing large volumes of data. It could include web server logs, Internet click stream
data, social media content and activity reports, text from customer emails, mobile phone call details and
machine data captured by multiple sensors.

Organisations from different domain are investing in Big Data applications, for examining large data
sets to uncover all hidden patterns, unknown correlations, market trends, customer preferences and other
useful business information.

Major areas Big Data applies are:

 Big Data Applications in Healthcare


 Big Data Applications in Manufacturing
 Big Data Applications in Media & Entertainment
 Big Data Applications in IoT
 Big Data Applications in Government

Big Data Applications: Healthcare


The level of data generated within healthcare systems is not trivial. Traditionally, the health care
industry lagged in using Big Data, because of limited ability to standardize and consolidate data.

But now Big data analytics have improved healthcare by providing personalized medicine and
prescriptive analytics. Researchers are mining the data to see what treatments are more effective for
particular conditions, identify patterns related to drug side effects, and gains other important
information that can help patients and reduce costs.
With the added adoption of mHealth, eHealth and wearable technologies the volume of data is
increasing at an exponential rate. This includes electronic health record data, imaging data, patient
generated data, sensor data, and other forms of data.

Big Data Applications: Manufacturing


Predictive manufacturing provides near-zero downtime and transparency. It requires an enormous
amount of data and advanced prediction tools for a systematic process of data into useful information.

Major benefits of using Big Data applications in the manufacturing industry are:

 Product quality and defects tracking


 Supply planning
 Manufacturing process defect tracking
 Output forecasting
 Increasing energy efficiency
 Testing and simulation of new manufacturing processes
 Support for mass-customization of manufacturing

Big Data Applications: Media & Entertainment


Various companies in the media and entertainment industry are facing new business models, for the
way they – create, market and distribute their content. This is happening because of current consumer’s
search and the requirement of accessing content anywhere, any time, on any device.

Big Data provides actionable points of information about millions of individuals. Now, publishing
environments are tailoring advertisements and content to appeal consumers. These insights are gathered
through various data-mining activities.

Big Data benefits the media and entertainment industry by:


 Predicting what the audience wants
 Scheduling optimization
 Increasing acquisition and retention
 Ad targeting
 Content monetization and new product development

Big Data Applications: Internet of Things (IoT)


Data extracted from IoT devices provides a mapping of device inter-connectivity. Such mappings have
been used by various companies and governments to increase efficiency. IoT is also increasingly
adopted as a means of gathering sensory data, and this sensory data is used in medical and
manufacturing contexts.

Big Data Applications: Government


The use and adoption of Big Data within governmental processes allows efficiencies in terms of cost,
productivity, and innovation. In government use cases, the same data sets are often applied across
multiple applications & it requires multiple departments to work in collaboration. (Sinha, 2018)
References
Ceravolo, A. A. (2013). Consistent process mining over big data triple stores. Proceeding of the
International Congress on Big Data (BigData '13).

Che, D., Safran, M., & Peng, Z. (2013). Big Data to Big Data Mining: challenges, issues, and
opportunities. Database Systems for Advanced Applications.

Jay, L., Edzel, L., Behrad, B., & Hung-an, K. (2013, October 3). Recent advances and trends in
predictive manufacturing systems in big data environment. Retrieved from ScienceDirect:
https://www.sciencedirect.com/science/article/pii/S2213846313000114

Martin, H. (2015, January 3). Big Data for Development: A Review of Promises and Challenges
Development Policy Review. Retrieved from martinhilbert.net:
http://www.martinhilbert.net/big-data-for-development/

Microsoft. (2016). Big Data Architecture. Retrieved from Microsoft Docs:


https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/#internet-of-
things-iot

Sagiroglu, S., & Sinanc, D. (2013). Big data: a review. Proceedings of the International Conference on
Collaboration Technologies and Systems (CTS ’13). San Diego, California: IEEE.

Sebepou, Z., & Magoutis, K. (2010). Scalable storage support for data stream processing. Proceedings
of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST ’10). Incline
Village, Nevada: IEEE.

Sinha, S. (2018, July 16). Real Time Big Data Applications in Various Domains. Retrieved from
edureka!: https://www.edureka.co/blog/big-data-applications-revolutionizing-various-
domains/

Vous aimerez peut-être aussi