Vous êtes sur la page 1sur 11

BIG DATA MANAGEMENT

(2017)

SITI NORHAYATI BINTI MASHUDI, WAN NORAQILAH BINTI A.RAZAK,


MASTURAH BINTI MOHAMAD LIZA, NOR FADILA BINTI MOHD YUNUS,
SITI ROSNIEZA EILISA BINTI JAMAL
Introduction/background
Definition of big data
- Big data is a term for data sets that are so large or complex that traditional data
processing application software is inadequate to deal with them. The term "big data"
often refers simply to the use of predictive analytics, user behaviour analytics, or
certain other advanced data analytics methods that extract value from data, and
seldom to a particular size of data set. Variety makes big data really big. Big data
comes from a great variety of sources and generally has in three types: structured,
semi structured and unstructured. Structured data inserts a data warehouse already
tagged and easily sorted but unstructured data is random and difficult to analyze.
Semistructured data does not conform to fixed fields but contains tags to separate data
elements.

What is big data management?


- Big data management is the organization, administration and governance of large
volumes of both structured and unstructured data. The goal of big data management is
to ensure a high level of data quality and accessibility for business intelligence and
big data analytics applications. Corporations, government agencies and other
organizations employ big data management strategies to help them contend with fast-
growing pools of data, typically involving many terabytes or even petabytes of
information saved in a variety of file formats.

Why managing big data is becoming increasingly important?


- According to IBM, 2.5 quintillion bytes of data are generated by global businesses and
consumers every day. This shows that data is growing bigger and soon big data will
be the requirement for every organization. No longer will basic database management
tools be able to cope with the quantity and multiple sources of information. In
addition, the proliferation of multimedia data and ever-growing requests for
multimedia applications is one of the reason why big data is increasingly important.
The pervasiveness of mobile devices & consumer electronics and the popularity of
Internet & social networks have generated huge amount of multimedia information in
various media types such as text, image, video, and audio shared among a large
number of people. This creates the opportunities and intensifies the interest of the
research community in developing methods to address multimedia big data challenges
for real-world applications. The example of application domain for multimedia
application are video-on-demand, interactive video systems, surveillance, social
media, medicine, and healthcare. With the predicted growth of data in future years, it
is essential that businesses implement processes and software that will withstand data
growth. Big data management is important to companies and other organizations that
have big data to manage, but big data is still relatively new. Its important to beef up
data management infrastructure and skills as early as possible. Otherwise, an
organization can get so far behind from a technology viewpoint that its difficult to
catch up. From a business viewpoint, delaying the leverage of big data delays the
business value. Big data management is well worth doing because managing big data
leads to a number of benefits. Based on our research, the business and technology
tasks that improve most are analytic insights, the completeness of analytic data sets
(Emran 2015), business value drawn from big data, and all sales and marketing
activities.

Tools for Big Data Management


With new tools that address the entire data management cycle, big data technologies make it
technically and economically feasible, not only to collect and store larger datasets, but
also to analyse them in order to uncover new and valuable insights. These are several
examples of big data tools that widely use nowadays.
- Big data tools: Jaspersoft BI Suite

The Jaspersoft package is one of the open source leaders for producing reports from
database columns. The software is well-polished and already installed in many
businesses turning SQL tables into PDFs that everyone can scrutinize at meetings.
The company is jumping on the big data train, and this means adding a software layer
to connect its report generating software to the places where big data gets stored. The
Jasper Reports Server now offers software to suck up data from many of the major
storage platforms, including MongoDB, Cassandra, Redis, Riak, CouchDB, and
Neo4j. Hadoop is also well-represented, with JasperReports providing a Hive
connector to reach inside of HBase. This effort feels like it is still starting up -- many
pages of the documentation wiki are blank, and the tools are not fully integrated. The
visual query designer, for instance, doesn't work yet with Cassandra's CQL. You get
to type these queries out by hand. Once you get the data from these sources,
Jaspersoft's server will boil it down to interactive tables and graphs. The reports can
be quite sophisticated interactive tools that let you drill down into various corners.
You can ask for more and more details if you need them. This is a well-developed
corner of the software world, and Jaspersoft is expanding by making it easier to use
these sophisticated reports with newer sources of data. Jaspersoft isn't offering
particularly new ways to look at the data, just more sophisticated ways to access data
stored in new locations. I found this surprisingly useful. The aggregation of my data
was enough to make basic sense of who was going to the website and when they were
going there.

- Big data tools: Pentaho Business Analytics


Pentaho is another software platform that began as a report generating engine; it is,
like JasperSoft, branching into big data by making it easier to absorb information
from the new sources. You can hook up Pentaho's tool to many of the most popular
NoSQL databases such as MongoDB and Cassandra. Once the databases are
connected, you can drag and drop the columns into views and reports as if the
information came from SQL databases. I found the classic sorting and sifting tables to
be extremely useful for understanding just who was spending the most amount of time
at my website. Simply sorting by IP address in the log files revealed what the heavy
users were doing. Pentaho also provides software for drawing HDFS file data and
HBase data from Hadoop clusters. One of the more intriguing tools is the graphical
programming interface known as either Kettle or Pentaho Data Integration. It has a
bunch of built-in modules that you can drag and drop onto a picture, then connect
them. Pentaho has thoroughly integrated Hadoop and the other sources into this, so
you can write your code and send it out to execute on the cluster.

- Big data tools: Karmasphere Studio and Analyst


Many of the big data tools did not begin life as reporting tools. Karmasphere Studio,
for instance, is a set of plug-ins built on top of Eclipse. It's a specialized IDE that
makes it easier to create and run Hadoop jobs. I had a rare feeling of joy when I
started configuring a Hadoop job with this developer tool. There are a number of
stages in the life of a Hadoop job, and Karmasphere's tools walk you through each
step, showing the partial results along the way. I guess debuggers have always made it
possible for us to peer into the mechanism as it does its work, but Karmasphere
Studio does something a bit better: As you set up the workflow, the tools display the
state of the test data at each step. You see what the temporary data will look like as it
is cut apart, analyzed, and then reduced. Karmasphere also distributes a tool called
Karmasphere Analyst, which is designed to simplify the process of plowing through
all of the data in a Hadoop cluster. It comes with many useful building blocks for
programming a good Hadoop job, like subroutines for uncompressing Zipped log
files. Then it strings them together and parameterizes the Hive calls to produce a table
of output for perusing.
- Big data tools: Talend Open Studio
Talend also offers an Eclipse-based IDE for stringing together data processing jobs
with Hadoop. Its tools are designed to help with data integration, data quality, and
data management, all with subroutines tuned to these jobs. Talend Studio allows you
to build up your jobs by dragging and dropping little icons onto a canvas. If you want
to get an RSS feed, Talend's component will fetch the RSS and add proxying if
necessary. There are dozens of components for gathering information and dozens
more for doing things like a "fuzzy match." Then you can output the results. Stringing
together blocks visually can be simple after you get a feel for what the components
actually do and don't do. This was easier for me to figure out when I started looking at
the source code being assembled behind the canvas. Talend lets you see this, and I
think it's an ideal compromise. Visual programming may seem like a lofty goal, but
I've found that the icons can never represent the mechanisms with enough detail to
make it possible to understand what's going on. I need the source code. Talend also
maintains TalendForge, a collection of open source extensions that make it easier to
work with the company's products. Most of the tools seem to be filters or libraries that
link Talend's software to other major products such as Salesforce.com and
SugarCRM. You can suck down information from these systems into your own
projects, simplifying the integration.

- Big data tools: Skytree Server


Not all of the tools are designed to make it easier to string together code with visual
mechanisms. Skytree offers a bundle that performs many of the more sophisticated
machine-learning algorithms. All it takes is typing the right command into a
command line. Skytree is more focused on the guts than the shiny GUI. Skytree
Server is optimized to run a number of classic machine-learning algorithms on your
data using an implementation the company claims can be 10,000 times faster than
other packages. It can search through your data looking for clusters of mathematically
similar items, then invert this to identify outliers that may be problems, opportunities,
or both. The algorithms can be more precise than humans, and they can search
through vast quantities of data looking for the entries that are a bit out of the ordinary.
This may be fraud -- or a particularly good customer who will spend and spend. The
free version of the software offers the same algorithms as the proprietary version, but
it's limited to data sets of 100,000 rows. This should be sufficient to establish whether
the software is a good match.

- Big data tools: Tableau Desktop and Server

Tableau Desktop is a visualization tool that makes it easy to look at your data in new
ways, then slice it up and look at it in a different way. You can even mix the data with
other data and examine it in yet another light. The tool is optimized to give you all the
columns for the data and let you mix them before stuffing it into one of the dozens of
graphical templates provided. Tableau Software started embracing Hadoop several
versions ago, and now you can treat Hadoop "just like you would with any data
connection." Tableau relies upon Hive to structure the queries, then tries its best to
cache as much information in memory to allow the tool to be interactive. While many
of the other reporting tools are built on a tradition of generating the reports offline,
Tableau wants to offer an interactive mechanism so that you can slice and dice your
data again and again. Caching helps deal with some of the latency of a Hadoop
cluster. The software is well-polished and aesthetically pleasing. I often found myself
reslicing the data just to see it in yet another graph, even though there wasn't much
new to be learned by switching from a pie chart to a bar graph and beyond. The
software team clearly includes a number of people with some artistic talent.
- Big data tools: Splunk

Splunk is a bit different from the other options. It's not exactly a report-generating
tool or a collection of AI routines, although it accomplishes much of that along the
way. It creates an index of your data as if your data were a book or a block of text.
Yes, databases also build indices, but Splunk's approach is much closer to a text
search process.

- Big data tools: MarkLogic


MarkLogic is built to deal with heavy data loads and allow users to access it through
real-time updates and alerts. It provides geographical data that is combined with
content and location relevance along with data filtering tools. This tool is ideal for
those looking at paid content search app development. It supports flexible APIs such
as Node.js Client API, NoSQL and it also offers Samplestack to help show developers
how to implement a reference architecture using key MarkLogic concepts and sample
code.

Challenges of Big Data Management

- Privacy: The privacy is the most sensitive issue, with conceptual, legal, and technological
implications. This concern increases its importance in the context of big data. Privacy can
also be understood in a broader sense as encompassing that of companies wishing to
protect their competitiveness and consumers and stages eager to preserve their
sovereignty and citizens.

- To access and sharing of information: It is common to expect reluctance of private


companies and other institutions to share data about their clients and users, as well as
about their own operations. Obstacles may include legal or reputational considerations, a
need to protect their competitiveness, a culture of secrecy, and more broadly, the absence
of the right incentive and information structures. There are also institutional and technical
challenges, when data is stored in places and ways that make it difficult to be accessed
and transferred (Leza & Emran 2014).

- To rethink security for information sharing in big data use cases: Many online services
today require us to share private information (i.e., Facebook, LinkedIn, etc.), but beyond
record-level access control we do not understand what it means to share data, how the
shared data can be linked, and how to give users fine-grained control over this sharing.
The size of big data structures is also a crucial point that cans constraint the performance
of the system. Managing large and rapidly increasing volumes of data has been a
challenging issue for many decades. In the past, this challenge was mitigated by
processors getting faster, which provide us with the resources needed to cope with
increasing volumes of data. But there is a fundamental shift underway now considering
that data volume is scaling faster than computer resources.

- Size issue: the larger the data set to be processed, the longer it will take to analyze. The
design of a system that effectively deals with size is likely also to result in a system that
can process a given size of data set faster. However, it is not just this speed that is usually
meant when we refer to speed in the context of big data. Rather, there is an acquisition
rate challenge in the ETL process. Scanning the entire data set to find suitable elements is
obviously impractical. Rather, index structures are created in advance to permit finding
qualifying elements quickly.

- Working with new data sources: The relevance and harshness of those challenges will
vary depending on the type of analysis being conducted, and on the type of decisions that
the data might eventually inform. The big core challenge is to analyse what the data is
really telling us in a fully transparent manner.

- In multimedia big data: The semantic gap between semantics and video visual appearance
is a challenge towards automated ontology-driven video annotation. . Ontology builds a
formal and explicit representation of semantic hierarchies for the concepts and their
relationships in video events, and allows reasoning to derive implicit knowledge. With
the rapid growth of video resources on the world-wide-web, for example, on YouTube
alone, 35 hours of video are unloaded every minute, and over 700 billion videos were
watched in 2010.

Conclusion
- As the conclusion, effective big data management helps companies locate valuable
information in large sets of unstructured data and semi-structured data from a variety
of sources, including call detail records, system logs and social media sites. Most big
data environments go beyond relational databases and traditional data warehouse
platforms to incorporate technologies that are suited to processing and storing non-
transactional forms of data. The increasing focus on collecting and analysing big data
is shaping new platforms that combine the traditional data warehouse with big data
systems in a logical data warehousing architecture. As part of the process, the must
decide what data must be kept for compliance reasons, what data can be disposed of
and what data should be kept and analysed in order to improve current business
processes or provide a business with a competitive advantage. This process requires
careful data classification so that ultimately, smaller sets of data can be analysed
quickly and productively.
References

Dr. Borne, Kirk. (2014, April 14). Top 10 Data Challenges A Serious Look at 10 Big Data Vs.
Retrieved from https://mapr.com/blog/top-10-big-data-challenges-serious-look-10-big-data-vs/

Rouse, M. (2013, October). Big Data Management. Retrieved from

http://searchdatamanagement.techtarget.com/definition/big-data-management

Wayner, P. (2012, April 18). 7 top tools for taming big data. Retrieved from

http://www.infoworld.com/article/2616959/big-data/7-top-tools-for-taming-big-data.html

Emran, N.A., 2015. Data completeness measures. In Advances in Intelligent Systems and
Computing. pp. 117130.
Leza, F.N.M. & Emran, N.A., 2014. Data accessibility model using QR code for lifetime
healthcare records. World Applied Sciences Journal, 30(30), pp.395402.

Vous aimerez peut-être aussi