Vous êtes sur la page 1sur 14

Understanding the data analytics

project life cycle

While dealing with the data analytics projects, there are some fixed tasks that should be

followed to get the expected output. So here we are going to build a data analytics project

cycle, which will be a set of standard data-driven processes to lead data to insights effectively.

The defined data analytics processes of a project life cycle should be followed by sequences

for effectively achieving the goal using input datasets. This data analytics process may include

identifying the data analytics problems, designing, and collecting datasets, data analytics, and

data visualization.

The data analytics project life cycle stages are seen in the following diagram:

Let's get some perspective on these stages for performing data analytics.

Identifying the problem

Today, business analytics trends change by performing data analytics over web datasets for

growing business. Since their data size is increasing gradually day by day, their analytical

application needs to be scalable for collecting insights from their datasets.

With the help of web analytics, we can solve the business analytics...
5th chapter –content-topic

Understanding data analytics


problems

In this section, we have included three practical data analytics problems with various stages of

data-driven activity with R and Hadoop technologies. These data analytics problem definitions

are designed such that readers can understand how Big Data analytics can be done with the

analytical power of functions, packages of R, and the computational powers of Hadoop.

The data analytics problem definitions are as follows:

 Exploring the categorization of web pages

 Computing the frequency of changes in the stock market

 Predicting the sale price of a blue book for bulldozers (case study)

Exploring web pages categorization


This data analytics problem is designed to identify the category of a web page of a website,

which may categorizedpopularity wise as high, medium, or low (regular), based on the visit

count of the pages. While designing the data requirement stage of the data analytics life cycle,

we will see how to collect these types of data from Google Analytics.

Identifying the problem


As this is a web analytics problem, the goal of the problem is to identify the importance of web

pages designed for...

Refer above book

7 phases of a data life cycle


https://www.bloomberg.com/professional/blog/7-phases-of-a-data-life-cycle/

July 14, 2015


Most data management professionals would acknowledge that
there is a data life cycle, but it is fair to say that there is no common
understanding of what it is. If you Google “Data Life Cycle” you will
not find anything that clearly describes it. But, if data management
professionals know that there really is a Data Life Cycle, then it is
incumbent on us to try to define it.

This is one attempt to describe the Data Life Cycle. It takes the
position that a life cycle consists of phases, and each phase has its
own characteristics. Einstein, when he was a teenager tried to think
what it would be like to ride a beam of light. There is no chance that
we can emulate Einstein, but perhaps we can put his idea to use.
What would happen if we could ride on a piece of data as it moved
through the enterprise? What new experiences would the piece of
data have? What phases would it pass though?

1. Data Capture

The first experience that an item of data must have is to pass within
the firewalls of the enterprise. This is Data Capture, which can be
defined as:

 The act of creating data values that do not yet exist and have
never existed within the enterprise.
There are three main ways that data can be captured, and these
are very important:

1. Data Acquisition:the ingestion of already existing data that


has been produced by an organization outside the enterprise
2. Data Entry: the creation of new data values for the enterprise
by human operators or devices that generate data for the
enterprise
3. Signal Reception:the capture of data created by devices,
typically important in control systems, but becoming more
important for information systems with the Internet of Things

There may well be other ways, but the three identified above have
significant Data Governance challenges. For instance, Data
Acquisition often involves contracts that govern how the enterprise
is allowed to use the data it obtains in this way.

2. Data Maintenance

Once data has been captured it usually encounters Data


Maintenance. This can be defined as:

 The supplying of data to points at which Data Synthesis and


Data Usage occur, ideally in a form that is best suited for
these purposes.
We will deal with Data Synthesis and Data Usage in a moment.
What Data Maintenance is about is processing the data without yet
deriving any value from it for the enterprise. It often involves tasks
such as movement, integration, cleansing, enrichment, changed
data capture, as well as familiar extract-transform-load processes.

Data Maintenance is the focus of a broad range of data


management activities. Because of this, Data Governance faces a
lot of challenges in this area. Perhaps one of the most important is
rationalizing how data is supplied to the end points for Data
Synthesis and Data Usage, e.g. preventing proliferation of point-to-
point transfers.

3. Data Synthesis

This is comparatively new, and perhaps still not a very common


phase in the Data Life Cycle. It can be defined as:

 The creation of data values via inductive logic, using other


data as input.
It is the arena of analytics that uses modeling, such as is found in
risk modeling, actuarial modeling, and modeling for investment
decisions. Derivation by deductive logic is not part of this – that
occurs in Data Maintenance. An example of deductive logic is Net
Sales = Gross Sales – Taxes. If I know Gross Sales and Taxes,
and I know the simple equation just outlined, then I can calculate
Net Sales.

Inductive logic requires some kind of expert experience, judgement,


and/or opinion as a part of the logic, e.g. the way in which credit
scores are created.

4. Data Usage

So far we have seen how our in single data value has entered the
enterprise via Data Capture, and has been moved around the
enterprise, perhaps being transformed and enriched in Data
Maintenance, and possibly being an input to Data Synthesis. Next,
it reaches a point where it is used in support of the enterprise. This
is Data Usage, which can be defined as:

 The application of data as information to tasks that the


enterprise needs to run and manage itself.
This would normally be tasks outside the data life cycle
itself. However, data is becoming more central to business models
in many enterprises. For instance, data may itself be a product or
service (or part of a product or service) that the enterprise
offers. This too is Data Usage, even if it is part of the Data Life
Cycle, because it is part of the business model of the enterprise.

Data usage has special Data Governance challenges. One of them


is whether it is legal to use the data in the ways which business
people want. This is referred to as “permitted use of data”. There
may be regulatory or contractual constraints on how data may
actually be used, and part of the role of Data Governance is to
ensure that these constraints are observed.

5. Data Publication

In being used, it is possible that our single data value may be sent
outside of the enterprise. This is Data Publication, which can be
defined as:

 The sending of data to a location outside of the enterprise.


An example would be a brokerage that sends monthly statements
to its clients. Once data has been sent outside the enterprise it
is de facto impossible to recall it. Data values that are wrong
cannot be corrected as they are beyond the reach of the
enterprise. Data Governance may be need to assist in deciding
how incorrect data that has been sent out of the enterprise will be
dealt with. Unhappily, data breaches also fall under Data
Publication.

6. Data Archival

Our single data value may experience many rounds of usage and
publication, but eventually the end of its life begins to loom
large. The first part of this is to archive the data value. Data
Archival is:

 The copying of data to an environment where it is stored in


case it is needed again in an active production environment,
and the removal of this data from all active production
environments.
A data archive is simply a place where data is stored, but where no
maintenance, usage, or publication occurs. If necessary the data
can be restored to an environment where one or more of these
occur.

7. Data Purging

We now come to the actual end of life of our single data


value. Data Purging is:

 The removal of every copy of a data item from the enterprise.


Ideally, this will be done from an archive. A Data Governance
challenge in this phase of the data life cycle is proving that the
purge has actually been done properly.

Critique

The terms we have used may be disputed. “Life Cycle” is not really
accurate because data does not reproduce or recycle itself, which
happens in real life cycles. “Data Life History” might be closer to the
truth, but is not a familiar term. “Life History” is used to describe the
phases of growth in an organism like a butterfly, but again data is
different. Therefore, “Data Life Cycle” might as well be used.

What has been described here are phases with logical


dependencies, not actual data flows. Data flows may go round and
round through these phases, e.g. from Data Synthesis back to Data
Maintenance and then returning to Data Synthesis and so on in
more cycles. A description of these flows is quite different to the
Data Life Cycle, though design should be informed by the Data Life
Cycle.

Nor have environments been described. Some environments might


be such that all phases of the Data Life Cycle occur in them, i.e.
silos. However, it does seem reasonable that architecture should
reflect the Data Life Cycle and that is a topic for another day.

Finally, data does not have to pass through all phases. Early
mainframe systems had nothing more than Data Capture and Data
Usage. Today, the full Data Life Cycle is more common.
What is important is that we define the Data Life Cycle because
each phase has distinct Data Governance Needs. Greater clarity
about the Data Life Cycle will help the mission of Data Governance.

This article was written by Malcolm Chisholm from Information


Management and was legally licensed through the NewsCred
publisher network.

Steps in the Data Life Cycle


Proposal Planning & Writing Data Analysis
 Review of existing data sources,
 Document analysis and file
determine if project will produce new
data or combine existing data manipulations
 Investigate archiving challenges,  Manage file versions
costs, consent and confidentiality
 Identify potential users of your data Data Sharing
 Contact Archives for advice  Determine file formats
Project Start Up
 Contact Archive for advice
 Create a data management plan
 Further document and clean data
 Make decisions about documentation
form and content End of Project

 Conduct pretest of collection materials  Deposit data in data archive


and methods (repository)
Data Collection
 Organize files, backups & storage, QA
for data collection
 Think about access control and
security

Remember: Managing data in a research project is a process that runs throughout the
project. Good data management is one of the foundations for reproducible research.
Good management is essential to ensure that data can be preserved and remain
accessible in the long-term, so it can be re-used and understood by future researchers.
Begin thinking about how you’ll manage your data before you start collecting it.
http://data.library.virginia.edu/data-management/lifecycle/
Other Life Cycles
 Data Curation Centre: Curation Lifecycle Model
 DataONE Best Practices Through the Data Life Cycle
 DDI Alliance: Research Life Cycle
 Life Cycle of Research, by Charles Humphrey
 ICPSR Data Life Cycle
 Research Lifecycle at UCF

https://www.capgemini.com/service/digital-services/insights-data/insights-data-
strategy/insights-data-architecture/
How do you make the big promise of the new data landscape a reality – and ensure
your strategy can be executed? Answer: Make it an architected journey. Our
pragmatic approach to Insights & Data architecture provides you with a solid, yet
agile foundation for change, renewal and innovation. So that your solutions are
designed for digital right from the start.
Create real, sustainable value from the new data landscape

With the emergence of the new data landscape, the change potential for
organizations is bigger than ever before. There are no filters on what data can be
acquired and stored, no restrictions on what can be analyzed, and no waiting time
for presenting real-time, tailor-made and highly actionable insights. However,
advances in technology are moving at lightning speed and it seems more difficult
than ever to select the right technology components, while preserving the crucial
assets of the existing data estate.

How do you reap the benefits of the new data landscape right now, while being
prepared for new insights and data opportunities that may be unknown today?
How do you create an architecture for change that is fresh, pragmatic and does
justice to the new ways of thinking of the digital enterprise? This asks for a new
approach to Insights & Data architecture, one that enables new, unexplored
opportunities for the insights-driven enterprise.

Design architecture for the future


Capgemini has one of the biggest, most active and most experienced architect
communities in the world. And for good reason, as no digital transformation is
successful without a proper architectural foundation. We believe that architecture
should never be a purpose on its own. It’s meant to facilitate and enable a change,
whatever change the organization envisions.
Technology is the driver behind the new data landscape. Through our Insights & Data architectures,
we ensure you benefit from the latest advances, while modernizing your existing IT estate towards a
powerful and agile digital platform – industrializing where it counts, innovating where it makes the
difference.

But above all, architecture is a tool to bring business and technology together. The change objectives
of the organizations always have the central role and our architecture visualizations are compelling
and understandable for all stakeholders. We believe architecture should tell stories and bring
simplicity, rather than introducing piles of documentation and layers of additional complexity.

 Focus on business outcomes: We turn your digital vision into an Insights & Data
architecture that is geared towards one and one thing only: turning data into tangible,
measurable business benefits
 Pragmatic and compelling: We produce architectural assets that are exactly to the point,
precisely what are needed for the change, and convincing in their visualization
 Leveraging open standards: Capgemini is a leader in both using and developing
architectural open standards. We have the world’s largest community of certified TOGAF
architects, have donated major contributions to TOGAF 9, and are actively involved in the
Open Platform 3.0 forum of The Open Group, particularly focusing on open standards for big
data and the Business Data Lake
 Accelerated: We use tools such as our Accelerated Solution Environments (ASEs),
TechnoVision innovation framework, the WARP industrialized assessment approach, and our
powerful big data architecture reference models to have a flying start and deliver quick
results
 Full lifecycle, full landscape: We take all aspects of the Insights & Data lifecycle into the
scope of architecture, all the way from acquiring and marshaling data, to analytics,
visualization and action. But also all the way from infrastructure to business services
 Delivers step-by-step: With an architected digital platform vision at the heart of your
Insights & Data strategy, you are not only equipped to deliver quick, compelling results right
now but also to address any future business opportunities.

Respond more quickly

We helped a global consumer goods company to first industrialize and then


modernize their Insights & Data landscape.With major change objectives in the
areas of customer intimacy, continuous improvement and agility, we architected an
Insights & Data platform that was able to collect more data from more various
sources and combined a robust, unified information core with a highly
industrialized access to data and new, flexible ways to analyze and visualize
insights.
We maintained the proven quality and robustness of the existing data warehouse solutions, but
augmented them with next-generation technologies – such as Hadoop and Tableau – to add agility
and scalability, and to respond much more quickly to new business needs.

 Learn how data drives digital transformation


 See how enterprises are cracking the data conundrum
 Get inspired by our TechnoVision 207 approach and our perspectives on how to ‘Thrive On
Data’
 Hear about the essentials of our big data reference architecture
 See our vision on the Business Data Lake explained, to get a flavor of what the new data
landscape looks like.

https://blogs.sas.com/content/hiddeninsights/2013/10/11/how-well-are-you-managing-the-
analytical-life-cycle/

How Well Are You Managing the


Analytical Life Cycle?
1

By Christian Moe on Hidden Insights 11/10/2013

To many organizations, analytics means building a model that reveals new


knowledge from data. Unfortunately, that’s too narrow a definition. Now that
advanced analytical models are becoming high-value organizational assets –
essential tools to attract and retain customers in telco, manage financial risk in
insurance and banking, and predict demand and set prices in retail – these models
and their underlying data must be managed and governed like other strategic assets.
This is rarely done, however. For example, we see that organizations are struggling
to:

 Keep track of model versioning. Analysts don’t just develop one model to solve a
business problem. They develop a set of competing models and use different
techniques to address complex problems. They will have models at various stages of
development and models tailored for different product lines and business units. As a
result, your organization can quickly find itself managing thousands of models.
 Do structured and rapid deployment of new models and data. The model and data
environment is anything but static. Models will be continually re-deployed as they are
tested and as new results and data sources become available. As such, the model
deployment process is a much more iterative process than the traditional IT process
of building applications.
 Embed models into decision processes and decisions into model development. Models and
data are not of any use if they only feed reports and dashboards. The results of
analytics should guide business decisions, and the results of those business decisions
should be fed back into models and model development. In a distributed and loosely
managed modeling environment, this is hard to achieve. When different data sets and
variables are used to create the models and there is little validation or back testing
results become inconsistent.
As a consequence of the above, managers must make decisions based on the
model results they receive, and everyone hopes for the best. To solve these and
similar challenges in a systematic fashion, you will need to establish proper
governance of the analytics life cycle, i.e. the processes and technology support to
ensure that your organisation's operational use of analytical models sustains and
extends strategies and objectives.
In practice, you should consider establishing:

 A model repository. A central, secure repository that stores extensive documentation


about the model, its scoring code and associated metadata. Using the repository,
modelers can easily collaborate and reuse model code, with their activities tracked via
user/group authentication, version control and audit controls.
 Automated workflow. Enable the model management process to become more
automated and collaborative. As an example, users should be able to track each step
of a modeling project, from problem statement through development, deployment and
retirement.
 Governance. Accountability metrics and version control status reports on who is
allowed to change what, when control is passed from one area to another, and more.
A centralized model repository, lifecycle templates and version control can provide
visibility into analytical processes and ensure that they can be audited to comply with
internal governance and external regulations.
 Validation. Scoring logic should be validated before models are put into production,
using a systematic template and process to record each test the scoring engine goes
through, to ensure the logic embedded in the champion model is sound.
 Performance monitoring. As the champion model reaches test, stage and production
lifecycle milestones, its status and performance metrics should be pushed to subject
matter experts through standard reporting channels to gauge a model’s fitness for the
business question at hand.
 Deployment. Deployment options and procedures aligned with IT and outsourcing
partners – in batch or real time.
Analytical models use your data to tell you about the likelihood of some future
business event. Since nobody knows exactly what’s going to happen in the future,
managing models is about managing the uncertainty of future outcomes across the
organization. That’s an important enough purpose to deserve rigorous process
controls – strong governance of analytical lifecycle management.

Sources:

Vous aimerez peut-être aussi