Vous êtes sur la page 1sur 4

Data Mining using SQL Server Analysis Server

Article by Shamaran Satkunanathan

In today’s world, software applications are moving from traditional information systems to Business
Intelligent systems. The growing of data and information have within data need to raise to develop
new kind of applications for the organizations. To address this, data mining solutions have become
an integral part of many software solutions.

Data mining is the process of discovering actionable information from large sets of data. It helps
organizations analyze incredible amounts of data in order to detect common patterns or learn new
things. It uses mathematical analysis to derive patterns and trends from existing data. However
existing data need to be organized via ETL (Extract, Transform, Loading) process before applying
the mining technique on them. This is because typically these patterns cannot be discovered by
traditional data exploration methods since the relationships are too complex or because there is too
much data.

There are many data mining techniques available to analyze data and drive the useful knowledge
and patterns from those. These techniques are ranged from extremely complex to basic. Each
technique serves a slightly different purpose or goal. Here are few examples approaches to data
mining.

Clustering
Cluster detection is a type of pattern recognition that is used to detect patterns within large data
sets. It’s a bit like arranging a large amount of information into categories using patterns which
emerge during data analysis.

Classification
Classification Analysis is a systematic process for obtaining important and relevant information
about data, and metadata – data about data. The classification analysis helps identifying to which
set of categories different types of data belong. Classification analysis is closely linked to cluster
analysis as the classification can be used to cluster data.

Regression
Regression is a technique that aims to predict future outcomes using large sets of existing
variables. This is used to predict future user engagement, customer retention and even property
prices.

Anomaly or Outlier Detection


Anomaly detection refers to the search for data items in a dataset that do not match a projected
pattern or expected behaviour. Anomalies are also called outliers, exceptions, surprises or
contaminants and they often provide critical and actionable information.

1
Association Rule Learning
Association rule learning enables the discovery of interesting relations between different variables
in large databases. Association rule learning uncovers hidden patterns in the data that can be used
to identify variables within the data and the co-occurrences of different variables that appear with
the greatest frequencies.

Microsoft SQL Server Data tools


There are many data mining tools available to apply the data and predict. Here we are going to talk
about Microsoft SQL Server Data tools and how its features support to build & deploy a mining
model.

The Microsoft SQL Server Data tools includes SQL server relational databases, Azure SQL
databases, Integration Services packages, Analysis Services data models and reporting Services.
The analysis server contains data mining algorithms and query tools that make it easy to build a
comprehensive solution for a variety of projects. SQL Server Management Studio, contains tools for
browsing models and managing data mining objects.

Developing a mining application generally takes following steps


1. Defining the problem
2. Preparing Data
3. Exploring Data
4. Building Models
5. Exploring and Validating Models
6. Deploying and updating Models
Following diagram explains the cyclical flow of creating a data mining model. Each step in the
process needs to be repeated many times in order to create a decent model.

Image source: https://i-msdn.sec.s-msft.com/dynimg/IC125015.jpeg

2
Defining the Problem
First step is to determine the scope of the problem and analyzing the business requirements to
defining specific goals for the data mining project. Here we might need to conduct a data availability
study.

Preparing Data
In this step, we are working with a very huge data set and cannot examine every transaction for
data quality; therefore, we might need data profiling and automated data cleansing and filtering
tools, such as Microsoft SQL Server Master Data Services or SQL Server Data Quality Services to
explore the data and find the inconsistencies.

Exploring Data
In this step, we can use tools such as Master Data Services to canvass available sources of data
and define their availability for data mining. We can use tools such as SQL Server Data Quality
Services, or the Data Profiler in Integration Services, to analyze the distribution of data and repair
issues such as incorrect or missing data.

Building Models
In this step, we state the columns of data which we want to use by creating a mining structure.
When we process the mining structure, Analysis Services produces aggregates and other statistical
information that can be used for analysis. This information can be used by any mining model that is
based on the structure.

Processing a model is called as training. Applying a specific mathematical algorithm to the data in
the structure is the process of training. By using training, we can extract patterns. The patterns that
we find in the training process depend on the following three points
 Selection of training data,
 The algorithm we chose,
 How we have configured the algorithm.

We can also use parameters to adjust each algorithm and apply filters to the training data to use
just a subset of the data. After data is passed through the model, the mining model object holds
summaries and patterns that can be queried or used for prediction.

Exploring and Validating Models


By using Analysis Services we can distinct data into training and testing a dataset so that we can
precisely assess the performance of all models on the same data. We use the training dataset to
build the model and the testing dataset to test the accuracy of the model by creating prediction
queries.

We can explore the trends and patterns that the algorithms discover by using the viewers in Data
Mining Designer in SQL Server Data Tools. We can also test how well the models create
predictions by using tools in the designer such as the lift chart and classification matrix. To verify
whether the model is specific to our data or might be used to make inferences on the general
population, we can use the statistical technique called cross-validation to automatically create
subsets of the data and test the model against each subset.

3
Deploying and Updating Models

Here we can use the use the models to create predictions, which we can then use to make
business decisions. SQL Server provides the DMX language that we can use to create prediction
queries, and Prediction Query Builder to help you build the queries.

References
 http://charc-concepts.org/the-benefits-of-data-mining/
 https://msdn.microsoft.com/en-us/library/bb522607.aspx
 https://msdn.microsoft.com/en-us/library/bb510516.aspx