A Comparative Study of Single-Step and Multi-Step Data Mining Tools

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.
ORG
26
A Comparative Study of Single-step and Multi-step Data Mining Tools

Dost Muhammad Khan1, Nawaz Mohamudally2
1
Assistant Professor, Department of Computer Science & IT, The Islamia University of Bahawalpur, PAKISTAN & PhD Student, School of Innovative Technologies & Engineering, University of Technology, Mauritius (UTM)
Associate Professor, & Consultancy & Technology Transfer Centre, Manager, University of Technology, Mauritius
Abstract
As a matter of fact there is unanimity that data mining is not a single-step process and the discover of knowledge from a dataset is the result of successive processes called multi-steps. The current data mining tools are designed to solve discrete consecutive tasks, such as classification or clustering, hence the tools turn out to be a single-step process and fail to produce the knowledge. Furthermore, in single-step tools the extraction of knowledge depends on the right choice of algorithms to apply and how to analyze the output, because most of them are generic and there is no context specific logic that is attached to the application. The choice of the algorithm remains ad-hoc in many data mining tools. The scientific community is very much conscious about this problematical issue and faced multiple challenges in establishing consensus over a unified data mining theory (UDMT) based on multi-step data mining processes. In this paper we draw a comparison between a single-step and multi-steps data mining tools. In singlestep data mining tools the selection of algorithms is on ad-hoc based which is inadequate to produce the knowledge. On the other hand the multi-step data mining tool where the selection of the algorithms depends on the nature of the data provides the knowledge to the user. Keywords: Unified Data Mining Theory (UDMT), ODM, MS SQL Server, Unified Data mining Tool (UDMTool), Single-step Tool, Multi-step Tool, Unified Data Mining Processes (UDMP)
1. Introduction
The ultimate goal of the data mining is to extract the knowledge from a dataset. According to Michael Berry, there are two types of data mining; directed and undirected. The directed data mining is to explain or categorize some particular target fields, such as income or response, where as the undirected data mining deals with to find patterns or similarities among groups of records, without the use of a particular target field [1]. The data mining techniques help to find the relationships between multiple parental variables and the outcomes they influence. The data mining is used in different fields of study like bioinformatics, genetics, medicine and education due to their robustness, scalability and efficiency. The classification, clustering, regression model and association rules are the main areas of data mining [2][3][4]. The current data mining algorithms and techniques are designed for individual problems, such as classification or clustering. The choice of the algorithm depends on the intended use of extracted knowledge. The data can be used either to predict future behavior or to describe patterns in an understandable form within discovered process. There exists many data mining tools and environments, some of these are: Excel, File Search Assistant, SAS Text Miner, SPSS Clementine, SPSS LexiQuest and Oracle Data Mining (ODM), MS SQL Server and many more but none of these tools provide the unification of multiple data mining tasks [5][6][7]. The utilization of these tools depends on the specific problem; some tool may produce good results on some type of data and may be misleading results for another type of data and these tools can only solve the individual data mining task. The researchers are now emphasizing that the discovery of knowledge from a dataset is not a single-step process rather it is a multi-step process. Unfortunately, the available tools are based on the single-step process, therefore, tools are failed to produce the knowledge. The unified data mining tool (UDMTool), a multi-step process, unifies the clustering, classification and visualization tasks of data mining and hence provides the better solution as compare to the other available data mining tools [8]. In this paper we draw a comparison between data mining tools where the selection of algorithms is on ad-hoc based which is inadequate to produce the knowledge. On the other hand the unified data mining theory (UDMT) where the selection of the algorithms depends on the nature of the data provides the knowledge to the user. In a single-step tool the data mining tasks clustering, classification, association and feature selection are individually carried out therefore, there is no link between these data mining tasks hence it is
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
27
difficult to extract knowledge from the given dataset. On the other hand, in s multi-step tool the data mining tasks clustering, classification and visualization are unified and the tool looks like a single-step, provides the knowledge as output. The rest of the paper is organized as follows; section 2 deals with the Data Mining Tools, section 3 is about the comparison of tools and results are discussed in section 4 and finally the conclusion is drawn in section 5.
2. Data Mining Tools

In this section we discuss the single-step data mining tools namely ODM and MS SQL Server and a multi-step data mining tool called UDMTool. 2.1 Oracle Data Mining (ODM) The architecture of ODM is based on the Cross Industry Standard Process for Data Mining (CRISP-DM) model which was founded in 1997 and funded by the European Commission. The main idea was to define an industry standard for data mining [9]. The CRISP-DM process is shown below: Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment There are six steps in CRISP-DM process model. The ODM implements and supports the last three steps of CRISPDM model. The main components of the MS SQL Server are shown below: Data Source Modeling Evaluation and Deployment The data mining is an iterative process, the process continues after a solution is deployed. The lessons learned during the process can trigger new business questions. Any change in the data can require new models. The subsequent data mining processes benefit from the experiences of previous ones. The remaining steps are supported by a combination of the ODM and the Oracle database, especially in the context of an Oracle data warehouse. The facilities of the Oracle database can be very useful during data understanding and data preparation. The ODM integrates data mining with the Oracle database and exposes data mining through the interfaces namely, Java interface, PL/SQL interface, an Automated data mining, the Data mining SQL functions and the Graphical interfaces. The ODM supports data mining model export and import in native format between Oracle databases or schemas to provide a way to move models [9][10][13]. The workflow of ODM is illustrated in figure 1.
Figure 1. The Workflow of the ODM The figure 1 depicts the workflow of the ODM. The data source is the dataset, explore data is the viewing the dataset and selection of model is the data mining models such as clustering, classification, association and feature extraction. These are the required components to do mining in the ODM. The next phase is to apply the model on the dataset and finally store the results in a separate table for further processing. The user can apply only two components data source and model and build the model. The rest of the components are just to facilitate the user. 2.2 MS SQL Server The MS SQL Server also uses the Cross Industry Standard Process for Data Mining (CRISP-DM) model. Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment The data mining is a process that involves the interaction of multiple components. In MS SQL Server one can access the sources of data in a SQL Server database or any other data source to use for training, testing, or prediction, define the data mining structures and models by using Business Intelligence Development Studio or Visual Studio 2008 and the data mining objects are managed, create the predictions and the queries by using SQL Server Management Studio. After the completion of the solution, deploy it to an instance of Analysis Services. The main components of the MS SQL Server are shown below:
28
Data Source Data Mining Structure Data Mining Models Deployment In MS SQL Server the data mining can be done quickly and easily on relational data tables, or any other data source that has been defined as an Analysis Services data source view. The MS SQL Server 2008 Analysis Services also provides the ability to separate the data into training and testing datasets. A data mining structure is a logical data structure that defines the data domain from which mining models are built. A single mining structure can support multiple mining models that share the same domain. The data mining structure can also be partitioned into a training and test dataset. This partitioning can be done automatically when the data mining structure is defined. A data mining model represents a combination of data, a data mining algorithm, and a collection of parameter and filter settings that affect the data used and how the data is processed. The ultimate goal of data mining development is to create a model that can be used by end users [12][14]. 2.3 The Unified Data Mining Tool (UDMTool) The Unified Data Mining Tool (UDMTool) is a new and better next generation solution based on the UDMT which is a unified way of architecting and building software solutions by integrating different data mining tasks. The foundation of the UDMTool is that the Knowledge can only be obtained if the data mining processes such as clustering, classification and visualization are unified which is also called the Unified Data mining Theory (UDMT) i.e. the Knowledge can be extracted from a given dataset after passing through all the data mining processes. This is illustrated in equation (1).
Knowledge Clustering Classification Visulaization
(1)
It can be written as in equation (2).

K A B C
(2)
Where A is the clustering, B is the classification, C is the visualization and K is the knowledge. The architecture of the UDMTool is based on the unified data mining process (UDMP) as illustrated in figure 2.
Figure 2. The Unified Data Mining Process The first three processes of the figure 2 are data gathering, data cleansing and then preparing a dataset. The next process unifies the clustering, classification and visualization processes of data mining, called unified data mining processes (UDMP) followed by the output which is the knowledge. The user evaluates and interprets the knowledge according to his/her business rules. The dataset is the only required input; the knowledge is produced as final output from the UDMP. As compared to the ad-hoc data mining models, the appropriate data mining algorithms are selected automatically depending on the nature and the value of the given dataset in the UDMP. The figure 3 depicts the architecture of the UDMTool.
29
Figure 3. The Architecture of the UDMTool The UDMTool is a multiagent system (MAS). The dataset is the required input; there are many types of datasets like, numeric, categorical, multimedia, text and many more. First agent takes the dataset and computes the value of Akaike Information Center (AIC), a model selection criterion, second agent creates the appropriate vertical partitions of the dataset and the third agent computes the logarithm value of the complexities O of data mining algorithms deployed in the UDMTool. The fourth agent is applied to input the vertically partitions of the dataset to UDMP, which itself is a MAS, where one agent is for clustering, second agent is for classification and the third agent is for visualization, these agents are cascaded i.e. the output of one agent is an input of second agent and the output of second agent is input of the third agent. The appropriate data mining algorithms for clustering, classification and visualization are selected through the value of AIC of the given dataset, the process is completed by an agent which maps the value of AIC with the logarithmic value of the complexities O of data mining algorithms. The function of the UDMTool is demonstrated in figure 4.
Figure 4. The Function of the UDMTool A well-prepared dataset is an input of this framework. First, intelligent agent compute the value model of selection AIC, which is used to select appropriate data mining algorithm. A MAS called the UDMP is based on the UDMT. Finally, the knowledge is extracted, which is either accepted or rejected. The relationship between dataset and selection criterion is one-to-one i.e. one dataset and one value for model selection and between dataset and vertical partitions is one-to-many i.e. more then one partitions are created for one dataset. The relationship between selection criterion and the UDMP is one-to-one i.e. one value of selection model will give one data mining algorithm and finally the relationship between vertical partitions and the UDMP is many-to-many i.e. many partitioned datasets are inputs for the UDMP and only one result is produced as knowledge.
3. A Comparison of ODM, MS SQL Server and UDMTool

A comparison is drawn between ODM, MS SQL Server and UDMTool in table 1. Table 1. A Comparison of ODM, MS SQL Server and UDMTool ODM It is not a magic wind. The user has to select manually an appropriate data mining algorithm from the available data mining pool and if the required results are not MS SQL SERVER It is not a magic wind. The user has to combine the different data mining algorithms provided by MS on Ad-hoc bases in order to UDMTool It is a magic wind. The tool is based on Unified Data Mining Theory (UDMT). There is no need to select any data mining
30
produced or obtained from the selected algorithm, one has to choose another one. In this suite one algorithm is for one data mining task, e.g. for clustering k-means, but the produced clusters presents only the groups of the data, it is not a knowledge or serve any purpose to the user. In order to extract the feature or pattern from the given dataset, one has to combine or unify different algorithms manually or one by one and then at the end the desired results are obtained. There is no need to prepare a dataset for mining. It supports the already created databases. It also provides the training facility of a dataset. Java Implementation Interface only supports numeric datasets and DBMS_DATA_Mining Interface supports categorical and numeric data. The user has to set parameters for each of algorithm in order to produce useful pattern from the dataset. If no parameter is set then the default values are automatically taken by the algorithm, i.e. the algorithms are not optimized according to the requirement of the given dataset.
find the solutions of the problem. MS SQL Server does not provide any facility which shows that this combination of algorithms will produce better results for the problem. It provides a facility to view the cluster profiles, which helps the user to select the cluster for further processing.
algorithm, the tool automatically selects suitable and appropriate algorithms according to the nature of the data and produces the knowledge in the form of 2D graphs. The processes for the extraction of knowledge from the given datasets are unified, which eases the user to produce required results.
There is no need to prepare a dataset for mining. It supports the already created databases. It also provides the training facility of a dataset. The suite of MS algorithms supports numeric and categorical datasets. The user has to set parameters for each of algorithm in order to produce useful pattern from the dataset. If no parameter is set then the default values are automatically taken by the algorithm, i.e. the algorithms are not optimized according to the requirement of the given dataset. The number of parameters of algorithms in MS SQL Server is more than ODM. Supports only limited number of algorithms for each of the data mining tasks like clustering and classification. The results of MS SQL Server can be opened in MS Excel using Add-ins, which we say a separate facility of data visualization. In MS SQL Server, testing the accuracy of mining models is performed through Mining Accuracy Chart, which plots a Lift Chart, shows the performance of different models under different algorithms. MS SQL Server uses Data Mining Extensions (DMX) which extends SQL commands. IDE is provided by MS SQL Server. Mining Model Wizards ease the user to choose the
The user has to prepare the dataset in the form of a text or data file. The tool does not support any databases. The tool supports only numeric datasets because all the programs are implemented in Java. The algorithms are optimized in this tool. Therefore, there is no need to set default parameters.
Supports only limited number of algorithms for each of the data mining tasks like clustering and classification. ODM does not provide visualization of the data, for this purpose the user has to import/export the results to the other visualization tools like MS Excel etc. It provides the support for Model evaluation using BIC, export and import, comparison and cross validation only in Java Implementation Interface. Some of the mention facilities are not supported by the other implementation of ODM. ODM implements data mining through Java objects in function setting and algorithm setting. Graphical User Interface is provided by ODM.
There is no such limit in the tool; the user can further add the required algorithms. The tool directly provides the visualization of the dataset, which helps the user to draw conclusion and extract knowledge. It provides the only support for Model evaluation and selection using AIC. If the user wants to import/export any result, copy/paste can be used.
UDMTool implements data mining algorithms through Intelligent Agents, developed in Java. Graphical User Interface is provided by UDMTool.
31
There is no such limit in ODM but if the user is applying the Java language then there may be some constraints.
different data source provided e.g. different MS Algorithms and in this way the system becomes user friendly. There is no such limit in MS SQL Server.
The UDMTool supports: Number of parameters = 23 Number of Attributes = 211 The Sample Size = 12000
It is obvious from the table 1 that in ODM and MS SQL Server, the selection of algorithms is on ad-hoc bases, although both data mining suites provide the statistical information about the dataset, but these information are not sufficient to extract the knowledge from the given dataset. The data mining processes clustering, classification and visualization are individually carried out in ODM and MS SQL Server and there is no relation between these data mining processes, therefore, it is difficult to extract the knowledge. On the other hand, the proposed UDMTool unifies all the required data mining processes to extract the knowledge and the selection of the data mining algorithm(s) in each data mining process is made through the value of model selection criterion AIC and the complexities O of data mining algorithm(s).
4. Results and Discussion

The MS SQL Server, ODM and the UDMTool are tested on the variety of datasets, Diabetes, a medical dataset, Breast Cancer, a medical dataset, Iris, an agriculture dataset, Sales, an account dataset and Cars, a vehicle dataset. We present the results of Breastcancer, a medical dataset. The attributes of dataset Breast Cancer are: Clump Thickness (CT), Uniformity of Cell Size (UCS), Uniformity of Cell Shape (UCSh), Marginal Adhesion (Mad), Single Epithelial Cell Size (SECS), Bare Nuclei (BNu), Bland Chromatin (BCh), Normal Nucleoli (NNu), Mitoses , Class (benign, malignant) [19]. Case 1: The Results of MS SQL Server 1. The Result of MS Clustering Algorithm
Figure 5. The Diagram of the Clusters of the Breastcancer dataset We apply the MS clustering data mining algorithm which is similar to k-means clustering algorithm. Figure 5 shows the 10 clusters of the given dataset without the predictable variable. The solid lines show the strong relation between the clusters and the thin lines show the weak relation. As it is obvious from the above figure 1, there is a strong relation among cluster 1 and cluster 7 and 3 and the other clusters. On the other hand there is a weak relation between cluster 1 and 10, cluster 2 and 9, cluster 2 and 6 and cluster 5 and 6. From the figure 1 one can only visualize the structure of the clusters and their relation but it is still difficult to produce useful information. The population means number of records per cluster of each cluster is visible by putting the curser on the cluster.
32
The MS clustering algorithm produces the 10 clusters by default if the user wants to make his own choice it can only be done through the programming of MS clustering algorithm, by using the wizards there is no option of selection of number of clusters. Why the algorithm produces 10 clusters for each dataset it is an issue in MS clustering algorithm? The algorithm either uses the horizontal partition or vertical partition. All the clustering data mining algorithms are unsupervised machine learning algorithms, therefore, there is no need to specify the predicted or target variable in the dataset. The next tables are the extra features available in MS SQL Server 2005. Table 2. Clusters Profile
Population Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 7 Cluster 5 Cluster 8 Cluster 6 Cluster 9 Cluster (All) Size: Size: 84 Size: 36 Size: 27 Size: 24 Size: 18 Size: 14 Size: 12 Size: 11 Size: 5 10 Size: 2 233 B Ch B Nu 3.27+/2.37 3.22+/3.40 3.27+/2.37 3.22+/3.40 1.90+/0.79 1.04+/0.19 5.76+/2.15 6.53+/3.16 1.78+/0.80 1.00 6.55+/2.31 7.52+/3.33 2.19+/1.01 1.15+/0.37 5.03+/2.01 8.62+/2.51 1.78+/0.84 2.62+/1.35 3.31+/2.34 3.23+/2.17 2.41+/0.80 1.16+/0.39 4.00+/1.41 2.00
benign: benign: benign: benign: benign: benign: benign: benign: benign: benign: benign: 1.000 1.000 0.990 1.000 0.069 1.000 0.000 1.000 0.124 1.000 164 benign malignant: malignant: malignant: malignant: malignant: malignant: malignant: malignant: malignant: malignant: Class malignant malignant: 0.000 0.000 0.010 0.000 0.931 0.000 1.000 0.000 0.876 0.000 missing 69 missing: missing: missing: missing: missing: missing: missing: missing: missing: missing: missing: 0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 CT M Adh Mitoses N Nuc SECS UC Sh UCS 4.15+/2.75 2.63+/2.65 1.52+/1.61 2.65+/2.83 3.03+/2.08 2.91+/2.81 2.81+/2.86 4.15+/2.75 2.63+/2.65 1.52+/1.61 2.65+/2.83 3.03+/2.08 2.91+/2.81 2.81+/2.86 2.39+/1.40 1.00 1.00 1.00 1.93+/0.37 1.00 1.00 6.09+/2.33 4.88+/2.49 1.00 6.04+/3.24 4.89+/2.05 6.14+/2.27 5.89+/2.52 3.37+/1.66 1.67+/0.83 1.00 1.13+/0.34 2.00 1.93+/0.92 1.11+/0.32 7.41+/2.32 6.47+/3.02 4.72+/3.02 6.46+/3.11 6.70+/2.60 7.66+/2.54 8.02+/2.31 2.88+/1.70 1.00+/0.02 1.00 1.74+/0.77 2.00 1.40+/0.74 2.01+/0.86 8.85+/1.26 4.76+/3.21 1.93+/0.61 4.69+/2.24 3.22+/1.05 4.19+/1.75 3.94+/1.64 2.65+/1.65 2.98+/2.90 1.60+/1.89 1.00 2.00+/1.03 1.10+/0.32 1.00 3.49+/1.78 1.26+/0.63 1.00 1.00 2.29+/0.82 2.47+/1.63 1.88+/1.08 3.90+/1.13 2.07+/1.22 1.00 1.87+/0.36 2.64+/0.94 1.19+/0.43 1.31+/0.50 3.00+/2.83 2.00 2.00 2.50+/0.71 2.00 1.50+/0.71 2.50+/0.71
Table 2 is about the profile of each cluster with all the attributes of the given dataset. Table also shows the size of each cluster i.e. the number of record per cluster. There are only two parameters of the attribute class benign and malignant and all the other attributes have the integer values in the given dataset but the MS clustering algorithm shows the two possible values of each attribute which may confuse the user. The value of each attribute varies from cluster to cluster. The interpretation of table 2 is a little bit difficult. Table 3. Clusters Characterizing Variables Class Class B Nu B Ch UC Sh SECS CT CT N Nuc B Ch UCS Values benign malignant 3.2 - 5.5 3.3 - 4.9 2.9 - 4.8 1.6 - 3.0 4.2 - 6.0 2.3 - 4.2 2.7 - 4.6 1.7 - 3.3 2.8 - 4.7 Probability Probability = 70.386% Probability = 29.614% Probability = 24.980% Probability = 24.980% Probability = 24.980% Probability = 24.980% Probability = 24.980% Probability = 24.980% Probability = 24.980% Probability = 24.980% Probability = 24.980%
33
Table 3 is about the clusters characterizing, the attribute/ variable, its value in different clusters and the probability of the variable. The value and the probability of variables/attributes SECS, MAdh, UCSh, UCS, CT and BNu is high in some clusters as compare to the rest of variables/attributes. Table 4. Cluster Discrimination Variables UCS UC Sh N Nuc M Adh Mitoses B Nu UC Sh UCS M Adh N Nuc B Nu SECS Mitoses SECS Class Class B Ch B Ch CT CT SECS Values 1.0 1.0 1.0 1.0 1.0 1.0 1.5 1.0 10.0 1.0 10.0 1.0 10.0 1.0 10.0 1.5 10.0 1.3 2.5 1.0 10.0 2.5 10.0 benign malignant 1.0 2.8 2.8 10.0 1.0 3.3 3.3 10.0 1.0 1.3 Favors Cluster 1 Favors Complement of Cluster 1 Score = 0.000 Score = 0.069 Score = 0.288 Score = 0.324 Score = 3.879 Score = 28.032 Score = 51.050 Score = 52.941 Score = 53.583 Score = 54.846 Score = 59.424 Score = 61.108 Score = 64.856 Score = 76.031 Score = 79.337 Score = 79.337 Score = 80.075 Score = 85.118 Score = 90.353 Score = 91.269 Score = 96.887
Table 4 is about the cluster discrimination and the results of only cluster 1 are shown in this table. The favor and the complement of the favor of cluster 1 are shown. Similarly, the results of the remaining clusters can be displayed. These are three available options after applying the MS clustering algorithm. 2. The Results of MS Decision Tree Algorithm
Figure 6. The Decision Tree of the Breastcancer dataset
34
We apply the MS Decision Tree Algorithm which is ID3 data mining algorithm, on the Breastcancer dataset. The figure 3 depicts the structure of the decision tree. In our proposed UDMTool we are producing the rules instead of the tree. In MS SQL Server, in order to get the decision rules, one has to apply the MS Association Rules. 3. The Results of MS Association Rules Table 5. The Association Rules Support 196 164 160 154 148 147 145 144 142 141 140 140 Size 1 1 2 1 2 1 2 2 3 2 3 1 Itemset Mitoses < 1.1818008626 Class = benign Class = benign, Mitoses < 1.1818008626 SECS < 2.352245034 SECS < 2.352245034, Mitoses < 1.1818008626 N Nuc < 1.4350025798 SECS < 2.352245034, Class = benign N Nuc < 1.4350025798, Mitoses < 1.1818008626 SECS < 2.352245034, Class = benign, Mitoses < 1.1818008626 N Nuc < 1.4350025798, Class = benign N Nuc < 1.4350025798, Class = benign, Mitoses < 1.1818008626 UCS < 1.6782988738
The table 5 shows the association rules of the dataset Breastcancer. We are showing only the top support values of the variables, otherwise the MS Association Rules Algorithms produces a long list, which also confuse the user how to select the specific value and get the required results. It is important point to note here is that in order to get the rules MS Association algorithm is applied, the decision tree in MS SQL Server does not produce the decision rules. The proposed UDMTool uses C4.5 data mining algorithm for classification and produces only few rules in the form of if-then-else which are easy to take the decision for the user. Case 2: The Results of ODM 11g2 In ODM, there is no option to save the results of each data mining process like MS SQL Server, therefore, the results are saved using the print screen. Figure 7 depicts the workflow of clustering model; similarly, the other data mining models such as classification, association and feature selection are applied.
Figure 7. The Workflow of the Clustering Model The ODM provides a visual facility of workflow of each model to the user. Figure 7 shows the workflow of the clustering model. The data source which is a table of the oracle or a dataset is the required component, the other component is explore data which is basically a view of the dataset, we think it is an optional component and the last
35
component is a model which is one of the data mining processes like clustering, classification, association and feature selection as the list provided by ODM. The user can apply only one model at a time, so this is why we are referring ODM is a single-step tool. A link is created between the data source and data explore and data source and a model. Finally, build the model and the ODM applies all the available data mining algorithms in a model and the user can compare the results of all algorithms and also view the results of a particular required data mining algorithm. The user can also store the results in a separate table. 1. The Enhanced k-means Clustering Algorithm
Figure 8. The Results of the K-means Clustering Model We apply the enhanced k-means clustering algorithm of ODM. The algorithm uses the top-down or divisive technique of hierarchical clustering. There is an option available in ODM to set the required parameters of the algorithm if the parameters are not set then ODM uses the default. We test the dataset by setting the default parameters. The ODM creates the clusters in a tree structure the clusters are shown in figure 8. The characterization of each clusters is also performed in ODM, giving the centroids and clusters rule separately, which facilitates the user the better understanding about the cluster. In this way we assume that the ODM is unifying the clustering and classification processes.
36
Figure 9. The Results of the K-means Clustering Model with Centroid Figure 9 shows the value of the centroids of a cluster. There is no role of the value of the centroid in the knowledge extraction from a dataset.
Figure 10. The Results of the K-means Clustering Model with Cluster Rules Figure 10 shows the rules of a cluster, which is a task of the classification data mining process. The rules of a cluster are also known as decision rules play an important and vital role in the knowledge extraction from a dataset. On the other hand our proposed UDMTool is providing the decision rules of each cluster in the next step by using the C5.4 a classification data mining algorithm. The user can apply these decision rules in simple queries for further validation of the results. 2. The Results Classification using Decision Tree Algorithm
Figure 11. The Decision Tree Algorithm with Decision Rules We apply the decision tree algorithm from the classification model of ODM and the results are shown in figure 11. The algorithm creates a tree structure of clusters and provides the characterization of each cluster is given in the form of rules, surrogates and target values. Furthermore, the number of clusters produced through the decision tree algorithm varies from the enhanced k-means clustering algorithms. The decision rules facilitate the user the better
37
understanding about the cluster. In this way we assume that the ODM is unifying the clustering and classification processes.
Figure 12. The Decision Tree Algorithm with Surrogates Figure 12 shows the value of the surrogates of a cluster. There is no role of the value of the surrogates in the knowledge extraction from a dataset.
Figure 13. The Decision Tree Algorithm with Target Values Figure 13 shows the value of the target values of in a cluster. The percentage of the target values varies from cluster to cluster. We can say that there is no role of the value of the target values in the knowledge extraction from a dataset. Remark: After applying the clustering and classification models of ODM, it is difficult for the user to select the right model because in both models first the clusters are created and then the rules of each cluster are produced. The output of both cases is not the same. In UDMTool the first process is clustering followed by the classification and visualization, therefore, there is no such problem in multi-step tool. We can say the results of clustering model are accurate because in the data mining process model first the clusters are created and then the rest of the processes are applied to extract the useful information and knowledge.
38
Case 3: The Results of UDMTool The UDMTool produces the 2D scatter graphs as the final output(s) of the Breastcancer dataset which can be interpreted as knowledge.
Figure 14 The Graph between UCSh and MAdh attributes of Breastcancer dataset The graph in figure 14 can be divided into two regions; in the first region, the value of the attributes Uniformity of Cell Shape and Marginal Adhesion varies and it is constant in the subsequent second region. The outcome of this graph is that if the value of the attributes is variable then the patient has malignant class of breast cancer and benign class of breast cancer for the constant values of the attributes.
Figure 15 The Graph between BCh and Mitoses attributes of Breastcancer dataset The value of the attributes Mitoses and Bland Chromatin is almost constant throughout in this graph of figure 15. The graph can be divided into two main regions; the value of the attributes Bland Chromatin and Mitoses varies in the first region and remains constant in the subsequent next region. The outcome of this graph is that if the value of the attributes is variable then the patient has malignant class of breast cancer otherwise benign class of breast cancer for the constant value of the attributes. Table 6 below summaries the results of data mining processes clustering, classification and visualization using MS SQL Server, ODM and UDMTool. Table 6. Summary of the output Data Mining Process Clustering MS SQL Server 1. Uses MS Clustering algorithm and creates 10 clusters by default ODM 1. Uses K-means Clustering algorithm and creates 10 UDMTool 1. Uses K-means Clustering algorithm and
39
(the number of clusters are not optimized). 2. Provides the further characterization of each cluster such as population per cluster, probability of each input variable. 3. Provides the bindings (weak or strong) among clusters. Remark: Only the clusters population and probability is not sufficient to extract knowledge.
Classification
1. Uses MS Decision Tree algorithm and creates a horizontal tree of the whole dataset. There are total 8 nodes of the tree. 2. The rules of the dataset can be created by another algorithm MS Association. The list of the rules is very long, some time misleading and confuse the user in the selection of important and the best rules. In this way the user has to apply two data mining algorithm to obtain the decision rules. Remark: The nodes of the tree do not reflect the knowledge.
Visualization
There is no such model/algorithm is provided although MS SQL Server provides GUI in each process of data mining. The user can save the results and use MS Excel as visualization tool. Remark: The data mining processes are not unified rather than each process is individually carried out therefore it is difficult to extract the knowledge. A single-step data mining tool where the selection of algorithms is
Conclusion
clusters by default (the number of clusters are not optimized) in a hierarchical structure. 2. Provides the further characterization of each cluster such as centroids and clusters rule. Remark: Clusters rules are basically output of the classification data mining process. In this way the ODM unifies clustering and classification data mining processes, which is a step forward towards the knowledge extraction. 1. Uses the Decision Tree algorithm and creates a hierarchical tree of the whole dataset. There are total 7 nodes of the tree. 2. Provide the further characterization of each node by Surrogates, Decision rules and percentage of target value in each node. 3. The decision rules are in the form of (if-then-else) which can be deployed in the simple query. Some times it looks like that there is no such difference in clustering and classification models in ODM, the only difference is of the characterization. The decision rules vary from cluster to cluster. Remark: There is still confusion in the selection of the results of these two data mining processes in ODM. There is no such model/algorithm is provided although ODM provides GUI in each process of data mining. The user can save the results and use MS Excel as visualization tool. Remark: The data mining processes clustering and classification are unified which is a step forward in knowledge extraction. A single-step (up to some extent a multi-step) data
creates 2 clusters according to the target values of the input variable of the given dataset because the number of clusters are optimized. 2. Provides the further characterization of each cluster such as population per cluster.
1. Uses the output(s) of the clustering process as input and applies the C4.5 (Decision Tree) algorithm and classify each cluster by providing the decision rules as output. 2. The number of decision rules varies from cluster to cluster. The list of decision rules is not long as in MS SQL Server. This is also referred as the characterization of classified clusters. 3. The output of this process is in the form of (if-then-else) like in ODM, which can be deployed in the simple query.
Provides the 2D graphs of each classified cluster which helps the user to visualize then interpret the results and finally extract the knowledge. Remark: The data mining processes are unified which eases the user to extract the knowledge.
A multi-step data mining tool where the selection of
40
ad-hoc and difficult to extract knowledge.
mining tool where the selection of algorithms is adhoc and the knowledge extraction is ease as compared to MS SQL Server.
algorithms is automatic (based on the value of the dataset) and the knowledge extraction is very simple.
We test the Breastcancer, a medical dataset on MS SQL Server, ODM and UDMTool, three different data mining tools. The obtained results are different although we used the same data mining algorithm in each process of data mining. Firstly, in MS SQL Server and ODM some of the inputs of the data mining algorithms are not optimized on the other hand UDMTool uses the optimized algorithms. Secondly, the data mining processes clustering, classification and visualization are individually carried out in MS SQL Server and there is no relation between the data mining processes, therefore, it is difficult to extract the knowledge. In ODM clustering and classification are unified which helps the user to extract the knowledge. In UDMTool data mining processes are unified and the output of clustering is the input of classification and the output of classification is the input of visualization which provides the user knowledge.
5. Conclusion
The conclusion is that in MS SQL server the selection of the data mining algorithms which are also called the MS data algorithms is easy but the choice of the algorithm depends on the user not on the data. The user has to select different algorithms on each step of data mining processes to obtain the knowledge which is the primary goal of the Data Mining. In a single-step data mining tool like MS SQL Server, if one algorithm is not providing the required results; the user has to choose another one to get the required results. In ODM the process of clustering and classification is unified i.e. if the user applies the clustering algorithm it automatically produces the rules of each cluster. Similarly, if the user chooses the classification algorithm it first produces the clusters and then the rules of each cluster. This is somehow a step towards a multi-step knowledge extraction process. But again the choice of the algorithm depends on the user not on the data. ODM provides facility of the workflow which is helpful for the user. It is obvious from the above results it is difficult for the user to extract knowledge from ODM, although the tool provides a lot of statistical information of the given dataset. We conclude that no single algorithm can produce the knowledge, which is not possible in a single-step based data mining tools like MS SQL Server and ODM because the knowledge is a multi-step process and our proposed UDMTool is ultimate choice. Another issue in the single-step tools is that the selected data mining for the particular task takes the whole dataset and produces the results. The produced results are not the inputs of other data mining tasks; therefore, it is difficult to extract knowledge from the given dataset. It is due to the fact that in single-step tools, each data mining task is carried out individually, instead of unifying the data mining tasks. It is only possible if the output of one data mining task must be the input of next task i.e. the output of clustering data mining task must be the input of classification process which is not possible in the single-step tools. One possible solution of this issue is that the user save the results of first step in a separate dataset and then apply newly created dataset as input to the next step, which we believe is a lengthy process because saving the results and preparing a dataset in not very simple. It is obvious from the results of both single-step based data mining tools, MS SQL Server and ODM that no single algorithm can produce the knowledge, because the knowledge is a multi-step process and our proposed UDMTool is ultimate choice.
Acknowledgement
The authors are thankful to The Islamia University of Bahawalpur, Pakistan for providing financial assistance to carry out this research activity under HEC project 6467/F II.
References
[1] Berry, M.J., Data Mining Techniques: For Marketing, Sales and Customer Relationship Management, Hoboken, NJ, USA: John Wiley & Sons Incorporated, pp. 35, 2004. [2]Skrypnik, Irina., Terziyan, Vagan., Puuronen, Seppo., and Tsymbal, Alexey, Learning Feature Selection for Medical Databases, CBMS 1999. [3] Peng, Y., Kou, G., Shi, Y., Chen, Z., A Descriptive Framework for the Field of Data Mining and Knowledge Discovery, International Journal of Information Technology and Decision Making, Vol. 7, Issue: 4, Page 639-682, 2008
41
[4] Grossman. Robert, Kasif. Simon, Moore. Reagan, Rocke. David and Ullman. Jeff, Data Mining Research: Opportunities and Challenges, A Report of three NSF Workshops on Mining Large, Massive, and Distributed Data, (Draft 8.4.5) January 21, 1998 [5] Yang. Qlang, Wu. Xindong, 10 CHALLENGING PROBLEMS IN DATA MINING RESEARCH, International Journal of Information Technology & Decision Making, Vol. 5, No. 4 (2006) 597604, 2006 [6] Wu. Xindong, Kumar. Vipin, Quinlan, J. Ross, et al, Top 10 algorithms in data mining, SURVEY PAPER, Knowl Inf Syst (2008) 14:137, 2008. [7] Das, Somenath, "Unified data mining engine as a system of patterns, Master's Theses. Paper 3440.http://scholarworks.sjsu.edu/etd_theses/3440, 2007. [8] Singh. Shivanshu K., Eranti. Vijay Kumer., Fayad. M.E., Focus Group on Unified Data Mining Engine (UDME 2010): Addressing Challenges, Focus Group Proposal, 2010. [9] CRISP-DM 1.0-Step-by-step data mining guide at URL:http://www.crisp-dm.org/CRISPWP-0800.pdf [10] Oracle Data Mining Concepts 10g Release 2 (10.2) at URL: http://docs.oracle.com/html/B14339_01/5dmtasks.htm [11] US Census Bureau. Iris, Diabetes, Vote and Breast datasets at URL: www.sgi.com/tech/mlc/db visited 2009. [12] Web site of Micro soft http://msdn.microsoft.com/en-us/library/bb510508(v=sql.105).aspx [13] Oracle Data Mining (ODM) Concepts, 10g Release 1 (10.1), Part Number B10698-01, at URL: http://docs.oracle.com/cd/B12037_01/datamine.101/b10698/ 2003. [14] Utley, Craig, Introduction to SQL Server 2005 Data Mining, at URL: http://msdn.microsoft.com/enus/library/ms345131(v=sql.90).aspx, 2005.

A Comparative Study of Single-Step and Multi-Step Data Mining Tools

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

A Comparative Study of Single-Step and Multi-Step Data Mining Tools

Transféré par

Droits d'auteur :

Formats disponibles

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

A Comparative Study of Single-step and Multi-step Data Mining Tools

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

2. Data Mining Tools

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

It can be written as in equation (2).

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

3. A Comparison of ODM, MS SQL Server and UDMTool

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

4. Results and Discussion

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

Figure 6. The Decision Tree of the Breastcancer dataset

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

A multi-step data mining tool where the selection of

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

ad-hoc and difficult to extract knowledge.

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

Vous aimerez peut-être aussi