Vous êtes sur la page 1sur 7

Rapid Miner process - getting started with Assignment 2 and 3 (fundRaising data) Explore tha data - after reading

it in thru the ReadExcel node. Look into the distribution of values in different variables.

Would we like to transform some of the variables? - what transformations? - maybe try log transform of some variables (with very skewed distributions)? - maybe try group values into ranges -- these can be user-specified, based on domain knwoledge or what seems 'common-sense'

The following nodes in the RapidMiner process show some examples of transformations. Check the distributions after the transfomations -- do transformations help (why?) Go into the nodes to make sure you understand how the data transformations are specified.

The top part shows how certain variable transformations are obtained The Validation node is for splitting the data into training and validation and learning a model - as in teh last assignment. The lower part shows how we may perform a Principal Components Analysis (PCA)

The Generate Attribute node shows how new attributes can be obtained by specifying functions on existing attributes. The dialog box for specifying new variables is obtained by pressing the Edit List button. Try creating some new variable yourself.

The first Discretize node defines ranges for values of the RAMNTALL variable - i,.e. it converts this variable from numeric to nominal values having multiple 'classes' or groups. These groups are specified as shown, by pressing the EditList button

You can specify any names for the classes - here, we have chosen the group names based on teh value-ranges they include. The second group is named '50-100' (u can name it '50 to 100' or '50 to Hundred Frogs' if u like). The upper limits for the ranges are specified in sequence.

The same ranges can be specified for multiple variables, in a single Discretize node - as in teh case of the third Discretize node (shown below).

The Attribute filter type is set to subset to indicate that multiple attributes are to be selected into this node. Attributes can be selected in to the right-side pane of the dialog (dialog is obtained by pressing SelectAttributes)

Here, the AVGGIFT and LASTGIFT variables are selected into this node - so the disctretization operation will be performed on these attributes. The value ranges for the different groups are specified in the same way as in the last node.

For the PCA part - a subset of attributes to be used for PCA is selected using the Select Attribute node. Attributes to be included are selected in to the right-side panel.

Note - only the selected attributes are available at the 'exa' output poprt of the Select Attribute node. We next normalize these attributes using the Normalize node - the parameters here can specify different methods for normalizing. We have chosen 'range transformation' and 0.0 and 1.0 as min and max of the range.

For the PCA node - we choose 'keep variance' and set 0.95 as 'variance threshold'. This means that we'd like to retain as many (of the new) variables as needed to keep 95% of the information content or variance in the data.

The results of PCA can be seen at the 'pre' output port of the PCA node. It gives principal components in descending order of variance that they capture and cumulative variance - we see here that the first three principal components (PC1, PC2, PC3) capture 95% of the total information. So, instead of 5 original variables, we can use just 3 principal components.

The eigenvectors are also shown:

and these can help calculate the values of the new variables (PC1, PC2,....) from the values of original variables. The 'exa' output port gives the new attribute (principal components) for the data (ie. gives the values of variables PC1, PC2, ...for each data row). We can then use these values in subsequent processing.