Vous êtes sur la page 1sur 57

Data analysis satisfaction poll

In this part we present how to define global satisfaction and how to see all interactions between variables.

Data is contained in text file (CSV).

There is a title line The separator is a semicolon

The import wizard automatically detects the file separators and title line.

The first column is an identifier. Since this information is not useful for analysis, the column becomes grey: it is unused.

The file contains missing data. The average value of present data shall replace any missing value in the considered column.

Data information is displayed here. 711 poll responses are gathered in this dataset.

Variables represent evaluation marks from 1 to 10. Manual discretization allows showing repartition function of the selected continuous variable.

Discretizing continuous values

Generate a discretization with equal distances with three intervals leads to this graph.

Since the discretization is adequate, it can be applied to all variables

For transferring the discretization mode to other variables

Ctrl + A for applying discretization to all variables.

The Bayesian network is created with one node per column.

The search function * and % can be used for simplifying search

For characterizing global satisfaction, the first step is to use the search function for finding Satisfaction node.

Clicking on the line causes the node to blink.

This node is the target variable of the analysis. We are interested in the >7 satisfaction value.

The augmented Markov blanket shall be used for characterizing the target variable. It allows to find the minimal set of variables that characterize global satisfaction.

Zoom in and out tools are available for better graph visualization.

Force directed layout positioning algorithm allows organizing the nodes on the workspace

While switching to validation mode, note that only 15 nodes among 215 are selected relevant by the network

For highlighting important relationships between variables, the force of the arcs tools shall be used.

An arcs thickness is proportional to its relevance with regards to target variable. SE1 variable is the most important for global satisfaction

Unconnected nodes become transparent.

BayesiaLab can generate reports.

SE1 node is in first position : it is the most important variable of this analysis.

The probabilistic profile of polls presenting a global satisfaction mark >=7 is also reported.

After closing the report, note that it is possible to monitor all correlations between variables by right clicking in the right side of the screen.

The monitors display the probability distribution and permit changing the variables values.

Target variable has red background.

As the most important, SE1 variable appears in first position.

Monitors can be used for finding the probabilistic profile of polls presenting high satisfaction mark.

When clicking on this modality, the probabilities are propagated throughout the network. The probabilistic profile becomes readable.

The same technique can be applied to other modalities and variables. The results are automatically propagated to the remaining variables.

Poor SE1 mark is reported on all monitors.

After target variable characterization, the second part of this tutorial explores the relationship between all variables of the poll.

In modelization mode, delete all arcs.

The SopLEQ algorithm is appropriate for discovering associations between variables.

After some computational time, SopLEQ learning finds a complex network.

By using positioning and zoom tools, the graph becomes more readerfriendly. In this case, where the graph is large but with average connectivity, symmetric positioning is adequate.

For increasing network readability, a comments dictionary can be linked with the graph. In this file, the name of each node is completed with comments.

Clicking this button displays or disables comments for selected nodes

When done, hints indicate that the node has comments.

A modality dictionary can also be interactively designed. This can be done by double clicking on a node and opening modality name sheet

Give a name to each modality

Once the modalities labels are validated, the dictionary can be exported as a text file

The file is defined only for SK5 node.

#Wed Oct 11 14:28:27 CEST 2006 SK5.<\=7=Average SK5.<\=4=Poor SK5.>7=Very good

By a simple modification, it becomes valid for all nodes of the graph.

#Wed Oct 11 14:28:27 CEST 2006 <\=7=Average <\=4=Poor >7=Very good

The dictionary can now be associated back to all nodes of the graph

The monitors from the validation mode become easier to read.

The same process can be applied for attributing values to modalities and generating modality values dictionary. This is done in modelization mode, by double clicking a node and opening the values sheet.

When the modality is poor, it marks 0 points, 10 points for average and 20 points for very good

The same process consisting of exporting the dictionary, modifying the text file and importing back can be applied for attributing values to all nodes modalities

The total and average values of the graph modalities are calculated

The values are also computed depending on the probability distribution.

Every question is related to a theme. For instance, this pool has 36 themes. The class concept in BayesiaLab is useful for associating themes to nodes. The themes dictionary is contained in a text file.

By clicking on the new-appeared icon on the bottom right of the window, the class editor opens. It becomes possible to apply modifications to classes instead of applying to nodes
Opens the class editor

The readability can be increased by applying automatic class colours. This is done by selecting all the classes with <ctrl + a> and clicking the color button.

Note that nodes are globally gathered by colour. This provides useful information about links inter and intra-theme. In this case, this also denotes a welldesigned poll.

When closing the Edit classes window, the nodes become coloured depending on their class.

The comments are also coloured depending on the class.

A colours dictionary can also be saved as a text file.

In this example, themes have been created base on expert knowledge. Nevertheless, BayesiaLab provides tools for automatic theme design by grouping semantically close variables.

In validation mode, the variable clustering is based on association rules discovering in the network.

Since the clustering is applied, new colours are applied to nodes.

Moving this cursor forces the number of groups. The nodes colours are also changed.

BayesiaLab identified 48 nodes groups.

This is for validating the current clustering

Exiting the clustering mode

There are two other new icons in the clustering toolbar.

When validating, a confirmation is asked.

BayesiaLab is able to build latent variables according to the recently realized clustering.

In modelization mode, the multiple clustering allows clustering individuals from each single variable group.

Data is saved in this directory

This wizard tunes the multiple clusterings realized. (one per identifier cluster).

Specifying the number of classes for each new latent variable

In the same fashion as data clustering, a HTML report is created for each clustering. They are useful for renaming new variables and their modalities

Once the clusterings are realized, a new network is created with one node per latent variable (keeping the initial colour)

An internal database is created. It contains the most probable cluster values for each line of the initial file. This database can be saved in a spare file with the data menu.

Probabilistic relationships between the nodes of this new network can be discovered with the SopLEQ algorithm. After computation and automatic nodes positioning, the obtained network present 51 nodes representing the latent variables of the initial dataset.

Vous aimerez peut-être aussi