Vous êtes sur la page 1sur 19

Ease of doing business over the world

using Multivariate Statistics

Submitted by
Piyush R Zalkey (17021141073)
Vaibhav Bala Krishna (17021141111)
Introduction
According to several studies conducted across the world, doing business across the
world depends on 5 major parameters namely:
a) Starting a business
b) Getting a location
c) Accessing finance
d) Dealing with day to day operations
e) Operating in a secure business environment

Several efforts are being made towards creating a secure business environment.
www.Doingbusiness.org is one such effort, where one can find ample amounts of
information on all these efforts and the current scenario. We have used one such
dataset to apply multivariate statistics and gather insights.
Objectives
1) To determine various factors affecting the ease of doing business in various countries.
2) Clustering various nations based on the ease of doing business and other indices.
3) Listing out the key insights

Methodology
Method of data collection:
The data was collected through a report generated using the microdata and indicators
interface of https://data.worldbank.org/indicator. Initially a total of 190 countries and 15
variables were selected. However due to the presence of missing values almost 50
countries and 6 variables were eliminated.

Sample size:
The data set finally consisted of 140 countries (as observations) and 9 variables.

Software details:
The project was completed with the help of IBM SPSS 24 Statistics which is a leading
statistical software used to solve business and research problems by means of ad-hoc
analysis, hypothesis testing and predictive analytics.

Statistical techniques:
In Advanced statistics we deal with Multivariate analysis which involves observation and
analysis of more than one statistical outcome variable at a time. In this particular project
we have used Factor and Cluster analysis techniques to achieve our objectives.
Descriptive Statistics:

Descriptive Statistics

N Minimum Maximum Mean Std. Deviation Variance

Statistic Statistic Statistic Statistic Statistic Statistic

Building quality control index 140 1.00 15.00 9.5186 2.99110 8.947
(0-15)

Ease of doing business 140 1 188 101.27 50.835 2584.170


index (1=easiest to
185=most difficult)

Extent of disclosure index (0 140 1.00 10.00 5.8243 2.36527 5.595


to 10)

Procedures required to start 140 1.00 20.00 7.1000 3.18989 10.175


a business (number)

Profit tax (%) 140 .10 53.00 18.4857 8.48275 71.957

Time required to start a 140 .50 230.00 20.7607 25.78477 664.854


business (days)

Trade: Cost to import (US$ 140 50.00 3039.00 537.6236 400.95370 160763.869
per container)

Trade: Time to export (day) 140 1.50 515.00 65.5493 60.31811 3638.274

Trade: Time to import (days) 140 1.50 588.00 87.5371 85.23761 7265.450

While most of the variables are self-explanatory, a) the extent of disclosure index means
the extent to which investors are protected through disclosure of ownership and financial
information, b) building quality control index relates to infrastructure quality and c) Profit
tax % is the total tax rate (as a % of commercial profits).The Ease of doing business index
is a rank provided by the world bank to countries based on their yearly studies over the
world.
Cluster Analysis

Cluster analysis is an exploratory analysis that tries to identify structures within the
data. Cluster analysis is also called segmentation analysis or taxonomy analysis. More
specifically, it tries to identify homogenous groups of cases if the grouping is not
previously known. Because it is exploratory, it does not make any distinction between
dependent and independent variables.

In SPSS Cluster Analyses can be found in Analyze/Classify…. SPSS offers three


methods for the cluster analysis: K-Means Cluster, Hierarchical Cluster, and Two-Step
Cluster.

a) K-means cluster is a method to quickly cluster large data sets. The researcher
defines the number of clusters in advance. This is useful to test different models
with a different assumed number of clusters.

b) Hierarchical cluster is the most common method. It generates a series of models


with cluster solutions from 1 (all cases in one cluster) to n (each case is an
individual cluster). Hierarchical cluster also works with variables as opposed to
cases; it can cluster variables together in a manner somewhat similar to factor
analysis. In addition, hierarchical cluster analysis can handle nominal, ordinal, and
scale data; however, it is not recommended to mix different levels of measurement.

c) Two-step cluster analysis identifies groupings by running pre-clustering first and


then by running hierarchical methods. Because it uses a quick cluster algorithm
upfront, it can handle large data sets that would take a long time to compute with
hierarchical cluster methods. In this respect, it is a combination of the previous
two approaches. Two-step clustering can handle scale and ordinal data in the
same model, and it automatically selects the number of clusters.

The hierarchical cluster analysis follows three basic steps: 1) calculate the distances, 2)
link the clusters, and 3) choose a solution by selecting the right number of clusters.
First, we have to select the variables upon which we base our clusters. In the dialog
window we add the math, reading, and writing tests to the list of variables. Since we want
to cluster cases we leave the rest of the tick marks on the default.

For interval data, the most common is Square Euclidian Distance. It is based on the
Euclidian Distance between two observations, which is the square root of the sum of
squared distances. Since the Euclidian Distance is squared, it increases the importance
of large distances, while weakening the importance of small distances.
If we have ordinal data (counts) we can select between Chi-Square or a standardized
Chi-Square called Phi-Square. For binary data, the Squared Euclidean Distance is
commonly used.

Next, we have to choose the Cluster Method. Typically, choices are between-groups
linkage (distance between clusters is the average distance of all data points within these
clusters), nearest neighbor (single linkage: distance between clusters is the smallest
distance between two data points), furthest neighbor (complete linkage: distance is the
largest distance between two data points), and Ward’s method (distance is the distance
of all clusters to the grand average of the sample). Single linkage works best with long
chains of clusters, while complete linkage works best with dense blobs of
clusters. Between-groups linkage works with both cluster types. It is recommended is to
use single linkage first. Although single linkage tends to create chains of clusters, it helps
in identifying outliers. After excluding these outliers, we can move onto Ward’s
method. Ward’s method uses the F value (like in ANOVA) to maximize the significance
of differences between clusters.

A last consideration is standardization. If the variables have different scales and means
we might want to standardize either to Z scores or by centering the scale. We can also
transform the values to absolute values if we have a data set where this might be
appropriate.

Cluster Analysis: Two Step Cluster


Using the Direct marketing functionality with 4 variables we were able to obtain 3 different
clusters. The cluster quality is above 0.5 which is a good sign.
We used ease of doing business index, trade: time to import, trade: time to export and
time required to start a business. Ease of doing business index is the most important
predictor variable.
Clusters Description:

We obtained 3 distinct clusters using two step technique. The properties of which are as
follows:
a) Cluster 1 consists of countries with a tough environment for businesses, since the
mean (Cluster 1) is almost near 75 percentile. It takes much more time to export
as well as import goods from these countries. It takes about 19 days to start a
business in these countries. This group mainly consists of developing countries
including India, Nepal, Sri Lanka, Brazil etc where conditions have improved a lot
over the years and are still improving.
b) Cluster 2 consists of only outliers (numbering only 2). Doing business in these
countries is the toughest. It takes almost 400 days to export as well as import
goods from these countries. Even starting a business takes about 118 days in
these countries. Countries like South Sudan, and Democratic republic of Congo
are examples where there is a lot of political instability which has strongly impacted
trade.
c) Cluster 3 consists of countries where doing business is the easiest, since the mean
(Cluster 1) is lower than 25 percentile. It takes the least time to export as well as
import goods from these countries (mostly developed countries). It takes about 9
days to start a business in these countries on an average. Developed countries
like New Zealand, Australia, United States, China, Canada make up this cluster.
The conditions in these countries are highly favorable for setting up new
businesses.

The Two step cluster gave us a better picture than the original model obtained through
hierarchical clustering (which was to skewed).
Factor Analysis

Factor analysis is a method of data reduction. It does this by seeking underlying


unobservable (latent) variables that are reflected in the observed variables (manifest
variables). There are many different methods that can be used to conduct a factor
analysis (such as principal axis factor, maximum likelihood, generalized least squares,
unweighted least squares). There are also many different types of rotations that can be
done after the initial extraction of factors, including orthogonal rotations, such as varimax
and equimax, which impose the restriction that the factors cannot be correlated, and
oblique rotations, such as promax, which allow the factors to be correlated with one
another. You also need to determine the number of factors that you want to
extract. Given the number of factor analytic techniques and options, it is not surprising
that different analysts could reach very different results analyzing the same data
set. However, all analysts are looking for simple structure. Simple structure is pattern of
results such that each variable loads highly onto one and only one factor.

Factor analysis is a technique that requires a large sample size. Factor analysis is based
on the correlation matrix of the variables involved, and correlations usually need a large
sample size before they stabilize. Tabachnick and Fidell (2001, page 588) cite Comrey
and Lee’s (1992) advise regarding sample size: 50 cases is very poor, 100 is poor, 200
is fair, 300 is good, 500 is very good, and 1000 or more is excellent. As a rule of thumb,
a bare minimum of 10 observations per variable is necessary to avoid computational
difficulties.

If the factor analysis is being conducted on the correlations (as opposed to the
covariances), it is not much of a concern that the variables have very different means
and/or standard deviations (which is often the case when variables are measured on
different scales).

a. Mean – These are the means of the variables used in the factor analysis.

b. Std. Deviation – These are the standard deviations of the variables used in the factor
analysis.

c. Analysis N – This is the number of cases used in the factor analysis.

Determinant = .024
All we want to see in Correlations table is that the determinant is not 0. If the determinant
is 0, then there will be computational problems with the factor analysis, and SPSS may
issue a warning message or be unable to complete the factor analysis.

KMO and Bartlett's Test


Kaiser-Meyer-Olkin Measure of Sampling Adequacy. .704
Bartlett's Test of Sphericity Approx. Chi-Square 503.079
df 28
Sig. .000

a. Kaiser-Meyer-Olkin Measure of Sampling Adequacy – This measure varies


between 0 and 1, and values closer to 1 are better. A value of .6 is a suggested minimum.

b. Bartlett’s Test of Sphericity – This tests the null hypothesis that the correlation
matrix is an identity matrix. An identity matrix is matrix in which all of the diagonal
elements are 1 and all off diagonal elements are 0. You want to reject this null
hypothesis.

a. Communalities – This is the proportion of each variable’s variance that can be


explained by the factors (e.g., the underlying latent continua). It is also noted as h2 and
can be defined as the sum of squared factor loadings for the variables.

b. Initial – With principal factor axis factoring, the initial values on the diagonal of the
correlation matrix are determined by the squared multiple correlation of the variable with
the other variables. For example, if you regressed items 14 through 24 on item 13, the
squared multiple correlation coefficient would be .564.

c. Extraction – The values in this column indicate the proportion of each variable’s
variance that can be explained by the retained factors. Variables with high values are
well represented in the common factor space, while variables with low values are not well
represented. (In this example, we don’t have any particularly low values.) They are the
reproduced variances from the factors that you have extracted. You can find these values
on the diagonal of the reproduced correlation matrix.

Total Variance Explained


Initial Eigenvalues Extraction Sums of Squared Loadings
Component Total % of Variance Cumulative % Total % of Variance Cumulative %
1 3.467 43.343 43.343 3.467 43.343 43.343
2 1.629 20.366 63.709 1.629 20.366 63.709
3 1.111 13.883 77.591 1.111 13.883 77.591
4 .582 7.277 84.868
5 .443 5.538 90.407
6 .342 4.281 94.688
7 .246 3.073 97.761
8 .179 2.239 100.000

Total Variance Explained


Rotation Sums of Squared Loadings
Component Total % of Variance Cumulative %
1 2.662 33.280 33.280
2 1.874 23.419 56.699
3 1.671 20.893 77.591
4
5
6
7
8

a. Factor – The initial number of factors is the same as the number of variables used in
the factor analysis. However, not all 8 factors will be retained. In this example, only the
first three factors will be retained (as we requested).

b. Initial Eigenvalues – Eigenvalues are the variances of the factors. Because we


conducted our factor analysis on the correlation matrix, the variables are standardized,
which means that each variable has a variance of 1, and the total variance is equal to the
number of variables used in the analysis, in this case,8.

c. Total – This column contains the eigenvalues. The first factor will always account for
the most variance (and hence have the highest eigenvalue), and the next factor will
account for as much of the left over variance as it can, and so on. Hence, each
successive factor will account for less and less variance.

d. % of Variance – This column contains the percent of total variance accounted for by
each factor.
e. Cumulative % – This column contains the cumulative percentage of variance
accounted for by the current and all preceding factors. For example, the third row shows
a value of 77.591%. This means that the first three factors together account for 77.591
% of the total variance.

f. Extraction Sums of Squared Loadings – The number of rows in this panel of the
table correspond to the number of factors retained. In this example three factors were
retained, so there are three rows, one for each retained factor. The values in this panel
of the table are calculated in the same way as the values in the left panel, except that
here the values are based on the common variance. The values in this panel of the table
will always be lower than the values in the left panel of the table, because they are based
on the common variance, which is always smaller than the total variance.

g. Rotation Sums of Squared Loadings – The values in this panel of the table
represent the distribution of the variance after the varimax rotation. Varimax rotation tries
to maximize the variance of each of the factors, so the total amount of variance accounted
for is redistributed over the three extracted factors.
The scree plot graphs the eigenvalue against the factor number. You can see these
values in the first two columns of the table immediately above. From the third factor on,
you can see that the line is almost flat, meaning that each successive factor is accounting
for smaller and smaller amounts of the total variance.

a. Factor Matrix – This table contains the unrotated factor loadings, which are the
correlations between the variable and the factor. Because these are correlations,
possible values range from -1 to +1.

Component Matrixa
Component
1 2 3
Building quality control index -.472 .234 .742
(0-15)
Ease of doing business index .857 -.098 -.227
(1=easiest to 185=most
difficult)
Extent of disclosure index (0 -.475 .594 .267
to 10)
Procedures required to start .613 -.515 .430
a business (number)
Time required to start a .528 -.505 .477
business (days)
Trade: Cost to import (US$ .719 .467 .014
per container)
Trade: Time to import (days) .723 .527 .150
Trade: Time to export (day) .770 .443 .049

Extraction Method: Principal Component Analysis.a


a. 3 components extracted.

Using Principal component analysis method 3 components/factors were extracted.

b. Rotated Factor Matrix – This table contains the rotated factor loadings (factor pattern
matrix), which represent both how the variables are weighted for each f actor but also the
correlation between the variables and the factor. Because these are correlations,
possible values range from -1 to +1. This makes the output easier to read by removing
the clutter of low correlations that are probably not meaningful anyway.
For orthogonal rotations, such as varimax, the factor pattern and factor structure matrices
are the same.

Rotated Component Matrixa


Component
1 2 3
Building quality control index -.136 .068 .897
(0-15)
Ease of doing business index .559 .341 -.605
(1=easiest to 185=most
difficult)
Extent of disclosure index (0 .056 -.447 .668
to 10)
Procedures required to start .161 .882 -.143
a business (number)
Time required to start a .108 .863 -.064
business (days)
Trade: Cost to import (US$ .847 .059 -.121
per container)
Trade: Time to import (days) .901 .105 .010
Trade: Time to export (day) .872 .120 -.125

Extraction Method: Principal Component Analysis.


Rotation Method: Varimax with Kaiser Normalization.a
a. Rotation converged in 5 iterations.

c. Factor – The columns under this heading are the rotated factors that have been
extracted. As you can see by the footnote provided by SPSS (a.), three factors were
extracted .These are the factors that analysts are most interested in and try to name.

After Varimax Rotation, we get a clearer picture where,


Factor 3: Building quality index
Factor 2: Procedures required to start a business, Time required to start a business
Factor 1: Trade (Cost to import), Time to import, Time to export
Variables not under any components: Extent of disclosure index, Ease of doing business
index
The first Factor could be made up of presence of good infrastructure and technology
in the varies countries.
The second factor is related to the process of starting a business.
And the third factor deals with the operational efficiency or logistics side (related to
transportation of goods from one place to another) as well as the cash cycles and the
respective costs.

Key Insights

a) All countries have different business environments. Starting a business may only
take half a day in some countries while the process may still be pending after
months in others.
b) Since there are no even standings when it comes to a country’s financial, political
or technological strength, some countries might dominate others when it comes
to import/export norms.
c) The output from the factor analysis runs parallel to the measure set by
Doingbusiness.org which says there are 5 major factors used to measure the
ease of doing business.
d) The three clusters obtained highlight the fact that there is a huge disparity
between developed and developing countries.

Vous aimerez peut-être aussi