Académique Documents
Professionnel Documents
Culture Documents
Group – 11
Visalakshi Arunachalam
Saumil Badia
Chirag Gala
Pratik Kapadia
Introduction:
• Confidence:
Confidence is the ratio of the number of transactions that include all items in the consequent as well as the
antecedent (namely, the support) to the number of transactions that include all items in the antecedent.
The support is simply the number of transactions that include all items in the antecedent and consequent parts of
the rule. (The support is sometimes expressed as a percentage of the total number of records in the database.)
• Lift:
Lift is a measure of the performance of a model. It is basically the likelihood of occurrence of an outcome (head)
given the antecedent (body).
Lift = Confidence
Freq of head
• Association Rule:
Association rule mining finds interesting associations and/or correlation relationships among large set of data
items. Association rules shows attribute value conditions that occur frequently together in a given dataset. A
typical and widely-used example of association rule mining is Market Basket Analysis.
For example, data are collected using bar-code scanners in supermarkets. Such ‘market basket’ databases consist
of a large number of transaction records.
Association rules provide information of this type in the form of "if-then" statements. These rules are computed
from the data and, unlike the if-then rules of logic, association rules are probabilistic in nature.
In addition to the antecedent (the "if" part) and the consequent (the "then" part), an association rule has two
numbers that express the degree of uncertainty about the rule. In association analysis the antecedent and
consequent are sets of items (called itemsets) that are disjoint (do not have any items in common).
• K-Means
The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center
is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension
separately over all the points in the cluster.
Association Rules:
Q1.
Rules Generated: 11
Data
Input Data project1_data!$M$1:$Q$2187
Data Format Item List
Minimum Support 150
Minimum Confidence % 30
# Rules 11
Overall Time (secs) 3
If customers buy from landsend.com they tend to buy from llbean.com also and the confidence of this
rule is 41.45%
Comparing two association rules that have reverse body and head
A.
Support
Support Support (Head &
Rule# Confidence Head Body (Head) (Body) Body) Lift Ratio
1 41.54 landsend.com=> llbean.com 479 538 199 1.688051
2 36.99 llbean.com=> landsend.com 538 479 199 1.688051
Rule 1:
Rule 2:
Support
Support Support (Head &
Rule# Confidence Head Body (Head) (Body) Body) Lift Ratio
7 36.46 oldnavy.com=> victoriassecret.com 820 825 299 0.96617
8 36.24 victoriassecret.com=> oldnavy.com 825 820 299 0.96617
Rule 1:
Rule 2:
Support Confidence
(No. of Transactions) (%) No. of Rules Time
100 30 14 1
100 40 5 1
100 50 2 2
100 60 0 0
150 30 11 1
150 40 4 1
150 50 1 1
200 30 4 1
200 40 1 1
200 50 1 1
250 30 3 1
250 40 0 0
Chart displaying the trend in Association Rules for different values of Support &
Confidence Level
Conclusion:
No. of Association Rules generated depends on both Confidence & Support Level. Higher the value of these two factors
lesser the number of rules.
Q4.
Cluster centers
Average
Average distance
Cluster #Obs distance in Cluster #Obs
in cluster
cluster
Cluster-1 506 1.68 Cluster-1 506 355.4019014
Cluster-2 704 1.477 Cluster-2 704 404.8165039
Cluster-3 563 1.347 Cluster-3 563 251.4218487
Cluster-4 413 1.868 Cluster-4 413 842.0241951
Overall 2186 1.565 Overall 2186 436.4733185
Elapsed Time
Cluster 3 Cluster 4
Region HH Size Age Income Money Region HH Size Age Income Money
3 2 11 3 599.08 4 5 6 7 360.49
2 2 9 5 363.89 3 5 7 7 475.95
2 2 5 5 133.48 4 4 5 7 1976.27
3 2 11 5 697.97 3 4 4 5 135.91
3 2 8 4 377.35 2 6 6 6 377.32
3 2 9 7 629.69 4 6 5 5 328.7
4 3 6 4 303.48 2 5 8 5 259.82
2 2 3 5 184 3 4 3 6 158.89
4 2 4 4 131.89 3 6 7 6 1704.46
3 3 10 5 287.85 4 4 9 4 1840.72
3 2 9 4 83.87 2 6 8 5 87.44
4 3 5 7 318.92 3 4 9 7 413.18
3 2 8 5 342.62 <-AVG-> 3 5 6 6 676.595
The clusters generated are uniquely identifiable. They are based on the attributes house hold size, age and income.
Cluster1 are called as random buyers. Their income is low and still they end up buying more. Cluster2 is called as luxury
seekers as they are old aged people with high income and they tend to spend more. Cluster3 is called as careful buyers as
they have a middle income and smaller family. They really plan and purchase at a low frequency. Cluster4 is called as Big
Buyers as they have a bigger family and middle aged persons with high income group. So they tend to purchase a lot and
in high frequency.
Q6.
CASE-I (K=2):
Cluster-1 0 120.7418813
Cluster-1 2.25067 2.131368 6.689008 3.825738 0.229221 1.0563 0.852547 0.135389 501.111093
Cluster-2 120.7418813 0
Cluster-2 2.277778 3.79375 7.0125 5.193055 0.996528 1.025694 0.964583 0.132639 621.830856
Average Average
Cluster #Obs distance in Cluster #Obs distance in
cluster cluster
Cluster Center:
For K = 2, we have only 2 clusters. Though the number of observations in each cluster is identical, there is no difference
between the clusters in most of the parameters. Hence we cannot classify the clusters appropriately. So K=2 is not an
optimum value for k-means clustering.
CASE-II (K=4):
Cluster Center:
Cluster region hhsz age income child race connection country money
Cluster-1 2.378882 3.062112 7.329194 4.10559 0.664596 1.024845 -0.000001 0.254658 453.632868
Average Average
Cluster #Obs distance Cluster #Obs distance in
in cluster cluster
325.605348
Cluster-1 161 2.568 Cluster-1 161
2
446.428346
Cluster-2 771 1.935 Cluster-2 771
9
373.971923
Cluster-3 523 2.175 Cluster-3 523
4
Cluster-4 731 2.241 Cluster-4 731 504.745063
Overall 2186 2.141 Overall 2186 439.695642
CASE-II (K=10):
Cluster Center:
Cluster region hhsz age income child race connection country money
etwee
Cluster-1 Cluster-2 Cluster-3 Cluster-4 Cluster-5 Cluster-6 Cluster-7 Cluster-8 Cluster-9 Cluster-10
uster
enters
uster- 25.1451665 142.205791 47.2028949 57.2179869 73.1156037
0 3.95034679 2.76989798 3999.75895 46.65710299
4 9 1 3 7
uster- 25.1451665 117.125391 22.1739320 26.6881051 32.2960826 3974.66799 48.0407051
0 25.0839916 21.66266898
4 6 3 1 2 2 6
uster- 141.881422 46.7987016 56.8790043 72.8456421
3.95034679 25.0839916 0 2.86576461 3999.46828 46.33796747
9 2 9 6
uster- 142.205791 117.125391 141.881422 95.1324295 143.674667 85.0904591 3857.60514 69.1276477
0 95.59415137
9 6 9 7 7 8 4 4
uster- 47.2028949 22.1739320 46.7987016 95.1324295 48.6098735 3952.70556 26.1461054
0 10.5412827 2.10163247
1 3 2 7 7 2 1
uster- 26.6881051 143.674667 48.6098735 58.6701395 74.6041191
2.76989798 2.86576461 0 4001.25815 48.10445001
1 7 7 2 1
uster- 57.2179869 32.2960826 56.8790043 85.0904591 58.6701395 3942.66102 16.2671794
10.5412827 0 10.76098261
3 2 9 8 2 1 7
uster- 3974.66799 3857.60514 3952.70556 3942.66102 3926.67664
3999.75895 3999.46828 4001.25815 0 3953.178371
2 4 2 1 5
uster- 73.1156037 48.0407051 72.8456421 69.1276477 26.1461054 74.6041191 16.2671794 3926.67664
0 26.56454999
7 6 6 4 1 1 7 5
uster- 46.6571029 21.6626689 46.3379674 95.5941513 48.1044500 10.7609826 3953.17837 26.5645499
2.10163247 0
0 9 8 7 7 1 1 1 9
Average Average
Cluster #Obs distance in Cluster #Obs distance in
cluster cluster
Cluster-1 270 1.603 Cluster-1 270 308.507607
Cluster-2 214 1.631 Cluster-2 214 297.5014998
Cluster-3 283 1.344 Cluster-3 283 277.938553
Cluster-4 446 1.379 Cluster-4 446 393.3044052
Cluster-5 226 1.44 Cluster-5 226 348.2440924
Cluster-6 157 2.512 Cluster-6 157 322.7320293
Cluster-7 247 1.339 Cluster-7 247 326.2300767
Cluster-8 44 3.033 Cluster-8 44 1502.255369
Cluster-9 57 3.485 Cluster-9 57 383.4960612
Cluster-10 242 2.163 Cluster-10 242 367.8065808
Overall 2186 1.685 Overall 2186 360.4535118
For K=10, there are too many clusters and the inter cluster difference is minimal. The number of observations in
each cluster is too low and a manager cannot differentiate marketing initiatives to various market segments
identified with 10 clusters as the clusters are too similar.
CASE-III (K=6):
Cluster Center:
connectio
Cluster region hhsz age income child race country money
n
Cluster-1 2.382165 3.050956 7.350318 4.127388 0.662421 1 -0.000001 0.261147 448.730053
Cluster-2 2.253363 2.042601 6.672646 4.242153 0 1 1 0 530.739044
Cluster-3 3.306195 3.582301 6.761062 4.920354 0.99646 1 1 0 611.6884
Cluster-4 1.414226 3.711297 7.129707 5.059972 0.998605 1 1 0 631.830167
Cluster-5 2.271186 2.966102 6.661017 3.983051 0.661017 2.338984 0.932203 0.152542 733.070973
Cluster-6 2.330578 3.318182 6.747934 4.747934 0.760331 1 1 0.999999 496.809785
Distance between
Cluster-1 Cluster-2 Cluster-3 Cluster-4 Cluster-5 Cluster-6
cluster centers
Cluster-1 0 82.02735741 162.9684465 183.1095957 284.3465278 48.10445001
Cluster-2 82.02735741 0 80.97986148 101.1176476 202.3397826 33.98039549
Cluster-3 162.9684465 80.97986148 0 20.23468212 121.4001728 114.8877857
Cluster-4 183.1095957 101.1176476 20.23468212 0 101.2635411 135.0288767
Cluster-5 284.3465278 202.3397826 121.4001728 101.2635411 0 236.2680567
Cluster-6 48.10445001 33.98039549 114.8877857 135.0288767 236.2680567 0
Average Average
Cluster #Obs distance in Cluster #Obs distance in
cluster cluster
Cluster-1 157 2.512 Cluster-1 157 322.7320293
Cluster-2 446 1.787 Cluster-2 446 376.2517197
Cluster-3 565 1.737 Cluster-3 565 470.6028622
Cluster-4 717 1.77 Cluster-4 717 480.9731203
Cluster-5 59 3.647 Cluster-5 59 664.8660267
Cluster-6 242 2.163 Cluster-6 242 367.8065808
Overall 2186 1.912 Overall 2186 437.9971766
Out of the K=4 and K=6, K=4 would be better for clustering. This is because in k=4 we obtain more disparate clusters.
For e.g. With K=4 there is significant difference between clusters based on the education level besides other variables.
This distinction was not as evident with K value 6.Also with 4 clusters the numbers persons per cluster is sizeable and
since inter cluster difference is higher based on more input variables K=4 should be the optimal value for K-Means
clustering.
Conclusion:
To look at what are the other possible customer segments we formulated clusters but with different values of K
The following is the result of our analysis based on different values of K
Value of K Average Distance b/w Average Distance b/w Interpretability
cluster centers cluster members
Landsend.com Llbean.com
Cluster centers
connectio
Cluster region hhsz age income child money
n
Cluster-1 1.833333 3 8 4.5 0.5 -0.000001 797.148331
Cluster-2 1.117647 3.517647 7.188235 5.541177 1 1 949.315413
Cluster-3 2.128205 1.974359 7.435898 5.282051 -0.000002 1 619.080895
Cluster-4 2.855073 3.84058 8.478263 5.985508 1 1 910.932895
Average Average
Cluster #Obs distance in Cluster #Obs distance in
cluster cluster
567.820509
Cluster-1 6 2.206 Cluster-1 6
4
740.539882
Cluster-2 86 1.57 Cluster-2 86
6
401.573635
Cluster-3 39 1.625 Cluster-3 39
5
521.159355
Cluster-4 68 1.75 Cluster-4 68
9
593.937492
Overall 199 1.661 Overall 199
2
Elapsed Time
Cluster Interpretation:
Cluster_
1
region hhsz age income child connection money
1 3 7 6 1 1 863.69
1 4 10 6 1 1 723.31
1 3 10 6 1 1 666.65
1 2 7 5 0 0 1065.07
3 2 8 5 1 0 491.12
2 4 7 7 0 0 635.91
1 2 9 3 0 0 2232.66
3 5 8 6 1 0 298.95
1 3 9 1 1 0 59.18
3 8 6 Both No Connection
Cluster_
2
region hhsz age income child connection money
1 3 7 6 1 1 863.69
1 4 10 6 1 1 723.31
1 3 10 6 1 1 666.65
1 4 4 5 1 1 1103.43
1 3 10 6 1 1 1239.79
1 4 4 5 1 1 757.2
1 3 7 6 1 1 1485.98
1 4 8 6 1 1 319.34
1 4 7 6 1 1 259.9
1 3 5 5 1 1 331.22
1 3 7 5 1 1 188
1 3 2 5 1 1 1146.05
1 3 10 5 1 1 1644.13
1 3 6 6 1 1 140.09
1 3 5 5 1 1 201.49
1 4 8 7 1 1 1094.31
1 4 6 7 1 1 1024.22
1 4 6 7 1 1 1058.38
1 4 9 7 1 1 904.3
1 4 5 7 1 1 1043.41
1 4 5 7 1 1 1110.67
1 3 5 7 1 1 1030.66
2 4 9 5 1 1 921.19
1 3 10 6 1 1 137.76
1 3 7 5 1 1 1845.37
Have
1 4 5 to 7 5 to 7 Have Connection 850
Children
Cluster_3
Region hhsz age income child connection money
2 2 7 6 0 1 391.91
2 2 11 5 0 1 870.21
3 2 5 5 0 1 370.31
2 3 6 6 0 1 208.97
3 2 6 6 0 1 156
2 2 8 7 0 1 215.32
1 2 7 5 0 1 568.45
1 2 9 5 0 1 752.38
2 2 6 7 0 1 1182.52
1 2 6 6 0 1 458.29
1 2 6 6 0 1 436.83
3 1 6 6 0 1 730.34
1 2 11 5 0 1 987.32
2 1 4 7 0 1 485.91
3 2 10 7 0 1 854.27
3 2 10 7 0 1 373.19
3 2 11 7 0 1 591.92
1 2 7 5 0 1 1532.77
3 2 5 7 0 1 161.87
3 2 8 7 0 1 1270.35
3 2 5 7 0 1 62.45
1 2 6 7 0 1 249.43
1 2 9 7 0 1 144
1 2 11 7 0 1 343.32
3 2 9 3 0 1 208
1,2,3 2 7 to 11 7 no child Have Connection 500
Cluster_
4
region hhsz age income child connection money
3 4 8 7 1 1 604.5
3 3 11 6 1 1 906.17
3 3 9 6 1 1 303
3 3 8 7 1 1 806.9
3 4 4 7 1 1 882.37
3 3 9 7 1 1 738.83
3 4 5 7 1 1 1363.64
3 4 6 5 1 1 304.83
3 3 10 7 1 1 766.99
3 3 6 7 1 1 831.93
3 3 6 7 1 1 611.97
2 4 9 5 1 1 921.19
2 4 9 7 1 1 842.53
3 3 9 7 1 1 1457.28
3 3 5 7 1 1 679.29
3 3 9 5 1 1 260.82
2 4 6 6 1 1 270.83
3 5 7 5 1 1 610.05
2 4 7 7 1 1 460.74
3 3 3 6 1 1 398
3 3 5 6 1 1 1729.28
3 5 9 7 1 1 1394.86
2 3 9 7 1 1 999.94
3 5 6 7 1 1 471.83
2 5 9 6 1 1 779.66
Have
2,3 4 7 to9 5to7 Have Connection 800
Children
Conclusion
When compared to the clusters generated in Part2 these clusters have more similarities and are classified based on lesser
number of attributes. We are able to narrow down on the customer segments more easily than the clusters in part2 due to
lesser number of attributes defining them. Hence these clusters help in concentrating on the specific segments of interest
to the company chosen in the association rule and hence pave way for better business Intelligence strategies. As it can be
seen from the conclusion members of these clusters are middle or old aged people with good income level and the
differentiating factors are house hold size and children with internet.
Q8.
Case Data
i
Input Data Cluster2!$N$1:$R$87
Data Format Item List
Minimum Support 15
Minimum Confidence % 30
# Rules 12
Overall Time (secs) 1
If customers buy from landsend.com they tend to from buy llbean.com also and the confidence of this rule is 100%
Support(a
Rule # Conf. % Antecedent (a) Consequent (c) Support(a) Support(c) Lift Ratio
U c)
Case Data
ii
Input Data Sheet1!$M$1:$R$87
Data Format Item List
Minimum Support 25
Minimum Confidence % 60
# Rules 2
Overall Time (secs) 1
If customers buy from landsend.com they tend to from buy llbean.com also and the confidence of this rule is 100%
Support(a
Rule # Conf. % Antecedent (a) Consequent (c) Support(a) Support(c) Lift Ratio
U c)
Case Data
iii
Input Data Sheet1!$M$1:$R$87
Data Format Item List
Minimum Support 20
Minimum Confidence % 35
# Rules 7
Overall Time (secs) 1
If customers buy from landsend.com they tend to from buy llbean.com also and the confidence of this rule is 100%
Support(a
Rule # Conf. % Antecedent (a) Consequent (c) Support(a) Support(c) Lift Ratio
U c)
The association rules generated has a 100% confidence level irrespective of the Support level and with a constant Lift
Ratio of 1. This was not the case for the rules generated in Part 1. This implies that we are looking at rules that are
sensible and pertaining to the company of our interest, which is Landsend.com. They appear to be losing their customers
to llbean.com. Also these rules generated are based on one specific company where as in Part1 the rules generated were
based on a huge dataset with no reference to specific company. It just denoted the buying behavior of the customers in
general.
Q9.
The shop landsend.com might be losing its customers to llbean.com. This is based on the XL
Miner report that we have generated. But the good thing is Landsen.com is also slowly gaining
customers from oldnavy.com. These are the some of the BI recommendations for Landsend.com