Vous êtes sur la page 1sur 18

Market Basket Analysis of Database Table References Using R: An

Application to Physical Database Design



Jeffrey Tyzzer
1


Summary

Market basket and statistical and informetric analyses are applied to a population of
database queries (SELECT statements) to better understand table usage and co-
occurrence patterns and inform placement on physical media.

Introduction

In 1999 Sally Jo Cunningham and Eibe Frank of the University of Waikato published a
paper titled Market Basket Analysis of Library Circulation Data [8]. In it the authors
apply market basket analysis (MBA) to library book circulation data, models of which are
a staple of informetrics, the application of statistical and mathematical methods in
Library and Information Sciences [10]. Their paper engendered ideas that led to this
paper, which concerns the application of MBA and statistical and informetric analyses to
a set of database queries, i.e. SELECT statements, to better understand table usage and
co-occurrence patterns.

Market Basket Analysis

Market basket analysis is a data mining technique that applies association rule analysis, a
method of uncovering connections among items in a data set, to supermarket purchases,
with the goal of finding items (i.e., groceries) having a high probability of appearing
together. For instance, a rule induced by MBA might be in 85% of the baskets where
potato chips appeared, so did root beer. In the Cunningham and Frank paper, the baskets
were the library checkouts and the groceries were the books. In this paper, the baskets are
the queries and the groceries are the tables referenced in the queries.

MBA was introduced in the seminal paper Mining Association Rules between Sets of
Items in Large Databases, by Agrawal et al. [1] and is used by retailers to guide store
layout (for example, placing products having a high probability of appearing in the same
purchase closer together to encourage greater sales) and promotions (e.g., buy one and
get the other half-off). The output of MBA is a set of association rules and attendant
metadata in the form {LHS => RHS}. LHS means left-hand side and RHS means
right-hand side. These rules are interpreted as if LHS then RHS, with the LHS
referred to as the antecedent and the RHS referred to as the consequent. For the potato
chip and root beer example, wed have {Chips => Root beer}.

1
jefftyzzer AT sbcglobal DOT net
2
The Project

Two questions directed my investigation:

1. Among the tables, are there a vital few [9] that account for the bulk of the table
references in the queries? If so, which ones are they?
2. Which table pairings (co-occurrences) are most frequent within the queries?

The answers to these questions can be used to:

Steer the placement of tables on physical media
2

Justify denormalization decisions
Inform the creation of materialized views, table clusters, and aggregates
Guide partitioning strategies to achieve collocated joins and reduce inter-node
data shipping in distributed databases
Identify missing indexes to support frequently joined tables
3

Direct the scope, depth, frequency, and priority of table and index statistics
gathering
Contribute to an organizations overall corpus of operational intelligence

The data at the focus of this study are metadata for queries executed against a population
of 494 tables within an OLTP database. The queries were captured over a four-day
period. There is an ad hoc query capability within the environment, but such queries are
run against a separate data store, thus the system under study was effectively closed with
respect to random, external, queries.

I wrote a Perl program to iterate over the compressed system-generated query log files,
272 in all, cull the tables from each of the SELECT statements within them, and, for
those referencing at least one of the 494 tables, output detail and summary data,
respectively, to two files. Of the 553,139 total statements read, 373,372 met this criterion
(the remainder, a sizable number, were metadata-type statements, e.g., data dictionary
lookups and variable instantiations).

The summary file lists each table and the number of queries it appears in; its structure is
simply {table, count}. The detail file lists {query, table, hour} triples, which were then
imported into a simple table consisting of three corresponding columns. query identifies
the query in which the table is referenced, tabname is the name of the table, and hour
designates the hour the query snapshot was taken, in the range 0-23.


2
Both within a given medium as well as among media with different performance characteristics, e.g.,
tiering storage between disk and solid-state drives (SSD) [15].
3
Adding indexes to support SELECTs may come at the expense of increased INSERT, UPDATE, and
DELETE costs. When adding indexes to a table the full complement of CRUD operations against it must
be considered. The analysis discussed here is easily extended to encompass (other) DML statements as
well.
3
The Vital Few

The 80/20 principle describes a phenomenon in which 20% of a population or group can
explain 80% of an effect [9]. This principle is widely observed in economics, where its
generally referred to as the Pareto principle, and informetrics, where its known as
Trueswells 80/20 rule [22]. Trueswell argued that 80% of a librarys circulation is
accounted for by 20% of its circulating books [9]. This behavior has also been observed
in computer science contexts, where its been noted that 80 percent of the transactions on
a file are applied to the 20 percent most frequently used records within it [14; 15].
I wanted to see if this pattern of skewed access could apply to RDBMSs as well, i.e., if
20% of the tables in the database might account for 80% of all table references within
queries.

Figure 1 plots in ascending (rank) order the number of queries each of the 494 tables is
referenced in (space prohibits a table listing the frequencies of the 494 tables), showing a
characteristic reverse-J shape. Figure 2 presents this data as a Lorenz curve, which was
generated using the R package ineq. As [11] puts it, [t]he straight line represents the
expected distribution if all tables were queried an equal number of times, with the
curved line indicating the observed distribution. As the figure shows, 20% of the tables
account for a little more than 85% of the table references. Clearly, a subset of tables, the
vital few, account for the majority of table references in the queries. These tables get
the most attention query-wise and therefore deserve the most attention performance-wise.

Figure 1 - Plot of query-count-per-table frequency
4


Figure 2 - Lorenz curve illustrating the 80/20 rule for table references

The Market Basket Analysis

The 373,372
4
statements mentioned earlier are the table baskets from which the subset of
transactions against the 25 most-queried tables is derived. As a first step toward
uncovering connections among the tables in the database using MBA, I used the R
package diagram to create a web plot, or annulus [21], of the co-reference relationships
among these 25 tables, shown in figure 3. Note that line thickness in the figure is
proportional to the frequency of co-occurrence. As can be seen, there is a high level of
interconnectedness among these tables. Looking at these connections as a graph, with the
tables as nodes and their co-occurrence in queries as edges, I computed the graphs
clustering coefficient, which is the number of actual connections between the nodes
divided by the total possible number of connections [3], which turned out to be 0.75, a
not surprisingly high value given what figure 3 illustrates.


4
See Appendix B for a sample size formula if a population of table baskets is not already available.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Table Query Percentages
Cum pct of table references
C
u
m

p
c
t

o
f

t
a
b
l
e

p
o
p
u
l
a
t
i
o
n
(.8525, .2)
5

Figure 3 - Web plot of the co-references between the 25 most-queried tables

To mine the tables association rules, I used Rs arules package. Before the table
basket data could be analyzed it had to be read into an R data structure, which is done
using the read.transactions() function. The result is a transactions object, an
incidence matrix of the transaction items. To see the structure of the matrix, type the
name of the variable at the R prompt:

> tblTrans
transactions in sparse format with
110071 transactions (rows) and
154 items (columns)

Keep in mind the number of transactions (queries) and items (tables) shown here differs
from their respective numbers listed previously because I limited the analysis to just
those table baskets with at least two of the 25 most-queried tables in them. Looking at the
output, thats 110,071 queries and 154 tables (the top 25 along with 129 others they
appear with).

To see the contents of the tblTrans transaction object, the inspect() function is
used (note I limited inspect() to the first five transactions, as identified by the ASCII
ordering of the transactionID):

Co-reference Patterns Among the 25 Most-queried Tables
ADDRESS
BIRTH_INFORMATION
CASE
CASE_ACCOUNT
CASE_ACCOUNT_SUMMARY
CASE_ACCOUNT_SUMMARY_TRANSACTION
CASE_COURT_CASE
CASE_PARTICIPANT
CHARGING_INSTRUCTION
COMBINED_LOG_TEMP_ENTRY
COURT_CASE
EMPLOYER EMPLOYER_ADDRESS
INTERNAL_USER
LEGAL_ACTIVITY
LOGICAL_COLLECTION_TRANSACTION
ORG_UNIT
PARTICIPANT
PARTICIPANT_ADDRESS
PARTICIPANT_EMPLOYER
PARTICIPANT_NAME
PARTICIPANT_PUBLIC_ASSISTANCE_CASE
PARTICIPANT_RELATIONSHIP
SOCIAL_SECURITY_NUMBER
SUPPORT_ORDER
6

> inspect(tblTrans[1:5])
items transactionID
1 {PARTICIPANT,
PARTICIPANT_NAME} 1
2 {ADDRESS,
BIRTH_INFORMATION,
PARTICIPANT,
PARTICIPANT_ADDRESS,
PARTICIPANT_PHONE_NUMBER,
PARTICIPANT_PHYSICAL_ATTRIBUTES,
SOCIAL_SECURITY_NUMBER} 10
3 {CASE_COURT_CASE,
COURT_CASE,
LEGAL_ACTIVITY,
MEDICAL_TERMS,
SUPPORT_ORDER,
TERMS} 100
4 {CASE,
CASE_ACCOUNT,
CASE_ACCOUNT_SUMMARY,
CASE_COURT_CASE} 1000
5 {ADDRESS,
PARTICIPANT,
PARTICIPANT_ADDRESS} 10000

The summary() function provides additional descriptive statistics concerning the make-
up of the table transactions (output format edited slightly to fit):

> summary(tblTrans)
transactions as itemMatrix in sparse format with
110071 rows (elements/itemsets/transactions) and
154 columns (items) and a density of 0.02457888

most frequent items:
CASE CASE_PARTICIPANT PARTICIPANT COURT_CASE CASE_COURT_CASE (Other)
51216 38476 35549 21519 21421 248454

element (itemset/transaction) length distribution:
sizes
2 3 4 5 6 7 8 9 10 11 12 13 14 15 17
33573 29883 15367 12456 9879 3825 1899 603 1064 775 128 38 490 89 2

Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 3.000 3.785 5.000 17.000

includes extended item information - examples:
labels
1 ACCOUNT_HOLD_DETAIL
2 ADDRESS
3 ADJUSTMENT

includes extended transaction information - examples:
transactionID
1 1
2 10
3 100

Theres a wealth of information in this output. Note for instance the minimum item
(table) count is two, and the maximum is seventeen, and that there are 33,573
transactions with two items and two transactions with seventeen. While Im loath to
assign a limit to the maximum number of tables that should ever appear in a query, a
7
DBA would likely be keen to investigate the double-digit queries for potential tuning
opportunities.

Lastly, we can use the itemFrequencyPlot() function to generate an item
frequency distribution. Frequencies can be displayed as relative (percentages), or absolute
(counts). Note that for readability, I limited the plot to the 25 most-frequent items among
the query baskets, by specifying a value for the topN parameter. The command is below,
and the plot is shown in figure 4.

> itemFrequencyPlot(tblTrans, type = "absolute", topN = 25, main
= "Frequency Distribution of Top 25 Tables", xlab = "Table Name",
ylab = "Frequency")

Figure 4 - Item frequency bar plot among the top 25 tables
With the query baskets loaded, it was then time to generate the table association rules.
The R function within arules that does this is apriori(). apriori() takes up to
four arguments but I only used two: a transaction object, tblTrans, and a list of two
parameters that specify the minimum values for the two rule interestingness criteria,
5

generality and reliability [12; 16]. Confidence, the first parameter, is a measure of
generality, and specifies how often the rule is true when the LHS is true, i.e.,
!
countOfBasketsWithLHSandRHSItems
countOfBasketsWithLHSItems
.
The second is support, which corresponds to the reliability criterion and specifies the
proportion of all baskets where the rule is true, i.e.,

5
This was the attendant metadata I mentioned in the Market Basket Analysis section.
8
!
countOfBasketsWithLHSandRHSItems
totalCount
.
A third interestingness criterion, lift, is also useful for evaluating rules and figures
prominently in output from the R-extension package arulesViz (see figure 5).
Paraphrasing [18], lift measures the confidence of a rule and the expected confidence that
the second table will be queried given that the first table was:
!
Confidence(Rule)
Support(RHS)

with Support(RHS) calculated as

!
countOfBasketsWithRHSItem
totalCount
.

Lift indicates the strength of the association over its random co-occurrence. When lift is
greater than 1, the rule is better than guessing at predicting the consequent.

For confidence I specified .8 and for support I specified .05. The command I ran was

> tblRules <- apriori(tblTrans, parameter = list(supp= .05, conf
= .8))

which generated 71 rules. To get a high-level overview of the rules, you can call the
overloaded summary() function against the output of the apriori():

> summary(tblRules)
set of 71 rules

rule length distribution (lhs + rhs):sizes
2 3 4 5
11 28 25 7

Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 3.000 3.000 3.394 4.000 5.000

summary of quality measures:
support confidence lift
Min. :0.05239 Min. :0.8037 Min. :1.728
1st Qu.:0.05678 1st Qu.:0.8585 1st Qu.:2.460
Median :0.06338 Median :0.9493 Median :4.533
Mean :0.07842 Mean :0.9231 Mean :4.068
3rd Qu.:0.08271 3rd Qu.:0.9870 3rd Qu.:5.066
Max. :0.28343 Max. :1.0000 Max. :7.204

mining info:
data ntransactions support confidence
tblTrans 110071 0.05 0.8

To see the rules, execute the inspect() function (note Im only showing the first and
last five, as sorted by confidence):

9
> inspect(sort(tblRules, by = "confidence"))
lhs rhs support confidence lift
1 {SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.11643394 1.0000000 5.847065
2 {CASE_COURT_CASE,
SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.07112682 1.0000000 5.847065
3 {COURT_CASE,
SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.08326444 1.0000000 5.847065
4 {CASE_PARTICIPANT,
SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.05753559 1.0000000 5.847065
5 {CASE,
SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.06406774 1.0000000 5.847065
<snip>
67 {CASE_COURT_CASE,
COURT_CASE,
SUPPORT_ORDER} => {CASE} 0.05678153 0.8088521 1.738347
68 {CASE_COURT_CASE,
COURT_CASE,
LEGAL_ACTIVITY,
SUPPORT_ORDER} => {CASE} 0.05678153 0.8088521 1.738347
69 {CASE_COURT_CASE,
COURT_CASE} => {CASE} 0.13003425 0.8044175 1.728816
70 {CASE_COURT_CASE,
LEGAL_ACTIVITY} => {CASE} 0.07938512 0.8041598 1.728262
71 {CASE,
CASE_COURT_CASE} => {COURT_CASE} 0.13003425 0.8037399 4.111179


Lets look at the first and last rules and interpret them. The first rule says that over the
period during which the queries were collected, the SUPPORT_ORDER table appeared in
11.64% of the queries and that when it did it was accompanied by the
LEGAL_ACTIVITY table 100% of the time. The last rule, the 71
st
, says that during this
same period CASE and CASE_COURT_CASE appeared in 13% of the queries and that
they were accompanied by COURT_CASE 80.37% of the time.

While its not visible from the subset shown, all 71 of the generated rules have a single-
item consequent. This is fortunate, and is not always the case, as such rules are the most
actionable in practice compared to rules with compound consequents [4].

Figure 5, generated using arulesViz, is a scatter plot of the support and confidence of
the 71 rules generated by arules. Here we see the majority of the rules are in the 0.05-
0.15 support range, meaning between 5% and 15% of the 110,071 queries analyzed
contain all of the tables represented in the rule.
10

Figure 5 - Plot of the interestingness measures for the 71 generated rules

An illuminating visualization is shown in Figure 6, also generated by the aRulesViz
package. This figure plots the rule antecedents on the x-axis and their consequents on the
y-axis. To economize on space, the table names arent displayed but rather are numbered
corresponding to output accompanying the graph thats displayed on the main R window.

Looking at the plot, two things immediately stand out: the presence of four large rule
groups, and that only nine tables (the y-axis) account for the consequents in all 71 rules.
These nine tables are the nuclear tables around which all the others orbit, the most vital
of the vital few.

Rule Interestingness Measures
2
3
4
5
6
7
lift
0.8 0.85 0.9 0.95 1
0.05
0.1
0.15
0.2
0.25
confidence
s
u
p
p
o
r
t
11

Figure 6 - Plot of table prevalence of rules

Another Way: Odds Ratios

As the final step, I computed the odds ratios between all existing pairings occurring
among the top-25 tables, which numbered 225. Odds is the ratio of the probability of an
events occurrence to the probability of its non-occurrence, and the odds ratio is the ratio
of the odds of two events (e.g., two tables co-occurring in a given query vs. each table
appearing without the other) [19]. To compute the odds ratios, I used the cc() function
from the epicalc R package which, when given a 2x2 contingency table (see table 1,
generated with the R CrossTable() function), outputs the following (the counts
shown are of the pairing of the PARTICIPANT and CASE_PARTICIPANT tables):

FALSE TRUE Total
FALSE 408021 50902 458923
TRUE 56963 21745 78708
Total 464984 72647 537631

OR = 3.06
Exact 95% CI = 3.01, 3.12
Table Prevalence Among Rules
10 20 30 40 50
2
4
6
8
Antecedent (LHS)
C
o
n
s
e
q
u
e
n
t

(
R
H
S
)
2
3
4
5
6
7
lift
12
Chi-squared = 15719.49, 1 d.f., P value = 0
Fisher's exact test (2-sided) P value = 0

For these two tables, the odds ratio is 3.06, with a 95% confidence interval of 3.01 and
3.12. Odds greater than 1 are considered significant, and the higher the number the
greater the significance.

Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|


Total Observations in Table: 537631


| CASE_PARTICIPANT
PARTICIPANT | FALSE | TRUE | Row Total |
--------------------------------------|-----------|-----------|-----------|
FALSE | 408021 | 50902 | 458923 |
| 310.961 | 1990.337 | |
| 0.889 | 0.111 | 0.854 |
| 0.877 | 0.701 | |
| 0.759 | 0.095 | |
--------------------------------------|-----------|-----------|-----------|
TRUE | 56963 | 21745 | 78708 |
| 1813.123 | 11605.065 | |
| 0.724 | 0.276 | 0.146 |
| 0.123 | 0.299 | |
| 0.106 | 0.040 | |
--------------------------------------|-----------|-----------|-----------|
Column Total | 464984 | 72647 | 537631 |
| 0.865 | 0.135 | |
--------------------------------------|-----------|-----------|-----------|
Table 1 - 2x2 contingency table

I tabulated the odds ratios of the 225 pairings, and table 2 shows the first 25, ordered by
the odds ratio in descending order. The top four odds ratios show as Inf--infinite--because
either the FALSE/TRUE or TRUE/FALSE cell in their respective 2x2 contingency table
was 0, indicating that among the 537,631 observations (rows) analyzed, the one table
never appeared without the other.

The results shown in table 2 differ from what figure 3 depicts. The former shows the
strongest link between the CASE and CASE_PARTICIPANT tables, whereas for the
latter its SUPPORT_ORDER and SOCIAL_SECURITY_NUMBER. This is because the
line thicknesses in figure 3 are based solely on the count of co-references (analogous to
the TRUE/TRUE cell in the 2x2 contingency table), whereas odds ratios consider the pair
counts relative to each other, i.e., they take into account the other three cells--
FALSE/FALSE, FALSE/TRUE, and TRUE/FALSE--as well.

Table 1 Table 2 OR
SUPPORT_ORDER SOCIAL_SECURITY_NUMBER Inf
SOCIAL_SECURITY_NUMBER PARTICIPANT_PUBLIC_ASSISTANCE_CASE Inf
CASE_ACCOUNT CASE Inf
13
BIRTH_INFORMATION ADDRESS Inf
PARTICIPANT_EMPLOYER EMPLOYER_ADDRESS 209.014
EMPLOYER_ADDRESS EMPLOYER 54.854
ORG_UNIT INTERNAL_USER 53.583
PARTICIPANT_EMPLOYER EMPLOYER 37.200
SOCIAL_SECURITY_NUMBER PARTICIPANT_NAME 36.551
PARTICIPANT_RELATIONSHIP PARTICIPANT_NAME 34.779
CHARGING_INSTRUCTION CASE_ACCOUNT 27.309
SOCIAL_SECURITY_NUMBER PARTICIPANT_RELATIONSHIP 26.885
CHARGING_INSTRUCTION CASE_ACCOUNT_SUMMARY 25.040
PARTICIPANT LOGICAL_COLLECTION_TRANSACTION 23.281
COMBINED_LOG_TEMP_ENTRY CHARGING_INSTRUCTION 22.357
PARTICIPANT_NAME PARTICIPANT 9.739
SOCIAL_SECURITY_NUMBER PARTICIPANT 8.799
PARTICIPANT_PUBLIC_ASSISTANCE_CASE COMBINED_LOG_TEMP_ENTRY 8.578
CASE_COURT_CASE CASE 7.913
PARTICIPANT_ADDRESS PARTICIPANT 7.869
PARTICIPANT COMBINED_LOG_TEMP_ENTRY 7.861
COURT_CASE CASE_COURT_CASE 7.525
PARTICIPANT_NAME PARTICIPANT_EMPLOYER 7.454
SOCIAL_SECURITY_NUMBER PARTICIPANT_ADDRESS 7.420
PARTICIPANT_NAME ORG_UNIT 7.346
Table 2 - Top 25 table-pair odds ratios

Conclusion

In this paper, Ive described a holistic process for identifying the most-queried vital
few tables in a database, uncovering their usage patterns and interrelationships, and
guiding their placement on physical media.

First I captured query metadata and parsed it for further analysis. I then established that
there are a vital few tables that account for the majority of query activity. Finally, I
used MBA supplemented with other methods to understand the co-reference patterns
among these tables, which may in turn inform their layout on storage media.

My hope is Ive described what I did in enough detail that youre able to adapt it, extend
it, and improve it to the betterment of the performance of your databases and
applications.

Appendix A - Data Placement on Physical Media

Storage devices such as disks maximize throughput by minimizing access time [5], and a
fundamental part of physical database design is the allocation of database objects to such
physical media--deciding where schema objects should be placed on disk to maximize
performance by minimizing disk seek time and rotational latency. The former is a
function of the movement of a disks read/write head arm assembly and the latter is
dependent on its rotations per minute. Both are electromechanical absolutes, although
their speeds vary from disk to disk.

As is well known, disk access is several orders of magnitude slower than RAM access--
estimates range from four to six[ibid.], and this relative disparity is no less true today
than it was when the IBM 350 Disk Storage Unit was introduced in 1956. So while this
14
topic may seem like a bit of a chestnut in the annals of physical database design, it
remains a germane topic. The presence of such compensatory components and strategies
as bufferpools, defragmenting, table reorganizations, and read ahead prefetching in the
architecture of modern RDBMS underscores this point [20]. The fact is read/write heads
can only be in one place on the disk platter at a time. Solid-state drives (SSD) offer
potential relief here, but data volumes are rising at a rate much faster than that at which
SSD prices are falling.

Coupled with this physical reality is a fiscal one, as its been estimated anywhere
between 16-40% of IT budget outlay is committed to storage [6; 23]. In light of such
costs, it makes good financial sense for an organization to be a wise steward of this
resource and seek its most efficient use.

If one is designing a new database as part of a larger application development initiative,
then such tried-and-true tools as CRUD (create, read, update, delete) matrices and entity
affinity analysis can assist with physical table placement, but such techniques quickly
become tedious, and therefore error-prone, and these early placement decisions are at best
educated guesses. What would be useful is an automated, holistic, approach to help refine
the placement of tables as the full complement of queries comes on line and later as it
changes over the lifetime of the application, without incurring extra storage costs. The
present paper is, of course, an attempt at such an approach.

As to where to locate data on disk media in general, the rule of thumb is to place the
high-use tables on the middle tracks of the disk given this location has the smallest
average distance to all other tracks. In the case of disks employing zone-bit recording
(ZBR), as practically all now do, the recommendation is to place the high-frequency
tables, say, the vital few 20%, on the outermost cylinders, as the raw transfer rate is
higher there since the bits are more densely packed. This idea can be extended further by
placing tables typically co-accessed in queries in the outermost zones on separate disk
drives [2], minimizing read/write head contention and enabling query parallelism. If
zone-level disk placement specificity is not an option, separating co-accessed vital few
tables onto separate media is still a worthwhile practice.

Appendix B - How many baskets?

For this analysis, performed on modest hardware, I used all of the snapshot data I had
available. Indeed, one of the precepts of the burgeoning field of data science [17] is that
with todays commodity hardware, computing power, and addressable memory sizes, we
no longer have to settle for samples. Plus, when it comes to the data weve been
discussing, sampling risks overlooking its long-tailed aspect [7]. Nonetheless, there may
still be instances where analyzing all of the snapshots at your disposal isnt practicable, or
you may want to know at the outset of an analysis how many statements youll need to
capture to get a good representative of the query population (and therefore greater
statistical power), if you dont have a ready pool of snapshots from which to draw. In
either case, you need to know how large your sample needs to be for a robust analysis.

15
Sample size formulas exist for more conventional hypothesis testing, but Zaki, et al. [24]
give a more suitable sample size formula, one specific to market basket analysis:

!
n =
"2ln(c)
#$
2
,

where n is the sample size, c is 1 - !, (! being the confidence level), " is the acceptable
level of inaccuracy, and # is the minimum required support [13]. Using this equation,
with 80% confidence (c = .20), 95% accuracy (" = .05), and 5% support (# = .05), the
sample size recommendation is 25,751.

Using this sample size, I ran the sample() function of the apriori package against
the tblTrans transaction object, the results of which I then used as input to the
apriori() function to generate a new set of rules. This time, 72 rules were generated.
Figure 7 shows the high degree of correspondence between the relative frequencies of the
tables in the sample (bars) and the population (line).


Figure 7 - Relative frequencies of the tables in the sample vs. in the population.

Figure 8 plots the regression lines of the confidence and support of the rules generated
from the sample (black) and the population (grey). Again, notice the high degree of
correspondence.

Sample Frequency Distribution of Top 25 Tables
F
r
e
q
u
e
n
c
y
0
4
0
0
0
8
0
0
0
C
A
S
E
C
A
S
E
_
P
A
R
T
I
C
I
P
A
N
T
P
A
R
T
I
C
I
P
A
N
T
C
O
U
R
T
_
C
A
S
E
C
A
S
E
_
C
O
U
R
T
_
C
A
S
E
C
A
S
E
_
A
C
C
O
U
N
T
_
S
U
M
M
A
R
Y
L
E
G
A
L
_
A
C
T
I
V
I
T
Y
P
A
R
T
I
C
I
P
A
N
T
_
N
A
M
E
C
A
S
E
_
A
C
C
O
U
N
T
S
U
P
P
O
R
T
_
O
R
D
E
R
A
D
D
R
E
S
S
C
O
M
B
I
N
E
D
_
L
O
G
_
T
E
M
P
_
E
N
T
R
Y
O
R
G
_
U
N
I
T
C
A
S
E
_
A
C
C
O
U
N
T
_
S
U
M
M
A
R
Y
_
T
R
A
N
S
A
C
T
I
O
N
B
I
R
T
H
_
I
N
F
O
R
M
A
T
I
O
N
P
A
R
T
I
C
I
P
A
N
T
_
A
D
D
R
E
S
S
S
O
C
I
A
L
_
S
E
C
U
R
I
T
Y
_
N
U
M
B
E
R
P
A
R
T
I
C
I
P
A
N
T
_
E
M
P
L
O
Y
E
R
L
O
G
I
C
A
L
_
C
O
L
L
E
C
T
I
O
N
_
T
R
A
N
S
A
C
T
I
O
N
I
N
T
E
R
N
A
L
_
U
S
E
R
C
H
A
R
G
I
N
G
_
I
N
S
T
R
U
C
T
I
O
N
E
M
P
L
O
Y
E
R
P
A
R
T
I
C
I
P
A
N
T
_
P
H
Y
S
I
C
A
L
_
A
T
T
R
I
B
U
T
E
S
T
E
R
M
S
P
A
R
T
I
C
I
P
A
N
T
_
R
E
L
A
T
I
O
N
S
H
I
P
Table Name
16

Figure 8 - Correspondence between the support and confidence of the sample and population rules

References

[1] Agrawal, Rakesh, et al. Mining Association Rules Between Sets of Items in Large
Databases. Proceedings of the 1993 ACM SIGMOD International Conference on
Management of Data. pp. 207-216.

[2] Agrawal, Sanjay, et al. Automating Layout of Relational Databases. Proceedings of
the 19
th
International Conference on Data Engineering (ICDE 03). (2003): 607-
618.

[3] Barabsi, Albert-Lszl and Zoltn N. Oltvai. Network Biology: Understanding the
Cells Functional Organization. Nature Reviews Genetics. 5.2 (2004): 101-113.

[4] Berry, Michael J. and Gordon Linoff. Data Mining Techniques: For Marketing, Sales,
and Customer Support. New York: John Wiley & Sons, 1997.

[5] Blanchette, Jean-Franois. A Material History of Bits. Journal of the American
Society For Information Science and Technology. 62.6 (2011): 1042-1057.

[6] Butts, Stuart. How to Use Single Instancing to Control Storage Expense. eWeek 03
August 2009. 22 December 2010 <http://mobile.eweek.com/c/a/Green-IT/How-
to-Use-Single-Instancing-to-Control-Storage-Expense/>.

[7] Cohen, Jeffrey, et al. MAD Skills: New Analysis Practices for Big Data. Journal
Proceedings of the VLDB Endowment. 2.2 (2009): 1481-1492.

0.80 0.85 0.90 0.95 1.00
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
Rule Sample vs. Population
Confidence
S
u
p
p
o
r
t
17
[8] Cunningham, Sally Jo and Eibe Frank. Market Basket Analysis of Library
Circulation Data. Proceedings of the Sixth International Conference on Neural
Information Processing (1999). Vol. II, pp. 825-830.

[9] Eldredge, Jonathan D. The Vital Few Meet the Trivial Many: Unexpected Use
Patterns in a Monographs Collection. Bulletin of the Medical Library
Association. 86.4 (1998): 496-503.

[10] Erar, Aydin. Bibliometrics or Informetrics: Displaying Regularity in Scientific
Patterns by Using Statistical Distributions. Hacettepe Journal of Mathematics
and Statistics. 31 (2002): 113-125.

[11] Fu, W. Wayne and Clarice C Sim. Aggregate Bandwagon Effect on Online Videos'
Viewership: Value Uncertainty, Popularity Cues, and Heuristics. Journal of the
American Society For Information Science and Technology. 62.12 (2011): 2382-
2395.

[12] Geng, Liqiang and Howard J. Hamilton. Interestingness Measures for Data Mining:
A Survey. ACM Computing Surveys. 38.3 (2006): 1-32.

[13] Hashler, Michael, et al. Introduction to arules: A Computational Environment for
Mining Association Rules and Frequent Item Sets. CRAN 16 March 2010.
<http://cran.r-project.org/web/packages/arules/vignettes/arules.pdf>.

[14] Heising, W.P. Note on Random Addressing Techniques. IBM Systems Journal. 2.2
(1963): 112-6.

[15] Hsu, W.W., A.J. Smith, and H.C. Young. Characteristics of Production Database
Workloads and the TPC Benchmarks. IBM Systems Journal. 40.3 (2001): 781-
802.

[16] Janert, Philipp, K. Data Analysis with Open Source Tools. Sebastopol: OReilly,
2010.

[17] Loukides, Mike. What is Data Science? OReilly Radar. 2 June 2010
http://radar.oreilly.com/2010/06/what-is-data-science.html

[18] Nisbet, Robert, John Elder, and Gary Miner. Handbook of Statistical Analysis and
Data Mining. Burlington, MA: Academic Press, 2009.

[19] Ott, R. Lyman, and Michael Longnecker. An Introduction to Statistical Methods and
Data Analysis. 6
th
ed. Belmont, CA: Brooks/Cole, 2010.

[20] Pendle, Paul. Solid-State Drives: Changing the Data World. IBM Data
Management Magazine. Issue 3 (2011): 27-30.

18
[21] Spence, Robert. Information Visualization. 2
nd
ed. London: Pearson, 2007.

[22] Trueswell, Richard L. Some Behavioral Patterns of Library Users: the 80/20 Rule.
Wilson Library Bulletin. 43.5 (1969): 458-461.

[23] Whitely, Robert. Buyers Guide to Infrastructure: Three Steps to IT
Reorganisation. ComputerWeekly.com 01 September 2010. 22 December 2010
<http://www.computerweekly.com/feature/Buyers-Guide-to-infrastructure-Three-
steps-to-IT-reorganisation >.

[24] Zaki, Mohammed Javeed, et al. Evaluation of Sampling for Data Mining of
Association Rules. Proceedings of the 7th International Workshop on Research
Issues in Data Engineering (RIDE '97) High Performance Database Management
for Large-Scale Applications. (1997): 42-50.
*


Copyright 2014 by Jeffrey K. Tyzzer

Vous aimerez peut-être aussi