0 évaluation0% ont trouvé ce document utile (0 vote)
350 vues18 pages
Market basket analysis and statistical and informetric analyses are applied to a population of database queries (SELECT statements) to better understand table usage and co-occurrence patterns and inform placement on physical media.
Market basket analysis and statistical and informetric analyses are applied to a population of database queries (SELECT statements) to better understand table usage and co-occurrence patterns and inform placement on physical media.
Droits d'auteur :
Attribution Non-Commercial No-Derivs (BY-NC-ND)
Formats disponibles
Téléchargez comme PDF, TXT ou lisez en ligne sur Scribd
Market basket analysis and statistical and informetric analyses are applied to a population of database queries (SELECT statements) to better understand table usage and co-occurrence patterns and inform placement on physical media.
Droits d'auteur :
Attribution Non-Commercial No-Derivs (BY-NC-ND)
Formats disponibles
Téléchargez comme PDF, TXT ou lisez en ligne sur Scribd
Market Basket Analysis of Database Table References Using R: An
Application to Physical Database Design
Jeffrey Tyzzer 1
Summary
Market basket and statistical and informetric analyses are applied to a population of database queries (SELECT statements) to better understand table usage and co- occurrence patterns and inform placement on physical media.
Introduction
In 1999 Sally Jo Cunningham and Eibe Frank of the University of Waikato published a paper titled Market Basket Analysis of Library Circulation Data [8]. In it the authors apply market basket analysis (MBA) to library book circulation data, models of which are a staple of informetrics, the application of statistical and mathematical methods in Library and Information Sciences [10]. Their paper engendered ideas that led to this paper, which concerns the application of MBA and statistical and informetric analyses to a set of database queries, i.e. SELECT statements, to better understand table usage and co-occurrence patterns.
Market Basket Analysis
Market basket analysis is a data mining technique that applies association rule analysis, a method of uncovering connections among items in a data set, to supermarket purchases, with the goal of finding items (i.e., groceries) having a high probability of appearing together. For instance, a rule induced by MBA might be in 85% of the baskets where potato chips appeared, so did root beer. In the Cunningham and Frank paper, the baskets were the library checkouts and the groceries were the books. In this paper, the baskets are the queries and the groceries are the tables referenced in the queries.
MBA was introduced in the seminal paper Mining Association Rules between Sets of Items in Large Databases, by Agrawal et al. [1] and is used by retailers to guide store layout (for example, placing products having a high probability of appearing in the same purchase closer together to encourage greater sales) and promotions (e.g., buy one and get the other half-off). The output of MBA is a set of association rules and attendant metadata in the form {LHS => RHS}. LHS means left-hand side and RHS means right-hand side. These rules are interpreted as if LHS then RHS, with the LHS referred to as the antecedent and the RHS referred to as the consequent. For the potato chip and root beer example, wed have {Chips => Root beer}.
1 jefftyzzer AT sbcglobal DOT net 2 The Project
Two questions directed my investigation:
1. Among the tables, are there a vital few [9] that account for the bulk of the table references in the queries? If so, which ones are they? 2. Which table pairings (co-occurrences) are most frequent within the queries?
The answers to these questions can be used to:
Steer the placement of tables on physical media 2
Justify denormalization decisions Inform the creation of materialized views, table clusters, and aggregates Guide partitioning strategies to achieve collocated joins and reduce inter-node data shipping in distributed databases Identify missing indexes to support frequently joined tables 3
Direct the scope, depth, frequency, and priority of table and index statistics gathering Contribute to an organizations overall corpus of operational intelligence
The data at the focus of this study are metadata for queries executed against a population of 494 tables within an OLTP database. The queries were captured over a four-day period. There is an ad hoc query capability within the environment, but such queries are run against a separate data store, thus the system under study was effectively closed with respect to random, external, queries.
I wrote a Perl program to iterate over the compressed system-generated query log files, 272 in all, cull the tables from each of the SELECT statements within them, and, for those referencing at least one of the 494 tables, output detail and summary data, respectively, to two files. Of the 553,139 total statements read, 373,372 met this criterion (the remainder, a sizable number, were metadata-type statements, e.g., data dictionary lookups and variable instantiations).
The summary file lists each table and the number of queries it appears in; its structure is simply {table, count}. The detail file lists {query, table, hour} triples, which were then imported into a simple table consisting of three corresponding columns. query identifies the query in which the table is referenced, tabname is the name of the table, and hour designates the hour the query snapshot was taken, in the range 0-23.
2 Both within a given medium as well as among media with different performance characteristics, e.g., tiering storage between disk and solid-state drives (SSD) [15]. 3 Adding indexes to support SELECTs may come at the expense of increased INSERT, UPDATE, and DELETE costs. When adding indexes to a table the full complement of CRUD operations against it must be considered. The analysis discussed here is easily extended to encompass (other) DML statements as well. 3 The Vital Few
The 80/20 principle describes a phenomenon in which 20% of a population or group can explain 80% of an effect [9]. This principle is widely observed in economics, where its generally referred to as the Pareto principle, and informetrics, where its known as Trueswells 80/20 rule [22]. Trueswell argued that 80% of a librarys circulation is accounted for by 20% of its circulating books [9]. This behavior has also been observed in computer science contexts, where its been noted that 80 percent of the transactions on a file are applied to the 20 percent most frequently used records within it [14; 15]. I wanted to see if this pattern of skewed access could apply to RDBMSs as well, i.e., if 20% of the tables in the database might account for 80% of all table references within queries.
Figure 1 plots in ascending (rank) order the number of queries each of the 494 tables is referenced in (space prohibits a table listing the frequencies of the 494 tables), showing a characteristic reverse-J shape. Figure 2 presents this data as a Lorenz curve, which was generated using the R package ineq. As [11] puts it, [t]he straight line represents the expected distribution if all tables were queried an equal number of times, with the curved line indicating the observed distribution. As the figure shows, 20% of the tables account for a little more than 85% of the table references. Clearly, a subset of tables, the vital few, account for the majority of table references in the queries. These tables get the most attention query-wise and therefore deserve the most attention performance-wise.
Figure 1 - Plot of query-count-per-table frequency 4
Figure 2 - Lorenz curve illustrating the 80/20 rule for table references
The Market Basket Analysis
The 373,372 4 statements mentioned earlier are the table baskets from which the subset of transactions against the 25 most-queried tables is derived. As a first step toward uncovering connections among the tables in the database using MBA, I used the R package diagram to create a web plot, or annulus [21], of the co-reference relationships among these 25 tables, shown in figure 3. Note that line thickness in the figure is proportional to the frequency of co-occurrence. As can be seen, there is a high level of interconnectedness among these tables. Looking at these connections as a graph, with the tables as nodes and their co-occurrence in queries as edges, I computed the graphs clustering coefficient, which is the number of actual connections between the nodes divided by the total possible number of connections [3], which turned out to be 0.75, a not surprisingly high value given what figure 3 illustrates.
4 See Appendix B for a sample size formula if a population of table baskets is not already available. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative Table Query Percentages Cum pct of table references C u m
p c t
o f
t a b l e
p o p u l a t i o n (.8525, .2) 5
Figure 3 - Web plot of the co-references between the 25 most-queried tables
To mine the tables association rules, I used Rs arules package. Before the table basket data could be analyzed it had to be read into an R data structure, which is done using the read.transactions() function. The result is a transactions object, an incidence matrix of the transaction items. To see the structure of the matrix, type the name of the variable at the R prompt:
> tblTrans transactions in sparse format with 110071 transactions (rows) and 154 items (columns)
Keep in mind the number of transactions (queries) and items (tables) shown here differs from their respective numbers listed previously because I limited the analysis to just those table baskets with at least two of the 25 most-queried tables in them. Looking at the output, thats 110,071 queries and 154 tables (the top 25 along with 129 others they appear with).
To see the contents of the tblTrans transaction object, the inspect() function is used (note I limited inspect() to the first five transactions, as identified by the ASCII ordering of the transactionID):
The summary() function provides additional descriptive statistics concerning the make- up of the table transactions (output format edited slightly to fit):
> summary(tblTrans) transactions as itemMatrix in sparse format with 110071 rows (elements/itemsets/transactions) and 154 columns (items) and a density of 0.02457888
most frequent items: CASE CASE_PARTICIPANT PARTICIPANT COURT_CASE CASE_COURT_CASE (Other) 51216 38476 35549 21519 21421 248454
Min. 1st Qu. Median Mean 3rd Qu. Max. 2.000 2.000 3.000 3.785 5.000 17.000
includes extended item information - examples: labels 1 ACCOUNT_HOLD_DETAIL 2 ADDRESS 3 ADJUSTMENT
includes extended transaction information - examples: transactionID 1 1 2 10 3 100
Theres a wealth of information in this output. Note for instance the minimum item (table) count is two, and the maximum is seventeen, and that there are 33,573 transactions with two items and two transactions with seventeen. While Im loath to assign a limit to the maximum number of tables that should ever appear in a query, a 7 DBA would likely be keen to investigate the double-digit queries for potential tuning opportunities.
Lastly, we can use the itemFrequencyPlot() function to generate an item frequency distribution. Frequencies can be displayed as relative (percentages), or absolute (counts). Note that for readability, I limited the plot to the 25 most-frequent items among the query baskets, by specifying a value for the topN parameter. The command is below, and the plot is shown in figure 4.
> itemFrequencyPlot(tblTrans, type = "absolute", topN = 25, main = "Frequency Distribution of Top 25 Tables", xlab = "Table Name", ylab = "Frequency")
Figure 4 - Item frequency bar plot among the top 25 tables With the query baskets loaded, it was then time to generate the table association rules. The R function within arules that does this is apriori(). apriori() takes up to four arguments but I only used two: a transaction object, tblTrans, and a list of two parameters that specify the minimum values for the two rule interestingness criteria, 5
generality and reliability [12; 16]. Confidence, the first parameter, is a measure of generality, and specifies how often the rule is true when the LHS is true, i.e., ! countOfBasketsWithLHSandRHSItems countOfBasketsWithLHSItems . The second is support, which corresponds to the reliability criterion and specifies the proportion of all baskets where the rule is true, i.e.,
5 This was the attendant metadata I mentioned in the Market Basket Analysis section. 8 ! countOfBasketsWithLHSandRHSItems totalCount . A third interestingness criterion, lift, is also useful for evaluating rules and figures prominently in output from the R-extension package arulesViz (see figure 5). Paraphrasing [18], lift measures the confidence of a rule and the expected confidence that the second table will be queried given that the first table was: ! Confidence(Rule) Support(RHS)
with Support(RHS) calculated as
! countOfBasketsWithRHSItem totalCount .
Lift indicates the strength of the association over its random co-occurrence. When lift is greater than 1, the rule is better than guessing at predicting the consequent.
For confidence I specified .8 and for support I specified .05. The command I ran was
which generated 71 rules. To get a high-level overview of the rules, you can call the overloaded summary() function against the output of the apriori():
Min. 1st Qu. Median Mean 3rd Qu. Max. 2.000 3.000 3.000 3.394 4.000 5.000
summary of quality measures: support confidence lift Min. :0.05239 Min. :0.8037 Min. :1.728 1st Qu.:0.05678 1st Qu.:0.8585 1st Qu.:2.460 Median :0.06338 Median :0.9493 Median :4.533 Mean :0.07842 Mean :0.9231 Mean :4.068 3rd Qu.:0.08271 3rd Qu.:0.9870 3rd Qu.:5.066 Max. :0.28343 Max. :1.0000 Max. :7.204
mining info: data ntransactions support confidence tblTrans 110071 0.05 0.8
To see the rules, execute the inspect() function (note Im only showing the first and last five, as sorted by confidence):
Lets look at the first and last rules and interpret them. The first rule says that over the period during which the queries were collected, the SUPPORT_ORDER table appeared in 11.64% of the queries and that when it did it was accompanied by the LEGAL_ACTIVITY table 100% of the time. The last rule, the 71 st , says that during this same period CASE and CASE_COURT_CASE appeared in 13% of the queries and that they were accompanied by COURT_CASE 80.37% of the time.
While its not visible from the subset shown, all 71 of the generated rules have a single- item consequent. This is fortunate, and is not always the case, as such rules are the most actionable in practice compared to rules with compound consequents [4].
Figure 5, generated using arulesViz, is a scatter plot of the support and confidence of the 71 rules generated by arules. Here we see the majority of the rules are in the 0.05- 0.15 support range, meaning between 5% and 15% of the 110,071 queries analyzed contain all of the tables represented in the rule. 10
Figure 5 - Plot of the interestingness measures for the 71 generated rules
An illuminating visualization is shown in Figure 6, also generated by the aRulesViz package. This figure plots the rule antecedents on the x-axis and their consequents on the y-axis. To economize on space, the table names arent displayed but rather are numbered corresponding to output accompanying the graph thats displayed on the main R window.
Looking at the plot, two things immediately stand out: the presence of four large rule groups, and that only nine tables (the y-axis) account for the consequents in all 71 rules. These nine tables are the nuclear tables around which all the others orbit, the most vital of the vital few.
Rule Interestingness Measures 2 3 4 5 6 7 lift 0.8 0.85 0.9 0.95 1 0.05 0.1 0.15 0.2 0.25 confidence s u p p o r t 11
Figure 6 - Plot of table prevalence of rules
Another Way: Odds Ratios
As the final step, I computed the odds ratios between all existing pairings occurring among the top-25 tables, which numbered 225. Odds is the ratio of the probability of an events occurrence to the probability of its non-occurrence, and the odds ratio is the ratio of the odds of two events (e.g., two tables co-occurring in a given query vs. each table appearing without the other) [19]. To compute the odds ratios, I used the cc() function from the epicalc R package which, when given a 2x2 contingency table (see table 1, generated with the R CrossTable() function), outputs the following (the counts shown are of the pairing of the PARTICIPANT and CASE_PARTICIPANT tables):
FALSE TRUE Total FALSE 408021 50902 458923 TRUE 56963 21745 78708 Total 464984 72647 537631
OR = 3.06 Exact 95% CI = 3.01, 3.12 Table Prevalence Among Rules 10 20 30 40 50 2 4 6 8 Antecedent (LHS) C o n s e q u e n t
( R H S ) 2 3 4 5 6 7 lift 12 Chi-squared = 15719.49, 1 d.f., P value = 0 Fisher's exact test (2-sided) P value = 0
For these two tables, the odds ratio is 3.06, with a 95% confidence interval of 3.01 and 3.12. Odds greater than 1 are considered significant, and the higher the number the greater the significance.
Cell Contents |-------------------------| | N | | Chi-square contribution | | N / Row Total | | N / Col Total | | N / Table Total | |-------------------------|
I tabulated the odds ratios of the 225 pairings, and table 2 shows the first 25, ordered by the odds ratio in descending order. The top four odds ratios show as Inf--infinite--because either the FALSE/TRUE or TRUE/FALSE cell in their respective 2x2 contingency table was 0, indicating that among the 537,631 observations (rows) analyzed, the one table never appeared without the other.
The results shown in table 2 differ from what figure 3 depicts. The former shows the strongest link between the CASE and CASE_PARTICIPANT tables, whereas for the latter its SUPPORT_ORDER and SOCIAL_SECURITY_NUMBER. This is because the line thicknesses in figure 3 are based solely on the count of co-references (analogous to the TRUE/TRUE cell in the 2x2 contingency table), whereas odds ratios consider the pair counts relative to each other, i.e., they take into account the other three cells-- FALSE/FALSE, FALSE/TRUE, and TRUE/FALSE--as well.
In this paper, Ive described a holistic process for identifying the most-queried vital few tables in a database, uncovering their usage patterns and interrelationships, and guiding their placement on physical media.
First I captured query metadata and parsed it for further analysis. I then established that there are a vital few tables that account for the majority of query activity. Finally, I used MBA supplemented with other methods to understand the co-reference patterns among these tables, which may in turn inform their layout on storage media.
My hope is Ive described what I did in enough detail that youre able to adapt it, extend it, and improve it to the betterment of the performance of your databases and applications.
Appendix A - Data Placement on Physical Media
Storage devices such as disks maximize throughput by minimizing access time [5], and a fundamental part of physical database design is the allocation of database objects to such physical media--deciding where schema objects should be placed on disk to maximize performance by minimizing disk seek time and rotational latency. The former is a function of the movement of a disks read/write head arm assembly and the latter is dependent on its rotations per minute. Both are electromechanical absolutes, although their speeds vary from disk to disk.
As is well known, disk access is several orders of magnitude slower than RAM access-- estimates range from four to six[ibid.], and this relative disparity is no less true today than it was when the IBM 350 Disk Storage Unit was introduced in 1956. So while this 14 topic may seem like a bit of a chestnut in the annals of physical database design, it remains a germane topic. The presence of such compensatory components and strategies as bufferpools, defragmenting, table reorganizations, and read ahead prefetching in the architecture of modern RDBMS underscores this point [20]. The fact is read/write heads can only be in one place on the disk platter at a time. Solid-state drives (SSD) offer potential relief here, but data volumes are rising at a rate much faster than that at which SSD prices are falling.
Coupled with this physical reality is a fiscal one, as its been estimated anywhere between 16-40% of IT budget outlay is committed to storage [6; 23]. In light of such costs, it makes good financial sense for an organization to be a wise steward of this resource and seek its most efficient use.
If one is designing a new database as part of a larger application development initiative, then such tried-and-true tools as CRUD (create, read, update, delete) matrices and entity affinity analysis can assist with physical table placement, but such techniques quickly become tedious, and therefore error-prone, and these early placement decisions are at best educated guesses. What would be useful is an automated, holistic, approach to help refine the placement of tables as the full complement of queries comes on line and later as it changes over the lifetime of the application, without incurring extra storage costs. The present paper is, of course, an attempt at such an approach.
As to where to locate data on disk media in general, the rule of thumb is to place the high-use tables on the middle tracks of the disk given this location has the smallest average distance to all other tracks. In the case of disks employing zone-bit recording (ZBR), as practically all now do, the recommendation is to place the high-frequency tables, say, the vital few 20%, on the outermost cylinders, as the raw transfer rate is higher there since the bits are more densely packed. This idea can be extended further by placing tables typically co-accessed in queries in the outermost zones on separate disk drives [2], minimizing read/write head contention and enabling query parallelism. If zone-level disk placement specificity is not an option, separating co-accessed vital few tables onto separate media is still a worthwhile practice.
Appendix B - How many baskets?
For this analysis, performed on modest hardware, I used all of the snapshot data I had available. Indeed, one of the precepts of the burgeoning field of data science [17] is that with todays commodity hardware, computing power, and addressable memory sizes, we no longer have to settle for samples. Plus, when it comes to the data weve been discussing, sampling risks overlooking its long-tailed aspect [7]. Nonetheless, there may still be instances where analyzing all of the snapshots at your disposal isnt practicable, or you may want to know at the outset of an analysis how many statements youll need to capture to get a good representative of the query population (and therefore greater statistical power), if you dont have a ready pool of snapshots from which to draw. In either case, you need to know how large your sample needs to be for a robust analysis.
15 Sample size formulas exist for more conventional hypothesis testing, but Zaki, et al. [24] give a more suitable sample size formula, one specific to market basket analysis:
! n = "2ln(c) #$ 2 ,
where n is the sample size, c is 1 - !, (! being the confidence level), " is the acceptable level of inaccuracy, and # is the minimum required support [13]. Using this equation, with 80% confidence (c = .20), 95% accuracy (" = .05), and 5% support (# = .05), the sample size recommendation is 25,751.
Using this sample size, I ran the sample() function of the apriori package against the tblTrans transaction object, the results of which I then used as input to the apriori() function to generate a new set of rules. This time, 72 rules were generated. Figure 7 shows the high degree of correspondence between the relative frequencies of the tables in the sample (bars) and the population (line).
Figure 7 - Relative frequencies of the tables in the sample vs. in the population.
Figure 8 plots the regression lines of the confidence and support of the rules generated from the sample (black) and the population (grey). Again, notice the high degree of correspondence.
Sample Frequency Distribution of Top 25 Tables F r e q u e n c y 0 4 0 0 0 8 0 0 0 C A S E C A S E _ P A R T I C I P A N T P A R T I C I P A N T C O U R T _ C A S E C A S E _ C O U R T _ C A S E C A S E _ A C C O U N T _ S U M M A R Y L E G A L _ A C T I V I T Y P A R T I C I P A N T _ N A M E C A S E _ A C C O U N T S U P P O R T _ O R D E R A D D R E S S C O M B I N E D _ L O G _ T E M P _ E N T R Y O R G _ U N I T C A S E _ A C C O U N T _ S U M M A R Y _ T R A N S A C T I O N B I R T H _ I N F O R M A T I O N P A R T I C I P A N T _ A D D R E S S S O C I A L _ S E C U R I T Y _ N U M B E R P A R T I C I P A N T _ E M P L O Y E R L O G I C A L _ C O L L E C T I O N _ T R A N S A C T I O N I N T E R N A L _ U S E R C H A R G I N G _ I N S T R U C T I O N E M P L O Y E R P A R T I C I P A N T _ P H Y S I C A L _ A T T R I B U T E S T E R M S P A R T I C I P A N T _ R E L A T I O N S H I P Table Name 16
Figure 8 - Correspondence between the support and confidence of the sample and population rules
References
[1] Agrawal, Rakesh, et al. Mining Association Rules Between Sets of Items in Large Databases. Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. pp. 207-216.
[2] Agrawal, Sanjay, et al. Automating Layout of Relational Databases. Proceedings of the 19 th International Conference on Data Engineering (ICDE 03). (2003): 607- 618.
[3] Barabsi, Albert-Lszl and Zoltn N. Oltvai. Network Biology: Understanding the Cells Functional Organization. Nature Reviews Genetics. 5.2 (2004): 101-113.
[4] Berry, Michael J. and Gordon Linoff. Data Mining Techniques: For Marketing, Sales, and Customer Support. New York: John Wiley & Sons, 1997.
[5] Blanchette, Jean-Franois. A Material History of Bits. Journal of the American Society For Information Science and Technology. 62.6 (2011): 1042-1057.
[6] Butts, Stuart. How to Use Single Instancing to Control Storage Expense. eWeek 03 August 2009. 22 December 2010 <http://mobile.eweek.com/c/a/Green-IT/How- to-Use-Single-Instancing-to-Control-Storage-Expense/>.
[7] Cohen, Jeffrey, et al. MAD Skills: New Analysis Practices for Big Data. Journal Proceedings of the VLDB Endowment. 2.2 (2009): 1481-1492.
0.80 0.85 0.90 0.95 1.00 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0 0 . 2 5 Rule Sample vs. Population Confidence S u p p o r t 17 [8] Cunningham, Sally Jo and Eibe Frank. Market Basket Analysis of Library Circulation Data. Proceedings of the Sixth International Conference on Neural Information Processing (1999). Vol. II, pp. 825-830.
[9] Eldredge, Jonathan D. The Vital Few Meet the Trivial Many: Unexpected Use Patterns in a Monographs Collection. Bulletin of the Medical Library Association. 86.4 (1998): 496-503.
[10] Erar, Aydin. Bibliometrics or Informetrics: Displaying Regularity in Scientific Patterns by Using Statistical Distributions. Hacettepe Journal of Mathematics and Statistics. 31 (2002): 113-125.
[11] Fu, W. Wayne and Clarice C Sim. Aggregate Bandwagon Effect on Online Videos' Viewership: Value Uncertainty, Popularity Cues, and Heuristics. Journal of the American Society For Information Science and Technology. 62.12 (2011): 2382- 2395.
[12] Geng, Liqiang and Howard J. Hamilton. Interestingness Measures for Data Mining: A Survey. ACM Computing Surveys. 38.3 (2006): 1-32.
[13] Hashler, Michael, et al. Introduction to arules: A Computational Environment for Mining Association Rules and Frequent Item Sets. CRAN 16 March 2010. <http://cran.r-project.org/web/packages/arules/vignettes/arules.pdf>.
[14] Heising, W.P. Note on Random Addressing Techniques. IBM Systems Journal. 2.2 (1963): 112-6.
[15] Hsu, W.W., A.J. Smith, and H.C. Young. Characteristics of Production Database Workloads and the TPC Benchmarks. IBM Systems Journal. 40.3 (2001): 781- 802.
[16] Janert, Philipp, K. Data Analysis with Open Source Tools. Sebastopol: OReilly, 2010.
[17] Loukides, Mike. What is Data Science? OReilly Radar. 2 June 2010 http://radar.oreilly.com/2010/06/what-is-data-science.html
[18] Nisbet, Robert, John Elder, and Gary Miner. Handbook of Statistical Analysis and Data Mining. Burlington, MA: Academic Press, 2009.
[19] Ott, R. Lyman, and Michael Longnecker. An Introduction to Statistical Methods and Data Analysis. 6 th ed. Belmont, CA: Brooks/Cole, 2010.
[20] Pendle, Paul. Solid-State Drives: Changing the Data World. IBM Data Management Magazine. Issue 3 (2011): 27-30.
[22] Trueswell, Richard L. Some Behavioral Patterns of Library Users: the 80/20 Rule. Wilson Library Bulletin. 43.5 (1969): 458-461.
[23] Whitely, Robert. Buyers Guide to Infrastructure: Three Steps to IT Reorganisation. ComputerWeekly.com 01 September 2010. 22 December 2010 <http://www.computerweekly.com/feature/Buyers-Guide-to-infrastructure-Three- steps-to-IT-reorganisation >.
[24] Zaki, Mohammed Javeed, et al. Evaluation of Sampling for Data Mining of Association Rules. Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications. (1997): 42-50. *