Vous êtes sur la page 1sur 4

The Application of Apriori-Gen Algorithm in the Association Study in Type 2 Diabetes

Weidong Mao Department of Mathematics & Computer Science Virginia State University Petersburg, VA 23806, USA Email:wmao@vsu.edu Jinghe Mao Department of Biology Tougaloo College Tougaloo, MS 39174, USA Email:jmao@tougaloo.edu

Abstract Study shows type 2 diabetes is a genetic disease and evidence of a statistical interaction among several SNPs has been reported. Gene-gene interaction analysis has been used to identify common disease susceptibility genes. This paper explores the application of Apriori-Gen algorithm on the association study among SNPs. The algorithm has been applied to SNP data of type 2 diabetes for association study. The result shows the interaction among 5 SNPs with support s of 50% and condence of 60%. The risk rate RR and odds ratio OR are 2.14 and 2.92, respectively. The experiment results indicate that the interaction among those SNPs is associated with the disease.

I. INTRODUCTION Type 2 diabetes is the most common form of diabetes. In type 2 diabetes, either the body does not produce enough insulin or the cells ignore the insulin. Insulin is necessary for the body to be able to use glucose for energy. When you eat food, the body breaks down all of the sugars and starches into glucose, which is the basic fuel for the cells in the body. Insulin takes the sugar from the blood into the cells. When glucose builds up in the blood instead of going into cells, it can cause two problems: 1)Right away, your cells may be starved for energy. 2)Over time, high blood glucose levels may hurt your eyes, kidneys, nerves or heart. While diabetes occurs in people of all ages and races, some groups have a higher risk for developing type 2 diabetes than others. Research shows that the type 2 diabetes is caused by a complicated interplay of genes, environment, insulin abnormalities, increased glucose production in the liver, increased fat breakdown, and possibly defective hormonal secretions in the intestine. The recent dramatic increase indicates that lifestyle factors (obesity and sedentary lifestyle) may be particularly important in triggering the genetic elements that cause this type of diabetes [1]. Although the type 2 diabetes cannot easily be treated, it can be avoided if people at high risk change their living style, such as their diet. But how can we tell the susceptibility of people to the disease before symptoms are found and help them make informed decisions about their health? With the development of DNA microarray technique, it is possible to access the human genetic information related to specic diseases. Assessing the association between DNA variants and disease has been used widely to identify regions of the genome and candidate genes that contribute to disease [2].

99.9% of one individuals DNA sequences are identical to that of another person. Over 80% of this 0.1% difference will be Single Nucleotide Polymorphisms (SNP) and they promise to signicantly advance our ability to understand and treat human disease. A SNP is a single base substitution of one nucleotide with another. Each individual has many single nucleotide polymorphisms that together create a unique DNA pattern for that person. It is important to study SNPs because they represent genetic differences among human beings. Genome-wide association studies require knowledge about common genetic variations and the ability to genotype a sufciently comprehensive set of variants in a large patient sample [3]. High-throughput SNP genotyping technologies make massive genotype data, with a large number of individuals, publicly available. Accessibility of genetic data makes genome-wide association studies for complex diseases possible. Success stories when dealing with diseases caused by a single SNP or gene, sometimes called monogenic diseases, have been reported [4]. However, most complex diseases, such as diabetes, are characterized by a non-mendelian, multifactorial genetic contribution with a number of susceptible genes interacting with each other [5]. A fundamental issue in the analysis of SNP data is to dene the unit of genetic function that inuences disease risk. Is it a single SNP, a regulatory motif, an encoded protein subunit, a combination of SNPs in a combination of genes, an interacting protein complex, a metabolic or a physiological pathway [6]? In general, it may be impossible to associate a single SNP or gene with a disease because a disease may be caused by completely different modications of alternative pathways, and each gene only makes a small contribution. This makes the identication of genetic factors difcult. MultiSNP interaction analysis is more reliable but it is computationally infeasible. An exhaustive search among multi-SNP combination is computationally infeasible even for a small number of SNPs. Furthermore, there are no reliable tools applicable to large genome ranges that could rule out or conrm association with a disease. Its important to search for informative SNPs among a huge number of SNPs. These informative SNPs are assumed to be associated with genetic diseases. Tag SNPs generated by the multiple linear regression based method [7] are good

978-1-4244-2902-8/09/$25.00 2009 IEEE

informative SNPs, but they are reconstruction-oriented instead of disease-oriented. Although the combinatorial search method [8] for nding disease-associated multi-SNP combinations has a better result, the exhaustive search is still very slow. Apriori algorithm is the most well known association rule algorithm and is used in most commercial products, but the number of samples usually is very small. This paper explores the possibility of applying it on the disease association study which has a large data set and tries to nd the association among multiple SNPs that may be responsible for the disease. The association of multi-SNP combination can be measured by risk ratio and odds ratio. The goal of disease association study is to assess accumulated information targeted to nd interaction of multi-SNPs which are associated to complex diseases with signicantly high accuracy and statistical power. The proposed method is applied to analyze the genetic data of the type 2 diabetes. It can be also applied in disease prevention and control in the near future. For example, after training the available case-control genome data, we can nd those signicant SNPs which are well associated with the disease. When a patient comes, and we obtain his/her genetic data, we dont need to check the whole sequence, but only disease-associated SNPs instead. This will save a lot of money and time for diagnosis and can be done before the onset of diseases. Therefore, treatment could start earlier to prevent or delay the occurrence of the disease. II. A PRIORI ALGORITHM IN ASSOCIATION STUDY In this section we rst introduce the association rules, the support and the condence of strength, then we describe large itemsets and the Apriori algorithm. Finally we illustrate how to apply the algorithm in the association study. A. Introduction of Association Rules Association rules are used to show the relationships between data items. A database in which an association rule is to be found is viewed as a set of tuples, where each tuples contains a set of items [9]. The association rule can be dened as Denition 1. Denition 1: Given a set of items I = {I1 , I2 , . . . , Im } and a database of tuples D = {t1 , t2 , . . . , tn } where ti = {Ii1 , Ii2 , . . . , Iik } and Iij I, and association rule is an implication of the from X Y where X, Y I are sets of items called itemsets and X Y = . We generally are not interested in all implications but only those that are important. Here importance usually is measured by two key features called support and condence as dened Denition 2 and Denition 3, respectively. Denition 2: The support(s) for an association rule X Y is the percentage of tuples in the database that contain X Y . Denition 3: The condence or strength () for an association rule X Y is the ratio of the number of tuples that contain X Y to the number of tuples that contain X.

B. Large Itemsets The most common approach to nding association rules is to break up the problem into two parts: Find large itemsets as dened in Denition 1. Generate rules from frequent itemsets. An itemset is any subset of the set of all item, I. Denition 4: A large (frequent) itemset is an itemset whose number of occurrence is above a threshold s. We use the notation L to indicate the complete set of large itemset and I to indicate a specic large itemset. Once the large intemsets have been found, we know that any interesting association rule, X Y , must have X Y in this set of frequent itemsets. Note that the subset of any large itemset is also large. When all large itemsets are found, generating the association rule is straightforward. Algorithm 1 outlines this technique. Algorithm 1: Input: I //Items L //Large Items s //Support //Condence Output: R //Association Rules satisfying s and Algorithm: R = ; for each I L do for each x L such that x = do support(I) if support(x) then R = R {x (I x)}; C. Apriori Algorithm The Apriori algorithm uses the following property, which we call the large itemset property: Any subset of a large itemset must be large. The large itemsets are also said to be downward closed because if an itemset satises the minimum support requirement, so do all of its subsets. The basic ideal of the AprioriGen algorithm is to generate itemset of a particular size and then scan the database to count these of see if they are large. During scan i, candidates of size i, candidate Ci are counted. Only those candidates that are large are used to generate candidates for the next pass. That are Li are used to generate Ci+1 . An itemset is considered as a candidate only of all its subsets also are large. To generate candidates of size i + 1, joins are made of large itemsets found in the previous pass. D. Data Set and Preparation According to the Human Genome Project [10], over 99.9% of one individuals DNA sequences are identical to that of another person. Over 80% of this 0.1% difference will be Single Nucleotide Polymorphism, and they promise to signicantly advance our ability to understand and treat human disease. In a short, a SNP is a single base substitution of one nucleotide with another. Both substitutions have to be

+1000

1001

+1002

1000 Case Control

-2002 -1200 +1202

-2001

0001

+2202

1100

-0001

Fig. 1.

Genotype Graph X{H, G}

observed in the general population at a frequency greater than 1%. An example of a SNP is individual A has a sequence GAACCT, while individual B has sequence GAGCCT, the polymorphism is a A/G. Each individual has many single nucleotide polymorphisms that together create a unique DNA pattern for that person. Recent work has suggested that SNPs in human population are not inherited independently; rather, sets of adjacent SNPs are present on alleles in a block pattern, so called haplotype. Many haplotype blocks in human have been transmitted through many generations without recombination. This means although a block may contain many SNPs, it takes only a few SNPs to identify or tag each haplotype in the block. A genome-wide haplotype would comprise half of a diploid genome, including one allele from each allelic gene pair. The genotype is the descriptor of the genome which is the set of physical DNA molecules inherited from the organisms parents. A pair of haplotype consists a genotype. SNPs are bi-allelic and can be referred as 0 if its a majority and 1, otherwise. If both haplotypes are the same allele, then the corresponding genotype is homogeneous, can be represented as 0 or 1. If the two haplotypes are different, then the genotype is represented as 2. The case-control sample populations consist of N individuals which are represented in genotype with M SNPs. Each SNP attains one of the three values 0,1, or 2. The sample G is an (0, 1, 2)-valued N M matrix, where each row corresponds to an individual, each column corresponds to a SNP. Based on the matrix we construct a genotype graph X = {H, G}, where the vertices H are distinct haplotypes and the edges G are genotypes each connecting its two haplotypes (vertices). There are two types of edges in the graph: case and control. The construction of the graph for case-control sample is illustrated in Fig. 1 . Constructing a complete human haplotype map is helpful when associating complex diseases with their related SNPs. Unfortunately, the number of SNPs is very large and it is costly to sequence many individuals. Therefore, it is desirable to reduce the number of SNPs that should be sequenced to a small number of informative representatives

called tag SNPs. On the other hand, these important tag SNPs or subset of genotype/haplotype probably are responsible for diseases. We want to minimize the number of tag SNPs without losing any disease information. The greedy algorithm for nding disease tagging SNPs [11] is applied to X, we drop certain SNPs (or, equivalently, keeping only certain tag SNPs). Indeed, dropping a SNP may result in collapsing of certain vertices in X, i.e., different vertices become identical. Collapsing vertices may also result in collapsing certain edges (genotypes). A SNP dropping is not allowed if that results in collapsing edges from case and control populations, but collapsing of edges from the same population is allowed. A simple greedy strategy consists of (1) traversing all the SNPs and (2) dropping a SNP if it is allowed that will result in keeping a minimal subset of SNPs which do not collapse genotypes from opposite populations. Our experiments show that we are left with 19 tag SNPs out of 92 for the type 2 diabetes data set. E. Apriori Algorithm in Association Study For the 19 SNPs we got from the tagging method we described previously, each of them could be viewed as a item Ii , and the itemset I = {I1 , I2 , . . . , I19 }. Because each SNP has 3 possible value, 0, 1 or 2, each item could be divided into 3 items. From observation we found that usually the major allele 0 is large item, we can ignore the other 2 alleles. We have 125 individual genotypes, each of them is a tuple, the data set D = {t1 , t2 , . . . , t125 }, where ti = {Ii1 , Ii2 , . . . , Ii19 } and Iij I. Now the itemset has 19 items and the database has 125 tuples and we can start the rst scan to nd large item candidates with s = 50% and = 60%. For the rst scan, we found 11 candidates are large items out of 19 items. During the second scan, we need to combine every one of this 11 candidates with all the other and we 2 have C11 = 55 candidates. Of these, 7 candidates are large. When we apply the Apriori-Gen at this level, we join any set with another set that has one item in common. For example, {I2 , I3 } is joined with {I2 , I7 }, but not with {I7 , I9 }. When it is joined, the new item ia added to it. After the join, we have 7 candidates for the next scan. There are 4 large items after scan 3. After join, we have 4 candidates for scan 4 only one large item {I2 , I9 , I11 , I13, I17 } left after scan 4. The scan process is shown in Table 1. F. Relative Risk & Odds Ratio In order to prove the interaction of this 5 SNPs is associated with the disease, we need to compute relative risk and odds ratio. The relative risk (RR) is the risk of an event (or of developing a disease) relative to exposure. Relative risk is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group. In our study, the relative risk is the ration of the probability of the large item {I2 , I9 , I11 , I13 , I17 } occurring in the case group versus the control group. RR = pcase pcontrol (1)

Scan 1

3 4

Candidates I1 , I2 , I3 , I4 , I5 , I6 , I7 , I8 , I9 , I10 , I11 , I12 , I13 , I14 , I15 , I16 , I17 , I18 , I19 {I1 , I2 }, {I1 , I3 }, {I1 , I4 }, {I1 , I7 }, {I1 , I9 }, {I1 , I11 }, {I1 , I12 }, {I1 , I13 }, {I1 , I15 }, {I1 , I17 }, {I2 , I3 }, {I2 , I4 }, {I2 , I7 }, {I2 , I9 }, {I2 , I11 }, {I2 , I12 }, {I2 , I13 }, {I2 , I15 }, {I2 , I17 }, {I3 , I4 }, {I3 , I7 }, {I3 , I9 }, {I3 , I11 }, {I3 , I12 }, {I3 , I13 }, {I3 , I15 }, {I3 , I17 }, {I4 , I7 }, {I4 , I9 }, {I4 , I11 }, {I4 , I12 }, {I4 , I13 }, {I4 , I15 }, {I4 , I17 }, {I7 , I9 }, {I7 , I11 }, {I7 , I12 }, {I7 , I13 }, {I7 , I15 }, {I7 , I17 }, {I9 , I11 }, {I9 , I12 }, {I9 , I13 }, {I9 , I15 }, {I9 , I17 }, {I11 , I12 }, {I11 , I13 }, {I11 , I15 }, {I11 , I17 }, {I12 , I13 }, {I12 , I15 }, {I12 , I17 }, {I13 , I15 }, {I13 , I17 }, {I15 , I17 } {I2 , I3 , I9 }, {I2 , I3 , I17 }, {I2 , I7 , I11 }, {I2 , I9 , I17 }, {I2 , I11 , I17 }, {I7 , I11 , I13 }, {I7 , I11 , I15 } {I2 , I3 , I9 , I11 }, {I2 , I3 , I9 , I13 , I19 }, {I2 , I9 , I11 , I13 , I17 }, {I2 , I11 , I13 , I17 }, TABLE I A PRIORI -G EN A LGORITHM S CAN

Large Itemsets I1 , I2 , I3 , I4 , I7 , I9 , I11 , I12 , I13 , I15 , I17 {I2 , I3 }, {I2 , I9 }, {I2 , I17 }, {I9 , I11 }, {I11 , I13 }, {I11 , I17 }, {I13 , I17 },

{I2 , I3 , I9 }, {I2 , I9 , I11 }, {I2 , I13 , I17 }, {I11 , I13 , I17 }, {I2 , I9 , I11 , I13 , I17 }

A relative risk of 1 means there is no difference in risk between the two groups. An RR less than 1 means the event is less likely to occur in the experimental group than in the control group. An RR greater than 1 means the event is more likely to occur in the experimental group than in the control group. Odds Ration (OR) is dened as the ratio of the odds of an event occurring in case group to the odds of it occurring in control group. If the probabilities of the large item in each of the groups are p (case group) and q (control group), then the odds ratio is: p/(1 p) (2) OR = q/(1 q) An odds ratio of 1 indicates that the condition or event under study is equally likely in both groups. An odds ratio less than 1 indicates that the condition or event is less likely in the rst group. And an odds ratio greater than 1 indicates that the condition or event is more likely in the rst group. The relative risk of the large item is 2.14 and the odds ratio is 2.92, which means the large item {I2 , I9 , I11 , I13 , I17 } is well associated with the type 2 diabetes. III. C ONCLUSIONS In this paper, we discuss the potential of applying the Apriori-Gen algorithm to the association study for the type 2 diabetes. The interaction of Multi-SNPs is found with support s of 50% and condence of 60%, and it is proved to be associated with the type 2 diabetes with relative risk RR of 2.14 and odds ratio OR are of 2.92. In our future work we are going to continue validation of the proposed method and improve the performance. R EFERENCES
[1] Type 2 Diabetes, http://ezinearticles.com/?Type-2-Diabetes [2] Cardon, L.R., Bell, J.I.: Association Study Designs for Complex Diseases, Vol.2. Nature Reviews: Genetics (2001), 91-98. [3] Hirschhorn, J.N.,Daly, M.J.: Genome-wide Association Studies for Common Diseases and Complex Diseases, Vol.6. Nature Reviews: Genetics (2005), 95-108.

[4] Merikangas, KR., Risch, N. : Will the Genomics Revolution Revolutionize Psychiatry, The American Journal of Psychiatry, (2003),160:625-635. [5] Botstein, D., Risch, N.:Discovering Genotypes Underlying Human Phenotypes: Past Successes for Mendelian Disease, Future Approaches for Complex Disease, Nature Genetics (2003), 33:228-237. [6] Clark, A.G., Boerwinkle E., Hixson J. and Sing C.F.: Determinants of the success of whole-genome association testing, Genome Res.(2005) 15, 1463-1467. [7] He, J. and Zelikovsky, A.: Tag SNP Selection Based on Multivariate Linear Regression, Proc. of International Conference on Computational Science (2006), LNCS 3992, 750-757. [8] Brinza, D., He, J. and Zelikovsky, A.: Combinatorial Search Methods for Multi-SNP Disease Association, Proc. of International Conference of the IEEE Engineering in Medicine and Biology (2006), 5802-5805. [9] Margaret H.D., Data Mining - Intrdocution and advanced topics, prentice Hall, ISBN 0-13-088892-3. [10] The Human Genome Project, http://www.ornl.gov. [11] Mao W.: A Haplotype Based Method for Genetic Disease Susceptibility Prediction. The 2nd International Conference on Bioinformatics and Biomedical Engineering (ICBBE2008), 2008; 478-481.

Vous aimerez peut-être aussi