Vous êtes sur la page 1sur 3

Susan Wei

1/3

RESEARCH STATEMENT
Susan Wei (susanwe@live.unc.edu)
Ninety percent of the data stored in the world today has been created in the past two years. This data revolution, which some have dubbed Big Data, has opened up many exciting areas in statistics. The challenge of Big Data is not simply that it is big. While classical statistics deals with lists of numbers, modern data comes to us as texts, videos, images, and tweets. My research is in the development of statistical methodology which brings together techniques from machine learning, high dimensional low sample size asymptotics, and empirical processes to tackle the challenges presented by modern data types. Below I discuss my dissertation research and ongoing work, followed by an outline of my longer-term agenda.

The Blessing of Dimensionality


Many modern datasets are characterized by high-dimensionality. In a typical gene expression study, for instance, the number of genes far exceed the number of subjects they are measured on. Dimensionality poses a serious challenge because the amount of data needed to support classical methods of establishing statistical signicance grows very rapidly with the number of variables involved. This is the so-called curse of dimensionality. Over the last few decades, many innovative statistical methods have been proposed for high dimensional settings. One general approach is to x classical techniques. Another approach is to instead leverage the quirks of high dimensionality. For instance, many binary classication problems involve data that are not linearly separable. However, a mapping to a higher dimensional space can induce separability. However, there is a ip side to the curse: the blessing of dimensionality. In my dissertation work, I studied this aspect of high dimensionality in the context of a two-sample hypothesis testing framework called DiProPerm (Wei et al. 2013). DiProPerm is a powerful non-parametric procedure that can be used to test for equal distributions or equal means. The method relies on the high accuracy of binary linear classiers in high dimensional low sample size (HDLSS) settings. Binary linear classiers are also easy to interpret the coecients in the linear classier can be loosely interpreted to indicate a variables importance. DiProPerm provides a way to assess when such an interpretation can be relied upon. The methodology has been successful in oering useful insight in various biological applications (Clement 2012; Miedema et al. 2012; Shen 2012; Segall et al. 2010; Bradford et al. 2011). As DiProPerm is especially designed for HDLSS data, the most appropriate asymptotic regime to consider for its limiting behavior is one where dimension goes to innity for xed sample size. This is to be contrasted with the standard asymptotic regime which requires sample size to go to innity for xed dimension. The asymptotic regime in random matrix theory, on the other hand, requires both sample size and dimension to go to innity. All three asymptotic regimes provide useful perspectives. Far less work has been done on HDLSS asymptotics, however. I show certain DiProPerm tests, under rather natural settings, are inconsistent under the HDLSS asymptotic regime (Wei et al. 2013). The result oers guidance on the use of DiProPerm in practice. The result is also interesting because it is contrary to the canons of hypothesis testing that most reasonable tests have power converging to one as sample size goes to innity.

The Doctor Can See You Now. Like, Right Now.


Telemedicine is the use of telecommunication and information technologies to provide health care at a distance. An important application of telemedicine is to the management of chronic diseases such as diabetes. One such tool is a diabetes diary, widely available as a smartphone application nowadays, which allows the patient to easily record blood glucose readings, insulin dose, carbohydrate intake, and exercise throughout the day. The idea is that the patients healthcare team should periodically examine the records to make appropriate adjustments. The benets of using a diabetes diary include reduction in the number of unnecessary hospital visits and ability to receive real-time feedback from physicians remotely. Unfortunately, patients and doctors lack the analytic tools to associate blood glucose trends with factors such as diet, exercise, and insulin administration. I am currently collaborating with a group of researchers at the Tromso Telemedicine Laboratory in Norway to develop such tools. An intermediate goal of my work is to develop a method that alerts patients to aberrant blood glucose readings. A longer term goal of the project is to implement a personalized regime for each patient to provide suggestions about food, exercise, and insulin dosage that keep blood glucose in an acceptable range. This work will bring together techniques from machine learning, time series, and reinforcement learning as well as motivate the invention of new statistical methodology. My current work in telemedicine is in many ways a natural extension of my dissertation work in the area of personalized medicine. Personalized medicine is a medical model based on the notion that an individuals genetic information, along with other characteristics, can be used to customize healthcare decisions. In fact, some have called telemedicine personalized medicine at a distance. In my dissertation, I studied the discovery

http://www.swei.web.unc.edu/

Susan Wei

2/3

and estimation of treatment eect heterogeneity in personalized medicine. This is the common phenomenon in medicine where the same treatment can have dierent eects on dierent patients. To discover subgroups of patients who respond dierently to treatment, I considered a model where a linear hyperplane in the covariate space separates the subjects into two groups with dierent treatment responses (Wei and Kosorok 2013c). For example, suppose the outcome of interest is survival and the covariates are gene expressions. The method I developed can be applied to discover a classication rule based on gene expression that divides the population into two subgroups for whom treatment has dierent eects on survival. The hyperplane gives a clear classication rule that can be used in future studies. Furthermore, the coecients of the linear classier roughly indicate a genes importance. The work on discovering and estimating treatment eect heterogeneity is an extension of a machine learning concept I developed called Latent Supervised Learning (Wei and Kosorok 2013a). Latent Supervised Learning is the task of inferring a binary classication rule from input data where only a continuous surrogate variable is available for training. This type of learning falls in the middle of the machine learning spectrum that has clustering (unsupervised learning) at one end and classication (supervised learning) at the other. I rst developed Latent Supervised Learning in the context of continuous surrogate variables that are Gaussian distributed (Wei and Kosorok 2013a). A following paper extends the original methodology to the context of censored survival outcomes (Wei and Kosorok 2013b). The method for estimating treatment eect heterogeneity builds upon the tools used in these previous papers and further extends the possible distributions of the outcome variable to include any distribution in the exponential family. Tools from empirical processes provide an elegant approach for studying the theoretical properties of general machine learning objects. Despite its wide applicability, there remains a high barrier of entry to this theoretical branch of statistics. In my work on Latent Supervised Learning, tools from empirical processes have proved crucial in establishing the consistency properties of various estimators. Work on establishing weak convergence results is currently under progress. The extension of Latent Supervised Learning to ultra-high dimensional settings is another future research goal.

Putting a Face to the Name


Malignant melanoma has been increasing at an alarming rate over the last few decades and causes approximately 8,000 deaths per year in the United States. Furthermore, it aects many more young individuals relative to other solid tumors. Even more unfortunate, early detection is dicult due to the overlap in clinical and histologic appearances of melanomas with highly prevalent benign melanocytic nevi (moles). In the past decade, cancer genome projects have been very successful at identifying rich sets of oncogenesisrelated genes. Recent discovery of several key genetic mutations commonly found in malignant melanoma holds promise for better targeted therapy. Less clear, however, is the path from cancer genotype to cancer phenotype. In other words, how does a certain mutation manifest itself visually? Malignant melanoma is well suited for this type of study as there is a wealth of genomic and imaging data for the disease. Based on the tenet that cancer genotype and phenotype are intricately linked, my research seeks to develop a unied framework for the integrative study of malignant melanoma by drawing upon recent advances in pathology, cancer genomics, medical imaging analysis, and statistics. One major component of this research is the extraction of relevant features from melanoma images. For this I have considered traditional histologic images as well as dermoscopic images, a low-cost alternative to the former. Histopathology, the microscopic examination of tissue, continues to be considered the gold standard for cancer diagnosis. Despite this, histopathology is also notorious for high inter-observer variability. This has given rise to computer-assisted analytical approaches powered by tools from image analysis. In my dissertation work I collaborated with the Lineberger UNC Melanoma Group on the development of image-derived histologic features. My work contributed to an automatic feature extraction process capturing features related to color, geometry, and texture which were shown to have high discriminatory power against dierent types of melanocytic nevi (Miedema et al. 2012). In collaboration with researchers at the University of Tromso in Norway, I have also started developing feature extraction techniques for dermoscopic images. Dermoscopy is a recent advancement in the diagnosis of melanomas. It is a non-invasive technique that involves the examination of skin lesions with a dermoscope a hand-held optical device that typically consists of a magnifying lens and a light source which can reveal more global structures of melanoma than histologic images. The three datatypes involved in this research genetic data, histology images, and dermoscopic images tell varying dierent aspects about the same melanoma behind them. Current and future work will focus on the creation of new statistical methodologies which can mine common patterns across all datatypes while nding unique patterns within each. The statistical tools motivated by this application have great potential for wide-ranging applications to other cancer diseases and beyond.

http://www.swei.web.unc.edu/

Susan Wei

3/3

Long-term goals
My long-term research goals, beyond continuing to work on the aforementioned extensions and applications, are to develop statistical models for modern data types. Thus far statisticians have been very successful at modeling entities on the real line. However, salient models for modern data types, such as medical images, are still lacking. This is not surprising given the fact that we have few distributions to describe even multivariate data besides the multivariate Gaussian distribution. My plan is to invent new paradigms of statistical modeling, either mathematical or simulated in nature, that can be used for a wide-range of modern scientic inquiries. Another long term goal I have is to contribute to the advancement of statistics by elevating the statistical rigor of scientic applications. I will accomplish this through better statistical outreach and dissemination of research results. For the former, I will oer consulting not only on statistical data analysis but also on how to use statistics to plan studies. For the latter, I will focus on developing good strategies for data management and study reproducibility in my collaborations. This includes making my software freely available and publishing all code to reproduce data results.

References
Bradford, Blair U, Lock, Eric F, Kosyk, Oksana, Kim, Sungkyoon, Uehara, Takeki, Harbourt, David, DeSimone, Michelle, Threadgill, David W, Tryndyak, Volodymyr, Pogribny, Igor P, Bleyle, Lisa, Koop, Dennis R, and Rusyn, Ivan (2011). Interstrain dierences in the liver eects of trichloroethylene in a multistrain panel of inbred mice. Toxicological Sciences 120 (1), pp. 20617. Clement, Meagan E. (2012). Analysis Techniques for Diusion Tensor Imaging Data. PhD thesis. The University of North Carolina at Chapel Hill. Miedema, Jayson, Marron, James Stephen, Niethammer, Marc, Borland, David, Woosley, John, Coposky, Jason, Wei, Susan, Reisner, Howard, and Thomas, Nancy E (2012). Image and statistical analysis of melanocytic histology. Histopathology 61 (3), pp. 436444. Segall, S.K., Nackley, A.G., Diatchenko, L., Lariviere, W.R., Lu, X., Marron, J.S., Grabowski-Boase, L., Walker, J.R., Slade, G., Gauthier, J., Bailey, J.S., Stey, B.M., Maynard, T.M., Tarantino, L.M., and Wiltshire, T. (2010). Comt1 genotype and expression predicts anxiety and nociceptive sensitivity in inbred strains of mice. Genes, Brain, and Behavior 9 (8), pp. 93346. Shen, Dan (2012). Sparse PCA asymptotics and analysis of tree data. PhD thesis. University of North Carolina. Wei, Susan and Kosorok, Michael R (2013a). Latent Supervised Learning. Journal of The American Statistical Association 108 (503), pp. 957970. Wei, Susan and Kosorok, Michael R. (2013b). Latent Supervised Learning for Survival Data. Submitted. url: http://biostats.bepress.com/uncbiostat/art38. Wei, Susan and Kosorok, Michael R. (2013c). Latent Supervised Learning for Treatment Eect Heterogeneity. Wei, Susan, Lee, Chihoon, Wichers, Lindsay, and Marron, J.S. (2013). Direction-Projection-Permutation for High Dimensional Hypothesis tests. url: http://arxiv.org/abs/1304.0796.

http://www.swei.web.unc.edu/

Vous aimerez peut-être aussi