Vous êtes sur la page 1sur 2

Evaluating Data Mining Tools for Large Memory, High Performance Computing

Paul Rodriguez prodriguez@sdsc.edu Nicole Wolter nickel@sdsc.edu Adam Jundt ajundt@sdsc.edu

Project Overview
Increasingly, scientific breakthroughs will depend on advanced computing capabilities that help researchers manipulate and explore massive datasets. The trouble is that science and society produces so much data - internet documents, brain images, climate measurement, etc. which often needs to be analyzed as one big interdependent mass instead of small local pieces. SDSC is currently deploying special large, flash-storage, parallel computer architectures that can handle such data-intensive analysis. One type of analysis is pattern discovery which falls under the rubric of data mining. The goal of this research project is to evaluate data mining tools and compare the performance of select algorithms on a data intensive high performance computing (HPC) platform versus other traditional HPC platforms. The student will be involved in analysis, evaluation, and comparison of data mining in these environments, and be exposed to data mining algorithms and their potential application on a variety of data sources.

Proposal
The student will be a part of a team of researchers and programmers located at the San Diego Super Computer Center (SDSC). The major learning activities for the student will be to investigate the implementation and tools associated with data mining, and become familiar with the HPC environment. The major research activity for the student will be to find examples of large scientific data, run one or two data mining algorithms against the designated data set, and compare the performance across a well-defined set of parameters (system, scaling, algorithms, and data setup). The learning activity will involve independent research and guided exercises pertaining to data mining and parallel computing. The student will be expected to provide written or oral synopsis on scientific papers and experimental findings. Initial

background research as well as system and tool evaluation is estimated to last 2-4 weeks. It is expected that continued research and system understanding will continue throughout the project. The data mining tool(s) will be available and implemented on the systems prior to the students internship; however, the parameters for executing the tool will have to be explored. As such, the student will take a lead role in learning the tools (i.e. relevant algorithms) and the manner in which the algorithms are executed. The research activity will involve working with the PIs to find a large scientific data set that exists in the public domain and to understand the basic kinds of questions that are addressed with such data. Then the student will apply the data mining algorithms to the scientific data. This application will address basic issues such as how much data to use, what kind of preprocessing is necessary in a HPC environment (over and above a non-HPC one), and what kind of performance results can be delivered over some range of HPC parameters and algorithm parameters. Specific examples of data will come from fields such as: computational genomics, text mining, biomedical image processing, as well as additional fields that the student may be interested in. Within the chosen field, the student will learn a topical overview of the specific algorithms for pattern analysis, e.g. matrix factorization, sequence identification, data clustering, rule discovery, etc. Specific resources the student can access will be the three supercomputers available at SDSC: Dash a large, virtual shared memory + flash storage system, Triton a large memory system, and Trestles a flash storage system. Since each system has a different architecture, an understanding of the tradeoffs between the architecture and how it affects the performance of the software will be explored. We envision that this research will be applicable as a poster presentation, or part of a larger analysis our group will present at a conference in the future.

Details
We anticipate this project will support one student for a single quarter semester. The project will be overseen by PI Paul Rodriguez, with additional support from Nicole Wolter and Adam Jundt (SDSC User Support team). The student will be working with full time SDSC staff on this project. There are no other students currently involved. Research experience team will hold a weekly meeting to collaborate findings and evaluate progress and direction of project. Individual mentors will be available for consultation as needed.

Prerequisites Solid background in quantitative and analytical skills Experience with programming and scripting languages (C, C++, Fortran, PERL, Matlab) Strong communication skills in written and spoken English Strong problem solving capabilities Relevant URLs
http://www.sdsc.edu/us/resources/ www.sdsc.edu/Events/gcdid2010

Vous aimerez peut-être aussi