Last Updated: September 4, 2013 DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 2
Institution: Kennesaw State University
Institutional Contact: Daniel Papp, President
Date: XXXX
School/Division: School of Science and Mathematics
Department: Department of Mathematics and Statistics
Departmental Contact: Jennifer Lewis Priestley
Name of Proposed Program: Ph.D. in Analytics and Data Science
Degree: Ph.D.
CIP Code: XXXX DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 3
Executive Summary The McKinsey Global Institute has identified that the demand for deep analytical talent will outpace the supply in the United States by almost 200,000 people within three years. In response, the White House has launched a Big Data Research and Development Initiative, to expand the workforce needed to develop and use Big Data technologies. This theme is echoed by Thomas Davenports recent article in Harvard Business Review titled Data Scientist: The Sexiest Job in the 21 st Century. These studies and many others point to the need for universities to educate and train Data Scientists to address this demand. However, no university in the country currently has a degree program in Data Science defined as the intersection of Statistics, Mathematics and Computer Science. Kennesaw State University is proposing a Ph.D. in Analytics Data Science one of the first of its kind in the country. The degree will train individuals to translate large, unstructured, complex data into information to improve decision making. This curriculum will include programming, data mining, statistical modeling, and the mathematical foundations to support these concepts. Importantly, it will also emphasize communication skills both oral and written as well as application and tying results to business and research problems. Because this degree is a Ph.D. (rather than a Doctorate in Data Science), it creates flexibility for the student. Graduates can either pursue a position in the private or public sector as a practicing Data Scientist where the demand is expected to greatly outpace the supply or pursue a position within academia, where they would be uniquely qualified to teach these skills to the next generation. Kennesaw State University is well positioned to launch this degree. This is evidenced by the unparalleled success of the MS in Applied Statistics where graduates are in great demand and continue to have 100% placement and by the Minor in Applied Statistics with undergraduate demand from every college across the university, and similarly strong placement. The Minor attracts approximately 200 undergraduates every semester making it the most successful and sought out Minor in the history of KSU. The Ph.D. in Analytics and Data Science will not only help to close the talent gap in the area of Data Science, but will also continue KSUs trajectory of regional and national recognition in the area of applied analytics. DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 4
There are three ways you can get to the top of a tree: 1) sit on an acorn 2) make friends with a bird 3) climb it.Anonymous SECTION 1. PROGRAM DESCRIPTION AND OBJECTIVES: a. Objectives of the program The Ph.D. in Analytics and Data Science at Kennesaw State University is an advanced degree, which will prepare individuals to work as Data Scientists in a private or public sector capacity or, secondarily, to work in academia within a department focused on the application of data analysis. This Ph.D. will utilize a multidisciplinary approach, with emphasis on Statistics, Mathematics, Computer Science and a content discipline such as Biology, Chemistry, Finance, Physics, Political Science, etc.
b. Needs the program will meet I skate to where the puck is going to be, not where it has been. Wayne Gretzky The United States Federal Government recently issued a press release addressing what it sees as a growing critical shortage of data analysts and, on March 29, 2012, issued the Big Data Research and Development Initiative. One of the main purposes of the initiative is to expand the workforce needed to develop and use Big Data technologies. The term Big Data is increasingly included within descriptions of required skill sets across a wide variety of disciplines and sectors of the economy. While the accepted definition of Big Data is continuing to evolve, there is no question about the expansion and prevalence of related concepts and their expanded role in the future. According to The Economist magazine, unmanned American military aircraft (i.e., drone aircraft) flying over Iraq and Afghanistan in a single year (2009) produced approximately 24 years worth of video surveillance footage. Every year, Google acquires an equivalent amount of data to the entire Library of Congress.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 5
These astonishing facts highlight at least four major points about how data is collected, analyzed, and used: 1. Extraordinary, previously unimaginable amounts of data are being collected and stored for subsequent analysis, which contain potentially significant and meaningful information in the private and public sectors and to society at large. 2. It is not feasible to manually review and/or analyze such massive data in a timely manner using traditional methods. Computer-assisted semi- or fully-automated processes using new computational and data mining methods are needed in order to extract useful information from massive data sources in a timely manner. 3. In addition to massive amounts of traditional structured data (i.e., tabular data), extraordinary amounts of unstructured, non-traditional data such as video footage, audio recordings, and unstructured text are being collected and stored. Increasingly, these two very different types of data must be merged together in systematic ways in order to obtain useful information. 4. Unlike the past, data collection and analysis is no longer a purely academic endeavor. Data gathering and analysis for obtaining useful information most often used in decision making processes is used in almost every field and sector imaginable at present including the sciences, public health, the healthcare industry, all aspects of business and finance (including retail, insurance, marketing, the service industry, the credit industry, fraud detection, the communications industry, etc.), psychology, education, public policy agencies, government elections, and critically, in national security and defense. From these four points it follows that: The next generation of statisticians will face very different challenges and issues than previous generations of statisticians. As a result, the next generation of statisticians requires a new set of knowledge and skills in order to effectively serve the data analysis needs of the 21 st century. These skills will incorporate more emphasis on applied mathematics and on computer programming than has historically been the case even for applied statisticians.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 6
A recent article in Significance magazine made this point clear: Traditional data is numbers. (Emerging data) is not. It is digital, but it is generated by all kinds of hardware and software. It is text, it is videos, it is tweets and Facebook pages; it is about transactions and interconnections by the billions. Data is the raw material of statistics, but traditional statistical disciplines will not cope; this new data needs new ways of handling it, of analyzing it, of thinking about it and using it. A large number of new and well-trained 21 st century statisticians are needed in every sector of almost every society in the world, not just America, and not just developed countries. The U.S. business community is also aware of this need: Hal Varian, Ph.D., the Chief Economist at Google, Inc. states simply, Data are available; what is scarce is the ability to extract wisdom from them. Further to the recognition of the talent shortage evidenced through the White House Big Data Research and Development Initiative, the Big Data Report from the McKinsey Global Institute (MGI) estimates that the demand for data analysts could exceed the current supply by 140,000 to 190,000 positions by the year 2018 (see Figure 1). Figure 1 illustrates that there are 440,000 to 490,000 total data analyst job positions projected for 2018 with only 300,000 trained analyst to fill those positions. In other words, the demand for big data analysts could be 50 to 60% greater than its projected supply by 2018.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 7
FIGURE 1: The Talent Gap for Big Data Analysts
The Big Data MGI report also predicts differential gains as a result of the impact of big data and its use across different sectors. According to MGI, finance and government (Cluster B in Figure 2) are expected to benefit strongly from big data use in the future where computer and electronic products and information sectors (Cluster A in Figure 2) have already and will continue to experience substantial benefits from the impact and use of big data.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 8
FIGURE 2: Differential Potential Gains of Big Data by Sector
A brief survey of the diverse disciplines which have recognized the role of Big Data and the changing role of analytics includes: Business Customer relations management (CRM) is one of the most innovative and profitable ways in which businesses use big data. CRM is essentially the business practice of analyzing customer-centric big data to discover trends and use that information to customize or personalize offers and communications with customers to optimize business. CRM was once used only by Fortune 500 companies, however, now with the proliferation of big data and reduced costs in collecting and storing it, all types of companies are using it to optimize their business. In one example of a typical CRM application, a U.S. bank used big data analytics to predict which product offer was most likely to be accepted by a particular customer and thereby customize the next on-line product offered to that customer in an effort to cross-sell to existing customers (Berry & Linoff, 2000). This CRM initiative resulted in substantial gains in cross-selling and therefore profits to the bank well above the cost of implementation. This is just one of many examples of big data analytics in business. Others include fraud identification, DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 9
service rate estimation, predicting product failure, and optimizing direct mailing campaigns, among others. By all accounts, the main hindrance in CRM is a lack qualified data analyst (The Economist, 2010; Significance, 2012).
Healthcare & Public Health The proper use of digitized medical records has the potential of revolutionizing the healthcare industry. Proper analysis of these records may be used to detect unwanted drug interactions and/or side-effects, identify best practices in care (e.g., identify the most effective drug therapies), and even predict the onset of certain diseases before patients themselves are aware of symptoms (The Economist, 2012). In one example, medical doctors and data analysts in Alabama developed automated infection surveillance software that assists hospitals in identifying changes in nosocomial infection (i.e., hospital-acquired infection) rates using massive data from Blue Cross/Blue Shield of Alabama and statistical and data mining methods (Putman, 2003). It has been estimated that nosocomial infections add as much as nine days to a patients hospital stay leading to more than a $4 million per year additional expense. This infection surveillance software provides early warning to hospitals and allows them to intervene in a timely manner. This is only one of many possible examples where non- traditional statistical work involving big data has made a substantial improvement in healthcare quality and substantial savings to society.
Government According to the National Science Foundation (NSF, 2012) in the document entitled, Core Techniques and Technologies for Advancing Big Data Science & Engineering (NSF 12-499), the impact of big data is causing a literal paradigm shift in scientific and biomedical investigation that is transforming the missions of a number of U.S. Federal Government agencies: Today, US government agencies recognize that the scientific, biomedical and engineering research communities are undergoing a profound transformation with the use of large-scale, diverse, and high-resolution data sets that allow for data-intensive decision-making, including clinical decision making, at a level never before imagined. New statistical and mathematical algorithms, prediction techniques, and modeling methods, as well as multidisciplinary approaches to data collection, data analysis and DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 10
new technologies for sharing data and information are enabling a paradigm shift in scientific and biomedical investigation. Advances in machine learning, data mining, and visualization are enabling new ways of extracting useful information in a timely fashion from massive data sets, which complement and extend existing methods of hypothesis testing and statistical inference. As a result, a number of agencies are developing big data strategies to align with their missions.
These examples and countless others highlight three common emerging themes: 1. Big Data is ubiquitous. All disciplines. All sectors of the economy. 2. Data is no longer considered a necessary cost to be managed down, but rather as an asset to be mined and leveraged. 3. All sectors are increasingly finding a dearth of analytical talent to support their nascent, but explosive analytical needs, particularly as it is related to Big Data. In response to this, Kennesaw State University is proposing the development of a Ph.D. in Analytics and Data Science. It is our position that the Data Scientist will be uniquely positioned to fill the talent shortage as outlined above. It is critical to note that this is a proposal for a Ph.D. program in Analytics and Data Science rather than in Statistics. A great deal of attention is emerging in the field of analytics towards the role of the Data Scientist From IBM - A data scientist represents an evolution from the business or data analyst role. The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong application acumen, coupled with the ability to communicate findingsin a way that can influence how an organization approaches a business challenge.
From Thomas Davenport, Senior Managing Partner at Accenture and author of Competing on Analytics (Data Scientists) are not typical scientistsbut rather hybrids of science and computation. Somewhere along their career journeys they became interested in, and good at, the manipulation of data. In fact, many of them really have computational in front of their scientific specialties: computational biology, computational ecology, etc. If you want some DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 11
evidence of this hybrid specialization, look at your favorite data scientists profile on LinkedIn -- the home, by the way, of some of the best data scientists around -- and check out the skills they say they have. Youll see analytics (quantitative analysis, statistical modeling, predictive analytics, social network analysis, data mining, etc.) listed, of course. But you are also likely to see SQL, Java, C, Python, R, distributed databases, and so forth. All of these skills actually are found in one individual, and he seems typical of the breedto my knowledge, no universities have programs yet in big data analytics (though some are talking about them -- universities typically dont move too hastily). From Daniel Tunkelang, Chief Data Scientist at LinkedIn Strong analytical skills are a given: above all, a data scientist needs to be able to derive robust conclusions from data. But a data scientist also needs to possess creativity and strong communication skills. Creativity drives the process of hypothesis generation, i.e., picking the right problems to solve, the will to create value for users and the drive to improve business decisions. Communication is essential, because data scientists work in horizontal roles and partner with groups across the entire organization. At LinkedIn, data scientists collaborate with every other product group, as well as with sales and finance. Strong communication skills are a must-have. From Steve Hillion, VP of Analytics at GreenPlum, as quoted in Forbes - Im sure in 30 years time, there will be lots and lots of degrees in data science and thats where [data scientists will] come from, but right now its coming from all these different buckets (math, computer science, economics)And, just as the early days of computing were born in the garages of Silicon Valley do-it-yourself-ers, data science is likely to develop first in an ad-hoc, hands-on way. It is our position that the intersection of skills outlined in multiple ways above, are brought together under the description of Data Scientist, in a way that does not occur in a traditional Statistics curriculum. This term has emerged as the moniker of an individual with strong computational and programming skills, but also possessing business/content acumen, enabling clear and meaningful communication. As can be seen below, the term data scientist is emerging as a dominant search term in Job Search engines.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 12
FIGURE 3: Job Trends in Data Science
From Michael Rappa, Director of the Institute for Advanced Analytics at NC State The future of data science in the enterprise will be extremely bright, if a few key things happen: First, the right kinds of partnerships must be formed between data-rich companies and forward- thinking academic institutions. Second, institutions and employers need to encourage and reward the right set of data-science skills. Are statisticians going away? No. There will always be a need for traditional Statistics. Disciplines such as psychology, nursing, marketing research, medical research, etc., will always have a need for the traditional skills associated with hypothesis testing and model development. Data Scientists are different. They embody skills which traditional statisticians dont have. While data scientists must have strong skills in statistical testing and modeling, they are also strong in computational mathematics, data architecture, the process of ETL (extract, transport, load), programming (i.e., SAS, Java, C++, Hadoop), and typically have some content knowledge (i.e., Chemistry, Biology, Finance). The proposed Ph.D. in Analytics and Data Science at Kennesaw State University directly meets the national talent shortage in this space, as evidenced by movements such as the Big Data Research and Development Initiative of 2012, by effectively and thoroughly training and thereby expanding the workforce available to develop and use Big Data technologies. DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 13
Furthermore, this degree program would transform teaching and learning in the field of Big Data technology, another major objective of the White House Initiative. Consequently, we believe the degree will directly and/or indirectly accelerate the pace of discovery in science and engineering used to further understanding and knowledge, strengthen U.S. national security, and increase the quality of life for the average American citizen. With respect to the shortage of big data analysts and their training, The Big Data MGI report states, we believe that the constraint on this type of talent will be global, with the caveat that some regions may be able to produce the supply that can fill talent gaps in other regions. It is our strongly held position that we can make Georgia one of these key regions which produces Big Data analytical leadership for the world with this proposed Ph.D. degree. A Ph.D. program in Analytics and Data Science at Kennesaw State University has the potential of defining Kennesaw State University, the University System of Georgia, and the State of Georgia, as cutting-edge, state-of-the-art innovators in the methods and technologies that will shape and see us through the 21 st century. Sowhy do we think KSU can deliver a Ph.D. in Analytics and Data Science? The main reason is that we are already moving in that direction. This is evidenced through the successes of both the Minor in Applied Statistics and Data Analysis as well as in the Master of Science in Applied Statistics. The Minor in Applied Statistics, more than any other Minor field of study in the history of KSU, is a flagship of interdisciplinary success. Students are required to complete 15 hours (five courses) in Statistics at the 3000 level or above to qualify for a Minor in Applied Statistics and Data Analysis. In any given semester, the Minor serves the needs of over 200 students from almost every college across the university. Statistics represents the most diverse cross section of majors in 3000 or 4000 level courses, of any course of study. Where most upper division courses are populated by students from a single major, in the statistics courses (all STAT courses are above 3000), the classes are consistently populated with students from Biology and Chemistry, Finance and Economics, Psychology, Mathematics, Sociologyand even Theater (see Figure 4).
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 14
Why do undergraduates at KSU seek out a series of five upper division electives in Statistics? We believe that there are three primary reasons that have created this demand: 1. Statistics at KSU includes an inherently interdisciplinary faculty the same faculty which will power the Ph.D. in Analytics and Data Science. Most of the Statistics faculty has had experience in the private sector, including Ford Motor Company, The Childrens Hospital of Cincinnati, The Cancer Center at MD Andersen in Houston, TX, MasterCard International, VISA EU (London), AT&T/BellSouth (Brazil), Thompson Reuters, The Southern Company and ChoicePoint. Most students can find someone with an application of Statistics outside the classroom, aligned with their career aspirations. These experiences are brought into the classroom and students respond.
0 10 20 30 40 50 60 70 80 90 Exercise & Health Science Computer Science Criminal Justice Environmental Studies Interdisciplinary Studies Math Education Political Science Theatre Business Finance Geographic Info Sciences Information Systems Marketing Sociology Accounting Communications Exercise & Health Sciences Nursing International Affairs Biochemistry Chemistry Mathematics Biology Psychology FIGURE 4: Distribution of Minors in Applied Statistics and Data Analysis by Declared Major (Fall 2012) DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 15
2. Statistics is the process through which data is converted into meaningful information to support decision making. But, as outlined above, while data is increasingly ubiquitous and cheap and easy to capture and store, it is difficult to translate. Students recognize that whether they are studying Finance or Psychology, Biology or Political Science, they will have to understand how to translate data into information. This knowledge is no longer a differentiator but rather an ante to play. Since all disciplines work with data, in some form, all disciplines of study need to have some integration of Statistics for their graduates to be marketable.
3. Jobs. Jobs. Jobs. Students are increasingly turning to Statistics as a great way to position themselves in the marketplace. Undergraduates with Minors in statistics are having great success with job placement after graduation. Statistics students from KSU are recruited for positions across a wide variety of companies including The Home Depot, Coca Cola, The Southern Company, Link Analytics, Aspen Marketing Services, The Internal Revenue Service, Ultimate Software, IBM, Assurant, Compucredit, The CDC, Equifax.
The Masters of Science in Applied Statistics has a similar story. Since the launch of the degree in 2006, very few of the applicants have had undergraduate degrees in Statistics. MSAS applicants come from Engineering, Business, Medicine, and Education. A defining characteristic of the MSAS program is its fluid alignment with the needs of the market the faculty works with their Advisory Board to consistently pivot the curriculum to be aligned with the market. As a result, the MSAS is proud of its effective 0% unemployment rate amongst students without work restrictions. Statistics emerged as a unique discipline at KSU in Fall of 2006 all of this success has occurred in less than 6 years. In an effort to ensure limited duplication with other successful initiatives in Statistics within the University System, such as the programs at the University of Georgia and at Georgia Tech, KSU pursued a strongly applied orientation, meaning that course materials were focused on leveraging faculty experience outside the classroom, and applying Statistics the way practitioners apply Statistics. From the beginning KSU elected to have less emphasis on theoretical Statistics. This last point meant that KSU would have to have strong integration of statistical software into the curriculum. In response, we looked to the dominant software/language in the marketplace. This was, without question, SAS, which is used DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 16
by 95% of the Fortune 500 including all of the top companies in our regional footprint. As a result, all of our students, both at the undergraduate and graduate levels, learn strong SAS programming skills as a complement to their statistics skills. It is this dimension of programming skills, combined with a strong mathematical foundation, and deep and broad instruction in statistical modeling which has already well positioned the program to offer a Ph.D. in Analytics and Data Science.
c. Brief explanation of how the program is to be delivered Opportunities are usually disguised as hard work, so most people don't recognize them. Ann Landers While additional resources will be required to make the Ph.D. in Analytics and Data Science successful (see Section 4 below), much of the basic delivery infrastructure is in place. The general structure of the program will include three stages:
Stage 1: Course Work If you only have a hammer, you tend to see every problem as a nail. Abraham Harold Maslow The Ph.D. in Analytics and Data Science will begin with 48 hours of core course work/instruction, spread over (expected) four years of study, plus six hours of electives and 24 (minimum) hours of dissertation and internship (78 total hours). In response to the market needs and skill gaps as outlined above, the Ph.D. in Analytics and Data Science will have a strong interdisciplinary and application orientation. Generally, coursework will be structured as provided in Figure 5.
Coursework Application Research DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 17
A full listing of the proposed courses, and a sample program of study can be found in SECTION 5 below. The logic supporting this interdisciplinary approach is that the curriculum would be aligned with the needs of the marketplace as evidenced in Section 1b above. Students will be required to complete a comprehensive examination of their course materials before they are considered to have completed this stage. The comprehensive examination will cover materials from all of the three areas of study listed above.
Stage 2: Application The Ph.D. in Analytics and Data Science is, at its core, an applied program. Ph.D. students would be required to engage in one year for a total of 15 credit hours of application. This application will take one of two forms. The first form of application is private or public sector work experience. The Statistical Advisory Board has agreed, in principle, to hire Ph.D. students on a contract basis for a minimum of one year, after they have completed their coursework but prior to completing their dissertation.
Mathematics 15% Computer Science 20% Statistics 35% "Content Application" 30% FIGURE 5: Proposed Approximate Distribution of Credit Hours for the Ph.D. in Analytics and Data Science DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 18
This would accomplish three objectives: The hiring firm would cover one year (minimum) of doctoral student stipend ($25 - $30K). The Ph.D. student would apply concepts and skills learned in the classroom in a real environment. The experience has the potential to become a source of dissertation research. This integration with the companies represented by the Advisory Board also represents an important endorsement of the proposed program, as well as an extension of the engagement with the business community which has been the trademark of the Statistics programs to date. Letters of Intent and Support from the Statistical Advisory Board can be found in Section A.1. The second form would be to engage in applied research/scholarship projects with KSU faculty. Examples would include analytical work in the areas of Finance, Chemistry, Economics, Biology, Marketing, etc. It would be the intention that this second form of application would take place within the context of a grant, which would provide the funding for the student, as well as provide much needed analytical support for faculty engaged in grant-funded scholarship and research. It would be expected that either form of these 15 hours of application would segue into the third stage of the program dissertation research.
Stage 3: Dissertation Research A Ph.D. in Analytics and Data Science would require a formal Dissertation process, involving an interdisciplinary committee, comprised of faculty from Statistics, Computer Science, and Mathematics. Depending upon the application path pursued above, a faculty member from a content discipline (e.g., Marketing, Finance, Chemistry, Biology, Economics) would be included and possibly an external committee member as appropriate.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 19
d. Prioritization within the institutions strategic plan The Vision statement for the university: Kennesaw State University will be a nationally prominent university recognized for excellence in education, engagement, and innovation. The Ph.D. program in Analytics and Data Science supports all three of these sources of recognition. As evidenced above, there is an unquestioned recognition of a current and emerging talent shortage in the areas of Analytics and Data Science. In addition, no university in the country has an advanced degree/curriculum which formally addresses the specific skill sets defining the Data Scientist. KSU has the opportunity to establish the model for such a degree. In addition, the Program is truly engaged engaged across academic disciplines, as well as engaged with public and private sector partners who support this initiative. DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 20
SECTION 2. Description of the programs fit with the institutional mission and nationally accepted trends in the discipline. Noted above. SECTION 3. Description of how the program demonstrates demand and a justification of need in the discipline and geographic area (region, state, and nation) and is not unnecessary program duplication. The market demand for the skills which are aligned with a Ph.D. in Analytics and Data Science are evidenced above in Section 1b. The proposed Program is not only NOT a duplication of any programs currently in existence in the State of Georgia, the program would be the first of its kind in the country. A brief outline of the most closely related Ph.D. programs in the State of Georgia is provided in Table 1below. TABLE 1: Comparison of related Ph.D. programs in the University System of Georgia Institution Name of Program Stated Objectives Notes on Curriculum Program Housed Georgia Institute of Technology Ph.D. in Industrial Engineering with a Specialization in Statistics The Ph.D. in (Industrial and Systems Engineering) is a research degree...students have the opportunity to pursue work at virtually any of the points across the applied/theoretical spectrum... Courses incorporate strong mathematics, with methods courses aligned with manufacturing and engineering. Requirements include five core courses, two theory courses, three methods courses, one elective course (11 courses total). No requirement for internships or co-op. College of Engineering, H. Milton Stewart School of Industrial and Systems Engineering. Georgia Institute of Technology Ph.D. in Industrial Engineering with a Specialization in Computational Science and Engineering Georgia Tech's CSE Ph.D. degree will prepare students for a variety of positions in industry, government and academia that emphasize research and development. Students will be well prepared for positions in industryand in government. Graduates may pursue work in software and systems for modeling and simulation, systems integration, data mining and visualization, high performance computing, and computational modeling. Academic career possibilities include research and education in departments concerned with advancing the state- of-the-art in the development and application of computational models in engineering, the sciences and computing. The program emphasizes the integration and application of principles from mathematics, science, engineering and computing to create computational models. Courses include 6 core courses in computational mathematics and in high performance computing, three elective courses which must go beyond using computers to deepen understanding of computational methods, preferably in the context of some application domain and three elective courses in an application domain (12 courses total). No requirement for internships or co-op.
College of Engineering, H. Milton Stewart School of Industrial and Systems Engineering. University of Georgia Ph.D. in Statistics Heavy theoretical emphasis placement is exclusively in academic positions. Program includes a minimum of 10 courses including four statistical theory courses (core), two subcore electives one of which is a statistical computing course, and four unspecified STAT electives. College of Arts and Science, Statistics Department. Georgia State University Ph.D. in Mathematics and Statistics The Ph.D degree program in Mathematics and Statistics includes concentrations in bioinformatics, biostatistics, and mathematics. These concentrations address the critical need for mathematics faculty and the need for highly trained specialists in the areas of bioinformatics and biostatistics(the program) will graduate individuals with a broad background in applied areas for direct placement in business, industry, governmental institutions and research universities. Heavy emphasis on mathematics. The four core courses include Real Analysis, Matrix Analysis, Theory of Probability and Linear Statistical Analysis. Remaining courses vary based upon selected concentration. The Concentration in Bioinformatics incorporates three computer science courses. Eighteen courses required. College of Arts and Sciences, Department of Mathematics and Statistics. DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 21
These are all excellent programs which have achieved recognition across a variety of contexts. However, none of these programs are aligned with the skills defining the Data Scientist.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 22
SECTION 4. Brief description of institutional resources that will be used specifically for the program (e.g., personnel, library, equipment, laboratories, supplies & expenses, capital expenditures at program start-up and when the program undergoes its first comprehensive program review). The Department of Mathematics and Statistics does a lot of things well it has developed the strongest, most popular Minor in the history of the University (the Minor in Applied Statistics and Data Analysis). The MS in Applied Statistics has an effective 0% unemployment rate, with an ever-increasing number of applications each year. While the philosophy, the will, the drive and the orientation to make a Ph.D. in Analytics and Data Science an unqualified success at the level of the other programs is present without question, several gaps in the faculty skill set and in the physical infrastructure, have been identified, which must be mitigated in advance of any formal establishment of a Ph.D. in Analytics and Data Science. Faculty/Personnel requirements a. Director, Ph.D. in Analytics and Data Science. The Statistics faculty at KSU is comprised of talented, relatively young individuals who, while having worked in capacities outside of academia, only have teaching experience at KSU. No individual within this faculty has worked within an academic department which grants Ph.D.s (outside of work completed during their own Ph.D. programs). Therefore, an experienced individual is required to help to develop and to lead this effort to develop a flagship doctoral program in Analytics and Data Science. This individual will have had experience either leading or cultivating Ph.D. programs in Statistics, Applied Mathematics, Engineering, or other similar applied quantitative disciplines. A brief survey of the salaries of individuals who lead the programs listed above within the University System of Georgia is outlined in Table 2. TABLE 2: Relevant Salary Comparisons for Individuals leading Major analytical programs in the USG. Position Institution Salary not including travel (2011) Associate Chair for Graduate Studies and Professor Georgia Institute of Technology- H. Milton Stewart School of Industrial and Systems Engineering $140,167 Chair, Statistics Department University of Georgia, Department of Statistics $177,290 Director, Graduate Programs in Mathematics Georgia State University, Department of Mathematics and Statistics $91,525 DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 23
This individual would have the experience and authority of a department chair or associate dean. In addition, this individual would need to possess the intersection of skills which define a Data Scientist Statistics, Computer Science and Mathematics. As referenced above, these individuals are in strong demand in the marketplace. While the salary points for this skill set varies depending upon the discipline and the economic sector, the average for 5-7 years of experience outside of academia (which may also be a source of recruitment) ranged from $120,000 to $200,000+. In the short term, this individual would have responsibility for the launch of the Ph.D. program, ensure that the program is appropriately accredited, attract first-class students into the program, coordinate and facilitate external partnerships and communicate/market the program to external rating agencies (i.e. US News and World Report, Fortune, etc.). Projected Salary range (without travel) is $150,000. b. Associate Professor in Statistics. The demand for statistics courses at the undergraduate level and the MS level continues to grow as can be seen in Figure 6.
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 2006 2007 2008 2009 2010 2011 2012 FIGURE 6: Number of Courses taught with STAT prefix since 2006 Undergraduate Graduate DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 24
While the faculty base has doubled over a six year period from five in 2006 to 11 in 2012 the number of courses (not sections, but different courses/preps) has more than doubled from seventeen courses in 2006 (5 undergraduate and 12 graduate) to thirty- seven DIFFERENT courses in 2012 (12 undergraduate and 25 graduate). And, many of these courses now offer multiple sections, taking the teaching load for 11 faculty (and two lecturers) to over 50 course sections. Even without a proposed Ph.D. in Analytics and Data Science, the Statistics faculty is already heaving under the weight of its own popularity. In addition to teaching their own courses, they are frequently sought out by other departments around campus to graduate courses in Statistics. Examples include NURS9100 and NURS9200, INCM9102 and the COLES DBA Program. Therefore, there is a strong need for two faculty members, who can help share the teaching load, particularly as the number of different courses offered through the program continues to expand. Courses in advanced and emerging topics such as Neural Network modeling, Affinity Analysis, Social Network Analysis, Text Mining, as well as formal certification preparation courses for the Actuarial Exams, the SAS Programming Exams and the Six Sigma Exams have been considered, but not pursued because of the heavy teaching loads currently assumed by even senior faculty. The ideal individual would also have some experience teaching doctoral students. While the average ten month salary of an Associate Professor in Statistics at KSU is about $80,000, this is well below the academic industry median, and about half of the private sector average. To ensure a pool of competitive candidates for this position, this position is proposed at a salary of $90,000 for ten months (not including travel). A second position for an Assistant Professor of Statistics at KSU is proposed at a salary of $80,000 for ten months (not including travel).
c. Associate Professor in Mathematics. The Ph.D. Program will require, at minimum, three additional courses in theoretical mathematics Advanced Discrete Mathematics/Combinatorics, Graph Theory and Theory of Linear Models. While the average ten month salary of an Associate Professor in Mathematics at KSU is about $60,000, this is below the industry median. To ensure a pool of competitive DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 25
candidates for this position, this position is proposed at a salary of $70,000 for ten months (not including travel). d. Associate Professor in Computer Science. The Computer Science department at KSU is also experiencing an increased demand in courses at all levels (consistent with the demands and trends in the market as outlined above). With organic increased demand, combined with new demands from this proposed Ph.D., the CS department will require at least two additional faculty to accommodate the increased demand for courses. The expectations of the first incremental CS faculty to specifically help support the needs of the proposed program, would be some previous experience teaching Ph.D. students, combined with teaching experience (or industry experience) with practical application. Consistent with the current Associate Professor position in Computer Science, the expected ten month salary is approximately $90,000.
d. Assistant Professor in Computer Science. Same points as above regarding the Associate Professor position. The Assistant Professor candidate may be a freshly minted Ph.D. graduate, with strong skills in Big Data software/languages. Consistent with current Assistant Professors in the Computer Science department, the expected ten month salary for this position is approximately $80,000.
e. Office Manager. The Ph.D. in Analytics and Data Science would need a dedicated Office Manager to support the Director, as well as the doctoral students. Consistent with Office Managers supporting other doctoral programs at KSU, the expected twelve month salary for this position is approximately $50,000.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 26
f. Dedicated IT Professional. Given the centrality of High Performance Computing to the success of the degree, combined with the unique and complex requirements of a Big Data environment, the Ph.D. in Analytics and Data Science would require a dedicated IT professional to support the affiliated faculty and the doctoral students. Consistent with IT Professionals currently engaged in system support for the College of Science and Mathematics, the expected twelve month salary is expected to be $55,000. g. Stipend for Doctoral Students. Data regarding funding for doctoral students at other universities is highly varied by program. Because the Program is full time, students will have little opportunity to work outside of their Program-aligned internship. Therefore, the stipend will need to be sufficiently substantive in addition to being competitive to attract high-quality candidates. It is our initial position that the stipend for students should range from $20,000 to $30,000 annually with the discretion of the Director to award higher stipends for more promising students. Doctoral students would also receive a tuition waiver and would be required to teach or assist with research every semester. TABLE 7: Summary of incremental Faculty/Personnel for the Ph.D. in Analytics and Data Science Position Home Department Expected Annual Salary Notes Director, Ph.D. in Analytics and Data Science Department of Mathematics and Statistics $150,000 Twelve Month position starting YR Before Program (YR -1). Associate Professor of Applied Statistics Department of Mathematics and Statistics $90,000 Ten Month position starting in YR1. Assistant Professor of Applied Statistics Department of Mathematics and Statistics $80,000 Ten Month position starting in YR2. Associate Professor of Mathematics Department of Mathematics and Statistics $70,000 Ten Month position starting in YR2. Associate Professor of Computer Science Department of Computer Science with appointment to the Department of Mathematics and Statistics $90,000 Ten Month position starting in YR1. Assistant Professor of Computer Science Department of Computer Science with appointment to the Department of Mathematics and Statistics $80,000 Ten Month position starting in YR Before Program (YR -1). Office Manager Department of Mathematics and Statistics $50,000 Twelve Month position starting in YR1 IT Support College of Science and Mathematics $55,000 Twelve Month position starting in YR1. Doctoral Student Stipend Department of Mathematics and Statistics $150,000 $30,000 Per Student estimate assumes five students in YR1. TOTAL $815,000
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 27
Software Requirements A defining skill of the Data Scientist is their ability to work with Big Data. While the operational definitions of Big Data continue to evolve, there are three components to this new and increasingly ubiquitous type of data that most people agree are present: 1. Size. In this context big is defined as data not contained by traditional software packages. For example, data files with a billion records are increasingly not unusual. Traditional analysis packages like Microsoft Excel or even SPSS do not scale to accommodate this kind of data. 2. Type. Big Data often comes in forms other than traditional rows and columns. Text, tweets, even video, are well accommodated by traditional analytical software. 3. Velocity. Because data has become cheap and easy to capture, organizations capture data. Every day. Every hour. Every minute. Consider the Southern Company. This organization captures usage on the power grid continuously and feeds this data back into their systems. This velocity of data the constant updating of files is again, not accommodated by traditional analytical software. Big data with issues related to size, type and velocity, requires a different type of software solution for analysis. The SAS Institute is one of the few organizations to have developed a software solution for this problem. SAS High Performance Analytics is a solution that has been developed by SAS to address these issues. SAS HPA uses in-memory and distributed processing to analyze billions of records in seconds. By way of comparison, a current student course project from STAT8330 requires students to model 17 million records. With 20 students using a dedicated server (no other users permitted), the process takes about 4 days. In part because of the long standing relationship between the SAS Institute and Kennesaw State University, and the consistent integration of Base SAS software across almost all KSU STAT courses, as well as the recognition of this Ph.D. program becoming the first of its kind in the country, the SAS Institute has agreed to make a multi-million dollar gift to KSU in the form of a copy of this software. This gift will accomplish three very important objectives as the new Program unfolds:
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 28
1. Because KSU will be one of the first universities in the country to have this software, it will not only differentiate KSU, but provide important validation and enhanced credibility for the Program. 2. It will continue to make KSU the center for analytics in the Metropolitan Atlanta area where small, medium and large companies can come to the KSU Data Science Lab to test Big Data Analytics on a unique piece of software. 3. As organizations around the country integrate this software into their own systems, KSU students will be uniquely trained and qualified to work within an HPA environment. Data Science Lab requirements The multi-million dollar software grant from SAS is unprecedented in KSU history. This powerful software will help to propel the program into immediate national recognition as a leader in Big Data analytics. However, the software requires a level of hardware infrastructure which has never existed at KSU. Working with Erik Bowe, Chief Data Officer and Don Hayes, founder of DLL Consulting, a SAS Alliance Partner Affiliate and member of the Statistics Advisory Board, a team of faculty are evaluating multiple options to optimize the final hardware configuration to support a high performance analytics infrastructure. At a minimum, a high bandwidth storage area network (SAN) with at least 6 TB of storage with 7,500 RPM drives will be needed, with options for scalability. As part of this option, the program would begin with 20+ Dell/Teradata blades with 8 core chips along with 256 GB of RAM at 1600 MHz. More detailed hardware infrastructure conversations will be required, but research to- date indicates that the hardware solution options will range from $700,000 to $1,000,000. For the purposes of planning, an estimate of $800,000 will be used to fund the one time hardware requirements needed to accommodate the SAS high performance analytics software. It should be noted that this hardware infrastructure would also serve the needs of Computer Science which has also launched a Certificate program aligned with Big Data.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 29
TABLE 8: Summary of incremental Hardware/Physical Infrastructure for the Ph.D. in Analytics and Data Science
Item Home Department Expected Costs (one time unless noted otherwise) Notes New Servers $800,000 One time cost YR1 Annual Support from SAS $50,000 Annual cost TOTAL $850,000
A detailed summary of the estimated budget for the program can be found in Section 14 below. DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 30
SECTION 5. Curriculum: List the entire course of study required and recommended to complete the degree program. Provide a sample program of study that would be followed by a representative student. The proposed program of study for the Ph.D. in Analytics and Data Science presented in the pages below assumes that an entry-level student has completed the following courses or their equivalents at the masters degree-level (see Table 10). If any of these courses (or their professional equivalents) have not been completed by a candidate prior to admission to the Ph.D. program, then the student will be required to complete the appropriate course(s) in addition to the requirements that follow for the Ph.D. Table 10 below displays the course acronym, title, credit hours, status (either an existing course or new (one that must be developed for purposes of the Ph.D. program)), and associated prerequisites for eight foundational masters degree-level program prerequisites. In addition, Section A.2 in the Appendix provides the course descriptions for these classes.
TABLE 10: Previous Masters Degree-Level Required Coursework for the Ph.D. in Analytics and Data Science Prefix Course Name Credit Hours Status Pre-requisites Notes STAT 7010 Mathematical Statistics I 3-0-3 Existing Requires Calc 2, and preferable Calc 3. STAT 7100 Statistical Methods 3-0-3 Existing STAT 7010 Research Methods and Testing. This would be standard content in almost any MS program in Social Sciences, Hard Sciences or Statistics. STAT 7020 Statistical Computing and Simulation 3-0-3 Existing STAT 7100 Must include SAS and basic programming. STAT 8210 Applied Regression Analysis 3-0-3 Existing STAT 7100 and STAT 7020 Standard for most Statistics Curricula. STAT 8310 Applied Categorical Data Analysis 3-0-3 Existing STAT 8210 STAT 8320 Applied Multivariate Data Analysis 3-0-3 Existing STAT 8120 and STAT 8210 Standard for most Statistics Curricula. ACS 7010 C++ and Data Structures 3-0-3 Existing Standard for most CS Curricula. ACS 7030 Relational Database Systems 3-0-3 Existing
Because this is an applied degree program, for applicants who do not have these specific (or similar) courses on their academic transcript, practical experience would be a viable substitute. To ensure that applicants have the necessary depth of skills, a passing score on a basic skills exam would be required in lieu of a completed course.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 31
Students pursuing a Ph.D. in Analytics and Data Science would be required to take 48 course hours plus 6 hours of electives spread over four years, plus dissertation research (12 hour minimum) and internship (12 hour minimum). In total, this degree is expected to require a minimum of 78 credit hours of courses, internship and dissertation. TABLE 11: Core Required Courses in for the Ph.D. in Analytics and Data Science Prefix Course Name Credit Hours Status Pre-requisites* Notes STAT 8240 Data Mining I 3-0-3 Existing STAT 8210 STAT 8020 Advanced Programming in SAS 3-0-3 Existing STAT 7100 and 7020 After the completion of this course, students will sit for the Advanced Programming Certification Exam, sponsored by SAS. STAT 8330 Applied Binary Classification 3-0-3 Existing STAT 8120 STAT 8250 Data Mining II 3-0-3 New STAT 8240 This course may need to be taught by or at least in conjunction with the CS faculty. We anticipate heavy emphasis on Hadoop. STAT 8260 Segmentation Models 3-0-3 New STAT 8240 After the completion of this course (and 8240 and 8250), students will sit for the Data Mining Certification Exam, sponsored by SAS. STAT 8270 Production-Level Modeling 3-0-3 New STAT 8240 and STAT 8020
STAT XXXX Statistics Elective 3-0-3 STAT XXXX Statistics Elective 3-0-3 STAT CORE 24 HOURS
MATH 8020 Graph Theory 3-0-3 New MATH 8030 Discrete Mathematics 3-0-3 New Includes study of Combinatorics MATH 8010 Theory of Linear Models 3-0-3 New MATH CORE 9 HOURS ACS 7410 Parallel and Distributed Computing 3-0-3 New ACS7010 and ACS 7030 ACS 7510 HPC Infrastructure 3-0-3 New ACS 7010 ACS 7420 Algorithm Design for Big Data 3-0-3 New ACS 7410 ACS 8310 Data Warehousing 3-0-3 New ACS 7030 ACS XXXX ACS Elective 3-0-3 CS CORE** 15 HOURS TOTAL CORE REQUIREMENTS 48 HOURS DS 9900 Dissertation Research (min 2 semesters) 6-0-6 New Completion of Coursework Can be completed simultaneously with Internship DS 9700 Internship/Application (min 2 semesters) 6-0-6 New Completion of Coursework Can be completed simultaneously with Dissertation Research XXXX Free Elective (Statistics, Mathematics, or Computer Science or Content area) 3-0-3 New XXXX Free Elective (Statistics, Mathematics, or Computer Science or Content area) 3-0-3 New TOTAL PROGRAM REQUIREMENTS 78 HOURS * From transcript or from professional experience. ** All CS courses listed above will be cross listed with the MS in Applied Computer Science (ACS). These courses will be taught by a combination of Dr. Ying Xie and two incremental faculty to be hired via a search committee comprised of CS, Math and Stat faculty. DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 32
Students are welcome to take additional hours of elective courses (or substitutes with approval) from any department on campus. Because few departments currently offer Ph.D. or M.S. level courses, electives outside of the three disciplines outlined above would most likely take the form of directed studies. These directed studies, could segue into the application phase of the program. It would be expected that, where appropriate, any professor who engages in a substantive directed study with a Ph.D. student from this program would also serve on the dissertation committee. Consequently, only tenured Associate and Full Professors are invited to engage in directed studies with these Ph.D. students. Of the elective courses listed below from the Statistics curriculum, several have been considered for years as electives for inclusion in the MS in Applied Statistics. With additional faculty resources, these courses can now be developed and cross listed between the MS in Applied Statistics and the Ph.D. in Analytics and Data Science. TABLE 12: Elective Courses in Statistics for the Ph.D. in Analytics and Data Science Prefix Course Name Credit Hours Status Pre-requisites Notes STAT 8110 Quality Control and Process Improvement 3-0-3 Existing STAT 7100 & STAT 7020 Black Belt Preparation STAT 8140 Six Sigma Problem Solving 3-0-3 Existing STAT 8110 & STAT 8120 Black Belt Preparation STAT 8350 Social Network Analysis 3-0-3 New STAT 8240 STAT 8360 Applied Sampling Methods 3-0-3 New STAT 8240 & STAT 8020 STAT 8370 Affinity Analysis 3-0-3 New STAT 8240 STAT 8380 Churn Modeling 3-0-3 New STAT 8240 STAT 7900 Programming in R 3-0-3 Existing STAT7020 or CS 8530 STAT 7900 STAT 8399 Design and Analysis of Massive Survey Data 3-0-3 New STAT 8240 & STAT8020 STAT 8399 ACS 8430 Text and Web Mining 3-0-3 New ACS 8310 and ACS 7420 ACS 8510 Cloud Computing 3-0-3 New ACS 7410 and ACS 7510 DS XXXX CSAS Consulting 3-0-3 New Center Director Approval Option to work in the Consulting Center
While an ultimate program of study could take many different forms, an example of a typical program of study for a student in this program can be found in Table 13 below.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 33
TABLE 13: Sample Program of Study for the Ph.D. in Analytics and Data Science Semester Course Course Name Pre Req Hours Notes FALL 2014 STAT 8020 Advanced Programming in SAS STAT 7100 & 7020 (or BASE SAS Programming Certification) 3 STAT 8240 Data Mining I STAT8210 3 MATH 8010 Theory of Linear Models 3 ACS 7410 Parallel and Distributed Computing ACS 7010 and ACS 7030 3 SEMESTER = 12
SPRING 2015 STAT 8330 Applied Binary Classification STAT 8210 3 MATH 8030 Discrete Mathematics 3 ACS 7510 HPC Infrastructure ACS 7010 3 SEMESTER = 9
SUMMER 2015 MATH 8020 Graph Theory 3 STAT 8360 Applied Sampling Methods STAT 8240 and STAT 8020 3 STAT Elective SEMESTER = 6
FALL 2015 STAT 8250 Data Mining II STAT 8240 3 STAT 8260 Segmentation Models STAT 8240 3 ACS 8310 Data Warehousing ACS 7030 3 SEMESTER = 9
SPRING 2016 STAT 8270 Production-Level Modeling STAT 8240 and STAT 8020 3 STAT 8350 Social Network Analysis STAT 8240 3 STAT Elective ACS 7420 Algorithm Design for Big Data ACS7410 3 SEMESTER = 9
SUMMER 2016 ACS8430 Web and Text Mining ACS 8310 and ACS 7420 3 ACS Elective DS 9700 Internship/Application Completion of Coursework 3 SEMESTER = 6
FALL 2016 DS 9900 Dissertation Research Completion of Coursework 6 DS 9700 Internship/Application Completion of Coursework 6 SEMESTER=12
SPRING 2017 DS 9900 Dissertation Research Completion of Coursework 6 DS 9700 Internship/Application Completion of Coursework 6 SEMESTER=12
SUMMER 2017 DS 9900 Dissertation Research Completion of Coursework 6
PROGRAM = 81
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 34
e. Append materials available from national accrediting agencies or professional organizations as they relate to curriculum standards for the proposed program. Data Science represents a nascent, but trending confluence of disciplines. The degree incorporates Mathematics, Statistics and Computer Science. No national accrediting agency exists to oversee degrees in Data Science. Therefore, we will look to external entities for feedback and endorsement of our degree program. These external entities include: 1. The SAS Institute one of the primary clearinghouses for developments in the area of applied analytics (it should be noted that the SAS Institute is an engaged member of our advisory board and has been instrumental in guiding our thinking throughout the development of this proposal). 2. The KSU Statistics Advisory Board as referenced above, the Advisory Board is comprised of managers from organizations in the Metropolitan Atlanta area who are currently engaged in issues related to Big Data Analytics, including The HomeDepot, The Southern Company, The Centers for Disease Control, CompuCredit and several consulting firms and large insurance firms. Again, they have guided the principles incorporated in this proposal. 3. Other Universities. Several universities have MS and/or undergraduate programs aligned with applied statistics. Several professors from these institutions have taken the time to review our proposed curriculum and have provided their comments. Summaries of the comments can be found in Section A.3.
f. Indicate ways in which the proposed program is consistent with national standards. As indicated above, there are no national standards in this space. However, as outlined in Section 5.e, several knowledgeable experts have been solicited for their input and endorsement of the proposed curriculum. DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 35
It is worth noting that Thomas Davenport a recognized national expert and prolific author in the area of applied analytics prescribed the following skills for a Data Scientist in his article Data Scientist: The Sexiest Job of the 21 st Century" 1 : The most basic universal skill is the ability to code Communicate in a language that their stakeholders can understand Posit and test hypotheses Bring structure to large quantities of formless data g. If internships or field experiences are required as part of the program, provide information documenting internship availability as well as how students will be assigned and supervised. As provided in Section A.1, the Department of Mathematics and Statistics has received letters from our Statistical Advisory Board, stating that they would hire our Ph.D. students for a minimum of one year, after they have completed their course requirements. These students would be supervised at the hiring firm, but would also require a faculty mentor from the Department of Mathematics and Statistics. h. Indicate the adequacy of foundation course offerings to support the new program. Students would be admitted in a cohort. Courses which are unique to the Ph.D. program (not part of a MS curriculum) would be offered once a year.
1 http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/pr DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 36
SECTION 6: Admissions criteria. Please include required minima scores on appropriate standardized tests, grade point averages, and masters level graduate degree attainment. Admission into the Ph.D. in Analytics and Data Science Program would require the following: 1. Minimum GRE Score Top 20 th percentile 2. Minimum GPA 3.0 in MS program 3. Letters of Recommendation (academic and professional) 4. At least one year of previous work experience outside of academia (preferred) Applicants would need to provide three letters of recommendation, including two from academic references and one from an applied reference (i.e., a manager or colleague). 5. Interview All applicants would be required to have either an on-campus interview (preferred) or a video with the Program faculty.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 37
SECTION 7: Availability of assistantships. As evidenced by the Letters of Intent (see Section A.1) from the Statistical Advisory Board, external assistantships/internships should be readily available for students who complete their courses and pass their comprehensive exam. In addition, as the emphasis on research and scholarship continues to increase at KSU, these doctoral students probably more than any other students on campus would be expected to be in demand for grant/research support. Currently the Center for Statistics and Analytical Services employs two MS level GRAs every semester. These GRAs are engaged in multiple analytical projects across campus. Given the existing level of demand for MSAS students, it would be logical to anticipate sufficient demand for Data Science doctoral students to accommodate at least two GRAs from this Program every semester.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 38
SECTION 8. Student learning outcomes and other outcomes of the proposed program. The success of the Ph.D. in Analytics and Data Science will be measured through four instruments: 1. Successful completion of comprehensive examinations. At the completion of their required coursework, students will take a comprehensive examination, covering each of the three major areas of study. Students who do not pass their comprehensive examinations will be mentored into alternative options. These alternatives are outlined at the end of this section. 2. External Certifications. The SAS Institute has a strong certification program which serves as a de facto accreditation process for the applied analytics industry 2 . These certifications range from base programming, data management, predictive modeling and data mining. At present, most MS students in Applied Statistics, and increasingly more undergraduate minors in Applied Statistics, are earning these certifications. It would be our expectation that the Ph.D. students in Analytics and Data Science would continue this trend. With additional training in Computer Science and in Mathematics, these students should earn a SAS Certification in Base Programming, Advanced Programming, Statistical Business Analyst: Regression and Modeling, at the least. These certifications have clear and universally recognized value in the marketplace. 3. Job Placement. As referenced above, the MS in Applied Statistics has a 100% placement rate, with many students having multiple job offers in advance of graduation. An important distinction of success of the Ph.D. in Analytics and Data Science will be the demand for the graduates. Not only do we expect these graduates to experience a similarly strong placement rate as the MS students, but we expect that there will be competition in the market for these students. 4. Pre/Interim/Post competency survey. In conjunction with our Advisory Board, we will develop a required skills inventory including the skills which are required for job postings within their own firms when advertising for Data Scientists. Students will be evaluated against this inventory three times. The pre score will result from their assessment after acceptance into the program. The interim score will result from their assessment after completion of the
2 http://support.sas.com/certify/ DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 39
program. Their post score will come after one year in their post-graduation position. Students who do not pass the comprehensive exams will NOT simply earn an MS in Applied Statistics or an MS in Applied Computer Science. If failing Ph.D. students have a particular aptitude in one discipline and not in another, they may be guided to apply to one of these two programs, but acceptance is not guaranteed. A scenario exists where unsuccessful students may walk away empty-handed, although we expect this scenario to be rare.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 40
SECTION 9. Administration of the program: The Ph.D. in Data Science will be housed in the College of Science and Mathematics, with primary administrative responsibility in the Department of Mathematics and Statistics.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 41
SECTION 10. Waiver to Degree-Credit Hour (if applicable): If the program exceeds the total credit hours normally associated with similar programs offered both within and outside of the system, provide the institutions rationale for increased credit hour requirements. Not Applicable.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 42
SECTION 11. Accreditation: Describe disciplinary accreditation requirements associated with the program (if applicable). As described in Section 5.e above, while no accreditation agency exists for Data Science, the curriculum has been informed by several knowledgeable individuals and organizations. DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 43
SECTION 12. Projected enrollment for the program (especially during the first three years of implementation). Please indicate whether enrollments will be cohort-based. The Ph.D. in Analytics and Data Science will enroll five students a year for the first four years on a co-hort basis. After four years, the program, is expected to ramp up towards a steady state of 10 new students a year. YR 1 YR 2 YR 3 YR 4 YR 5 YR 6 Enrollment Projections 5 10 15 20 21 22
Course Sections Satisfying Program Requirements Previously Existing 7 16 18 20 20 20 New 9 2 2 0 0 0 Total Sections 16 18 20 20 20 20
Credit Hours Generated by Those Courses Existing 21 48 54 66 78 78 New 27 6 12* 12* 0 0 Total Credit Hours 48 54 66 78 78 78
Degrees Awarded 0 0 0 0 4 4 * Dissertation hours and Internship hours will be duplicated
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 44
SECTION 13: Faculty Provide an inventory of faculty directly involved with the administration of the program. On the list below, indicate which persons are existing faculty and which are new hires.
Faculty Name Rank Highest Degree Degrees Earned Academic Discipline Current Workload Bradley Barney Assistant Professor Ph.D. B.A., M.S., Ph.D. Statistics 3/3 Marla Bell Professor Ph.D. B.S., M.S., Ph.D. Statistics 1/1 Nicole Ferguson Assistant Professor Ph.D. B.S., M.S., Ph.D. Statistics 3/3 Joe DeMaio Professor Ph.D. B.S., M.S., Ph.D. Mathematics 2/2 Victor Kane Professor Ph.D. B.S., M.S., Ph.D. Statistics 3/3 Philippe Lavalle Associate Professor Ph.D. B.S., M.S., Ph.D. Mathematics 3/3 Louise Lawson Professor Ph.D. B.S., M.P.H., Ph.D. Statistics 3/3 Sherri Ni Associate Professor Ph.D. B.S., M.S., Ph.D. Statistics 3/3 Jennifer Priestley Associate Professor Ph.D. B.S., MBA, Ph.D. Statistics 2/2 Herman (Gene) Ray Assistant Professor Ph.D. B.S., M.S., Ph.D. Statistics 3/3 Lewis VanBrackle Professor Ph.D. B.S., M.S., Ph.D. Statistics 2/2 Ying Xie Associate Professor Ph.D. B.S., M.S., Ph.D. Computer Science 2/3 Daniel Yanosky Associate Professor Ph.D. B.S., M.S., Ph.D. Statistics 3/3 New Hire TBD Associate Professor Ph.D. Statistics NA New Hire TBD Assistant Professor Ph.D. Statistics NA New Hire TBD Assosciate Professor Ph.D. Mathematics NA New Hire TBD Assistant Professor Ph.D. Computer Science NA New Hire TBD Associate Professor Ph.D. Computer Science NA New Hire TBD Professor Ph.D. TBD NA
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 45
SECTION 15: Fiscal, Facilities, Enrollment Impact and Budget YR1 YR2 YR3 YR4 YR 5 YR 6 ENROLLMENT PROJECTIONS Doctoral Students 5 10 15 20 21 22
GRAND TOTAL COSTS 1,839,500 1,224,500 1,369,500 1,369,500 1,249,500 1,129,500 * includes the Assistant Professor in Computer Science from YR -1, the Associate Professor in Statistics in YR1 and the Associate Professor in Computer Science in YR1. + includes the faculty from YR1 and the Assistant Professor in Statistics in YR2 and the Associate Professor in Mathematics in YR2.
YR 1 YR 2 YR 3 YR 4 YR 5 YR 6 REVENUE SOURCES Source of Funds Reallocation of existing funds New Tuition Federal Funds Other Grants -
385,000
420,000
500,000
500,000
500,000 Student Fees Other (Consulting Center)
25,000
50,000
100,000
100,000
100,000 Other (Corporate Internships)
150,000
300,000
450,000 New state allocation requested for budget hearing Nature of Funds Base Budget One Time funds
GRAND TOTAL REVENUES 0 410,000 470,000 750,000 900,000 1,050,000
YR1 YR2 YR3 YR4 YR 5 YR 6 Net Total Program Cost 1,839,500 814,500 899,500 619,500 349,500 79,500
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 46
SECTION A.1: Letters of Support from Statistical Advisory Board
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 47
SECTION A.2: COURSE DESCRIPTIONS FROM SECTION 5 From Table 10: Previous Masters Degree-Level Required Coursework for the Ph.D. in Analytics and Data Science. Note that these courses are currently available in the MS in Applied Statistics and/or the MS in Applied Computer Science curricula. STAT 7010: Mathematical Statistics I Fundamental concepts of probability, random variables and their distributions; review of sampling distributions; theory and methods of point estimation and hypothesis testing, interval estimation, nonparametric tests, introduction to linear models. STAT 7020: Statistical Computing and Simulation Topics covered in STAT 7020 will include stochastic modeling, random number generators based on probability distributions, discrete-event simulation approaches, simulated data analysis, nonparametric analysis and sampling techniques. Given the importance of the SAS software to these types of applications, students will, by definition, refine and improve their SAS programming skills. The class will utilize real-world datasets from a variety of disciplines including, finance, manufacturing and medicine. STAT 7100: Statistical Methods Stat 7100 is designed to give students the foundation in statistical methods necessary for further study in the Master of Science in Applied Statistics program. The course begins with a study of statistical distributions (binomial, Poisson, uniform, exponential, gamma, chi-square and normal), descriptive statistics, the Central Limit Theorem, t-tests (one-sample, two-sample and paired) and confidence intervals. The course then moves on to more advanced techniques including categorical data analysis (chi-square tests), correlation, simple linear regression analysis and one-way analysis of variance. STAT 8120: Applied Experimental Design Methods for constructing and analyzing designed experiments are considered. The concepts of experimental unit, randomization, blocking, replication, error reduction and treatment structure are introduced. The design and analysis of completely randomized, randomized complete block, incomplete block, Latin square, split-plot, repeated measures, factorial and fractional factorial designs will be covered. STAT 8210: Applied Regression Analysis DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 48
Topics include simple linear regression, inferences, diagnostics and remedies, matrix representations, multiple regression models, generalized linear model, multicollinearity, polynomial models, qualitative predictor variables, model selection and validation, identifying outliers and influential observations, diagnostics for multicollinearity, and logistic regression. STAT 8310: Applied Categorical Data Analysis This course will cover methods of contingency table analysis, including data categorization, dose-response and trend analysis, and calculation of measures of effect and association. The students will learn to use generalized linear regression models including logistic, polychotomous logistic, Poisson and repeated measures (marginal and mixed models), and apply these appropriately to real-world data. Applications to Statistical software packages such as JMP, MINITAB, and/or SAS will be used. STAT 8320: Applied Multivariate Data Analysis Survey course in statistical analysis techniques. Through a combination of textbook and real- world data sets, students will gain hands-on experience in understanding when and how to utilize the primary multivariate methods Data Reduction techniques, including Principal components Analysis and Common Factor Analysis, ANOVA/MANOVA/MANCOVA, Cluster Analysis, Survival Analysis and Decision Trees. ACS 7010: C++ & Data Structures A study of C++ programming language and capabilities, with a study of computing data structures.
ACS 7030: Relational Database Systems A study of database systems for data science and analytics, including SQL and working with very large data sets.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 49
From Table 11: Required Courses in Mathematics and Statistics for the Ph.D. in Analytics and Data Science STAT 8240: Data Mining I Data Mining is an information extraction activity whose goal is to discover hidden facts contained in databases and perform prediction and forecasting through interaction with the data. The process includes data selection, cleaning and coding, using statistical pattern recognition and machine learning techniques, and reporting and visualizing the generated structures. The course will cover all these issues and will illustrate the whole process by examples of practical applications.
STAT 8020: Advanced Programming in SAS This course will cover advanced programming techniques using the SAS system for data management and statistical analysis. The topics covered include macro programming, using SQL with SAS and optimizing SAS programs. Upon completion of this course students will be prepared to take and pass the certification test and obtain the Advanced Programmer for SAS 9 certification.
STAT 8330: Applied Binary Classification This course is a heavily used concept in Statistical Modeling. Common applications include credit worthiness and the associated development of a FICO-esque credit score, fraud detection or the identification of manufacturing units which fail inspection. Students will learn how to use Logistic Regression, Odds, ROC curves, maximization functions to apply binary classification concepts to real-world datasets. This course will heavily use SAS-software and students are expected to have a strong working knowledge of SAS.
STAT 8250: Data Mining II This is the second course in a two-course sequence on data mining. It emphasizes advanced concepts and techniques for data mining and their application to massive data sets. Building on the knowledge and skills introduced in Data Mining I, this course covers mining patterns from temporal data, sequence data, graph data, semi-supervised learning, active learning, boosting and distributed data mining. In addition, support vector machine (SVM), multivariate adaptive regression splines (MARS), recursive partitioning and its extensions (e.g., bagging, boosting, and random forests) are covered. DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 50
STAT 8260: Segmentation Models This class begins by reviewing classical clustering methods introduced in the data mining sequence. These methods are studied in greater depth and their application in massive data classification and market segmentation endeavors is explored. The second half of this course introduces the use of probabilistic models for segmentation, including mixture and latent class models, among others, and explores their utility and strengths. Segmentation using both continuous and categorical inputs with these methods is stressed. Further emphasis is placed on practical application of these methods when applied to massive data sources and appropriate and accurate reporting of results. STAT 8270: Production-Level Modeling This course focuses on the practical use of statistical and data mining models in production-level use in massive data applications. The course focuses on the circular, continuous nature of the model life cycle by studying the planning, development, implementation, assessment, monitoring, retirement/replacement phases of production-level modeling. MATH 8010: The Theory of Linear Models This course provides a solid foundation of the theory behind linear statistical models for continuous responses. Students will learn to conceptualize linear statistical models using matrix algebra. The course begins with a review of the calculus sequence, linear algebra, probability theory, the multivariate normal distribution, and quadratic forms. Some of the topics covered include: simple and multiple regression, parameter estimation and interpretation, hypothesis testing, prediction, model diagnostics, model comparison, and variable selection. MATH 8030: Applied Discrete & Combinatorial Mathematics for Data Analysts This course covers applied discrete mathematics and combinatorial tools for data analyst. Topics covered include principles of counting, fundamentals of logic, set theory, mathematical induction, functions, and graph theory. Examples using applied data analysis and associated computing are used throughout. MATH 8020: Graph Theory This course introduces standard graph theoretic terminology, theorems and algorithms necessary to the study of large data networks. Topics include graphs, trees, paths, cycles, isomorphisms, routing problems, independence, domination, centrality, and coloring problems. Data structures for representing large graphs and corresponding algorithms for searching and optimization purposes accompany these topics. ACS 7410: Parallel & Distributed Computing DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 51
This course covers different parallelisms including shared memory parallelism (OpenMP), distributed memory parallelism (MPI), and MapReduce (the major distributed framework used in cloud) in solving a variety of complex problems
ACS 7420: Algorithm Design for Big Data This course covers advanced topics in algorithm design. Focus will be put on the design of advanced algorithms that are scalable to big data in a distributed computing environment.
ACS 7510: HPC Infrastructure A study of high performance computing technologies, including supercomputing, grid computing, cloud computing and other technologies. Also includes discussion of issues around the design and management of a large data center, data integrity, and data reliability.
ACS 8310: Data Warehousing Data warehousing and mining are indispensable components of Business Intelligence, which aims to enhance business competitiveness by providing business people with information and tools that are necessary to make critical business decisions. This course will cover major techniques of data warehousing and mining including: dimensional modeling, extraction- transformation-load (ETL), online analytical processing (OLAP), association mining, clustering, classification, and other business intelligence applications.
DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 52
SECTION A.3: RESPONSES TO PROPOSAL FROM OTHER UNIVERSITIES 1. Dr. Goutam Chakraborty, Professor (Marketing) Founding Director of Graduate Certificate in Business Data Mining Spears School of Business, Oklahoma State University
I read the document and I like it! You have done an outstanding job to make the case for a formal program in Data Science - which is much needed to meet the demands of the market. I have two main points of feedback: a) The director's salary needs to be much higher, if you are going to get the right type of person with the skill sets that you identified. b) I suggest that you consider the degree a DDS (Doctorate in Data Science) rather than PhD. I say this because as soon as you say PhD, everyone will start asking about a more theoretical research oriented degree.
2. Shesh N. Rai, Ph.D., Director, Biostatistics Shared Facility, JG Brown Cancer Center Professor and Wendell Cherry Chair in Clinical Trial Research Dept. of Bioinformatics & Biostatistics, School of Public Health & Information Sciences University of Louisville I think overall the proposal is very good. It will fill an emerging need which is current addressed by individuals with mathematics, computer science, and statistics backgrounds augmented with interest and experience. With approach to solving practical problems, data driven, is a novel approach in my view that you have proposed.
I am concerned about the employment opportunities for the individuals that complete the program. It seems they will be at a disadvantage to well-trained statisticians, mathematicians, and computer scientists who have the skills the program will build plus the additional skills expected from the specific disciplineThe program requires a bit more advanced statistical training such as an Inference Course and a Theory of Generalized Linear Models (or other appropriate foundational statistical course). 3. Satish Nargundkar, Ph.D., Assistant Professor of Managerial Sciences, Robinson College of Business. Georgia State University
I read the document, and it is quite a thorough job. There is certainly a need for this kind of program in the market. The key question, though, is about the end product - the document mentions the job market that has been favorable to MS students with skills in this area. However, for a PhD, I see the academic world as representing the job market as much as industry. The DRAFT PROPOSAL Ph.D. in Analytics and Data Science Page 53
program should perhaps be a doctorate, like Executive Doctorates in some universities, or an MS in data sciences. If it is be a true Ph.D., there needs to be a little more emphasis on research methods. Overall, definitely an idea worth pursuing.
New mathematical connections between various solutions of Ramanujan's equations and some parameters of Particle Physics and Cosmology (value of Cosmological Constant). XIII - Michele Nardelli, Antonio Nardelli