Lukasz - Data Science Dissertation - UCL PDF

Making of the data scientist profession - DJDQ8
MSc in Digital Anthropology
Making of the data scientist profession:
provisional selves, career transitions and the boundaries

between techne and episteme in data analysis roles
ukasz Alwast
Dissertation submitted in partial fulfilment of the requirements for the degree of MSc in
Digital Anthropology (UCL) of the University of London in 2014
Word Count: 14 254
UNIVERSITY COLLEGE LONDON

DEPARTMENT OF ANTHROPOLOGY
Note: This dissertation is an unrevised examination copy for consultation only and it
should not be quoted or cited without the permission of the Chairman of the Board of
Examiners for the MSc in Digital Anthropology (UCL)
Abstract
Over the past XX years, the term data science has swiftly moved into the
vernacular of scientific and technological vocabulary. As this happened, it signified
a larger phenomenon that is taking place in the sciences and society at large,
namely digitization and datafication of many of the aspects of the world that had
not been quantified and digitized before. This trend seems to have its own, new
acolytes data scientists.
Heralded by the media as the high-priests of
algorithms and the sexiest job of the XXI century, the phenomenon unravels a
more deeply grounded conversation about the establishment of a new profession
in the public milieu, the making of science and scientists, and the evolving nature
of handling and understanding data. Drawing on contributions from science and
technology studies (STS), organizational studies, anthropology, and Internet
studies, this work frames the research around the self-identity of a professional
and group perception of the authenticity and competence of interdisciplinary, XXI
century quantitative analysts.
#datascience #makingofscience #scientists #newprofessions

#selfidentity#provisionalselves #BigData #machinelearning
Table of contents
1. Introduction
2. Methodology
3. Research questions
12
4. Limitations of the study
12
5. Framing the research literature review
14
- Locating and grounding the term data
15
- A historical trajectory of the statistics discipline

and the data-analyst role
16
- Big Data the next frontier
19
- Provisional selves
21
- Communities of practice
23
6. Unpacking the key themes research and analysis
26
- Computerization and digitization of sciences
26
- The data scientist training
28
- Transitioning from academia to industry
31
- Interdisciplinary work practice
33
- Tools of practice
36
- Evolving nature of the data analyst role
38
7. Discussion
41
8. Closing words
44
9. Bibliography
46
Acknowledgements
I would like to thank my supervisor Stefana Broadbent for her guidance, patience
and confidence in setting this piece of research on track. I always found her
enthusiasm contagious, which made this creative endeavour much more
invigorating, intriguing, and within my reach.
I am also grateful to Haidy Geismar, my course co-convenor and personal tutor,
whom I could always count on for thoughtful advice and a critical eye.
I also appreciate the help of Ciara Green, my course peer, who dedicated her time
to listen to my rants on data science and proved to be a good, critical listener.
Then there are of course my informants, with whom a number of in-depth
interviews allowed me to investigate my questions in sufficient depth.
And finally, thanks to my mom and dad, for always supporting me in whatever I
decided to pursue.
I.
Introduction
Data, as it stands, surrounds us. For computational systems, we, as human
beings are carriers, herders and interpreters of it. As data, after all, is the
foundation for deriving information - the particular mean of insight and
intelligence that enables us to make informed, individual and collective
decisions. Or so we believe.
Some profound changes in this area have been happening over the past
30 years. With the advancement of computational technologies our ability to
collect, share and analyse data (and therefore, information) has changed to a
degree that is historically unprecedented. In fact, according to one of the
corporations that helped set up the infrastructure for this transition, IBM (2013)
- 90% of the worlds data has been produced in the last two years - and we are
yet to recognize how to harness its potential".
There is little doubt that amongst other phenomena, technology shapes
our lives (Bijker et al., 1989), and so do we, shape technology (Mackenzie &
Wajcman, 1985). After all technology - the outcome of making
something (techne), and science - the outcome of thoughtfully pursuing and
understanding (episteme), are inherently linked to one another (Parry, 2014).
As Thomas Kuhn (1963) asserted long years ago, science is inherently about the
data, so in as much should be technology. And if science is the initiator of a new
way of understanding the world, it also creates opportunities for doing things
differently. This, unfortunately, often translates in the popular discourse into a
simplification that scientific opportunities = money, or data = money, and
there are a number of larger and smaller loopholes of seeing the world through
such a lens. This is, however, often the reality of technology and business
narratives, and this is why the fairly new concept of data science, and its
acolytes - data scientists - appears so worthy of investigation.
There are, of course, some limitations to what only a few months of
research can capture in trying to unpack such a large phenomenon. This is why
this research aspired to become an ethnographic snapshot, on the level of its
5
day-to-day craftsman, of the individuals who are expected to deliver the

expectations associated with the proliferation of Big Data, ubiquitous sensing
and evidence-driven decision making. A few years back, rarely anyone had heard
about this profession as it now stands.
But circumstances have changed. Since the 2012 Harvard Business Review
article (Davenport & Patil, 2012) proclaimed the data scientist to be the
sexiest job of the 21st century, also did the term become a household name.
As a consequence, one could argue we are witnessing a profession in the making;
and as it is in the making, there seems to be a fair degree of dubiety around it,
showcasing intriguing aspects of how a professional identity is being shaped, how
communities of practice could be formed, what seems to be the spirit of times
in academic and technology research, and what could be the implications for the
individual, the organization one works for, and the society at large.
In doing so, this dissertation follows a classical structure for unpacking
subsequent questions. The very beginning depicts the genesis behind choosing
this particular topic and approaching it from this - and not another angle. It is
important to make it clear that investigating a community which is ill-defined,
dispersed, quite hardly accessible, and above all, diverse would be challenging
for a four-month long, single-location based (London) ethnographic study. With
this in mind, I have strived to ground the research and analysis in its historical
context, conducted a number of in-depth interviews with individuals pursuing
the role of the data scientists and individuals on the boundaries of the
profession. In addition, I participated in a number of meetings, which were
taken to be the best reflection of a seed of a data scientists community
(precisely, the Data Science London meet-ups) and analysed online conversations
and media accounts. The research also draws strongly on my experiences,
observations and thinking as an individual who had the opportunity to work with
and amongst people who acquired the professional title of a data scientists. It
seemed inappropriate, however, to pursue an ethnographic study of an
organization I was actively a part of; as not to impact its internal dynamics.
Preliminary consideration of these circumstances led me to believe that a

good way to frame the research would be through the lens of: (i) data as the
foundation of modern day decision-making (and to a degree, society constructed
around this assumption), (ii) the data analyst as the historical key-bearer to
deriving insights from data, (iii) Big Data seen as the next step to a new
paradigm of tools and methodologies associated with data analysis, (iv) inherent
links between science, academia and advanced data analysis training, (v) the
making of science and the scientists in the midst of digitization, (vi) establishing
provisional selfhood, self-identification of professional identity, authenticity of
competence, and (vii) the processs of establishing a community of practice.
This research was an interdisciplinary endeavor that involved inherently
cutting through a number of academic disciplines to better understand what
might be the forces at play. The sections on data, Big Data and the data analyst
profession are informed by research stemming from historical accounts of
statistical and computing disciplines, information sciences and Internet studies.
The latter sections, the making of science and the data scientists and their
toolkits, are strongly rooted in the tradition of science and technology studies
(STS), philosophy of science, sociology of expertise and material culture
anthropology. The deliberations on the provisional selfhood, professional selfidentification and communities of practice have strong links with social identity
theory, organizational studies and social and cognitive anthropology.
Following this, the study then becomes more insights-grounded and
analytical in terms of using the recognized findings for informing the argument.
Ethnographic analysis led to a far larger number of interesting areas of
investigation that this study could possibly have covered. It was therefore a
deliberate decision to distil and link the emergent themes around three
underlying research questions.
What constitutes a professional self-identity of a data-scientist?
How does the job of the data scientist fit into the larger picture of
the making of science and the evolution of the data analyst role?
Where do data scientists sit along the changing nature (and

understanding) of knowledge associated with digitization of data
and Big Data?
To answer these questions, it seemed critical to start with a commentary

on the proliferation of computerization and digitization within sciences. This has
been something that all of the involved informants, as well as expertcommentaries and literature recognized to be a phenomenon transforming the
world of science. In particular, the introduction of machine-learning
methodologies and tailor-made, computational data analysis tools. Secondly, it
needed to be underlined that the data scientist is, in the opinion of many, still a
scientist, therefore academia mainly quantitative and computational scientific
training seemed to be a key part of nurturing the skills and capabilities which
would then allow one to fit the role.
It has been observed that this process of training has experienced
external winds of change. Academia is often not best suited to provide all of the
required skills and knowledge, hence a myriad of alternative sources of training
are increasingly emerging - especially through online learning platforms and
industry-linked fellowships. Precisely for that reason, the transition between
academia, and the decision on the new, more industrial / tech-entrepreneurship
career path seems to be an important choice and part of the data scientists
professional identity. The job tasks, the tools and the methodologies the data
scientists pursue are part of this picture.
Data scientists, however, are also an element of a larger organizational
puzzle; the data scientist is often expected to leverage the organizations datacapabilities and play an important role in establishing good practices around
data-literacy. This opens the lid to a rich conversation around the evolution of
the data analyst role, its place in modern society and implications on how it is
designed. Naturally, this is merely a snapshot of a much larger conversation
which is already happening.
Finally, as is any formal piece of academic research, the analytical part is
8
followed by a discussion that picks, unravels and comments on the key themes of
this study. This required: (i) colliding the arguments on the establishment of an
interdisciplinary profession happening on the boundaries of quantitative and
computing sciences, (ii) recognizing the importance of training in pursuing the
role of a scientist within an organization (often associated with objectivity and
evidence-led approach to solving challenges), and above all, (iii) understanding
the data-, computing- and epistemological- educational role in translating the
expectations standing in front of Big Data into day-to-day tools, processes and
practices.
All of this takes place within an ongoing conversation and surrounding
semantic tension around the term data science, data scientists and the
organizational and institutional changes this entails for the future of work and
decision-making - making it highly pertinent for the growing body of knowledge
within the field of digital anthropology.
II.Methodology
Research for this study did not start with a pre-assumed research
question.
It began with an iterative process of probing and exploring what would be an
appropriate angle to unravel ones interests, recognize compelling questions and
seek out relevant, transferable knowledge. Building on those interests (the
processes of long-term socio-technical development, broadly understood
innovation and social perception (and expectations) for scientific and
technological change), I was looking for a phenomenon that would merge those
themes together - and data science and data scientists appeared as timely and
worthy candidates.
For the requirements of an ethnographic endeavour, however, this was not
going to be an easy task. Data scientist roles in organizations are still fairly
scarce,
isolated, and highly industry-specific. Companies and recruiters are
outstripping each other in trying to acquire talent, and, if they are successful,
those individuals often work on some of the more critical aspects of
organizations processes, in many cases, highly confidential and sensitive. It was
therefore very difficult to convince any of the individuals I had in my network,
or in their network, to pursue an organizational ethnography and investigate
their organization, as a research field. Limiting the research to one organization
would also be a danger in itself, therefore a decision was made to conduct the
research with a number of informants from different organizations, focus on
them as individuals (rather than their organizations), and collate ethnographic
insight from an accessible field site, for which Data Science London appeared to
be a good candidate.
Inspiration for pursuing the research through such an approach emerged
from anthropological accounts of researchers who historically also tried to follow
either scientific, or ICT-heavy communities, from Latour and Woolgar (1986
(1979), Levy (1984), Latour (1987), Miller and Slater (2001), Biao (2006), Kelty
10
(2008), to Coleman (2013). Advised by my supervisor, I started exploring the

issue by having explorative conversations with people involved in the field a
researcher exploring how people interact with technologies in health and life
sciences and a data scientist working in one of Londons top data science
agencies.
This was accompanied by participation in meetings of the Data
Science London community, taking part in hackathons and networking events.

These first encounters made me feel confident that, as a researcher, I would be
able to gather adequately diverse accounts of the topic, to allow the study to be
rich in insightful content. A result of this was a working hypothesis that data
science was an emergent profession, but due to its interdisciplinary character
and tech-pushed origins, quite a nebulous term by its nature - also for those
claiming to be data scientists themselves.
The key group of informants for the study were eight individuals with
whom in-depth semi-structured interviews were pursued in Spring and early
Summer 2014.
The group was composed of individuals who had acquired job
titles of data scientists, just months before the study began. Another group of
informants was composed of individuals who were data analysts in different
organizational settings e.g. a post-doctoral astrophysicist, a statistician with 15
years of industrial experience, or an economic research fellow. The third group
were individuals actively engaged in shaping (Data Science London) or
researching the community.
During the period of this study - and throughout the digital anthropology
masters program - I worked in an organization that was actively hiring and
building a data science team. However, due to the nature of being an actor in
an organizational setting, and the fact that the act of pursuing ethnographic
research might change the dynamics of my social position and relationships
within such environment (Berg & Lune, 2011), I made a deliberate decision that
this organization would not be a field of the study. However, without doubt, my
experiences and observations during that time helped me inform how the
research was framed and which themes would be selected for deeper
investigation, thus influencing the discussion and reflection on this subject.
11
Thirdly, my participation in Data Science London meet-ups led to a

number of conversations, observed practices and behaviours that were very
informative for understanding how a community self-organizes, how individuals
perceive their membership in such a community, how they self-identify in their
relationship to this community and what motivated them to invest their time
and attention into it. At certain points, I was tempted to conduct a more indepth analysis of the community itself., however, due to the dispersed regularity
of the meetings once every 1.5 and 2 months the analysis would not have had
adequate depth. Here, rather than focusing on the self-identity of data
scientists, I deliberately chose to focus more on the meet-ups as events
expressing a particular, local manifestation of a aspiring community of practice,
rather than as a larger picture of a profession.
Finally, the study was also complemented by on-going literature review
and analysis, both of historical accounts and media stories as they were
emerging.
It is worth keeping in mind that this field attracted significant
attention throughout the year of this study (2013-2014). A number of new public
and private institutions legitimizing the term data science emerged both in the
UK and the US (e.g. Imperial College Data Science Institute and Data Science at
NYU) and led to a number of discussions and conversations around this topic - for
example, a meeting at Imperial College titled A Data Scientist is a statistician
who lives in Shoreditch (?) (Data Science Institute, 2014).
This study might have also benefited more from applying ethnomethodological approaches to the subject, however, due to the confidentially of
the work pursued by some of the questioned data scientists, and their limited
pool, this approach had to be withdrawn and restrained to a number of semistructured interviews. For the purpose of a more in-depth study, on a larger
group of informants, ethnomethodology would be highly recommended for
triangulation purposes.
12
III. Research questions

Direct exposure to the phenomenon in its making, the literature review and
emergent framework of analysis led to the following research questions:
What constitutes a professional self-identity of a data-scientist?
How does the role of the data scientist fit the larger picture of the making of
science and the evolution of the data analyst role?
Where do data scientists sit along the changing nature (and understanding) of
knowledge associated with digitization of data and Big Data?
IV. Limitations of the study

Exploring the complexities of data analysis roles, data scientists themselves,
and data science as a phenomenon, one could easily recognize a microcosm of
trends and tensions that reflect the dynamics of society at large. However, the
limited scope of this study did not allow for investigation of the relationship
between those issues and data science in more detail - these are some of the
exemplars worth highlighting:
STEM (science, technology, engineering and mathematics) disciplines and
the gender gap participation in data science meet-ups, conversations with
data scientists and statistical data on female participation in STEM
professions (U.S. Department of Commerce, 2011) re-affirm that this
profession is under-represented in terms of gender imbalance.
This links
closely with the issue of biased group inclusivity, in-group favouritism and the
gender-biased perception of competence (Moss-Racusin et al., 2012) and
might also have impacted this study.
Values, beliefs and legacies of the open and free software movements the
popularity of data science can be associated with the proliferation of tools
developed in the spirit of open software, that allowed tackling with
increasingly sophisticated data questions. In fact, the Data Science London
organizers included in their mission statement the following claim:
13
dedicated to free, open, dissemination of data science and promotion of

open source and open data (Data Science London, 2014). This is resonant of
the concept of recursive publics and the maintenance of affinity introduced,
amongst others, by Christopher Kelty (2008) and sits closely to the notion of
geek cultures that data science clearly overlaps with. The relationship of
the two would be worth of a separate investigation that this study could
unfortunately not address.
Pro-innovation bias as data science thrived alongside the proliferation of
the concept of Big Data, there is a danger that it might become a victim of
technological hype and not yet well established body of critical literature - a
phenomenon named by Rogers (2010 (1962), as pro-innovation bias. Critical
studies of Big Data (boyd & Crawford, 2013) are in itself a theme that should
have strongly impacted the nature of this research. Although this critique is
acknowledged throughout this study, it does not constitute the core avenue of
argumentation.
Monetization of data there seems to be an on-going argument from a
number of technology enthusiasts that more data equals more money,
more innovation and better policy (Gartner, 2011). This is often
accompanied with the rush for immediate extraction of financial value from
any data, which authors - such as David Harvey (2007) - would likely argue
to be linked with the persistence of economic liberalism within modern
political economies. Big Data indeed opens opportunities for economic
benefits, however, the fact that data science seems to have emerged from
within the American tech-industry bubble has also substantial implications for
its wider perception and interpretation that this study needs to acknowledge.
Keeping these few examples in mind, it is worth emphasising that this study was
also constrained by its short time scale (4 months), single-location (London), and
accessibility to limited accounts of the informants tasks and routines (due to
the often-confidential nature of their work). These are, however, limitations
well known to qualitative and ethnographic research (Berg & Lune, 2011)
therefore, where possible, historical and expert insights were used to
complement this picture.
14
V. Framing the research

literature review
At an early stage of the research process it became clear that the nature
of the research would require insights from a range of the sciences, not
restricted solely to anthropology, nor solely social sciences. For that reason the
analytic framework had to build on some more classical aspects of the body of
knowledge derived from theories of socialization (Mead, 1934; Tajfel & Turner,
1986), studies of professional identity (Becker & Carper, 1956; Wilensky, 1964;
Abbott, 1988), and situated learning (Lave & Wenger, 2008 (1991)) all of which,
throughout the years, attracted attention from social and cognitive
anthropology, psychology and organizational behaviour studies. Also STS
literature (Latour 1979, 1987; Poovey 1988) and anthropology of policy and
bureaucracy (Shore, 1997: Riles, 2010), proved to be supportive in thinking
about the constitution of new fields of expertise and new kinds of fact.
With little doubt, due to the nature of the analysed community and
overlapping boundaries between different disciplines, it was required to refer to
perspectives from quantitative disciplines statistics, mathematics, economics
and computing disciplines - computer sciences, information retrieval sciences,
machine learning and artificial intelligence (examples including: Cleveland,
2001; Friedman, 2001; Varian, 2014). The larger picture of domains where data
science is applied, i.e. the life sciences and data analysis sectors: such as
finance, business and quantitative policy (Mattman, 2013; Pentland, 2014), also
seemed vital.
Needless to say, the topic of Big Data, which developed
alongside data science, also attracted significant and very interesting scholarly
work from researchers in Internet studies (Mayer-Schnberger & Cukier, 2013)
communications studies (Parks, 2014), digital- anthropology (Boellstorff, 2013),
digital- humanities (Manovich, 2011) and sociology (Ruppert, 2013).
This is why the literature review will be in its nature cross cutting,
pointing to contributions and sources available amongst the different areas
15
discussed above. Deeper commentary will be given only to those positions that
have been identified and judged to be important for supporting the research
questions and revealing of the larger picture of conversations that take place
within this topic.
The first part of the literature review will briefly introduce the term
data, and show an interesting historical trajectory of two disciplines, namely
statistics and computer science, which have always tackled data from a
perspective relevant for the profession of the data scientist. The latter part of
the literature review will introduce the body of literature on professional selfidentity, organizational socialization and provisional selves. This will correspond
closely with the section on the characteristics and dynamics of communities of
practice, and the forms of practices, tools and behaviours that make a group and
the individuals within it both socialized and distinctive.
V. I. Locating and grounding the term data.

The way the term data is used by different groups varies widely; words
change their semantics due to a confluence of social, cultural and linguistic
factors (Puschmann and Burgess, 2014: 1962), and almost every discipline and
disciplinary institution has its own norms and standards for the imagination of
data (Gitelman & Jackson, 2013: 12). Historically, the word data was derived
from the Latin plural datum which, in association with the verbal form dare,
translates into something given. A thoughtful investigation of the etymology of
the term conducted by Puschmann and Burgess (2014) suggests that the earliest
uses of the English word data in theoretical and mathematical context were in
the 17th century, in reference to mathematical variables and descriptions of
historical events. The principal sense of data shifted during the 18th century
from anything widely accepted as given, granted, or generally known, to a result
of experimentation, discovery or collection (Rosenberg, 2013). The usage of the
word increased over the 18th and 19th centuries, establishing itself firmly in
economics and administration beyond its earlier use only in mathematical and
natural philosophy. The term entrenched in science, business and administration
in the 19th and 20th century when both its frequency and use context expanded
16
significantly (Puschmann and Burgess, 2014: 1692). It was in the 1940s that the
earlier uses were supplemented with the use of the word data to describe any
information used and stored in the context of computing. With the shift from
paper record to digital information, data was increasingly used to refer to digital
objects that could be manipulated using a computer rather than generally
accepted facts or outcomes of experimentation or observation. As computing
matured, data increasingly left laboratories and offices to play a role in new,
domestic and public environments.
An interesting argument elaborated on by Puschmann and Burgess (2014:
1693) suggests that data stored as a piece of digital information marks a
departure from previous understandings of the term. In its past meaning, the
processes of giving and interpreting appeared to be highlighted, whereas in the
more recent meaning, data seems to come into being by acts of recording. As a
result of this shift, the most pronounced difference between the two is the
aspect of agency in data creation. In the past, data was mostly associated with
the role of the statistician, or sometimes more broadly, the data analyst, and
today it is much more grounded in the design and operations of computational
systems.
V. II. A historical trajectory of the statistics

discipline and the data analyst role
Individuals and institutions for centuries have gathered data. One of the
best documented events in ancient history is Herods gathering of census data in
Palestine, or the sophisticated tax-collection system imposed by the Roman
Empire. However, the modern institutionalization of data gathering and data
analysis is associated with the emergence of national statistical offices (18th
century - mostly UK, Sweden, Netherlands, Prussia) and the methodological
science diverging from mathematics statistics (Rosling, 2010). A leap forward
from then, in 1979, the American Statistical Association - the leading US
association for statisticians - organized a conference focused on The Analysis of
17
Large Complex Data Sets to address the issue of increasingly larger data sets
and inappropriate tools and knowledge to tackle with those new volumes of
data. Twenty years later, history repeated itself; in 1997 another ASA symposium
is themed Data Mining and the analysis of large data sets and today (in 2013
and 2014), industry conferences such as OReilly Strata choose as their main
theme - Making Data work - Big Data (OReilly, 2014).
Despite these clear historical loops, a number of respectable scholars and
pundits (Friedman, 2001; Anderson, 2008: Manovich, 2012; Mayer-Schnberger &
Cukier, 2013) argue that there has been a considerable change in the nature of
thinking about and dealing with data in the recent years. Back in the 70s and
80s, large and complex data sets were rare and little need was seen to analyse
those few that did exist. Data was collected manually, and the cost of collecting
it was closely associated to its volume, resulting in data collection being,
throughout the whole process, fairly expensive.
This changed as
computerization entered the space, and to some degree, the cost of setting up
data collecting infrastructures has decreased to a point that gave new
opportunities for entities that were not able to use these types of solutions
before. Needless to say, in extreme cases like the NSA or Google data farms
(Forbes, 2014) - even todays data infrastructures can be very expensive to
operate.
Not surprisingly, individuals in the field of statistics have repetitively been
asking themselves questions - what is the role of statistics in the data
revolution? Friedman himself, a statistics Professor at Stanford, argued over a
decade ago that the idea of learning from data has been around for a long time,
but the interest in analysing these large and complex data sets has only
recently [2000s] become so intense (Friedman, 2001: 5). He associated this
with the development of novel, data-base management systems where large
quantities of data resided, and as a result, has given fertile ground for data
mining approaches. The processes of analysing data for purposes other than for
which it was collected (Friedman, 2001: 6), shifting the usual application of data
from transaction processing to decision-support.
18
This argument, dating back to 2001, is worth remembering when looking

at current conversations around Big Data, including a question Friedman posed
towards data mining - although data mining appears to be a viable commercial
enterprise, one can ask whether or not it qualifies as an intellectual discipline?
(Friedman, 2001: 7).
In Friedmans words data mining is not yet an
intellectual discipline, but in the future, almost certainly [it will be] and one
can predict a big intellectual and academic future for new data mining
methodologies will emerge (2001: 7). At that time, data mining packages were
already incorporating well-known procedures from the fields of machine
learning, pattern recognition, neural-networks and data visualization. And of
course, some questions remained unanswered - should statistics remain at what
its good at (i.e. probabilistic inference based on mathematics), or ought it be
concerned with a set of problems, rather than tools? An important remark in
Freedmans argument was that statisticians will first and foremost have to
make peace with computing, as this is where the data is. As if computing is
to become one of the fundamental research tools, then the community will
have to teach, or be sure that students learn, the relevant Computer Science
topics, and some basic paradigms of the field will have to be modified
(Friedman, 2001: 9). This thought neatly corresponds with what Hal Varian
(2014), Professor at iSchool at Berkeley and Chief Economist at Google, argues
about the modern training of economists - in particular, econometricians and
the type of skills and tools they need to start acquiring from their computer
science comrades.
This observation leads to conversations about career prospects in data
analysis roles. Friedman argued (2001: 9) that up until around the 2000s, if
someone was interested in data analysis, then statistics was one of the very few
(even remotely) appropriate fields to work in. In 2013/2014, this is no longer the
case. There are many other exciting data orientated sciences that are
competing [with statistics] for customers, students, jobs and even [our own]
statisticians (Friedman, 2001: 9). Even prominent statisticians are becoming
more interested in researching problems embraced by other fields, and prefer
to work or publish in other areas (Friedman, 2001: 9). Having said that, this is
a very important issue for locating the data scientists profession in the larger
19
data analyst perspective. For Friedman, this brain drain of students and
researchers away from statistics was representing the most serious threat to the
future of the discipline, requiring profound re-examination of its place amongst
the information sciences.
The entirety of Friedmans argument, happening in the midst of the
Internet-bubble in 2001, corresponds very well with the thinking of William
Cleveland, Statistics Research Fellow at Bell Labs, who also in 2001 called out
for An Action Plan For Expanding the Technical Areas of the Fields of
Statistics, under the new label of data science (Cleveland, 2001).
Cleveland
expected that soon computer science [will] join mathematics as an area of

competency for the field of data science, enlarging its intellectual foundations,
and more importantly will carry statistical thinking to subject matter
disciplines.
This resembles, one could argue, a re-branding of more
sophisticated data analysts roles, from now requiring computational skills to

transcend other analysis-intensive disciplines.
For this reason explored later in the research and analysis section - it is
worthwhile to inquire how the data scientists themselves perceive this
phenomenon, and what causes them to believe what they do.
V. III. Big Data the next frontier

A number of respected individuals and organizations argue that the
potential benefits and costs of using large volumes of data e.g. for analysing
genetic sequences, social media interactions, health records, phone records,
government records, and other digital traces left by people are significant, but
still not sufficiently explored (boyd & Crawford, 2013: 663; Mayer-Schnberger &
Cukier, 2013). According to Manovich (2011), the term Big Data has been used,
mostly in the sciences, to refer to data sets that are large enough that they
require supercomputers. However, what once required customized machines can
now be analysed on desktop computers with freely available software. This fact
reiterates that it is not the hard infrastructure (as it often was in the past) that
20
defined Big Data, but the practices, skills and tools used on more mundane
levels of interactions. The question is, however, whether the mundane devices
are, as material objects, enough to conduct the analysis required for Big Data?
boyd & Crawford argue (2013, 663) that Big Data is less about data than it is
about the capacity to search, aggregate, and cross-reference large data sets and
define it as an interplay between a cultural, technological and scholarly
phenomenon (boyd & Crawford, 2013: 663).
They define Big Data as a
phenomenon that rests on the interplay of:

Technology - maximizing computation power and algorithmic accuracy to gather, analyse, link
and compare large data sets. Analysis drawing on large data sets to identify patterns in order
to make economic, social, technical and legal claims. Mythology the widespread belief that
large data sets offer a higher form of intelligence and knowledge that can generate insights
that were previously impossible, with the aura of truth, objectivity and accuracy.
Source: boyd & Crawford, 2013: 663
Looking at the etymology of the word, according to Steve Lohrs

investigation on behalf of the New York Times (Lohr, 2012): 2012 was the
breakout year for Big Data as an idea, in the marketplace, and as a term,
though its origins have hardly been explored before. Collaborating with an
economist from the University of Pennsylvania Francis Diebold - the two
recognized the first reference to Big Data in 2003 in a paper tilted Big Data
Dynamic Factor Models For Macroeconomic Measurement and
Forecasting (Diebold, 2012).
However, after some further investigation, it
appeared that the term Big Data probably originated in the lunch-table
conversations at Silicon Graphics, a high-performance computing manufacturer,
in the mid-1990s, and John Mashey, its chief scientist prominently (Diebold,
2012) and has since, with a significant uptake from 2007, gained traction within
the computer-software industry.
This is re-affirmed in the research by
Puschmann and Burgess (2014) who argue that the genesis of the term Big Data
lies firmly in the business world. Although the early discussions on data
processing technologies in business closely reflected this necessity (for new
tools, allowing companies to deliver faster search results or store larger volumes
of customer data) it has since evolved into a conversation centred around using
21
collected information for analytical purposes, specifically for predictive

modelling (Puschmann and Burgess, 2014: 1691).
Not surprisingly, according to boyd & Crawford (2013: 663), computerized
databases, the core at what Big Data stands for, are not new. The US Bureau of
the Census deployed the worlds first automated processing equipment in 1890,
with relation databases not emerging until the 1960s. It was personal computing
and the Internet that made it possible for a wider range of people including
scholars, marketers, governmental agencies, educational institutions and
motivated individuals, to produce, share, interact with and organize data. One
could argue that this increased computing power, and development of
appropriate tools in the 2000s, where the factors that contributed to the
diffusion of this argument.
Big Data, as boyd & Crawford also argue (2013: 664) is associated with a
possible change to the definition of knowledge. The introduction of Henry
Fords manufacturing system of mass production in the 20th century - using
specialized machinery and standardized products - quickly became the dominant
vision of technological progress, whilst Fordism produced a new understanding of
labour, the human relationship to work, and society at large. According to boyd
& Crawford (2012:665) Big Data has emerged as a system of knowledge that is
already changing it as an object, while also having the power to inform how we
understand human networks and community.
Change the instruments, and you will change the entire social theory that goes with
them
Source: Latour, 2009 in boyd & Crawford, 2012: 665
This change is mostly experienced at the layers of epistemology and

ethics Big Data is said to reframe key questions about the constitution of
knowledge, the processes of research, how we should engage with information,
and the nature and categorization of reality (boyd & Crawford, 2012: 665). This
follows a tendency to assume that the massive amounts of data, along with
22
applied mathematical and computational applications, will replace every other

tool that might be brought to bear (Anderson, 2008 in boyd & Crawford, 2012).
In a way, the question that arises from this is: how the harvesters of Big Data
might change the meaning of learning, and what new possibilities and new
limitations may come with these systems of knowing? With little doubt, this is
one of the kind of questions data scientists are expected to cope with on a
reccuring basis.
V. IV. Provisional selves

Becoming a data scientist is a fairly recent phenomenon, blossoming in
the last 3-4 years, and increasingly acquiring wider, public attention. This leads
into questions about the nature of self-identity and self-adaptation to a
professional role, or as Herminia Ibarra (1999: 764), a Harvard organizational
psychologist calls it - provisional selves.
Professional identities are deifined as relatively stable and enduring
constellations of attributes, beliefs, values, motives and experiences in terms of
how people define themselves in a professional word (Schein, 1978 in Ibarra,
1999: 765). This process of becoming a professional was well described in
sociological literature throughout the 70s and 80s (Hall, 1968; Wilensky, 1964;
Krause, 1971; Montagna, 1977 in Adams & Kowalski, 1980). Professional
identities form over time with varied experiences and feedback that allows
people to gain insight about their central and enduring preferences, talents and
values. For some professions - for example engineers, doctors, or architects official, professional association is granted as a result of certification, restricted
membership or educational accreditation. And for the time being, what makes a
data scientist a Data Scientist remains nebulous. There is no professional
association of data scientists, no accredited certification, and still rarely although increasingly growing over the last few years educational programmes
finishing with a degree in data science.
This argument corresponds interestingly with the work of Gil Eyal (2013),
sociologist at Columbia University, who argues that sociology of profession is in
23
fact a story of the past and sociology of expertise is a more timely and
comprehensive to capture the changes of todays world. In his words, sociology of
expertise maintains an analytical distinction between experts and expertise as
two irreducible models of analysis, treating expertise neither as an attribution,
nor a set of skills, but as a network connecting actors, instruments, statements
and institutional arrangements (Eyal, 2013).
Coming from that, a straightforward question seems to be whether it is
the data or the scientist part of the role, which has a stronger influence on
professional identification, or if it is something different? Ibarra suggests that in
assuming new roles, people not only acquire new skills but also adopt the social
norms and rules that govern how they should conduct themselves (Shein, 1978 in
Ibarra, 1999: 765). Practices and social norms of scientists in a lab setting were
already well investigated in a seminal study by Latour and Woolgar in 1979
(1986).
However, it is quite clear that todays labs, due to digitization and a
number of other social processes and institutional changes, represent a very

different environment. The question of what is the field of the data scientist?
is also very poingnant - a perspective that might be different depending on who
asks the question.
Data science, as a relitively new and undefined job title, often pushes
individuals into situations that require new skills, behaviours, attitudes and
patterns of interaction, that can produce fundamental changes to an individuals
self-definition (Ibarra, 1999: 766).
Not surprisingly, this phenomenon is
particularly relevant for data scientists as these individuals often transition from
academic/research backgrounds into industry, where the dynamics and
challenges of the environment require different practices. Identities have long
been seen as constructed and negotiated in social interaction (Mead, 1934;
Goffman, 1959) and socialization is not a unilateral process imposing conformity
on the individual. It is a negotiated adaptation by which people strive to improve
the fit between themselves and their work environment.
People often make
identity claims by conveying images that signal how they view themselves or
hope to be viewed by others, but it is unclear to what degree they remain part
of their past, scientific role, and to what degree part of a new form of a data
24
analysis/lead scientists/consultant, as is often expected of them. Not without

significance is the self-perception of authenticity; that is, the degree of
congruence between what one feels and communicates in public behaviour
about his or her character or competence (McIntosh 1989, in Ibarra 1999: 778).
For the data scientists, this is an area where the concept of situated learning
and communities of practice falls neatly into place.
V. V. Communities of practice
Communities of practice (CoP) is a concept developed at the beginning of
the 1990s by Jean Lave and Etienne Wenger, who proposed a new model of
learning, described at the time as situated learning theory (Lave & Wenger,
1991). The concept was a critique of earlier cognitivist theories of learning as
knowledge was said primarily not to be abstract and symbolic, but provisional,
mediated and socially constructed (Berger and Luckmann, 1966; Blacker, 1995).
Situated learning theory positions communities of practice as the context in
which an individual develops the practices - values, norms, relationships - and
identities appropriate to that community.
It differs in some aspects from
theories of socialization (Vygotsky, 1978) as it calls to attention the possibilities

for variation and intra-community conflict. Following this, learning is described
as an integral and inseparable aspect of social practice that involves the
construction of identity through changing forms of participation in communities
of practice - based mostly on processes of participation, identity-construction,
and practices (Handley et al., 2006).
As Wenger argued, participation refers not just to local events of
engagement in certain activities with certain people, but to a more
encompassing process of being active participants in the practices of social
communities and constructing identities in relation to these
communities (Wenger, 1998: 4). Therefore, participation is not seen just as a
physical action or event, but it involves both action (taking part) as well as
connection (Wenger, 1998: 55).
This implies the possibility of mutual

25
recognition and the ability to negotiate meaning, but does not necessarily entail
equality, respect or collaboration.
A particularly intriguing aspect of participation is how members of a
community gain status within it.
In their early works, Etienne and Wenger
(1990) suggested that there is a distinction between a core, and peripheries, and
it is through continuous participation that one gains recognition or moves to the
centre. They have, however, deviated slightly from this opinion since then and
acknowledged that participation may involve learning trajectories which do not
lead to a comprehensive full participation (Handley et al., 2006: 644). This is
an important point to note in respect to the interviews from this study.
Another important aspect of CoP is identity. The concept of identity rests
on critical readings of social identity theory (Handley et al., 2006: 664); but,
according to Leve and Wenger (1991), learning is not simply about developing
ones knowledge and practice, but also involves a process of understanding who
we are and in which communities of practice we belong and are accepted. Two
main processes of identity construction in a workplace are identity-regulation
and identity-work. According to Handley (2006: 644), the first process refers to
regulation originating from the organization (e.g. recruitment, induction and
promotion policies) and the employees individual responses. The second process
of identity-work refers to employees efforts to form, repair, maintain or revise
their perceptions of self, and this involves a negotiation between the
organizations efforts at identity-regulation (which the employee may, or may
not internalize) and the employees sense of self, derived from current work as
well as other identities (Handley, 2006: 645) all highly relevant for data
scientists in their working environments.
The third and final aspect of CoP is indeed practice, which according to
Brown and Duguid (2001: 203) is an undertaking or engaging fully in a task, job
or profession. After all, by participating in a community, newcomers develop an
awareness of that communitys practice. They come to understand and engage
with - or adopt and transform - various tools, language, role-definitions and
other explicit artefacts and implicit relations.
For data science, as an
intersection between science, computing and data analysis, this is particularly

26
interesting, as tools can be very defining of a profession, allowing for formal and
informal coordination and exchange of knowledge to be identified.
Finally, it is critical to note that communities of practice are not
homogenous, but differ across several dimensions geographic spread, lifecycle
and pace of evolution. Individuals may participate to a different degree in loose
networks of practice both across and beyond organizational boundaries, but
according to Handley et al. (2006: 646), it is in relation to these communities
and networks that individuals develop their identities and practices through
processes such as role modelling, experimentation and identity-construction. An
individuals continual negotiation of self within and across multiple
communities of practice may generate intra-personal instabilities within the
community. An example of this is a scenario where a newcomer experiences a
conflict of identity in relation to a role or practice he or she is expected to
adopt (Ashforth and Humphrey, 1993) - a case that data scientists are
particularly exposed to, as they enter new organizational environments with,
often, inflated expectations to the nature of their work.
27
VI. Unpacking the key themes

- research and analysis
VI. I. Computerization and digitization of
sciences
I believe data science is more about the science bit, than the data
- Data Scientist [1], working in industry
The last two decades have seen significant changes to the way we use
computational tools in modern workspaces (Brynjolfsonn & McAfee, 2014;
Pentland, 2014). The Internet, email, cloud services and mobile phones are only
a few manifestations of this phenomenon. Along consumer products, big changes
have been also happening in the world of artistic and scientific crafts namely,
the worlds of design and science.
A good example for design are architects, who are less and less being
trained in being proficient at the drawing board, but instead master the use of
design software such as AutoCad, Adobe Suite and the likes. Another example
are surgeons who increasingly need to become proficient in using tools that
allow them to conduct distant, robotic surgeries. These changes are also
reaching the sciences. The basic tools of scientific practice have changed too
in many cases, today, a laptop connected to the Internet and appropriate
research software is enough to pursue multiple scientific inquiries.
This has
impact on conducting both qualitative and quantitative research, in the majority

of disciplines ranging from the (digital) humanities to (digital) sociology and
Internet studies, to molecular biology, nanomedicine and computational
28
statistics. This phenomenon, at its macro-level, seems to be key to

understanding the context in which data science has come to life, and how it
reflects the current spirit of times.
Data is, after all, at the core of all sciences. The scientific pursuit is
perceived as one of the most rigorous ways of conducting research, as it
crunches and critically evaluates data. It is therefore difficult to imagine data
without science, or science without data. Especially, as scientific processes and
practices are increasingly taking place within the sphere of the digital, it is
increasingly difficult to see both science and data outside of a digitized context.
As a consequence, computational literacy is increasingly becoming a key factor
for scientific careers - either by making the research more sophisticated, or
leveraging scientific communication and engagement. It therefore not surprising
that a number of disciplines are increasingly considering whether they should
improve their own, intra-disciplinary tools, or reach out to other disciplines to
borrow, apply and build on the tools of other disciplines.
This process of continual learning, swapping and experimenting with tools
seems to be at the heart of data science. In a number of public discussions that
are happening around data science, there is an agreement that it involves
matching skills from the computational sciences with statistical and
mathematical interference, and applying them to certain domain challenges
(Rauser, 2011). There are some limitations to this approach. Proficient training
in scientific domains e.g. chemistry, regenerative medicine, fluid dynamics,
economics already in itself is resource-consuming, with additional
computational training adding to this complexity. To some degree, this is why
computer science and machine learning skills usually associated with data
scientists are blending into other scientific tool kits, often perceived as an
additional data crunching resources.
This phenomenon is well illustrated by
one of the informants:
in cosmology there are enormous amounts of data to tackle; its not a controlled
environment and theres pressure on taking as much data as there is possible thats where
machine learning becomes useful, when you dont have enough information, or when youre
29

dealing with noisy information () but theres also a danger, sometimes machine learners are
satisfied with answers which are good enough, which is OK for industry e.g. a web search in
Google but often not enough for science
Astrophysicists, dealing with Big Data from space measurements
It is a convincing argument that the influence of computerization and

digitization on digital-literacy of the workforce (and, as a result, decisionmaking) might have far-reaching organizational and institutional consequences
(Simon, 1965; Mayer-Schnberger & Cukier, 2013; Brynjolfsonn & McAfee, 2014).
At the same time, this phenomenon reveals interesting tensions between making
(techne) and understanding (espiteme), in data analysis jobs.
The recurring
question seems to be how technically or scientifically literate the members of

the group need to be, or their decision makers, to pursue well informed and
comprehended decisions.
There is something in how this literacy/expertise is
constituted as a kind of political and organizational process. It could be an

expression of a wider, historically well known trend of incorporating
scientification to working environments be it in the form of data-driven or
evidence-based decision making and seems to sit in parallel to the ongoing
process of automation of work. This suggests data science is part of a larger
conversation around data-literacy, expertise and skills that might at one point
become a requirement for life-long learning.
One could imagine the data
scientists playing an active role in not only being the skilled technical
craftsman but also the digital champion or educator of this transition in skills
training.
30
VI. II. The data scientists training

youre a scientist first of all there is a process of how you see the world, design,
experiment - and then you need to have a combination of maths, computing and domain
knowledge
Speaking of training. Along with the extensive digitization of data storage,

the interpretation of vast amounts of data meant a new breed of researchers
familiar with both science and advanced computing needed to emerge. In
Mattmans (2013) words (who was the principal-lead on big data initiatives in the
Californian Jet Propulsion Laboratory), to solve Big-Data challenges researchers
need skills both in science and computing, and this opinion strongly resonates
with the standpoint of the informants:
from my experience, people with whom I work with [bio-informaticians, bio-engineers]

and whom you would consider data scientist need to know programming to be able to do high
performance computing () there is the cry for more data and computing literacy () and its
quite difficult to find people with those type of skills
- Researcher, exploring how people interact with technologies in health and life sciences
what would make a data scientist? exceptional programming skills, use of common
statistical software and an academic background in physical sciences or statistics
- Statistician / Data Analyst, with 15 years of industry experience
This perspective remains consistent among different interviewees, job

advertisements and the opinions of data scientists themselves and individuals
working in the wider field. In the past, an equivalent of data scientist training
31
would likely be a graduate in quantitative disciplines maths, statistics, physics

or economics.
However, as computing skills arrive at the forefront of market
and organizational transformations, this pool is enlarged to include computer

scientists and engineers. In fact, most of the research informants who associated
themselves with data science had advanced academic training at a PhD-level in
scientific disciplines: computational modelling of biological systems,
astrophysics, computer science or statistics. Not surprisingly, having a degree in
data science is today still rare and it is only in the last two-three years that
academic institutions have launched, or are planning to launch, certificates and
degree programs to address this educational demand. It is also clear that
academic training alone is insufficient for acquiring a job in data science, which
is often the defining event that transforms a scientist into a data scientist.
data scientists work less on the data collection side, or infrastructure () data science
is more about analysis the front end between insight and data that refers to services and
data
- Research Fellow in Economics
they [data scientists] have to program, do statistics, follow the digital trail and know
what to do at the end of it () there is much about experience humility about the data
- Data scientists [2], working in industry
With this in mind, academic training is often perceived merely as a filter,

a first step in shaping the profile of the data scientist. To be successful in an
organizational setting helping to address data-challenges - one needs to be
able to adjust to the changing circumstances and demands of the role. This
entails different types of skills, behaviours and experiences that one would not
be exposed to in academia:
32
in academia youre getting points for being clever, and in industry it works out what you do, or
it doesnt, its important that you can do things quickly () in academia if you show how you did
it, theyll say ummthats just linear regression
- Data scientists [2], working in industry
when you leave academia you start learning C++ and Python () there seems to be a big
community movement to change tools catch-up with the great tools of the outside world
There is a discrepancy between the availability of well-trained scientists with

computing and analytical skills and the market demand for them. A McKinseys
report argues that by 2018, the United States alone could face a shortage of
140,000 to 190,000 people with deep analytical skills as well as 1.5 million
managers and analysts with the know-how to use the analysis of big data to
make effective decisions (McKinsey, 2011). For that reason, a number of new
initiatives, including one backed by a consortium of Silicon Valley techcompanies, created the Insight Data Science Programme ,which is a 6-week
training programme for post-doctoral, quantitative graduates to bridge the gap
between career in academia and data science () and enable scientists to learn
the industry specific skills to work in the growing field of big data at leading
companies (Insight Data Science Program, 2014). The training programme
illustrates well what data science training means for industry:
1. Intro to Data Science a round table discussion introducing concepts of

data science, a big-picture overview of what the field is and what makes
a great data scientist.
2. Data project a 3 week exercise to showcase existing data analysis skills
in a context that companies are familiar with, while forcing to learn the
technical skills and technologies that are standard in industry, including:
33
software engineering best practice, storing and retrieving data,

statistical analysis and machine learning, visualizing and communicating
results.
Source: Insight Data Science Fellows (2014)
This, and other training programmes in data science - e.g. those led by
General Assembly - are part of a larger system of training packages
complemented by earlier established on-line education courses promoted by
academic power-houses such as MIT and Stanford (with a number of on-line or
distance courses in Data Science, Machine Learning and Data Visualization). This
strongly affirms that data science is a career path that did not exist in the past
(at least, not under such a name). It is only in recent years that traditional
organizations of power and educational credibility - organizations such as iSchool
at Berkeley University, NYC and Imperial College London - have opened research
programmes in data science and entered this growing field.
VII. III. Transitioning from academia to

industry
in particle physics, cosmology, data science has been done for years () after a PhD
theres a gap to become a lecturer and many people want to become data scientist, as its
about the technical craft
One cannot look at the emergence of the data scientist role without the
context, and ongoing evolution of the labour market and novel employment
opportunities associated with socio-economic and technological change. For a
long time, some of the best and financially most rewarding career paths for
students graduating in mathematics, physics, economics and engineering in
general, quantitative degrees - were in big technology, engineering or financial
organizations.
An alternative to this were academic or corporate research
centres. It goes without saying that post-graduate education what still remains
34
at the core of the data scientist training has also changed over the years. More
and more PhD training opportunities have been offered at academic institutions
that were later not matched with further post-doc or tenured opportunities in
academia.
CERN, the Switzerland-based research centre best known for the
empirical backing of the Higgs Boson, in itself was home to hundreds of PhD and
post-doctoral scientists in fields ranging from physics, through to engineering
and computer science.
However, after the researchers term-of-practice is
finished, they may have to consider moving into industry due to the limited
opportunities at other research institutions. This is said to be one of the reasons
why the label data science has found fertile grounds scientists needed to
re-brand themselves for the purpose of industry roles.
Amongst the informants of the study were both experienced individuals
who have gone down the path to becoming a data scientist and individuals
working at the gateways of this role. This provided the study with an interesting
perspective on where and when the transition between data science began, and
whether it could be a sustainable label for self-identification in a working
environment.
I heard about data science for the first time as a PhD student, during an industrial placement
I did at a *major web-company*
I heard for the first time about data science when I was working at *major tech company* 4-5
years ago and tech-companies were starting recruiting data scientist
I heard about data science for the first time when I was doing a PhD,
and people were leaving for industry
35

I heard about data science probably for the first time with regards to Patils &
Davenports article in Harvard Business Review [the sexiest job of the 21st century]
Research Fellow in Economics
There is, however, a significant difference between training happening in

academia, and the type of expectations that the data scientists are often
supposed to match when acquiring a role at an organization driven by
commercial dynamics.
As one of the informants highlighted, recognition and
reputation building in academia which is an important part of establishing a

professional identify is, unlike in industry, often more related to the robustness
and sophistication of getting to a certain output, than the results themselves.
academia is pressured for novelty () thats why each life scientist writes his own
code, because everybody in academia needs to come with their own solution, and industry is
more about finding the code that is the most efficient for the given task and finance is
absolutely the best at it, and has for years been hiring some of the best of the best physicists
Research, exploring how people interact with technologies in health and life sciences
It also seems that the term data scientist has strong origins in the techindustry, in particular in places such as San Francisco and Seattle (home to the
largest tech-companies). To some degree, this should not be surprising, as data
science emerged from the recruiting practices of companies that really could
pursue Big Data Microsoft, IBM, Facebook, Twitter, Google etc., and it was (and
still often is) the endeavours of these organizations that the term seems to be
receiving so much attention.
These companies are also some of the main
recruiters on academic campuses, hiring both young graduates as well as

experienced scientists to run their more sophisticated streams of work.
They
are the most involved in the transition of scientists into tech jobs, which also
fits the larger campaigns narrative for STEM education.
36
VI. IV. Interdisciplinary work practice

we have computer scientists being hired into biological projects and roles () because
the data is so vast, machine learning techniques make it so much easier
The type of projects which data scientists get involved in - unless they are
in strictly domain-specific areas such as banking, insurance or biological mapping
often entail complexity that reaches far beyond what a simple computationalsystem could frame, and are within the interplay of a number of dynamic sociotechnical systems. Data-driven decision making, which seems to be at the heart
of data science for such areas as public health, transport, or public services
transport, requires expert knowledge from a range of disciplines and standpoints, often also taking into consideration social, political, scientific, usability
and aesthetic aspects of the developed solutions.
However, inter-disciplinary
work comes at a price. Different backgrounds and practices of problem inquiry

require the usage of different language, processes and practices of work that are
not always complementary and mutually understood.
I believe data science is more about the science bit, than the data () for example,
designers use data from a design perspective, but what they really do is design
data in itself is useless, unless it can be used as a tool to solve business or research questions ()
thats why data scientists need to express not only technical proficiency and a data-driven approach, but
also soft skills: team working, storytelling, engaging communication
37
Research Fellow in Economics
As a result, there is a large emphasis on the ability to work across

different organizational and domain boundaries and adjusting to the technical
knowledge of the audience. Communication in itself is a useful capability and
often on its own requires a separate skillset e.g. the practices and language to
speak are different for a design organization, a science organization and a policy
institution. However, these are the environments that the data scientists need
to operate within, as they have to be able to work not only on a level of
technical expertise (of handling data and applying computational techniques),
and this is one of the reasons a comprehensive skillset of a data scientist is so
rare, and difficult to capture within one individual. Each of the fields of training
science, computing, communication, domain knowledge, and business acumen
are in themselves areas, which require substantial attention in achieving
proficiency.
This takes us back to another larger conversation about the specialization
of labour and wide array of skill that the modern economy requires. More and
more emphasis is placed on the collaborative output (data-analysis, synthesis,
communication, framing and delivery), which is often attributed to
interdisciplinary teams. Depending on the complexity of the issue - e.g. a fairly
simple web-study, or sequencing of the whole human genome - this can range
from several to several hundred individuals in different organizational
constellations. Mitigating this complexity of interactions often requires
appropriate management and organizational culture.
Difficulty with
interdisciplinary work is well captured by the informants comments:
A statisticians way of thinking is being comfortable with uncertainty, and thats often
quite opposite of programmers, who in how they work need proof of correctness
38
I can plot results in R, MatLab, Gnuplot, when I speak to another statisticians but
with designers, they need to briefed in other ways () I met creatives who have no idea about
data science and just see it through visualization, but completely loose the science parts, and
thats a completely lost perspective
In fact, an interesting perspective for how data-challenges can be

perceived is through the lens of data-science hackathons events where data
scientists and other data-wonks gather for 24 or 48 hours to tackle datachallenges.
[Data Science London] This is a meetup for data scientists, data miners, statisticians, data
analysts, data engineers, data architects, data visualizers, data journalists, data science
practitioners, data consultants, academics, researchers, people from science and social
sciences, and in general people directly involved in data projects.
- Meet-up website
The formula of these events is mostly built around a set of data provided
either by a third-party, or by the organizers themselves. This data is, in many
cases, unstructured and requires a certain level of ingenuity and expertise to
make use of it. This data is then available to teams of hackers composed of
data scientists, software engineers, web developers, graphic designers and
others to devise, in most cases, a prototype for a data-product that in some
way would address a need in a novel and impactful way. This is the space where
interdisciplinary work takes place at its most extreme. Participants often do not
know each other before the event and have to coin teams through conversation
at the beginning of the meeting. At many of these events, it is reiterated that
group-work leads to the best results, requiring an effort to reach out to people
who are usually outside of ones disciplinary background.
39
Im a technologist. I spent 20 years writing software building infrastructure, using

technology to answer hard questions. Maybe the hardest thing I learned in those 20 years is: in
order to do great work, you cant limit yourself only to only knowing technical things. () you
need to know people who are very different to yourselves, and sadly, a tech education does not
prepare you for this very well
- PhD candidate in Computer Science, one of the meetup presenters
But that there is a difference between declarations and practice.

Hackathons entail an element of competition, either for an award, satisfaction,
or the joy of play.
The spirit of competition impacts how teams are being
formed a kind of speed-dating process takes place, where individuals talk to

each other, recognize whether they are mutually interested in a given problem
and what each of them can bring to the table in terms of experience, skills and
tools. The boundary of who is considered as a credible team member varies, but
ultimately rests on a combination of: perceived educational training, exposure
to appropriate industry experience and a universal ability to apply a contextagnostic toolset to certain data-problems.
These were ultimately the kind of
features that were respected. Individuals from different fields had to make an
effort to be accepted into the group, particularly if they did not have the data
literacy and ability to use the appropriate data science tools.
VI. V. Tools of practice

Tools often play a key role in determining a professional occupation. As
proficiency in using the Adobe Suite might entail someone to call themselves a
graphic designer, the pure use of a drawing board or AutoCad does not make
someone an architect, at least in the view of formal institutions. This interplay
between professionalization, accreditation, regulation and use of vocational
skills is likely to be as old as the development of craftsman guilds.
nebulous term such as data science, this becomes even more confusing.
For a
It is
40
only the last few years, if not months, that certain institutional frameworks for
recognition and accreditation have begun to be established (i.e. the Insight Data
Science Fellowships, Coursera or EDx courses), and it seems quite clear that the
expectation for data science skills flourishes on the demand side of the market,
rather than the supply side. For that reason, it was worth exploring what where
the skills and tools so much desired by industry recruiters:
Facebook (Data Scientist)
Fluency with at least one scripting language Python or PHP,

familiarity with relational databases and SQL, expert knowledge
of an analysis tool such as R, Matlab or SAS, experience working
with large data sets Map/Reduce, Hadoop, Hive (and a PhD in a
technical discipline)
Linkedin (Data Scientist)
Experience programming in an object orientated language (Java,

C++, etc.), knowledge of scripting languages Ruby or Python,
comfortable in data analysis & visualization using tools like R,
Matlab, or SciPy (and a MSc/PhD in a quantitative field, with a
strong background in machine learning, statistics or information
retrieval).
B A E S y s t e m s ( D a t a Hands-on analytical experience in technologies such as SAS and R,

Scientist)
appreciation and understanding of relational databases, ETL

principles, and platforms such as Hadoop or MongoDB (no
particular education indicated)
W G S N ( S e n i o r D a t a Hands-on experience with big data technologies (Hadoop, Elastic

Scientist)
Search, Solr, Java, Pig, Map Reduce), expert SQL skills, exposure
and understanding of development tools such as Java; predictive
analytics and machine learning packages (and BA/BS in maths/
statistics/machine learning or equivalent)
Source: Linkedin search results (2014)
One can easily recognize similarities between the requirements earlier

described in the training of the data scientist section - a wide range of tools
refers to programming languages (C++, Python, Java), data retrieval and
operation tools (Map/Reduce, Hadoop, Hive) and analytical software (R, MatLab,
SAS). However, many of the study informants on several occasions underlined
that it is not the knowledge of these tools, per se, that is key to the data
scientist doing his or her job correctly, but the ability to apply them to a data
41
problem. That is precisely what distinguishes a skilled data scientist from an

aspiring one.
many people come into data science, or machine learning meet-ups, theyre provided
with data, theyll run a simple algorithm and say this is the result; Im a data scientist ()
thats not how this works
good ones [data scientists] know how to use proper tools for a given context,
others are just enthusiasts
It is important to underline the origins of these tools as it reveals yet

another link between how leading technology companies are influencing the
drivers of change on the labour market and how academic curricula are changing
accordingly. After all, in order to run computer mediated transactions (that, in
many cases, go into the billions) it was difficult to analyse this data with
conventional databases. Companies felt it was necessary to develop systems that
would allow them to manage this data on their own. Once this happened, the
tools were then released and labelled as Big Data tools, and gave space for the
development of data science.
There is a difference between conventional and non-conventional data science () and that
depends probably on the type of data source youre working on () tech companies Google,
Microsoft, Facebook - they get it right, they have the data, and can use it to do data
science () theyre the people who hit the problems first, and develop the tools
42
VI. VI. The evolving nature of the data analyst

role
a few years ago the requirements for the role were: familiarity with maths skills,
mostly Bayesian techniques, and these days its machine learning, and more engineering
backgrounds
An underlying theme in the discussions about the data scientist profession

is the evolving nature of jobs associated with data analysis, its identification,
computerization and merger with usability and data-aesthetics. This process, as
any other happening in a society heavily impacted by technological change,
implies some notable consequences. Increasingly, more and more data analysis
is now conducted by autonomous software and sophisticated education and
training is required to be able to actively participate in the design and use of
those systems. As in any past evolutions of the labour market, this means that
some individuals who dont catch-up or do not have the ability or will to
receive appropriate training, are being left behind. A counter measure to this
seems to be the emergence of online-learning platforms, which are (at least to
some degree) striving to address this polarization, however the real training of
the new adepts of data science is still, as proven by the informants insights, an
intensive and time consuming process, which also requires a certain degree of
quantitative literacy to remain in the process.
data is now on everyones mind () its a bit of a frenzy () and if you dont use data youre
not innovating
43
because its a newish thing [data science], it seems attractive for executive staff () like,
wow, he does data science means hes doing innovation
a lot of people are trying to brand themselves as data scientists e.g. a quant wants to find a
job, or an excel analyst tries to sell himself () there is a sense of tribalism, you know I want
to brand myself as a data-scientist, these are the good network, and things I can pick-up and
progress in my career
As shown in the literature review, this might not be a completely

unexpected turn of events. It is the increasing degree of automation and
autonomy of computer systems that take over the decision-making process from
human beings that makes the difference. Advances in machine learning and
neural networks, in some cases, result in the so called wisdom of Big Data to
be ultimately more valued within organizations than more contemporary
analytical methods. That is, for organisations with low- or mediacore- data
literacy. As a result, an open question arises, whether the centre of gravity of
data analysis remains with the analyst, or within the algorithm. Will the new
generation of analysts be composed of context-agnostic specialists with machine
learning and software engineering skills - as has already happened in some
occasions or will these skills simply become a casual part of the science or MBA
training? The question remains whether such a breed of analysts will be able to
tackle the kind of complex problems that are standing in front of us today (e.g.
public health, environmental pollution, ageing infrastructure, energy constrains)
if the emphasis is given more to the software agent, data visualization and the
story, than contextual and methodological rigour.
44
VII. Discussion
Research and analysis conducted during this short piece of study leads to
a convincing assertion that we might be witnessing the establishment of a new
profession - emerging on the boundaries of engineering, computing and statistics
- that sits within a longer lasting tradition of the evolution of the data analyst
role. The need for a new breed of researchers and data analysts has been
expressed on both sides of the market amongst the scientific community and
the industry, spearheaded by technology companies from technology hubs of the
United States. This phenomenon fits into the picture of a more universal change
happening in society, that is, digitization and scientification of work practice.
This process indeed might be concealed under the messages of increasing
datafication of products, services and policy interventions, marked by
additional slogans of data-driven analytics or evidence-based decision
making.
This is, however, a manifestation of technological change and the
increasing consequences that the ICT revolution is having on subsequent areas of

our lives.
This is also a good example for observing and evaluating how professional
identity is evolving within the community of people calling themselves data
scientists. This has much to do with the attributes, beliefs, motives and
experiences that they express and the identities they construct in social
interactions. This is particularly interesting for the data scientists profession, as
it is in the process of making. By some it is viewed with some scepticism, by
others eagerly taken on, and for many still remains nebulous. Additionally, the
term has inbuilt semantic and linguistic conflict of its parts - at what point was
there ever science without data? As this profession forms out of a stream of
scientific training, there is also the larger conversation about the making of
science, of a scientist, of science communication, and scientific management.
This leaves us with two underlying questions: how does a social group create a
45
new role, and how does the self-perception of authenticity accord with what one
feels and communicates around his or her competencies?
For the self-perceived data scientist and for the industry recruiters who
also shape the perception of the profession, the role is strongly associated with
an advanced (Masters or PhD) degree in applied quantitative disciplines or
computer science. This is because the role of the data scientist seems to assume
a blend between computing and quantitative analysis skills, backed with
practice and experience in conducting scientific work. This educational
background is still asserted by the labels of higher education, however
respective on-line and industry-led programmes have been made available for
enriching the training base.
This training is mostly focused on acquiring
knowledge about the use of Big Data tools, and their appropriate use for a
changing organizational context. In many cases the tools developed by industry
in the last few years are the ones mostly associated with the data scientists role
Hadoop, Cassandra, Map/Reduce, Hive, Pig, are all the new generation of Big
Data tools.
Programming languages, depending on the context: C++, Python,
Java; and analytical tools MatLab, Stata and R, too. These are the technical
craftsmanship tools that are expected for data scientists by the labour market,
and to some degree, by the data scientist themselves. In addition, as research
and analysis suggests, it is not the pure knowledge of these tools that makes one
a respected data scientist amongst peers, but the ability to independently
choose the appropriate tools for the given context and the aptitude to skilfully
interpret and communicate the findings. Pure knowledge of these tools,
training, or education doesnt seem to yet make one the data scientist. This is,
rather, a consequence of the type of role one is expected to pursue at the
workplace and its title. For example, there are a number of individuals
possessing the above characteristics who pursue the data scientist tasks, but are
not labelled as data scientists.
Due to this, the phenomenon corresponds closely with the question of
how the job of the data scientist fits the larger picture of the making of science
and the evolution of data analysis roles. There is a push in the public narrative,
supported by the tech-industry and backed by some policy decisions, that the
46
society experiences lack of appropriate training in scientific roles (STEM) which

puts a strain on the employment market.
It is also the expression of how
computing progressively penetrates subsequent areas of our lives, including

scientific conduct and the making and communication of science. The impact of
information communication technologies and computing also reflects that the
way we work in organizations changes over time be it academic, industrial,
public or third-sector. And data science and data scientists seem to be universal
labels to capture data- and computing-literate individuals who are comfortable
applying the novel tools that the technology sector (and academia) create to
tackle with the increasingly complex environment of data-generating
instruments.
A particularly interesting aspect over the last few years, even in the role
of the data scientists and data science, is the increasing emphasis on applying
machine learning methodologies to vast data streams. This reflects yet another
phenomenon increasingly taking place in organizations, namely the automation
of work and either replacement of human physical labour with robotics, or
human cognitive labour with algorithms. Taking this forward to data analysis
roles, one can recognize that there is a tendency towards substituting
organisational resources of data analysts (both in science, and in industry) with
computational solutions, and data scientists are often, due to their training, the
ones left bearing the torch. However, due to the complexity of the issues at the
centre of the analysis e.g. medical records and public health, environmental
sensors and pollution monitoring, mobility patterns and crisis management a
purely computational approach is often scientifically misleading.
As in many
other cases, domain knowledge is necessary in order to recognize appropriate

questions, frame the design of the research and soundly interpret the results.
For that reason, there is increasing emphasis on interdisciplinary skill-sets,
creating teams with different skills and competences that include different
attributes, values, practices and experiences associated with their roles. This
can in some cases lead to enhanced work outputs, but in the process creates
tensions resulting from different approaches to adressing, interpreting and
solving problems. Data scientists, due to their interdisciplinary profile and
47
exposure to some of the most sophisticated data problems, are often at the
forefornt of these conversations.
As a result, the data scientist sit alongside the changing (or rather
evolving) nature of techne and episteme deriving from the introduction of Big
Data approaches to subsequently new areas of data analysis. They bare much
responsibility over how the intepretation of using novel data analysis tools and
approaches will be translated into the fabric of the organizations that they are
working for, or the cause they are impacting. As with the use of machine
learning without thorough understanding of the investigated data, also Big Data
methods can likewise exclude certain observations. Data scientists are therefore
playing dual-natured roles in the organizational context that they are operating
within. They are the source of insights, research and of scientific rigour to the
pursued data-problem.
But what is already well documented in philosophy of
science and STS (e.g. Thomas Kuhn, 1963 and Donna Haraway, 1988) there is
rarely (if ever) an objective truth or neutral agenda and so objectivity is
situated and historically specific, particurarly, within an organisational setting.
This in some way circles the conversation back towards the notions of
education data, computational and epistemiological literacy -
and the
implications on the systems of knowing, and the meaning of learning. The ways
in which data scientists will be establishing their professional identities attributes, beliefes, values, motives and experiences as the profession grows,
might have substantial implications on how decision making is conducted in
industry, business or the public sphere. For that reason, it is critical to make
sure the process of educating data scientists is comprehensive enough to
overcome the interpretative socio-technical and political limitations of Big Data,
machine learning, and whaterver comes next.
And this is what makes data
science, data scientists and the making of the next generation of data analyst
roles so important for further research.
48
VII. Closing words

This piece of research has been a genuine, short attempt at recognizing,
capturing and unleashing some of the most interesting conversations currently
taking place around the still nascent term of data science.
The study
investigated how a new profession of advanced data analysts (data scientists) is

emerging out of the statitistical and computational sciences, and how it
proliferates to other domains of labour as data and Big Data become the bedrock
of science and decision-making. The research also investigated what consitutes a
professional identity of a data scientist. It recognized that education, training
and adaptation to a professional role are associated with the move from
academia to industry. It also stressed the role of the perception of authenthicity
and competences linked to the ability to use certain tools (techne) and the
ability to use them for the right context (episteme), and how it impacted the
process of self-identification.
The study also touched upon the place of data
scientists in the making of science influenced by digitization, computerization,

interdisciplinarity of work and how this corresponded with the evolution of the
data analyst role, which increasingly requires sophisticated and advanced
computational and quantitative training.
Finally, the research, analysis and subsequent interpretation marked how
data scientists sit alongside the debates about the changing nature (and
understanding) of knowledge associated with the introduction of Big Data
methodologies, and what kind of responsibilities linked with beliefs, values,
motives, data-literacy and science communication lie ahead for individuals
moving into this role.
In itself, the study also proved to be well suited for
applying a combination of ethnographic work, in-depth interviews and historical

analysis that led to insightul observations, helping to unpack a phenomenon
happening in front of our eyes as the macro trends (digitization, datafication
and automatization) drive and influence the socio- and techno-economic
environment we are currently experiencing. Above all, it also adds a small brick
into the body of knowledge of digital anthropology with additional, empirical
insights on the dynamics and forces that are shaping the relationships between
individuals, communities, decision-making agents and digital-era technology.
49
Bibliography
Abbott, A. (1988). The System of Professions: An Essay on Division of Expert Labour.
Chicago: The University of Chicago Press.
Adams, M., & Kowalski, G. (1980). Professional Self-Identification Among Art Students.
Studies in Art Education vol. 21 no. 3, 31-39.
Anderson, C. (2008, 23 June). The End of Theory: The Data Deluge Makes the Scientific
Method Obsolete. Retrieved from Wired Magazine: http://archive.wired.com/
science/discoveries/magazine/16-07/pb_theory
Ashfort, B., & Humphrey, R. (1993). Emotional Labor in Service Roles: The Influence of
Identity. Academy of Management Review vol. 18 no. 1, 88-115.
Bandura, A. (1977). Self-efficacy: Toward a Unifying Theory of Behavioural Change.
Psychological Review vol. 84, no. 2, 191-215.
BBC. (2010). Joy of Stats (with Prof. Hans Rosling) [Motion Picture].
Becker, S., & Carper, J. (1956). The Development of Identification with an Occupation.
The American Journal of Sociology, 289-298.
Berger, L., & Luckmann, T. (1966). The Social Construction of Reality. New York:
Penguin Books.
Biao, X. (2006). Global "Body Shopping": An Indian Labour System in the Information
Technology Industry. Princeton, NJ: Princeton University Press.
Bijker, W., & Law, J. (2012). Shaping Technology/Building Society: Studies in
Sociotechnical Change. MIT.
Blacker, F. (1995). Knowledge, Knowledge Work and Organizations: An Overview and
Interpretation. Organization Studies, 1021-1046.
Boellstorff, T. (2013). Making big data, in theory. First Monday vol. 18, nr. 10.
boyd, d., & Crawford, K. (2012). Critical Questions For Big Data: Provocations for a
cultural, technological and scholarly phenomenon. Information, Communication
& Society vol. 15, iss. 5, 662-679.
Brown, J., & Duguid, P. (2001). Knowledge and Organization: A Social-Practice
Perspective . Organizational Science 12(2), 198-213.
Brynjolfsson, E., & McAfee, A. (2014). The Second Machine Age: Work Progress and
Prosperity in Time of Brilliant Technologies. New York: W.W. Norton & Company.
Burkholder, L. (1992). Philosophy and the Computer. Boulder, San Francisco and
Oxford: Westview Press.
Cleveland, S. W. (2001). Data Science: an Action Plan for Expanding the Technical Areas
of the Field of Statistics. International Statistical Review, 21-26.
Colleman, G. (2013). Coding Freedom: The Ethics and Aesthetics of Hacking. Princeton:
Princeton University Press.
Data Science Institute. (2014, September 5). Data Science Institute - Events. Retrieved
from Imperial College London: http://www3.imperial.ac.uk/data-science/events
50
Data Science London. (2014, September 5). About @DS_LDN. Retrieved from Data
Science London: http://datasciencelondon.org/data-science-london/
Davenport, T., & Patil, D. J. (2012). Data Scientist: The Sexiest Job of the 21st Century.
Harvard Business Review.
Diebold, F. (2012). "On the Origin(s) and Development of the Term Big Data". Working
Paper - Penn Economics.
Eyal, G. (2013) For a Sociology of Expertise: The Social Origins of the Autism Epidemic.
AJS vol. 118, no 4., 863-907
Forbes. (2014, September 5). Article: Blueprints Of NSA's Ridiculously Expensive Data
Center In Utah Suggest It Holds Less Info Than Thought. Retrieved from Forbes:
http://www.forbes.com/sites/kashmirhill/2013/07/24/blueprints-of-nsa-datacenter-in-utah-suggest-its-storage-capacity-is-less-impressive-than-thought/
Friedman, J. H. (2001). The Role of Statistics in the Data Revolution? International
Statistical Review 69, (1), 5-10.
Gillespie, T. (2014). The relevance of algorithms. In T. Gillespie, & B. P., Media
technologies: Essays on communication, materiality, and society (pp. 167-194).
Cambridge, MA: MIT Press.
Gitelman, L., & Jackson, V. (2013). Introduction. In L. (. Gitelman, "Raw Data" Is an
Oxymoron (pp. 9-23). Cambridge, MA: The MIT Press.
Goffman, E. (1959). The Presentation of Self in Everyday Life. Anchor Books .
Google. (2014, September 5). Company Overview. Retrieved from Google Company:
https://www.google.com/about/company/
Hall, R. (1968). Professionalization and bureacratization. American Sociological Review,
92-104.
Handley, K., Sturdy, A., Finchman, R., & Clark, T. (2006). Within and beyond
communities of practice: Making sense of learning through participation, identity
and practice. Journal of Management Studies, 641-653.
Haraway, D. (1988). Situated Knowledges: The Science Question in Feminism and the
Privilege of Partial Perspective. Feminist Studies vol. 14, no. 3, 575-599.
Harvey, D. (2007). A Brief History of Neoliberalism. Oxford: Oxford University Press.
Ibarra, H. (1999). Provisional selves: Experimenting with image and identity in
professional adaptation. Administrative Science Quaterly vol. 44 iss. 4, 764-791.
IBM. (2014, September 5). Apply new analytics tools to reveal new opportunities.
Retrieved from IBM Smarterplanet: http://www.ibm.com/smarterplanet/us/en/
business_analytics/article/it_business_intelligence.html
Insight Data Science Program. (2014). White Paper. San Francisco: Insight Data Science
Program.
Kelty, C. (2008). Two Bits - The Cultural Significance of Free Software. Durham and
London: Duke University Press.
Krause, E. (1971). The Sociology of Occupations. Boston: Little, Brown and Company.
Kuhn, T. (1962). The Structure of Scientific Revolutions. Chicago: University of Chicago
Press.
51
Latour, B. (1987). Science in Action: How to Follow Scientists and Engineers through
Society. Cambridge, MA: Harvard University Press.
Latour, B., & Woolgar, S. (1986 (1979)). Laboratory Life: The Construction of Scientific
Facts. Princeton, NJ: Princeton University Press.
Latour, B., Jensen, P., Venturini, T., Grauwin, S., & Boullier, D. (2012). The whole is
always smaller than its parts a digital test of Gabriel Tardes' monads. The
British Journal of Sociology vol. 63, iss. 4, 590-615.
Lave, J., & Wenger, E. (2008 (1991)). Communities of Practice: Learning, Meaning, and
Identity. Cambridge University Press.
Levy, S. (1984). Hackers: Heroes of the Computer Revolution. New York : Nerraw
Manijaime/Doubleday.
Lohr, S. (2012, August 11). How Big Data Became So Big. Retrieved from The New York
Times: http://www.nytimes.com/2012/08/12/business/how-big-data-becameso-big-unboxed.html?pagewanted=all&_r=0
Manovich, L. (2011). Trending: The Promises and the Challenges of Big Social Data. In
M. K. Gold, Debates in Digital Humanities. The University of Minnesota Press:
Minneapolis.
Mattman, C. A. (2013). A vision for data science. Nature vol. 493, 473 - 475.
Mayer-Schnberger, V., & Cukier, K. (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think. Eamon Dolan/Houghton Mifflin
Harcourt.
McIntosh, P. (1989). Feeling like a fraud: Part II. Stone Center Working Paper no. 37,
Wellsey College.
McKinsey. (2011). Big data: The next frontier for innovation, competition, and
productivity. McKinsey.
Mead, G. H. (1934). Mind, Self, and Society . Chicago: University of Chicago Press.
Miller, D., & Slater, D. (2001). The Internet: An Ethnographic Approach. London:
Bloomsbury Academic.
Moss-Racusin, C., Dovidio, J. F., Brescoll, V., Grahama, M., & Handelsman, J. (2012).
Science facultys subtle gender biases favor male students. Proceedings of the
National Academy of Sciences of the United States of America (vol. 109 no. 41),
16474-16479.
O'Reilly. (2013, September 5). Retrieved from Strata Conference: http://
strataconf.com/
Parks, M. (2014). Big Data in Communication Research: Its Contents and Discontents.
Journal of Communication vol. 64, iss. 2, 355-360.
Parry, R. (2014, September 5). Episteme and Techne. Retrieved from The Stanford
Encyclopedia of Philosophy (Fall 2014 Edition): http://plato.stanford.edu/
archives/fall2014/entries/episteme-techne/
Pentland, A. (2014). Social Physics: How Good Ideas Spread The Lessons From a New
Science. New York: The Penguin Press.
52
Poovey, M. (1988). A History of the Modern Fact: Problems of Knowledge in the

Sciences of Wealth and Society. Chicago: The University of Chicago Press
Puschman, C., & Burgess, J. (2014). Metaphors of Big Data. International Journal of
Communication, 1690-1709.
Rauser, J. (2014, September 5). Strata New York 2011: John Rauser, "What is a Career
in Big Data?". Retrieved from Youtube: https://www.youtube.com/watch?
v=0tuEEnL61HM
Riles, A. (2010). Collateral Expertise: Legal Knowledge in the Global Financial Markets.
Current Anthropology 51(6), 795-818
Rogers, E. (2010 (1962)). Diffusion of Innovations. New York: Free Press.
Rosenberg, D. (2013). Data before the fact. In L. Gitelman, Raw data is an oxymoron
(pp. 15-40). Cambridge, MA: MIT Press.
Ruppert, E. (2013). Rethinking Empirical Social Sciences. Dialogues in Human
Geography, 268-273.
Schein, E. (1978). Career Dynamics: Matching Individual and Organizational Need.
Reading, MA: Addison-Wesley.
Simon, H. (1965). The Shape of Automation for Men and Management. New York:
Harper and Row.
Shore, C. (1997) Anthropology of Policy: Perspectives on Governance and Power,
London & New York: Routledge
Tajfel, H., & Turner, J. (1986). The social identity theory of intergroup behaviour. In S.
Worchel, & W. Austin, Psychology of Intergroup Relations (pp. 7-24). Chicago:
Nelson-Hall.
U.S. Department of Commerce. (2011). Women in STEM: A Gender Gap to Innovation.
Washington D.C.: U.S. Department of Commerce.
Varian, H. (2014). Big Data: New Tricks for Econometrics. Journal of Economic
Perspectives 28(2), 3-28.
Vygotsky, L. (1978). Mind in Society. Cambridge, MA: Harvard University Press.
Wenger, E. (1998). Communities of Practice: Learning, Meaning, and Identity.
Cambridge, MA: Cambridge University Press.
Wilensky, H. (1964). The Professionalization of Everyone? American Journal of
Sociology, 137-158.
53
54

Lukasz - Data Science Dissertation - UCL PDF

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Lukasz - Data Science Dissertation - UCL PDF

Transféré par

Droits d'auteur :

Formats disponibles

Making of the data scientist profession - DJDQ8

MSc in Digital Anthropology

Making of the data scientist profession:

provisional selves, career transitions and the boundaries

UNIVERSITY COLLEGE LONDON

Making of the data scientist profession - DJDQ8

#datascience #makingofscience #scientists #newprofessions

Making of the data scientist profession - DJDQ8

4. Limitations of the study

5. Framing the research literature review

- Locating and grounding the term data

- A historical trajectory of the statistics discipline

- Big Data the next frontier

6. Unpacking the key themes research and analysis

- Computerization and digitization of sciences

- The data scientist training

- Transitioning from academia to industry

- Interdisciplinary work practice

- Evolving nature of the data analyst role

Making of the data scientist profession - DJDQ8

Making of the data scientist profession - DJDQ8

Making of the data scientist profession - DJDQ8

day-to-day craftsman, of the individuals who are expected to deliver the

Making of the data scientist profession - DJDQ8

Preliminary consideration of these circumstances led me to believe that a

Making of the data scientist profession - DJDQ8

Where do data scientists sit along the changing nature (and

To answer these questions, it seemed critical to start with a commentary

Making of the data scientist profession - DJDQ8

Making of the data scientist profession - DJDQ8

isolated, and highly industry-specific. Companies and recruiters are

Making of the data scientist profession - DJDQ8

(2008), to Coleman (2013). Advised by my supervisor, I started exploring the

This was accompanied by participation in meetings of the Data

Science London community, taking part in hackathons and networking events.

The group was composed of individuals who had acquired job

Making of the data scientist profession - DJDQ8

Thirdly, my participation in Data Science London meet-ups led to a

It is worth keeping in mind that this field attracted significant

Making of the data scientist profession - DJDQ8

III. Research questions

IV. Limitations of the study

Making of the data scientist profession - DJDQ8

dedicated to free, open, dissemination of data science and promotion of

Making of the data scientist profession - DJDQ8

V. Framing the research

Needless to say, the topic of Big Data, which developed

Making of the data scientist profession - DJDQ8

V. I. Locating and grounding the term data.

Making of the data scientist profession - DJDQ8

V. II. A historical trajectory of the statistics

Making of the data scientist profession - DJDQ8

Making of the data scientist profession - DJDQ8

This argument, dating back to 2001, is worth remembering when looking

In Friedmans words data mining is not yet an

Making of the data scientist profession - DJDQ8

expected that soon computer science [will] join mathematics as an area of

This resembles, one could argue, a re-branding of more

sophisticated data analysts roles, from now requiring computational skills to

V. III. Big Data the next frontier

Making of the data scientist profession - DJDQ8

They define Big Data as a