Académique Documents
Professionnel Documents
Culture Documents
ukasz Alwast
Dissertation submitted in partial fulfilment of the requirements for the degree of MSc in
Digital Anthropology (UCL) of the University of London in 2014
Word Count: 14 254
Note: This dissertation is an unrevised examination copy for consultation only and it
should not be quoted or cited without the permission of the Chairman of the Board of
Examiners for the MSc in Digital Anthropology (UCL)
Abstract
Over the past XX years, the term data science has swiftly moved into the
vernacular of scientific and technological vocabulary. As this happened, it signified
a larger phenomenon that is taking place in the sciences and society at large,
namely digitization and datafication of many of the aspects of the world that had
not been quantified and digitized before. This trend seems to have its own, new
acolytes data scientists.
Heralded by the media as the high-priests of
algorithms and the sexiest job of the XXI century, the phenomenon unravels a
more deeply grounded conversation about the establishment of a new profession
in the public milieu, the making of science and scientists, and the evolving nature
of handling and understanding data. Drawing on contributions from science and
technology studies (STS), organizational studies, anthropology, and Internet
studies, this work frames the research around the self-identity of a professional
and group perception of the authenticity and competence of interdisciplinary, XXI
century quantitative analysts.
Table of contents
1. Introduction
2. Methodology
3. Research questions
12
12
14
15
16
19
- Provisional selves
21
- Communities of practice
23
26
26
28
31
33
- Tools of practice
36
38
7. Discussion
41
8. Closing words
44
9. Bibliography
46
Acknowledgements
I would like to thank my supervisor Stefana Broadbent for her guidance, patience
and confidence in setting this piece of research on track. I always found her
enthusiasm contagious, which made this creative endeavour much more
invigorating, intriguing, and within my reach.
I am also grateful to Haidy Geismar, my course co-convenor and personal tutor,
whom I could always count on for thoughtful advice and a critical eye.
I also appreciate the help of Ciara Green, my course peer, who dedicated her time
to listen to my rants on data science and proved to be a good, critical listener.
Then there are of course my informants, with whom a number of in-depth
interviews allowed me to investigate my questions in sufficient depth.
And finally, thanks to my mom and dad, for always supporting me in whatever I
decided to pursue.
I.
Introduction
Data, as it stands, surrounds us. For computational systems, we, as human
beings are carriers, herders and interpreters of it. As data, after all, is the
foundation for deriving information - the particular mean of insight and
intelligence that enables us to make informed, individual and collective
decisions. Or so we believe.
Some profound changes in this area have been happening over the past
30 years. With the advancement of computational technologies our ability to
collect, share and analyse data (and therefore, information) has changed to a
degree that is historically unprecedented. In fact, according to one of the
corporations that helped set up the infrastructure for this transition, IBM (2013)
- 90% of the worlds data has been produced in the last two years - and we are
yet to recognize how to harness its potential".
There is little doubt that amongst other phenomena, technology shapes
our lives (Bijker et al., 1989), and so do we, shape technology (Mackenzie &
Wajcman, 1985). After all technology - the outcome of making
something (techne), and science - the outcome of thoughtfully pursuing and
understanding (episteme), are inherently linked to one another (Parry, 2014).
As Thomas Kuhn (1963) asserted long years ago, science is inherently about the
data, so in as much should be technology. And if science is the initiator of a new
way of understanding the world, it also creates opportunities for doing things
differently. This, unfortunately, often translates in the popular discourse into a
simplification that scientific opportunities = money, or data = money, and
there are a number of larger and smaller loopholes of seeing the world through
such a lens. This is, however, often the reality of technology and business
narratives, and this is why the fairly new concept of data science, and its
acolytes - data scientists - appears so worthy of investigation.
There are, of course, some limitations to what only a few months of
research can capture in trying to unpack such a large phenomenon. This is why
this research aspired to become an ethnographic snapshot, on the level of its
5
followed by a discussion that picks, unravels and comments on the key themes of
this study. This required: (i) colliding the arguments on the establishment of an
interdisciplinary profession happening on the boundaries of quantitative and
computing sciences, (ii) recognizing the importance of training in pursuing the
role of a scientist within an organization (often associated with objectivity and
evidence-led approach to solving challenges), and above all, (iii) understanding
the data-, computing- and epistemological- educational role in translating the
expectations standing in front of Big Data into day-to-day tools, processes and
practices.
All of this takes place within an ongoing conversation and surrounding
semantic tension around the term data science, data scientists and the
organizational and institutional changes this entails for the future of work and
decision-making - making it highly pertinent for the growing body of knowledge
within the field of digital anthropology.
II.Methodology
Research for this study did not start with a pre-assumed research
question.
It began with an iterative process of probing and exploring what would be an
appropriate angle to unravel ones interests, recognize compelling questions and
seek out relevant, transferable knowledge. Building on those interests (the
processes of long-term socio-technical development, broadly understood
innovation and social perception (and expectations) for scientific and
technological change), I was looking for a phenomenon that would merge those
themes together - and data science and data scientists appeared as timely and
worthy candidates.
For the requirements of an ethnographic endeavour, however, this was not
going to be an easy task. Data scientist roles in organizations are still fairly
scarce,
outstripping each other in trying to acquire talent, and, if they are successful,
those individuals often work on some of the more critical aspects of
organizations processes, in many cases, highly confidential and sensitive. It was
therefore very difficult to convince any of the individuals I had in my network,
or in their network, to pursue an organizational ethnography and investigate
their organization, as a research field. Limiting the research to one organization
would also be a danger in itself, therefore a decision was made to conduct the
research with a number of informants from different organizations, focus on
them as individuals (rather than their organizations), and collate ethnographic
insight from an accessible field site, for which Data Science London appeared to
be a good candidate.
Inspiration for pursuing the research through such an approach emerged
from anthropological accounts of researchers who historically also tried to follow
either scientific, or ICT-heavy communities, from Latour and Woolgar (1986
(1979), Levy (1984), Latour (1987), Miller and Slater (2001), Biao (2006), Kelty
10
titles of data scientists, just months before the study began. Another group of
informants was composed of individuals who were data analysts in different
organizational settings e.g. a post-doctoral astrophysicist, a statistician with 15
years of industrial experience, or an economic research fellow. The third group
were individuals actively engaged in shaping (Data Science London) or
researching the community.
During the period of this study - and throughout the digital anthropology
masters program - I worked in an organization that was actively hiring and
building a data science team. However, due to the nature of being an actor in
an organizational setting, and the fact that the act of pursuing ethnographic
research might change the dynamics of my social position and relationships
within such environment (Berg & Lune, 2011), I made a deliberate decision that
this organization would not be a field of the study. However, without doubt, my
experiences and observations during that time helped me inform how the
research was framed and which themes would be selected for deeper
investigation, thus influencing the discussion and reflection on this subject.
11
attention throughout the year of this study (2013-2014). A number of new public
and private institutions legitimizing the term data science emerged both in the
UK and the US (e.g. Imperial College Data Science Institute and Data Science at
NYU) and led to a number of discussions and conversations around this topic - for
example, a meeting at Imperial College titled A Data Scientist is a statistician
who lives in Shoreditch (?) (Data Science Institute, 2014).
This study might have also benefited more from applying ethnomethodological approaches to the subject, however, due to the confidentially of
the work pursued by some of the questioned data scientists, and their limited
pool, this approach had to be withdrawn and restrained to a number of semistructured interviews. For the purpose of a more in-depth study, on a larger
group of informants, ethnomethodology would be highly recommended for
triangulation purposes.
12
This links
closely with the issue of biased group inclusivity, in-group favouritism and the
gender-biased perception of competence (Moss-Racusin et al., 2012) and
might also have impacted this study.
Values, beliefs and legacies of the open and free software movements the
popularity of data science can be associated with the proliferation of tools
developed in the spirit of open software, that allowed tackling with
increasingly sophisticated data questions. In fact, the Data Science London
organizers included in their mission statement the following claim:
13
alongside data science, also attracted significant and very interesting scholarly
work from researchers in Internet studies (Mayer-Schnberger & Cukier, 2013)
communications studies (Parks, 2014), digital- anthropology (Boellstorff, 2013),
digital- humanities (Manovich, 2011) and sociology (Ruppert, 2013).
This is why the literature review will be in its nature cross cutting,
pointing to contributions and sources available amongst the different areas
15
discussed above. Deeper commentary will be given only to those positions that
have been identified and judged to be important for supporting the research
questions and revealing of the larger picture of conversations that take place
within this topic.
The first part of the literature review will briefly introduce the term
data, and show an interesting historical trajectory of two disciplines, namely
statistics and computer science, which have always tackled data from a
perspective relevant for the profession of the data scientist. The latter part of
the literature review will introduce the body of literature on professional selfidentity, organizational socialization and provisional selves. This will correspond
closely with the section on the characteristics and dynamics of communities of
practice, and the forms of practices, tools and behaviours that make a group and
the individuals within it both socialized and distinctive.
significantly (Puschmann and Burgess, 2014: 1692). It was in the 1940s that the
earlier uses were supplemented with the use of the word data to describe any
information used and stored in the context of computing. With the shift from
paper record to digital information, data was increasingly used to refer to digital
objects that could be manipulated using a computer rather than generally
accepted facts or outcomes of experimentation or observation. As computing
matured, data increasingly left laboratories and offices to play a role in new,
domestic and public environments.
An interesting argument elaborated on by Puschmann and Burgess (2014:
1693) suggests that data stored as a piece of digital information marks a
departure from previous understandings of the term. In its past meaning, the
processes of giving and interpreting appeared to be highlighted, whereas in the
more recent meaning, data seems to come into being by acts of recording. As a
result of this shift, the most pronounced difference between the two is the
aspect of agency in data creation. In the past, data was mostly associated with
the role of the statistician, or sometimes more broadly, the data analyst, and
today it is much more grounded in the design and operations of computational
systems.
Large Complex Data Sets to address the issue of increasingly larger data sets
and inappropriate tools and knowledge to tackle with those new volumes of
data. Twenty years later, history repeated itself; in 1997 another ASA symposium
is themed Data Mining and the analysis of large data sets and today (in 2013
and 2014), industry conferences such as OReilly Strata choose as their main
theme - Making Data work - Big Data (OReilly, 2014).
Despite these clear historical loops, a number of respectable scholars and
pundits (Friedman, 2001; Anderson, 2008: Manovich, 2012; Mayer-Schnberger &
Cukier, 2013) argue that there has been a considerable change in the nature of
thinking about and dealing with data in the recent years. Back in the 70s and
80s, large and complex data sets were rare and little need was seen to analyse
those few that did exist. Data was collected manually, and the cost of collecting
it was closely associated to its volume, resulting in data collection being,
throughout the whole process, fairly expensive.
This changed as
computerization entered the space, and to some degree, the cost of setting up
data collecting infrastructures has decreased to a point that gave new
opportunities for entities that were not able to use these types of solutions
before. Needless to say, in extreme cases like the NSA or Google data farms
(Forbes, 2014) - even todays data infrastructures can be very expensive to
operate.
Not surprisingly, individuals in the field of statistics have repetitively been
asking themselves questions - what is the role of statistics in the data
revolution? Friedman himself, a statistics Professor at Stanford, argued over a
decade ago that the idea of learning from data has been around for a long time,
but the interest in analysing these large and complex data sets has only
recently [2000s] become so intense (Friedman, 2001: 5). He associated this
with the development of novel, data-base management systems where large
quantities of data resided, and as a result, has given fertile ground for data
mining approaches. The processes of analysing data for purposes other than for
which it was collected (Friedman, 2001: 6), shifting the usual application of data
from transaction processing to decision-support.
18
intellectual discipline, but in the future, almost certainly [it will be] and one
can predict a big intellectual and academic future for new data mining
methodologies will emerge (2001: 7). At that time, data mining packages were
already incorporating well-known procedures from the fields of machine
learning, pattern recognition, neural-networks and data visualization. And of
course, some questions remained unanswered - should statistics remain at what
its good at (i.e. probabilistic inference based on mathematics), or ought it be
concerned with a set of problems, rather than tools? An important remark in
Freedmans argument was that statisticians will first and foremost have to
make peace with computing, as this is where the data is. As if computing is
to become one of the fundamental research tools, then the community will
have to teach, or be sure that students learn, the relevant Computer Science
topics, and some basic paradigms of the field will have to be modified
(Friedman, 2001: 9). This thought neatly corresponds with what Hal Varian
(2014), Professor at iSchool at Berkeley and Chief Economist at Google, argues
about the modern training of economists - in particular, econometricians and
the type of skills and tools they need to start acquiring from their computer
science comrades.
This observation leads to conversations about career prospects in data
analysis roles. Friedman argued (2001: 9) that up until around the 2000s, if
someone was interested in data analysis, then statistics was one of the very few
(even remotely) appropriate fields to work in. In 2013/2014, this is no longer the
case. There are many other exciting data orientated sciences that are
competing [with statistics] for customers, students, jobs and even [our own]
statisticians (Friedman, 2001: 9). Even prominent statisticians are becoming
more interested in researching problems embraced by other fields, and prefer
to work or publish in other areas (Friedman, 2001: 9). Having said that, this is
a very important issue for locating the data scientists profession in the larger
19
data analyst perspective. For Friedman, this brain drain of students and
researchers away from statistics was representing the most serious threat to the
future of the discipline, requiring profound re-examination of its place amongst
the information sciences.
The entirety of Friedmans argument, happening in the midst of the
Internet-bubble in 2001, corresponds very well with the thinking of William
Cleveland, Statistics Research Fellow at Bell Labs, who also in 2001 called out
for An Action Plan For Expanding the Technical Areas of the Fields of
Statistics, under the new label of data science (Cleveland, 2001).
Cleveland
defined Big Data, but the practices, skills and tools used on more mundane
levels of interactions. The question is, however, whether the mundane devices
are, as material objects, enough to conduct the analysis required for Big Data?
boyd & Crawford argue (2013, 663) that Big Data is less about data than it is
about the capacity to search, aggregate, and cross-reference large data sets and
define it as an interplay between a cultural, technological and scholarly
phenomenon (boyd & Crawford, 2013: 663).
appeared that the term Big Data probably originated in the lunch-table
conversations at Silicon Graphics, a high-performance computing manufacturer,
in the mid-1990s, and John Mashey, its chief scientist prominently (Diebold,
2012) and has since, with a significant uptake from 2007, gained traction within
the computer-software industry.
Puschmann and Burgess (2014) who argue that the genesis of the term Big Data
lies firmly in the business world. Although the early discussions on data
processing technologies in business closely reflected this necessity (for new
tools, allowing companies to deliver faster search results or store larger volumes
of customer data) it has since evolved into a conversation centred around using
21
Change the instruments, and you will change the entire social theory that goes with
them
Source: Latour, 2009 in boyd & Crawford, 2012: 665
fact a story of the past and sociology of expertise is a more timely and
comprehensive to capture the changes of todays world. In his words, sociology of
expertise maintains an analytical distinction between experts and expertise as
two irreducible models of analysis, treating expertise neither as an attribution,
nor a set of skills, but as a network connecting actors, instruments, statements
and institutional arrangements (Eyal, 2013).
Coming from that, a straightforward question seems to be whether it is
the data or the scientist part of the role, which has a stronger influence on
professional identification, or if it is something different? Ibarra suggests that in
assuming new roles, people not only acquire new skills but also adopt the social
norms and rules that govern how they should conduct themselves (Shein, 1978 in
Ibarra, 1999: 765). Practices and social norms of scientists in a lab setting were
already well investigated in a seminal study by Latour and Woolgar in 1979
(1986).
particularly relevant for data scientists as these individuals often transition from
academic/research backgrounds into industry, where the dynamics and
challenges of the environment require different practices. Identities have long
been seen as constructed and negotiated in social interaction (Mead, 1934;
Goffman, 1959) and socialization is not a unilateral process imposing conformity
on the individual. It is a negotiated adaptation by which people strive to improve
the fit between themselves and their work environment.
identity claims by conveying images that signal how they view themselves or
hope to be viewed by others, but it is unclear to what degree they remain part
of their past, scientific role, and to what degree part of a new form of a data
24
V. V. Communities of practice
Communities of practice (CoP) is a concept developed at the beginning of
the 1990s by Jean Lave and Etienne Wenger, who proposed a new model of
learning, described at the time as situated learning theory (Lave & Wenger,
1991). The concept was a critique of earlier cognitivist theories of learning as
knowledge was said primarily not to be abstract and symbolic, but provisional,
mediated and socially constructed (Berger and Luckmann, 1966; Blacker, 1995).
Situated learning theory positions communities of practice as the context in
which an individual develops the practices - values, norms, relationships - and
identities appropriate to that community.
recognition and the ability to negotiate meaning, but does not necessarily entail
equality, respect or collaboration.
A particularly intriguing aspect of participation is how members of a
community gain status within it.
(1990) suggested that there is a distinction between a core, and peripheries, and
it is through continuous participation that one gains recognition or moves to the
centre. They have, however, deviated slightly from this opinion since then and
acknowledged that participation may involve learning trajectories which do not
lead to a comprehensive full participation (Handley et al., 2006: 644). This is
an important point to note in respect to the interviews from this study.
Another important aspect of CoP is identity. The concept of identity rests
on critical readings of social identity theory (Handley et al., 2006: 664); but,
according to Leve and Wenger (1991), learning is not simply about developing
ones knowledge and practice, but also involves a process of understanding who
we are and in which communities of practice we belong and are accepted. Two
main processes of identity construction in a workplace are identity-regulation
and identity-work. According to Handley (2006: 644), the first process refers to
regulation originating from the organization (e.g. recruitment, induction and
promotion policies) and the employees individual responses. The second process
of identity-work refers to employees efforts to form, repair, maintain or revise
their perceptions of self, and this involves a negotiation between the
organizations efforts at identity-regulation (which the employee may, or may
not internalize) and the employees sense of self, derived from current work as
well as other identities (Handley, 2006: 645) all highly relevant for data
scientists in their working environments.
The third and final aspect of CoP is indeed practice, which according to
Brown and Duguid (2001: 203) is an undertaking or engaging fully in a task, job
or profession. After all, by participating in a community, newcomers develop an
awareness of that communitys practice. They come to understand and engage
with - or adopt and transform - various tools, language, role-definitions and
other explicit artefacts and implicit relations.
interesting, as tools can be very defining of a profession, allowing for formal and
informal coordination and exchange of knowledge to be identified.
Finally, it is critical to note that communities of practice are not
homogenous, but differ across several dimensions geographic spread, lifecycle
and pace of evolution. Individuals may participate to a different degree in loose
networks of practice both across and beyond organizational boundaries, but
according to Handley et al. (2006: 646), it is in relation to these communities
and networks that individuals develop their identities and practices through
processes such as role modelling, experimentation and identity-construction. An
individuals continual negotiation of self within and across multiple
communities of practice may generate intra-personal instabilities within the
community. An example of this is a scenario where a newcomer experiences a
conflict of identity in relation to a role or practice he or she is expected to
adopt (Ashforth and Humphrey, 1993) - a case that data scientists are
particularly exposed to, as they enter new organizational environments with,
often, inflated expectations to the nature of their work.
27
The last two decades have seen significant changes to the way we use
computational tools in modern workspaces (Brynjolfsonn & McAfee, 2014;
Pentland, 2014). The Internet, email, cloud services and mobile phones are only
a few manifestations of this phenomenon. Along consumer products, big changes
have been also happening in the world of artistic and scientific crafts namely,
the worlds of design and science.
A good example for design are architects, who are less and less being
trained in being proficient at the drawing board, but instead master the use of
design software such as AutoCad, Adobe Suite and the likes. Another example
are surgeons who increasingly need to become proficient in using tools that
allow them to conduct distant, robotic surgeries. These changes are also
reaching the sciences. The basic tools of scientific practice have changed too
in many cases, today, a laptop connected to the Internet and appropriate
research software is enough to pursue multiple scientific inquiries.
This has
in cosmology there are enormous amounts of data to tackle; its not a controlled
environment and theres pressure on taking as much data as there is possible thats where
machine learning becomes useful, when you dont have enough information, or when youre
29
The recurring
scientists playing an active role in not only being the skilled technical
craftsman but also the digital champion or educator of this transition in skills
training.
30
- Researcher, exploring how people interact with technologies in health and life sciences
what would make a data scientist? exceptional programming skills, use of common
statistical software and an academic background in physical sciences or statistics
data scientists work less on the data collection side, or infrastructure () data science
is more about analysis the front end between insight and data that refers to services and
data
they [data scientists] have to program, do statistics, follow the digital trail and know
what to do at the end of it () there is much about experience humility about the data
in academia youre getting points for being clever, and in industry it works out what you do, or
it doesnt, its important that you can do things quickly () in academia if you show how you did
it, theyll say ummthats just linear regression
when you leave academia you start learning C++ and Python () there seems to be a big
community movement to change tools catch-up with the great tools of the outside world
This, and other training programmes in data science - e.g. those led by
General Assembly - are part of a larger system of training packages
complemented by earlier established on-line education courses promoted by
academic power-houses such as MIT and Stanford (with a number of on-line or
distance courses in Data Science, Machine Learning and Data Visualization). This
strongly affirms that data science is a career path that did not exist in the past
(at least, not under such a name). It is only in recent years that traditional
organizations of power and educational credibility - organizations such as iSchool
at Berkeley University, NYC and Imperial College London - have opened research
programmes in data science and entered this growing field.
One cannot look at the emergence of the data scientist role without the
context, and ongoing evolution of the labour market and novel employment
opportunities associated with socio-economic and technological change. For a
long time, some of the best and financially most rewarding career paths for
students graduating in mathematics, physics, economics and engineering in
general, quantitative degrees - were in big technology, engineering or financial
organizations.
centres. It goes without saying that post-graduate education what still remains
34
at the core of the data scientist training has also changed over the years. More
and more PhD training opportunities have been offered at academic institutions
that were later not matched with further post-doc or tenured opportunities in
academia.
empirical backing of the Higgs Boson, in itself was home to hundreds of PhD and
post-doctoral scientists in fields ranging from physics, through to engineering
and computer science.
finished, they may have to consider moving into industry due to the limited
opportunities at other research institutions. This is said to be one of the reasons
why the label data science has found fertile grounds scientists needed to
re-brand themselves for the purpose of industry roles.
Amongst the informants of the study were both experienced individuals
who have gone down the path to becoming a data scientist and individuals
working at the gateways of this role. This provided the study with an interesting
perspective on where and when the transition between data science began, and
whether it could be a sustainable label for self-identification in a working
environment.
I heard about data science for the first time as a PhD student, during an industrial placement
I did at a *major web-company*
- Data Scientist [1], working in industry
I heard for the first time about data science when I was working at *major tech company* 4-5
years ago and tech-companies were starting recruiting data scientist
I heard about data science for the first time when I was doing a PhD,
and people were leaving for industry
35
academia is pressured for novelty () thats why each life scientist writes his own
code, because everybody in academia needs to come with their own solution, and industry is
more about finding the code that is the most efficient for the given task and finance is
absolutely the best at it, and has for years been hiring some of the best of the best physicists
Research, exploring how people interact with technologies in health and life sciences
It also seems that the term data scientist has strong origins in the techindustry, in particular in places such as San Francisco and Seattle (home to the
largest tech-companies). To some degree, this should not be surprising, as data
science emerged from the recruiting practices of companies that really could
pursue Big Data Microsoft, IBM, Facebook, Twitter, Google etc., and it was (and
still often is) the endeavours of these organizations that the term seems to be
receiving so much attention.
They
are the most involved in the transition of scientists into tech jobs, which also
fits the larger campaigns narrative for STEM education.
36
- Researcher, exploring how people interact with technologies in health and life sciences
The type of projects which data scientists get involved in - unless they are
in strictly domain-specific areas such as banking, insurance or biological mapping
often entail complexity that reaches far beyond what a simple computationalsystem could frame, and are within the interplay of a number of dynamic sociotechnical systems. Data-driven decision making, which seems to be at the heart
of data science for such areas as public health, transport, or public services
transport, requires expert knowledge from a range of disciplines and standpoints, often also taking into consideration social, political, scientific, usability
and aesthetic aspects of the developed solutions.
However, inter-disciplinary
I believe data science is more about the science bit, than the data () for example,
designers use data from a design perspective, but what they really do is design
data in itself is useless, unless it can be used as a tool to solve business or research questions ()
thats why data scientists need to express not only technical proficiency and a data-driven approach, but
also soft skills: team working, storytelling, engaging communication
37
Difficulty with
A statisticians way of thinking is being comfortable with uncertainty, and thats often
quite opposite of programmers, who in how they work need proof of correctness
38
I can plot results in R, MatLab, Gnuplot, when I speak to another statisticians but
with designers, they need to briefed in other ways () I met creatives who have no idea about
data science and just see it through visualization, but completely loose the science parts, and
thats a completely lost perspective
[Data Science London] This is a meetup for data scientists, data miners, statisticians, data
analysts, data engineers, data architects, data visualizers, data journalists, data science
practitioners, data consultants, academics, researchers, people from science and social
sciences, and in general people directly involved in data projects.
- Meet-up website
The formula of these events is mostly built around a set of data provided
either by a third-party, or by the organizers themselves. This data is, in many
cases, unstructured and requires a certain level of ingenuity and expertise to
make use of it. This data is then available to teams of hackers composed of
data scientists, software engineers, web developers, graphic designers and
others to devise, in most cases, a prototype for a data-product that in some
way would address a need in a novel and impactful way. This is the space where
interdisciplinary work takes place at its most extreme. Participants often do not
know each other before the event and have to coin teams through conversation
at the beginning of the meeting. At many of these events, it is reiterated that
group-work leads to the best results, requiring an effort to reach out to people
who are usually outside of ones disciplinary background.
39
features that were respected. Individuals from different fields had to make an
effort to be accepted into the group, particularly if they did not have the data
literacy and ability to use the appropriate data science tools.
For a
It is
40
only the last few years, if not months, that certain institutional frameworks for
recognition and accreditation have begun to be established (i.e. the Insight Data
Science Fellowships, Coursera or EDx courses), and it seems quite clear that the
expectation for data science skills flourishes on the demand side of the market,
rather than the supply side. For that reason, it was worth exploring what where
the skills and tools so much desired by industry recruiters:
Search, Solr, Java, Pig, Map Reduce), expert SQL skills, exposure
and understanding of development tools such as Java; predictive
analytics and machine learning packages (and BA/BS in maths/
statistics/machine learning or equivalent)
Source: Linkedin search results (2014)
many people come into data science, or machine learning meet-ups, theyre provided
with data, theyll run a simple algorithm and say this is the result; Im a data scientist ()
thats not how this works
good ones [data scientists] know how to use proper tools for a given context,
others are just enthusiasts
42
data is now on everyones mind () its a bit of a frenzy () and if you dont use data youre
not innovating
- Researcher, exploring how people interact with technologies in health and life sciences
43
because its a newish thing [data science], it seems attractive for executive staff () like,
wow, he does data science means hes doing innovation
a lot of people are trying to brand themselves as data scientists e.g. a quant wants to find a
job, or an excel analyst tries to sell himself () there is a sense of tribalism, you know I want
to brand myself as a data-scientist, these are the good network, and things I can pick-up and
progress in my career
44
VII. Discussion
Research and analysis conducted during this short piece of study leads to
a convincing assertion that we might be witnessing the establishment of a new
profession - emerging on the boundaries of engineering, computing and statistics
- that sits within a longer lasting tradition of the evolution of the data analyst
role. The need for a new breed of researchers and data analysts has been
expressed on both sides of the market amongst the scientific community and
the industry, spearheaded by technology companies from technology hubs of the
United States. This phenomenon fits into the picture of a more universal change
happening in society, that is, digitization and scientification of work practice.
This process indeed might be concealed under the messages of increasing
datafication of products, services and policy interventions, marked by
additional slogans of data-driven analytics or evidence-based decision
making.
45
new role, and how does the self-perception of authenticity accord with what one
feels and communicates around his or her competencies?
For the self-perceived data scientist and for the industry recruiters who
also shape the perception of the profession, the role is strongly associated with
an advanced (Masters or PhD) degree in applied quantitative disciplines or
computer science. This is because the role of the data scientist seems to assume
a blend between computing and quantitative analysis skills, backed with
practice and experience in conducting scientific work. This educational
background is still asserted by the labels of higher education, however
respective on-line and industry-led programmes have been made available for
enriching the training base.
knowledge about the use of Big Data tools, and their appropriate use for a
changing organizational context. In many cases the tools developed by industry
in the last few years are the ones mostly associated with the data scientists role
Hadoop, Cassandra, Map/Reduce, Hive, Pig, are all the new generation of Big
Data tools.
Java; and analytical tools MatLab, Stata and R, too. These are the technical
craftsmanship tools that are expected for data scientists by the labour market,
and to some degree, by the data scientist themselves. In addition, as research
and analysis suggests, it is not the pure knowledge of these tools that makes one
a respected data scientist amongst peers, but the ability to independently
choose the appropriate tools for the given context and the aptitude to skilfully
interpret and communicate the findings. Pure knowledge of these tools,
training, or education doesnt seem to yet make one the data scientist. This is,
rather, a consequence of the type of role one is expected to pursue at the
workplace and its title. For example, there are a number of individuals
possessing the above characteristics who pursue the data scientist tasks, but are
not labelled as data scientists.
Due to this, the phenomenon corresponds closely with the question of
how the job of the data scientist fits the larger picture of the making of science
and the evolution of data analysis roles. There is a push in the public narrative,
supported by the tech-industry and backed by some policy decisions, that the
46
As in many
47
exposure to some of the most sophisticated data problems, are often at the
forefornt of these conversations.
As a result, the data scientist sit alongside the changing (or rather
evolving) nature of techne and episteme deriving from the introduction of Big
Data approaches to subsequently new areas of data analysis. They bare much
responsibility over how the intepretation of using novel data analysis tools and
approaches will be translated into the fabric of the organizations that they are
working for, or the cause they are impacting. As with the use of machine
learning without thorough understanding of the investigated data, also Big Data
methods can likewise exclude certain observations. Data scientists are therefore
playing dual-natured roles in the organizational context that they are operating
within. They are the source of insights, research and of scientific rigour to the
pursued data-problem.
science and STS (e.g. Thomas Kuhn, 1963 and Donna Haraway, 1988) there is
rarely (if ever) an objective truth or neutral agenda and so objectivity is
situated and historically specific, particurarly, within an organisational setting.
This in some way circles the conversation back towards the notions of
education data, computational and epistemiological literacy -
and the
implications on the systems of knowing, and the meaning of learning. The ways
in which data scientists will be establishing their professional identities attributes, beliefes, values, motives and experiences as the profession grows,
might have substantial implications on how decision making is conducted in
industry, business or the public sphere. For that reason, it is critical to make
sure the process of educating data scientists is comprehensive enough to
overcome the interpretative socio-technical and political limitations of Big Data,
machine learning, and whaterver comes next.
science, data scientists and the making of the next generation of data analyst
roles so important for further research.
48
The study
Bibliography
Abbott, A. (1988). The System of Professions: An Essay on Division of Expert Labour.
Chicago: The University of Chicago Press.
Adams, M., & Kowalski, G. (1980). Professional Self-Identification Among Art Students.
Studies in Art Education vol. 21 no. 3, 31-39.
Anderson, C. (2008, 23 June). The End of Theory: The Data Deluge Makes the Scientific
Method Obsolete. Retrieved from Wired Magazine: http://archive.wired.com/
science/discoveries/magazine/16-07/pb_theory
Ashfort, B., & Humphrey, R. (1993). Emotional Labor in Service Roles: The Influence of
Identity. Academy of Management Review vol. 18 no. 1, 88-115.
Bandura, A. (1977). Self-efficacy: Toward a Unifying Theory of Behavioural Change.
Psychological Review vol. 84, no. 2, 191-215.
BBC. (2010). Joy of Stats (with Prof. Hans Rosling) [Motion Picture].
Becker, S., & Carper, J. (1956). The Development of Identification with an Occupation.
The American Journal of Sociology, 289-298.
Berger, L., & Luckmann, T. (1966). The Social Construction of Reality. New York:
Penguin Books.
Biao, X. (2006). Global "Body Shopping": An Indian Labour System in the Information
Technology Industry. Princeton, NJ: Princeton University Press.
Bijker, W., & Law, J. (2012). Shaping Technology/Building Society: Studies in
Sociotechnical Change. MIT.
Blacker, F. (1995). Knowledge, Knowledge Work and Organizations: An Overview and
Interpretation. Organization Studies, 1021-1046.
Boellstorff, T. (2013). Making big data, in theory. First Monday vol. 18, nr. 10.
boyd, d., & Crawford, K. (2012). Critical Questions For Big Data: Provocations for a
cultural, technological and scholarly phenomenon. Information, Communication
& Society vol. 15, iss. 5, 662-679.
Brown, J., & Duguid, P. (2001). Knowledge and Organization: A Social-Practice
Perspective . Organizational Science 12(2), 198-213.
Brynjolfsson, E., & McAfee, A. (2014). The Second Machine Age: Work Progress and
Prosperity in Time of Brilliant Technologies. New York: W.W. Norton & Company.
Burkholder, L. (1992). Philosophy and the Computer. Boulder, San Francisco and
Oxford: Westview Press.
Cleveland, S. W. (2001). Data Science: an Action Plan for Expanding the Technical Areas
of the Field of Statistics. International Statistical Review, 21-26.
Colleman, G. (2013). Coding Freedom: The Ethics and Aesthetics of Hacking. Princeton:
Princeton University Press.
Data Science Institute. (2014, September 5). Data Science Institute - Events. Retrieved
from Imperial College London: http://www3.imperial.ac.uk/data-science/events
50
Data Science London. (2014, September 5). About @DS_LDN. Retrieved from Data
Science London: http://datasciencelondon.org/data-science-london/
Davenport, T., & Patil, D. J. (2012). Data Scientist: The Sexiest Job of the 21st Century.
Harvard Business Review.
Diebold, F. (2012). "On the Origin(s) and Development of the Term Big Data". Working
Paper - Penn Economics.
Eyal, G. (2013) For a Sociology of Expertise: The Social Origins of the Autism Epidemic.
AJS vol. 118, no 4., 863-907
Forbes. (2014, September 5). Article: Blueprints Of NSA's Ridiculously Expensive Data
Center In Utah Suggest It Holds Less Info Than Thought. Retrieved from Forbes:
http://www.forbes.com/sites/kashmirhill/2013/07/24/blueprints-of-nsa-datacenter-in-utah-suggest-its-storage-capacity-is-less-impressive-than-thought/
Friedman, J. H. (2001). The Role of Statistics in the Data Revolution? International
Statistical Review 69, (1), 5-10.
Gillespie, T. (2014). The relevance of algorithms. In T. Gillespie, & B. P., Media
technologies: Essays on communication, materiality, and society (pp. 167-194).
Cambridge, MA: MIT Press.
Gitelman, L., & Jackson, V. (2013). Introduction. In L. (. Gitelman, "Raw Data" Is an
Oxymoron (pp. 9-23). Cambridge, MA: The MIT Press.
Goffman, E. (1959). The Presentation of Self in Everyday Life. Anchor Books .
Google. (2014, September 5). Company Overview. Retrieved from Google Company:
https://www.google.com/about/company/
Hall, R. (1968). Professionalization and bureacratization. American Sociological Review,
92-104.
Handley, K., Sturdy, A., Finchman, R., & Clark, T. (2006). Within and beyond
communities of practice: Making sense of learning through participation, identity
and practice. Journal of Management Studies, 641-653.
Haraway, D. (1988). Situated Knowledges: The Science Question in Feminism and the
Privilege of Partial Perspective. Feminist Studies vol. 14, no. 3, 575-599.
Harvey, D. (2007). A Brief History of Neoliberalism. Oxford: Oxford University Press.
Ibarra, H. (1999). Provisional selves: Experimenting with image and identity in
professional adaptation. Administrative Science Quaterly vol. 44 iss. 4, 764-791.
IBM. (2014, September 5). Apply new analytics tools to reveal new opportunities.
Retrieved from IBM Smarterplanet: http://www.ibm.com/smarterplanet/us/en/
business_analytics/article/it_business_intelligence.html
Insight Data Science Program. (2014). White Paper. San Francisco: Insight Data Science
Program.
Kelty, C. (2008). Two Bits - The Cultural Significance of Free Software. Durham and
London: Duke University Press.
Krause, E. (1971). The Sociology of Occupations. Boston: Little, Brown and Company.
Kuhn, T. (1962). The Structure of Scientific Revolutions. Chicago: University of Chicago
Press.
51
Latour, B. (1987). Science in Action: How to Follow Scientists and Engineers through
Society. Cambridge, MA: Harvard University Press.
Latour, B., & Woolgar, S. (1986 (1979)). Laboratory Life: The Construction of Scientific
Facts. Princeton, NJ: Princeton University Press.
Latour, B., Jensen, P., Venturini, T., Grauwin, S., & Boullier, D. (2012). The whole is
always smaller than its parts a digital test of Gabriel Tardes' monads. The
British Journal of Sociology vol. 63, iss. 4, 590-615.
Lave, J., & Wenger, E. (2008 (1991)). Communities of Practice: Learning, Meaning, and
Identity. Cambridge University Press.
Levy, S. (1984). Hackers: Heroes of the Computer Revolution. New York : Nerraw
Manijaime/Doubleday.
Lohr, S. (2012, August 11). How Big Data Became So Big. Retrieved from The New York
Times: http://www.nytimes.com/2012/08/12/business/how-big-data-becameso-big-unboxed.html?pagewanted=all&_r=0
Manovich, L. (2011). Trending: The Promises and the Challenges of Big Social Data. In
M. K. Gold, Debates in Digital Humanities. The University of Minnesota Press:
Minneapolis.
Mattman, C. A. (2013). A vision for data science. Nature vol. 493, 473 - 475.
Mayer-Schnberger, V., & Cukier, K. (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think. Eamon Dolan/Houghton Mifflin
Harcourt.
McIntosh, P. (1989). Feeling like a fraud: Part II. Stone Center Working Paper no. 37,
Wellsey College.
McKinsey. (2011). Big data: The next frontier for innovation, competition, and
productivity. McKinsey.
Mead, G. H. (1934). Mind, Self, and Society . Chicago: University of Chicago Press.
Miller, D., & Slater, D. (2001). The Internet: An Ethnographic Approach. London:
Bloomsbury Academic.
Moss-Racusin, C., Dovidio, J. F., Brescoll, V., Grahama, M., & Handelsman, J. (2012).
Science facultys subtle gender biases favor male students. Proceedings of the
National Academy of Sciences of the United States of America (vol. 109 no. 41),
16474-16479.
O'Reilly. (2013, September 5). Retrieved from Strata Conference: http://
strataconf.com/
Parks, M. (2014). Big Data in Communication Research: Its Contents and Discontents.
Journal of Communication vol. 64, iss. 2, 355-360.
Parry, R. (2014, September 5). Episteme and Techne. Retrieved from The Stanford
Encyclopedia of Philosophy (Fall 2014 Edition): http://plato.stanford.edu/
archives/fall2014/entries/episteme-techne/
Pentland, A. (2014). Social Physics: How Good Ideas Spread The Lessons From a New
Science. New York: The Penguin Press.
52
53
54