Vous êtes sur la page 1sur 10

Privacy Snooper: IOT

Arnab Kumar1 , Harishma Dayanidhi1 and Vijay Kumar KS1


{arnabk, hdayanid, vkanlanji}@andrew.cmu.edu
1

Carnegie Mellon School of Computer Science, Pittsburgh, USA

Abstract. In various ML-as-a-service cloud systems, the process of performing machine learning over
the data is almost treated as a black box, where the user just feeds in their data, knows the model used
and the system outputs required insights. In this work, we explore the idea of being able to predict
sensitive attributes associated with the database given that the adversary would have access to a few
quasi-identifiers associated with the database. We use inversion attack as the theoretical foundation
for our attack, and implement the same for our database. We experiment this attack for dierent
variants of classification algorithms, like classification tree and regression tree. We follow it up with
analysing the accuracy of our attack for each of our classification based machine learning algorithms
for dierent size of training datasets. We end our work by trying to figure out what we say is the most
impactful attribute, by selectively removing the data pertaining to an attribute and check what is the
corresponding eect on inversion attack. We hope our work in this domain pushes future batches of
this class to explore this question even further, and too look into understanding if Dierential Privacy
solves this problem.
Keywords: Inversion Attack, Black Box, Classification Tree, Regression Tree

Introduction

The Internet of Things (IoT) is the network of physical objects or things embedded with electronics, software, sensors, and network connectivity enabling objects to collect and exchange data. British entrepreneur
Kevin Ashton first coined the term in 1999 while working at Auto-ID Labs (originally called Auto-ID centers
- referring to a global network of RFID connected objects. The bounds of such an inter-connected world is
truly mind-blowing. The Internet of Things allows objects to be sensed and controlled remotely across existing network infrastructure, creating opportunities for more direct integration between the physical world
and computer-based systems. Such an inter-connected network of devices resulting in improved efficiency,
accuracy and economic benefit. Experts estimate that the IoT will consist of almost 50 billion objects by
2020.
The term IoT has been used to refer not just to the advanced connection of devices, typically machine-tomachine communications (M2M) but also a wide variety of protocols, domains, and applications. Thing,
in the IoT sense, can refer to a wide variety of devices such as heart monitoring implants, biochip transponders on farm animals, electric clams in coastal waters, automobiles with built-in sensors, or field operation
devices that assist firefighters in search and rescue operations. These devices collect useful data with the
help of various existing technologies and then autonomously flow the data between other devices. Current
market examples include smart thermostat systems and washer/dryers that use Wi-Fi for remote monitoring.
The growing influence of IoT has now promulgated to surveillance devices as well. Surveillance is the
monitoring of the behaviour, activities, or other changing information, usually of people for the purpose of
influencing, managing, directing, or protecting them. This can include observation from a distance by means
of electronic equipment (such as CCTV cameras), or interception of electronically transmitted information
(such as Internet traffic or phone calls); and it can include simple, relatively no- or low-technology methods
such as human intelligence agents and postal interception. The wide variety of information collected over
dierent types of surveillance ranges from video and audio to heat-sensing and biometric these days. Such
surveillance by the government is justifies saying that it is used for intelligence gathering, the prevention of
crime, the protection of a process, person, group or object, or for the investigation of crime. Such concerns
have further accentuated given 9/11 and other terrorist attacks in recent times.
Such continual surveillance, by public and private entities alike, has raised several privacy concerns. Add
to that the plethora of new application areas for Internet has caused previously incomprehensible amounts
of data being generated. The development of sophisticated machine learning algorithms have now allowed
extensive analysis of this data, enabling insights that blatantly invade into the privacy of individuals whose

A. Kumar, H. Dayanidhi and V. Kumar KS

data is collected, without user choice or notice of such data collection taking place. As students of Privacy Engineering, and having faced such privacy harms in our life, we felt that a solution for this is highly imperative.
One of the most obvious and foolish solutions in mind would be to ask the government and private companies to stop all forms of surveillance. Several public advocacy organisations, companies from across the
political spectrum and initiatives like StopWatching.us have asked the government to completely cut down
on surveillance. However, we believe that such a solution would not just be impractical to be implemented,
but would pose huge concerns to national safety and security. After giving extensive thought to several
probable solutions to this problem, we believed that perhaps the best possible way of going about is to
properly notify an individual of the existence of all such surveillance devices. How individuals react to and
after proceed such circumstances are at their own discretion. We aspired to come up with a solution that
would give them the ability to be well informed of the existence of surveillance devices in their surrounding,
and their type (like audio, video, biometric etc) so that an individual may react accordingly. They might
also choose to avoid the route consisting of such surveillance devices if they deem it uncomfortable.
Our project aspires to firstly understand this intuition, and providing an introduction to inversion attack
in Section 3. We then proceed onto extending the inversion attack for dierent classification based algorithms
- classification tree and regression tree, and lay the foundation for us to understand which classification
based machine learning algorithm makes them more susceptible to inversion attacks in Section 3. We then
describe the dataset used for the purpose of our research in Section 4. Section 5 explores application of
Inversion attack to our dataset, understanding the variation in the accuracy of the inversion attack for
dierent sets of training sizes of the database, and trying to compute the most optimal training set. We
also try understanding impact of dierent attributes have on the inversion algorithm, investigating how
presence/absence would increase/decrease the probability of accurately predicting the sensitive attribute.
We then present some highlighted observations and the conclusion to our work in Section 6. We wrap up
our work by proposing avenues of future work and a few lines of thankfulness for our mentors in Section 7
and Acknowledgements respectively.

Related Work and Literature Review

In our initial proposal for the semester project, we proposed an idea incorporating two ideas of secure machine
learning: Dierential Privacy and Machine Leaning on encrypted databases. We believed that incorporating
these two seemingly orthogonal problems would give us what we believed complete security for machine
learning. Our idea aimed to address privacy concerns protecting data contributors from adversaries keeping
dierent threat models in mind: where the adversary is the analyst (dierential privacy) or the case where
the adversary is a person trying to access the original database (machine learning on encrypted databases).
Streamlining our idea required us to read variety of papers, both from the perspective of how machine
learning could leak information about the model being used and if combining this idea with Nina Tafts work
on Machine Learning in encrypted databases [?] would be practical for the purpose of this project. Sifting
through multiple papers on both these topics enabled us to arrive at the conclusion that it would pragmatic
for us to concentrate on the former: Understanding how machine learning models could leak information.
This led us to look into the works by Dr. Fredrikson on Inversion Attacks enabling an individual to backtrack
and predict sensitive attributes associated with an individual from partial knowledge of quasi identifiers and
the resultant of the machine learning model [?] [?]. A basic introduction to the idea around Inversion Attacks
is defined in the next section.
The maturity of work done around the idea of staging inversion attacks on machine learning is still in
its nascent stages. The paper [?] shows how machine learning has now enabled doctors to predict the
recommended dosage of warfarin dependent on a sensitive attribute like genetic information and other quasi
identifiers like geo-information. However, the authors then proceed onto proposing a black box inversion
attack that enables an attacker to predict the genetic information of an individual in the database from
his/her quasi-identifiers and recommended dosage of warfarin. What was most appalling about this work
was the accuracy with which a highly sensitive attribute like the genetic data could be predicted. Such
methods were then extended in the paper [?], where the authors have extended the algorithm developed in
[?] for cases in facial recognition etc.

Sequence

With a topic as broad as Privacy in the realm of Internet of Things, the journey from initial idea conceptualisation to final implementation involved introspection and intensive research at each and every step.

Privacy Snooper: IOT

This section guides the reader through each of these steps, and clearly elucidates every decision taken by
us, along with the corresponding reason for doing so.
3.1

Idea Conceptualisation

With severe security concerns prevailing worldwide, continual surveillance by public and private entities has
emerged as perhaps the only solution to counter, or at least provide sufficient evidence for the investigation
of such problems. This solution, though, has raised several privacy concerns. The combination of previously
incomprehensible amounts of data and sophisticated machine learning algorithms have now led to extensive
yet highly privacy invasive analysis of this data. Perhaps the scariest aspect of all this is that neither choice
nor notice is provided to data providers when such data collection does occur. As students of Privacy Engineering, we gave extensive thought to several probable solutions to this problem.
Some contemplation led us to believe that perhaps the best yet simplest possible way of going about
is to properly notify the user of the existence of surveillance devices around him/her. This, according to
our expectations, would lead to individuals to become more privacy aware at the least, without taking
any responsibility of how individuals react to surveillance notifications. We shall illustrate two of our most
important ideas:

1. To build a static map based application that presents the user with the location of all surveillance
devices. This would however require initial crowdsourcing by several users to identify all surveillance
devices even within the perimeter of the university.
2. To build an application that could dynamically detect the presence of surveillance devices and notify the
user about the same via standard notifications. Notifications could be customised for dierent types of
surveillance e.g. Audio, Video and Biometric. Further detailed information about the surveillance could
be addressed by a Privacy Statement associated with these notifications. The Privacy Statement is
a twist on the conventional privacy policy, as it addresses questions regarding the surveillance that
the users of the application found most useful. This is done in a manner that users find concise and
easily understandable. Though this would require a mechanism by which surveillance devices could
communicate with our mobile devices, it could be replicated via beacons positioned alongside these
surveillance devices.
After a lot of thought put into each idea, and discussion with several PhD students at Cylab, we narrowed
our focus on build an application that could dynamically detect the presence of surveillance devices and
notify the user about the same via standard notifications. Some of the key points that convinced us to work
on an iOS application for the latter are as follows:
1. The scenario associated with the latter idea is something we could see becoming the norm in the next
decade. Though current surveillance devices do not posses the capability to communicate with other
devices, we could easily replicate the same in a university or college setting via beacons positioned
alongside them.
2. It represented a dynamic scenario of being notified whenever a person nears a surveillance device. This
would have much more utility than the scenario of a person having to check out a map or continuously
paying attention to it while he/she traverses the campus.
3. It enabled us to address another one of the broken systems in privacy and security these days: Privacy
Policy. Most of the privacy policies associated with data collection are extremely long and convoluted.
The Privacy Statement provided within our application is concise, to the point and addresses questions
that users find most concerned about.
3.2

Basic Technical Decisions

Once the foundational idea for the project is set, we move towards illustrating the nuances of how to
implement the idea for the purpose of this project. This would involve a variety of decisions, from choosing
whether we would want to develop an Android/iOS application, to figuring out the beacons we should use.
All of our decisions have been taken with not just the technical capabilities of our team in mind, but also
other factors that might aect the scalability of this application for wider use in the future:

A. Kumar, H. Dayanidhi and V. Kumar KS

Operating System We chose to make an iOS application for the purpose of this application. The reasons
for this choice are as follows:
1. Several studies like *cite* have shown that iOS user are more concerned about security and privacy than
Android users. This fits in well with the initial demographic that we target, which would be privacy
savvy individuals.
2. The documentation provided by Apple for iOS application development is highly detailed with a highly
active developer network and online presence.
3. Our team possessed more experience with developing applications on the iOS platform.

iBeacon OR Eddystone As mentioned earlier, our application would require a mechanism which would
allow data collection devices such as video camera, audio recorder etc. to transmit information about them.
with our mobile devices. Unfortunately, surveillance devices do not posses this ability currently. This would
require us to validate the utility of our app through a workaround. We plan to do the same via Bluetooth
Low Energy (B.L.E.) devices like beacons accompanying all the surveillance devices.
Bluetooth is a wireless technology standard for exchanging data over short distances. This technology
was developed by Ericsson Corporation in 1994. It allows a range of devices to speak to each other using
radio waves. Bluetooth was designed to eliminate the need for connecting computer peripheral devices such
as mouse, keyboard etc. with wires. Since its inception it has become a standard for transmitting information
in the form of short range radio waves over distances of up to 30 feet and is used to connect and exchange
information between digital devices.
Bluetooth Low Energy is a wireless personal area network technology used for transmitting data over
short distances. As the name implies, it is designed for low energy consumption and cost. More details about
B.L.E can be found in Bluetooth 4.0 core specification [1]. B.L.E. maintains a communication range similar
to that of its predecessor, Classic Bluetooth [2]. All major mobile operating systems including iOS, Android,
Blackberry etc. support B.L.E. technology. It finds varied applications, ranging from basic Heath care to
Sports, fitness etc.
Dierent vendors have dierent implementations of Bluetooth 4.0 communication protocol. For instance,
Apple Inc. has come up with iBeacon. Similarly, Google has come up with Eddystone protocol. iBeacon is
officially supported by iOS devices device only. However, Eddystone has support for both iOS and Android.
One major limitation of iBeacon is that there is no way a beacon can send some useful data back to
the nearest mobile device. The iBeacon hardware transmits data that is used by the device to determine
proximity to a unique iBeacon hardware device. iBeacon does not provide any mechanism to transmit custom
data from the iBeacon to the mobile application. This seriously limits the utility of the iBeacon. Dierent
companies have built beacons adhering to these standards. In our implementation, we explored a variety of
Bluetooth beacons. Some examples of the beacons that we explored are GemTot, XY Find It, BlueCats,
Estimote etc. Out of these, we selected Estimote for the following reasons: Estimote uses Eddystone. This
means both Android and iOS devices are supported. Using Estimote, we can send a U.R.L. back to the
mobile device. This is an advantage. Good documentation and active support for developers in the Estmote
forum. Estimote beacons are easy to use and can be configured using a mobile device.
We aspired to come up with a solution that would give them the ability to be well informed of the
existence of surveillance devices in their surrounding, and their type (like audio, video, biometric etc) so
that an individual may react accordingly. They might also choose to avoid the route consisting of such
surveillance devices if they deem it uncomfortable.
Model Inversion attack was first described in a case study of linear classifiers in personalised medicine
by Fredrikson et al. The basic idea behind the attack is that an adversary with access to a machine learning
model abuses this privilege to learn sensitive information about individuals, given that the adversary would
be aware of a few quasi-identifiers associated with the individual whose data he/she is trying to compromise.
To explain the concept, we shall be taking an example of the warfarin dosage case study as discussed by
Fredrikson et al. in [?].
In our project, we assume an adversary who employs an inference algorithm A to discover the genotype
of a target individual . The adversary has access to a linear model f trained over a dataset D drawn i.i.d.
from an unknown prior distribution p. D has domain XY, where X = X1 ,...,Xd is the domain of possible
attributes and Y is the domain of the response. is represented by a single row in D, (x , y ), and the
attribute learned by the adversary is referred to as the target attribute xt .

Privacy Snooper: IOT

In addition to f , the adversary has access to marginals p 1,...,d ,y of the joint prior p, the dataset domain
XY, s stable dosage y of warfarin, some information about f s performance, and either of the following
subsets x K of s attributes:
1. Basic demographics: a subset of s demographic data, including age (binned into eight groups by the
IWPC), race, height, and weight. Note that this corresponds to a subset age race of the non-genetic
attributes in D.
2. All background: all of ps attributes except the genotype. The adversary has black-box access to textitf.
Unless it is clear from the context, we will specify whether f is the output of a DP mechanism, and
which type of back- ground information is available.
The author went about trying to device algorithms for inferring genotype from a model designed to
predict warfarin dosing. Based on the known priors, and how well the models output on a given row
coincides with As known response value, the candidate rows are weighted. The target attribute with the
greatest weight, computed by marginalising the other attributes, is returned.A mathematical formulation
of the algorithm is given in [?]. They have derived each step by showing how to compute the least biased
estimate of the target attributes likelihood, which the model inversion algorithm maximised to form a prediction.
For the purpose of the documentation, we shall not focus much on the mathematical
foundation around which the black box model of the inversion attack is built, since it is not our
primary contribution. All readers enthusiastic about reading on the same are recommended
to read the paper [?] by Fredrikson et al., where the authors go to great lengths to explain the
mathematical intuition and formulation of the concept. Our code for implementing the black
box inversion attack has been attached along with our submission. As is expected, we shall be
laying more focus on the results we obtained using various machine learning algorithms and
the black box inversion attack to invert it.

Algorithm 1: Algorithm for Inversion Attack

1
2
3
4
5
6
7
8
9
10
11
12

Input: A Dataset with one or more sensitive attributes


Result: A graph for accuracy of the attack for a given
begin
l=0;
for Each Dataset do
for Each Classification based algorithm in set - classification tree, regression tree do
Pick a split for training and testing sets;
for Every training split and testing set 1
do
Train and acquire the model;
for Each row in dataset do
Given the model and output value for a row, Run inversion attack on the model by
substituting all possible values for the sensitive attribute and find its accuracy;
Find average accuracy of the inversion attack over all the rows;
Plot a graph for accuracy of the attack for a given;
Compare split values of highest inversion attack accuracy for each of the algorithms;

Dataset Used

The dataset used for the purpose of this work is a dataset named steak-risk-survey.csv downloaded from
fivethirtyeight.com [?]. Participants in this survey were asked questions about their preferences for steak
preparation. The survey also involved participants answering some personal questions. Following are the
exact questions that were asked in the survey.
1. Consider the following hypothetical situations: In Lottery A, you have a 50 percent chance of success,
with a payout of 100 dollars. In Lottery B, you have a 90 percent chance of success, with a payout of 20
dollars. Assuming you have 10 to bet, would you play Lottery A or Lottery B?
2. Do you ever smoke cigarettes?
3. Do you ever drink alcohol?
4. Have you ever been skydiving?

5.
6.
7.
8.
9.
10.
11.
12.
13.

A. Kumar, H. Dayanidhi and V. Kumar KS

Do you ever drive above the speed limit?


Have you ever cheated on your significant other?
Do you eat steak?
How do you like your steak prepared?
What is your Gender?
What is your Age?
What is your Household Income?
What is your Education?
What is your Location?

Personal data was anonymised and the data was published. The machine learning algorithm is used to
predict what type of steak the individual likes.The inversion attack tried to predict a sensitive data such as
whether the person has cheated on his/her significant other. The adversary had access to anonymised data,
the model, the result, and other quasi-identifiers in the data set. Moving ahead, we even try figuring out what
we say is the most impactful attribute. What we mean by most impactful attribute is that we remove
data pertaining to a specific attribute (along with zero knowledge) and check what is the corresponding
eect on the accuracy of the inversion attack.

5
5.1

Inversion Attack - Results


Regression Tree

Decision tree learning uses a decision tree as a predictive model which maps observations about an item
to conclusions about the items target value. Decision trees where the target variable can take continuous
values (typically real numbers) are called regression trees [?]. In our initial analysis, we have used Regression
Trees to predict what type of steak the individual likes his steak prepared e.g Rare, Medium, Medium. Given
the model, the result derived (type of steak the individual likes) and all the remaining attributes except
whether the person has cheated on his/her significant other, the inversion attack tries to predict whether
an individual has ever cheated on his significant other. The ability to predict such a sensitive attribute might
be considered to be extremely privacy invasive.

Our analysis is not just restricted to the dataset for a given training size. Infact, we have analysed the
accuracy of the inversion attack for the dataset using regression tree as a machine learning model for dierent
sizes of the training and testing dataset as well. This would enable us to draw conclusions regarding the
optimal size of the training set for future analysis.
Our analysis raised a new question: Do presence or absence of dierent attributes (or quasi identifiers)
increase or decrease the accuracy with which a sensitive attribute can be accurately predicted/backtracked

Privacy Snooper: IOT

via inversion attack? Such an insight has immense repercussions for future analysis. Let us give an example
to explain the idea:
Our experiments begin with us trying to predict a specific output for the attributes of a dataset, and the
model being used given. This could be likened to a ML-as-a-service cloud system, where the user supplies
the service with a dataset, is aware of the data model being used, and receives the desired output from
the system. Now an adversary trying to figure out a sensitive attribute associated with the database might
use the output released by the service, knowledge of the data model being used and other quasi identifiers
in the database. For the purpose of this research, we have previously been assuming that the adversary is
aware of all quasi-sensitive attributes except the sensitive attribute, surely along with the output of the
database and data model used. However, let us assume that the adversary is now unaware of 2 attributes,
the sensitive attribute and one of the quasi-sensitive identifiers. We shall now try to predict the sensitive
attribute for this database given the data we posses. The accuracy with which this prediction can be made
would give us what we call the weight of the attribute, or the accuracy associated with the backtracking
given the absence of a specific quasi-identifier. Qualitatively, what this tells us is the attribute that should
not be released or known publicly at all costs. This would ensure that even though the adversary might try
predicting a sensitive attribute in such a scenario, he wouldnt be very eective in it. A formal definition for
the algorithm used to analyse this question is as follows:

Algorithm 2: Algorithm for Inversion Attack

1
2
3
4
5
6
7
8
9
10

11
12
13

Input: A Dataset with one or more sensitives


Result: A graph for accuracy of the attack for a given
begin
l=0;
for Each Dataset do
for Pick a fixed training and testing dataset split do
for Each Classification based algorithm in set - classification tree, regression tree do
Train and acquire the model;
Pick a secondary missing attribute ;
for Each do
for Each row in dataset do
G given model and output value for a row, run inversion attack by ignoring the value
of attribute on the model by substituting all possible values for the sensitive attribute
and find its accuracy;
Find average accuracy of the inversion attack over all the rows;
Plot a graph for accuracy of the attack for a given;
Algorithm for computing accuracy for inversion attack given ignoring the value of attribute on the
model;

5.2

A. Kumar, H. Dayanidhi and V. Kumar KS

Classification Tree

The Classification Tree is just like the regression tree, except the fact that the target variable can take a finite
set of values [?]. Here, we have used Classification Trees to predict what type of steak the individual likes his
steak prepared e.g Rare, Medium, Medium. Given the model, the result derived (type of steak the individual
likes) and all the remaining attributes except whether the person has cheated on his/her significant other,
the inversion attack tries to predict whether an individual has ever cheated on his significant other. This
thereby presents us with a situation identical to what we had in hand in the previous subsection, except
that we replicate everything for the case of a classification tree.

We have again extended our analysis of the most impactful attribute

Privacy Snooper: IOT

Highlighted Observations and Conclusion

Our observations for the experiments we have run is multifaceted. We start with presenting our observations
for accuracy of inversion attacks on Regression and Classification Tree:
1. We expect increase in efficiency of inversion attack as training data size increases for Regression Tree.
2. Regression tree classifier fits the pattern.
3. Classification tree classifier shows anomalous behaviour for small percentages of training data size (or
10 and 20 percent in the graph).
4. We suspect the behaviour is due to overfitting of data
5. Our method to measure accuracy of inversion attack = average of its accuracy for dierent rows. Overfitting might lead to a skewed distribution of accuracy with accuracy for training set data being very
high.
6. Conclusion: Average Accuracy not a eective measure of inversion attack efficiency
We now present our observations for accuracy of our inversion attack given the absence of specific quasisensitive identifiers.
1. Inversion Attack accuracy is the lowest in the absence of the attributes: 1. Household income 2. Location
( Census Region ) . This implies that if the data administrator can ensure that quasi-sensi tive identifiers
like Household income and Location could either be removed or prevented from falling into the hands
of the adversary, the accuracy with which he/she can predict a sensitive attribute like if a person has
cheated on your significant other is decreased.
2. Inversion Attack accuracy is the highest in the absence of the attributes: 1. Education 2. Gender
3. Conclusions: To prevent inferences on whether a person has cheated on their significant
other, it might be useful to suppress data on the attributes: Household income and Region
to which the person belongs.

Future work

As described above, household income has a significant impact on the accuracy of the inversion attack.
While there might be many implications of this co-relation, we are interested in determining how this can
be used to preserve the privacy of a user. In context of the data set used in the paper, we plan to run
an experiment in which the initial machine learning algorithm is trained without the household income
attribute, eectively suppressing the said attribute from the training and test dataset. The purpose of this
experiment is to determine how this new machine learning model compares in accuracy to the one which

10

A. Kumar, H. Dayanidhi and V. Kumar KS

was trained with the dataset that contained household income attribute. From this experiment we plan to
determine the eect a certain attribute has on the usability of the machine learning model, that is accuracy
with which it can predict the kind of steak a certain participant prefers. If the accuracy of the new model
varies only by a certain pre determined threshold, then we can eectively remove that attribute from the
entire dataset and use only the new machine learning model for prediction such that it would also be more
privacy preserving.

Acknowledgements
We would like to sincerely thank Dr. Fredrikson for his guidance and sharing his prior expertise in tackling
this problem . We are also highly thankful to Dr. Anupam Dutta and Mr. Amit Datta for their continuous
feedback on our work.

Vous aimerez peut-être aussi