Académique Documents
Professionnel Documents
Culture Documents
Interdisciplinary
Privacy Course,
June 2010
Web Mining
and
Privacy
Bettina
Berendt
K.U. Leuven,
Belgium
www.berendt.de
2
What is Web Mining?
And who am I?
Knowledge discovery
(aka Data mining):
"the non-trivial process of
identifying valid, novel,
potentially useful, and
ultimately understandable
patterns in data."
Web Mining:
the application of data mining
techniques on the content,
(hyperlink) structure, and usage
of Web resources. Web mining areas:
Web content mining
Navigation, queries, Web structure mining
content access & creation
Web usage mining
3
Why Web / data mining?
4
Agenda
a simple view
6
1. Behaviour on the Web (and elsewhere)
Data
7
2. Web (and other data) mining
Data
Privacy
problems!
8
Technical background of the problem:
Lindamood et al. 09
&
Heatherly et al. 09
12
3. Cryptographic privacy solutions
Data
not all !
13
4. "Privacy-preserving data mining"
Data
not all !
14
Two examples:
Association-rule mining
(& privacy-preserving AR mining)
Collaborative filtering
(& privacy-preserving collaborative filtering)
15
Two examples:
Association-rule mining
(& privacy-preserving AR mining)
Collaborative filtering
(& privacy-preserving collaborative filtering)
16
Reminder: The use of AR mining for store layout
(Amazon, earlier: Wal-Mart, ...)
Where to put:
spaghetti,
butter?
17
Data
2 Spaghetti, bread
4 bread, butter
22
The apriori principle and the pruning of the search tree
an example of "the data mining approach"
Spagetthi, Tomato sauce,
Bread, butter
23
The apriori principle and the pruning of the search tree
an example of "the data mining approach"
Spagetthi, Tomato sauce,
Bread, butter
24
The apriori principle and the pruning of the search tree
an example of "the data mining approach"
Spagetthi, Tomato sauce,
Bread, butter
25
Example:
If {spaghetti} then {spaghetti, tomato sauce}
Support: s = 2 / 5 (40%)
Confidence: c = 2 / 3 (66%)
26
Two examples:
Association-rule mining
(& privacy-preserving AR mining)
Collaborative filtering
(& privacy-preserving collaborative filtering)
27
Privacy-preserving data mining (PPDM)
Reconstruct Reconstruct
distribution distribution ...
of Age of Salary
Data Mining
Model
Algorithms
30
Reconstruction Problem
0V
1 9
0
A
g
e
O
r
i
g
in
a
ld
i
st
r
i
bu
ti
o
nf
or
Ag
e
P
r
o
ba
b
i
li
s
t
ic
es
t
i
ma
te
of
or
i
g
in
a
lv
a
l
ue
o
fV
32
Intuition (Reconstruct single point)
0V
1 9
0
A
g
e
O
r
i
g
in
a
lD
is
t
r
ib
u
ti
o
nf
or
Ag
e
P
r
o
ba
b
i
li
s
t
ic
es
t
i
ma
t
eof
o
ri
g
i
na
lv
a
l
u
eo
fV
33
Reconstructing the Distribution
Combine estimates of where point came from for all the points:
Gives estimate of original distribution.
1
0 9
0
A
g
e
34
Reconstruction: Bootstrapping
1 n fY (( xi yi ) a ) f Xj (a )
n i 1 f (( x y ) a ) f j (a )
Y i i
X
35
Seems to work well!
1200
1000
Number of People
800
Original
600 Randomized
Reconstructed
400
200
0
20 60
Age
36
Two examples:
Association-rule mining
(& privacy-preserving AR mining)
Collaborative filtering
(& privacy-preserving collaborative filtering)
37
What is collaborative filtering?
Idea: People who agreed in the past are likely to agree again
To predict a users opinion for an item,
use the opinion of similar users
Similarity between users is decided by looking at their overlap
in opinions for other items
User 1 8 1 2 7
?
User 2 2 5 7 5
?
User 3 5 4 7 4 7
User 4 7 1 7 3 8
User 5 1 7 4 6 5
User 6 8 3 8 3 7
40
User 1 8 1 ? 2 7
User 2 2 ? 5 7 5
User 4 7 1 7 3 8
Cosine
based
similarity
Adjusted
cosine
based
similarity
Correlation
based
similarity
44
Algorithm 2: K-Nearest-Neighbour
7 7 Aggregation
function: often
weighted sum
8 Weight
depends on
similarity
4
45
Outlook:
Model-based collaborative filtering
Instead of using ratings directly ("memory-based collaborative
filtering"),
develop a model of user ratings
Use the model to predict ratings for new items
To build the model:
Bayesian network (probabilistic)
Clustering (classification)
Rule-based approaches (e.g., association rules between co-
purchased items)
46
Two examples:
Association-rule mining
(& privacy-preserving AR mining)
Collaborative filtering
(& privacy-preserving collaborative filtering)
47
Collaborative filtering: idea
and architecture
Basic idea of collaborative filtering: "Users who liked this also
liked ..." generalize from "similar profiles"
Standard solution:
At the community site / centralized:
Compute, from all users and their ratings/purchases, etc., a global
model
To derive a recommendation for a given user: find "similar
profiles" in this model and derive a prediction
Mathematically: depends on simple vector computations in the
user-item space
48
Distributed data mining / secure multi-party computation:
The principle explained by secure sum
Given a number of values x1,...,xn belonging to n entities
compute S xi
such that each entity ONLY knows its input and the result of the
computation (The aggregate sum of the data)
49
Canny: Collaborative filtering with privacy
Each user starts with their own preference data, and knowledge
of who their peers are in their community.
By running the protocol, users exchange various encrypted
messages.
At the end of the protocol, every user has an unencrypted copy
of the linear model , of the communitys preferences.
They can then use this to extrapolate their own ratings
At no stage does unencypted information about a users
preferences leave their own machine.
Users outside the community can request a copy of the model
, from any community member, and derive
recommendations for themselves
Canny (2002), Proc. IEEE Symp. Security and Privacy; Proc. SIGIR
50
Privacy-preserving data publishing (PPDP)
Problem solved?
52
A second look at ...
privacy
(very much influenced by joint work with Seda Grses, see her
presentation in this course)
53
1. Privacy as confidentiality:
"the right to be let alone" and to hide data
Data
Is this all
there is
to privacy?
54
2. Privacy as control:
informational self-determination
Data
Dont do
THIS !
Data
56
1. Privacy as practice:
Identity construction and the societal re-negotiation of
the public/private divide
"privacy negotiations"
(incl. work by/with
Teltzrow, Preibusch,
Spiekermann)
57
A second look at ...
P. as P. as control P. as practice
confidentiality
Business /
application
understanding
Data for each cell:
understanding challenges
Data preparation opportunities
solution approaches
Modelling
Evaluation
(in the following:
Deployment simplification:
by phase, cell
differentiation
only in selected
cases
65
Business understanding:
Business models based on personal-data-as-value
66
Business understanding:
Business models based on avoiding data collection
67
Data understanding, in particular data collection
Threats:
Data collection may in itself be intrusive
Opportunities:
New forms of data collection (e.g. anonymous incident reporting)
Solution approaches:
Anonymisation technology
Use of pseudonyms
Other PETs that lead to fewer data being collected
68
Data preparation: data selection and data integration
Threats
data selection and
integration can lead to
record linkage and
therefore inferences
control via purpose
limitations becomes
essential
... threat or opportunity?:
69
Data integration: an example
[Frankowski, D., Cosley, D., Sen, S., Terveen, L., & Riedl, J. (2006).
You are what you say: Privacy risks of public mentions. In Proc. SIGIR06]
Generalisation with more robust de-anonymization attacks and different data:
[Narayanan A, Shmatikov V (2009) De-anonymizing social networks.
In: Proc. 30th IEEE Symposium on Security and Privacy 2009]
70
Merging identities the computational problem
Given a target user t from the forum users, find similar users (in
terms of which items they related to) in the ratings dataset
Rank these users u by their likelihood of being t
Evalute:
If t is in the top k of this list, then t is k-identified
Count percentage of users who are k-identified
E.g. measure likelihood by TF.IDF (m: item)
72
Results
73
What do you think helps?
74
Data preparation: data construction
- Definition and examples
"constructive data preparation operations such as the production of
derived attributes, entire new records or transformed values for
existing attributes."
May involve usage and/or prediction of values for attributes such as
gender, age
ethnicity, skin colour, sexual orientation
"people who are nostalgic for the former East German State"
[http://www.sociovision.de/loesungen/sinus-milieus.html]
"terror risk score"
(cf. Pilkington E (2006) Millions assigned terror risk score on trips to the US. The
Guardian, 2 Dec. 2006. http://www.guardian.co.uk/usa/story/0,,1962299,00.html )
75
Data preparation: data construction
- Analysis
Threats: The construction and naming of new attributes may create
controversial psychological or social categories. The intentional or
unintentional reification produces a social category or norm that may be
offensive per se and/or lend itself to abuses such as further privacy-relevant
activities (privacy as practice).
At the same time an opportunity? (imagine categories like "prolific donors to
online free-speech causes and, during modelling, findings that they do
"good" things)
Solution approaches:
anything that avoids such inferences ( all PPDM/PPDP)?
However, with the focus of PPDM on (i) data utility, (ii) avoiding
inferences on / damages to individuals, creation of new attributes and
profiling are explicitly not addressed.
76
Modelling
- Definition and Analysis (threats and opportunities)
Identification of interesting patterns
global characterizations of the modelled set of instances (e.g. clusters)
local characterizations Of a subset of all instances (e.g., association rules)
Threats
KD result patterns may be descriptions of or ascriptions to unwished-for social
categories (s.a.)
may also have implications on the public-private divide:
"[A system in which individuals and groups determine which description
best fits them] also addresses the second sense of privacy that of public
participation in the definition of the public/private divide. One of the most
insidious aspects of market profiling is that the social models thus
produced are private property [e.g., trade secrets]. ... When this private
social modeling is combined with the private persuasive techniques of
targeted marketing, the result is an anti-democratic [...] process of social
shaping. [Phillips(2004), p. 703]
Opportunities
Controversial relationships as a possible starting point of liberating debates that
further privacy as practice.
Example abortion?! [Donohue and Levitt(2001)]
77
Modelling
- Analysis: Solution approaches from PPDM
Modifications to the data (so that "private data remain private) (s.a.)
AND
Modifications to the results (so that "private knowledge remains
private).
Rule hiding example: discrimination-aware data mining [Pedreschi et
al(2008)].
discriminatory attributes (US law): race, religion, pregnancy status, ...
Discriminatory classification rules: propose a decision (e.g., whether to
give a loan) based on a discriminatory attribute in a
direct way (appearing in the rule premise) or
indirect way (appearing in an associated rule).
The authors propose metrics to control for such discrimination.
78
Evaluation
the step at which to ascertain that the results of the previous stages
"properly achieve[...] the business objectives. A key objective is to
determine if there is some important business issue that has not
been sufficiently considered. At the end of this phase, a decision on
the use of the data mining results should be reached.
review all the previously raised problems to make sure that the
deployment will be as privacy-protecting as possible (or as desired).
Look at unexpected results!
Example discrimination-aware data mining:
aware of pre-defined discriminatory categories / mining patterns
But what about newly found categories?
79
Deployment
- Definition
the gained insight is used, for example in
real-time personalization of Web page delivery and design
decision processes: what contract to offer or deny a customer,
whether to search a traveller at the border or not, ...
80
Deployment
- Analysis
Threats: These operational steps may
be intrusive per se: e.g., searching someone at an airport in response to a high
"terror risk score, searching their home and/or computer,
contribute to the knowledge about a data subject and thus be similar to more
data being collected and stored,
install social categories and norms as facts with all the consequences of such
redefinitions of reality: less consumer choice, heightened social inequalities,
more people treated as criminals, etc.
be wrongly applied due to the inherently statistical nature of patterns:
error margin s (e.g. misclassification errors)
Inconvenience and worse of false positives!
Survey of incidents: e.g. Daten-speicherungde (2010) Flle von
Datenmissbrauch und irrtmern [cases of data abuse and errors].
http://daten-
speicherung.de/wiki/index.php?title=F%C3%A4lle_von_Datenmissbrauch_
und_-irrt%C3%BCmern&oldid=3639
Opportunities: for reverse patterns, see above
Solution approaches: economic pressure (loss of goodwill / public image)?!
See Facebook users discussion (e.g. Grses, Rizk & Gnther, 2008)
81
Discussion item: What is this an example of?
Tracing anonymous edits in Wikipedia http://wikiscanner.virgil.gr/
82
[Method: Attribute matching]
83
Results (an example)
84
An outlook:
Thank you!
89
References I
(in the order in which they appear in the slides)
Barbaro M, Zeller T (9 August 2006) A face is exposed for aol searcher no. 4417749. New York Times
Owad T (2006) Data mining 101: Funding subversives with amazon wishlists.
http://www.applefritter.com/bannedbooks
Lindamood J, Heatherly R, Kantarcioglu M, Thuraisingham BM (2009) Inferring private information using
social network data. In: Proceedings of the 18th International Conference onWorldWideWeb,WWW2009,
Madrid, Spain, April 20-24, 2009, ACM, pp 11451146
Raymond Heatherly, Murat Kantarcioglu, Bhavani Thuraisinghaim, "Social Network Classification
Incorporating Link Type Values", IEEE International Conference on Intelligence and Security Informatics
2009.
Verykios VS, Bertino E, Fovino IN, Provenza LP, Saygin Y, Theodoridis Y (2004) State-of-the-art in
privacy preserving data mining. SIGMOD Record 33(1):5057
Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: SIGMOD Conference, ACM, pp 439
450
John Canny, Collaborative Filtering with Privacy, Proceedings of the 2002 IEEE Symposium on Security
and Privacy, p.45, May 12-15, 2002
John F. Canny: Collaborative filtering with privacy via factor analysis. SIGIR 2002: 238-245
Fung BCM, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: A survey on recent
developments. ACM Computing Surveys 42(4)
90
References II
(in the order in which they appear in the slides)
Berendt B (accepted) More than modelling and hiding: Towards a comprehensive view of Web mining
and privacy. Data Mining and Knowledge Discovery
Frankowski, D., Cosley, D., Sen, S., Terveen, L., & Riedl, J. (2006). You are what you say: Privacy risks
of public mentions. In Proc. SIGIR06
Narayanan A, Shmatikov V (2009) De-anonymizing social networks. In: Proc. 30th IEEE Symposium on
Security and Privacy 2009
Phillips D (2004) Privacy policy and PETs: The influence of policy regimes on the development and
social implications of privacy enhancing technologies. New Media & Society 6(6):691706
Donohue J, Levitt S (2001) The impact of legalized abortion on crime. Quarterly Journal of Economics
116(2):379420
Seda Grses, Ramzi Rizk, and Oliver Gnther. Privacy design in online social networks: Learning from
privacy breaches and community feedback. In Proc. of the Twenty Ninth International Conference on
Information Systems, 2008.
Norman D (1993) Things That Make Us Smart: Defending Human Attributes In The Age Of The
Machine. Perseus Books.
Grses S, Berendt B (2010) The social web and privacy: Practices, reciprocity and conflict detection in
social networks. In: Ferrari E, Bonchi F (eds) (2010) Privacy-Aware Knowledge Discovery: Novel
Applications and New Techniques. Chapman & Hall/CRC Press.
Lujun Fang, Kristen LeFevre: Privacy wizards for social networking sites. WWW 2010: 351-360.
91
Sources: I have re-used slides and pictures ...
(thanks to the Web community!)
Slides 10-11 are from
http://www.utdallas.edu/~bxt043000/cs7301_s10/Lecture25.ppt