Web Mining and Privacy: Bettina Berendt

1
Interdisciplinary
Privacy Course,
June 2010
Web Mining
and
Privacy
Bettina
Berendt
K.U. Leuven,
Belgium
www.berendt.de
2
What is Web Mining?
And who am I?
Knowledge discovery
(aka Data mining):
"the non-trivial process of
identifying valid, novel,
potentially useful, and
ultimately understandable
patterns in data."
Web Mining:
the application of data mining
techniques on the content,
(hyperlink) structure, and usage
of Web resources. Web mining areas:
Web content mining
Navigation, queries, Web structure mining
content access & creation
Web usage mining
3
Why Web / data mining?
4
Agenda
1. What is (Web) data mining? And what does it have to do with

privacy? a simple view
2. Examples of data mining and "privacy-preserving data mining":
Association-rule mining (& privacy-preserving AR mining)
Collaborative filtering (& privacy-preserving collaborative filtering)
3. A second look at ...privacy
4. A second look at ...Web / data mining
5. The goal: More than modelling and hiding Towards a
comprehensive view of Web mining and privacy. Threats,
opportunities and solution approaches.
6. An outlook: Data mining for privacy
5
What is (Web) data mining?
And what does it have to do with privacy?
(vis--vis the cryptographic techniques we heard about yesterday)
a simple view
6
1. Behaviour on the Web (and elsewhere)
Data
7
2. Web (and other data) mining
Data
Privacy
problems!
8
Technical background of the problem:
Privacy The dataset allows for Web mining (e.g.,

Problems: which search queries lead to which site
Example 1 choices),
it violates k-anonymity (e.g. "Lilburn"
a likely k = #inhabitants of Lilburn)
9
Privacy Where do people live who will buy

Problems: the Koran soon?
Example 2
Technical background of the problem:
A mashup of different data sources

Amazon wishlists
Yahoo! People (addresses)
Google Maps
each with insufficient k-anonymity, allows
for attribute matching and thereby
inferences
Predicting political affiliation from 10
Privacy Facebook profile and link data (1):
Problems: Most Conservative Traits
Example 3
Trait Name Trait Value Weight Conservative
Group george w bush is my 45.88831329

homeboy
Group college republicans 40.51122488
Group texas conservatives 32.23171423
Group bears for bush 30.86484689
Group kerry is a fairy 28.50250433
Group aggie republicans 27.64720818
Group keep facebook clean 23.653477
Group i voted for bush 23.43173116
Group protect marriage one man one 21.60830487

woman
Lindamood et al. 09 &

Heatherly et al. 09
11
Predicting political affiliation from Facebook profile
and link data (2): Most Liberal Traits per Trait Name
Trait Name Trait Value Weight Liberal
activities amnesty international 4.659100601
Employer hot topic 2.753844959
favorite tv shows queer as folk 9.762900035
grad school computer science 1.698146579
hometown mumbai 3.566007713
Relationship Status in an open relationship 1.617950632
religious views agnostic 3.15756412
looking for whatever i can get 1.703651985
Lindamood et al. 09
&
Heatherly et al. 09
12
3. Cryptographic privacy solutions
Data
not all !
13
4. "Privacy-preserving data mining"
Data
not all !
14
Two examples:
Association-rule mining
(& privacy-preserving AR mining)
Collaborative filtering
(& privacy-preserving collaborative filtering)
15
Two examples:
16
Reminder: The use of AR mining for store layout
(Amazon, earlier: Wal-Mart, ...)
Where to put:
spaghetti,
butter?
17
Data
"Market basket data": attributes with boolean domains

In a table each row is a basket (aka transaction)
Transaction ID Attributes (basket items)
1 Spaghetti, tomato sauce
2 Spaghetti, bread
3 Spaghetti, tomato sauce, bread
4 bread, butter
5 bread, tomato sauce

18
Generating large k-itemsets with Apriori
Transaction ID Attributes (basket items)

2 Spaghetti, bread
4 bread, butter
Min. support = 40%

step 1: candidate 1-itemsets
Spaghetti: support = 3 (60%)
tomato sauce: support = 3 (60%)
bread: support = 4 (80%)
butter: support = 1 (20%)
19
Contd. Transaction ID Attributes (basket items)

2 Spaghetti, bread
4 bread, butter
step 2: large 1-itemsets

Spaghetti
tomato sauce
bread
candidate 2-itemsets
{Spaghetti, tomato sauce}: support = 2 (40%)
{Spaghetti, bread}: support = 2 (40%)
{tomato sauce, bread}: support = 2 (40%)
20
Contd. Transaction ID Attributes (basket items)

2 Spaghetti, bread
4 bread, butter

{Spaghetti, tomato sauce}
{Spaghetti, bread}
{tomato sauce, bread}
candidate 3-itemsets
{Spaghetti, tomato sauce, bread}: support = 1 (20%)
{}
21
The apriori principle and the pruning of the search tree
an example of "the data mining approach"
Spagetthi, Tomato sauce,
Bread, butter
Spagetthi, Spagetthi, Spagetthi, Tomato sauce,

Tomato sauce, Tomato sauce, Bread, Bread,
Bread butter butter butter
Spaghetti, Spaghetti, Spaghetti, Tomato s., Tomato s., Bread,

tomato sauce bread butter bread butter butter
spaghetti Tomato sauce bread butter

22
Bread, butter



23
Bread, butter



24
Bread, butter



25
From itemsets to association rules
Schema: If subset then large k-itemset with support s and

confidence c
s = (support of large k-itemset) / # tuples
c = (support of large k-itemset) / (support of subset)
Example:
If {spaghetti} then {spaghetti, tomato sauce}
Support: s = 2 / 5 (40%)
Confidence: c = 2 / 3 (66%)
26
Two examples:
27
Privacy-preserving data mining (PPDM)
Database inference problem: "The problem that arises when

confidential information can be derived from released data by
unauthorized users
Objective of PPDM : "develop algorithms for modifying the original
data in some way, so that the private data and private knowledge
remain private even after the mining process.
Approaches:
Data distribution
Decentralized holding of data
Data modification
Aggregation/merging into coarser categories
Perturbation, blocking of attribute values
Swapping values of individual records
sampling
Data or rule hiding
Push the support of sensitive patterns below a threshold
28
"Privacy-preserving Web mining" example:
find patterns, unlink personal data
Volvo S40 website targets people in 20s
Are visitors in their 20s or 40s?
Which demographic groups like/dislike the website?
An example of the "Randomization Approach" to PPDM:

R. Agrawal and R. Srikant, "Privacy Preserving Data Mining",
SIGMOD 2000.
29
Randomization Approach Overview
30 | 70K | ... 50 | 40K | ...

...
Randomizer Randomizer
65 | 20K | ... 25 | 60K | ... ...
Reconstruct Reconstruct
distribution distribution ...
of Age of Salary
Data Mining
Model
Algorithms
30
Reconstruction Problem
Original values x1, x2, ..., xn

from probability distribution X (unknown)
To hide these values, we use y1, y2, ..., yn
from probability distribution Y
Given
x1+y1, x2+y2, ..., xn+yn
the probability distribution of Y
Estimate the probability distribution of X.
31
Intuition (Reconstruct single point)
Use Bayes' rule for density functions
0V
1 9
0
A
g
e
O
r
i
g
in
a
ld
i
st
r
i
bu
ti
o
nf
or
Ag
e
P
r
o
ba
b
i
li
s
t
ic
es
t
i
ma
te
of
or
i
g
in
a
lv
a
l
ue
o
fV
32
Intuition (Reconstruct single point)
Use Bayes' rule for density functions
0V
1 9
0
A
g
e
O
r
i
g
in
a
lD
is
t
r
ib
u
ti
o
nf
or
Ag
e
P
r
o
ba
b
i
li
s
t
ic
es
t
i
ma
t
eof
o
ri
g
i
na
lv
a
l
u
eo
fV
33
Reconstructing the Distribution
Combine estimates of where point came from for all the points:
Gives estimate of original distribution.
1
0 9
0
A
g
e
34
Reconstruction: Bootstrapping
1 n fY (( xi yi ) a ) f Xj (a )

n i 1 f (( x y ) a ) f j (a )
Y i i

X
35
Seems to work well!
1200
1000
Number of People
800
Original
600 Randomized
Reconstructed
400
200
0
20 60
Age
36
Two examples:
37
What is collaborative filtering?
"People like what

people like them
like"
regardless of
support and
confidence
38
User-based Collaborative Filtering
Idea: People who agreed in the past are likely to agree again
To predict a users opinion for an item,
use the opinion of similar users
Similarity between users is decided by looking at their overlap
in opinions for other items
Next step: build a model of user types "global model"

rather than "local patterns" as mining result
39
Example:
User-based Collaborative Filtering
Item 1 Item 2 Item 3 Item 4 Item 5
User 1 8 1 2 7
?
User 2 2 5 7 5
?
User 3 5 4 7 4 7
User 4 7 1 7 3 8
User 5 1 7 4 6 5
User 6 8 3 8 3 7
40
Similarity between users
Item 1 Item 2 Item 3 Item 4 Item 5
User 1 8 1 ? 2 7
User 2 2 ? 5 7 5
User 4 7 1 7 3 8
How similar are users 1 and 2?

How similar are users 1 and 4?
How do you calculate similarity?

42
Popular similarity measures
Cosine
based
similarity
Adjusted
cosine
based
similarity
Correlation
based
similarity
44
Algorithm 2: K-Nearest-Neighbour
Neighbours are people who

have historically had the
5
same taste as our user
7 7 Aggregation
function: often
weighted sum
8 Weight
depends on
similarity
4
45
Outlook:
Model-based collaborative filtering
Instead of using ratings directly ("memory-based collaborative
filtering"),
develop a model of user ratings
Use the model to predict ratings for new items
To build the model:
Bayesian network (probabilistic)
Clustering (classification)
Rule-based approaches (e.g., association rules between co-
purchased items)
46
Two examples:
47
Collaborative filtering: idea
and architecture
Basic idea of collaborative filtering: "Users who liked this also
liked ..." generalize from "similar profiles"
Standard solution:
At the community site / centralized:
Compute, from all users and their ratings/purchases, etc., a global
model
To derive a recommendation for a given user: find "similar
profiles" in this model and derive a prediction
Mathematically: depends on simple vector computations in the
user-item space
48
Distributed data mining / secure multi-party computation:
The principle explained by secure sum
Given a number of values x1,...,xn belonging to n entities
compute S xi
such that each entity ONLY knows its input and the result of the
computation (The aggregate sum of the data)
49
Canny: Collaborative filtering with privacy
Each user starts with their own preference data, and knowledge
of who their peers are in their community.
By running the protocol, users exchange various encrypted
messages.
At the end of the protocol, every user has an unencrypted copy
of the linear model , of the communitys preferences.
They can then use this to extrapolate their own ratings
At no stage does unencypted information about a users
preferences leave their own machine.
Users outside the community can request a copy of the model
, from any community member, and derive
recommendations for themselves
Canny (2002), Proc. IEEE Symp. Security and Privacy; Proc. SIGIR
50
Privacy-preserving data publishing (PPDP)
Taking a broader look:

In contrast to the general assumptions of PPDM, arbitrary
mining methods may be performed after publishing
need adversary models
Objective: "access to published data should not enable the
attacker to learn anything extra about any target victim
compared to no access to the database, even with the presence
of any attackers background knowledge obtained from other
sources
(this needs to be relaxed by assumptions about the background
knowledge)
A comprehensive current survey: Fung et al. ACM Computing
Surveys 2010
51
... and many more (and more advanced and
comprehensive) PPDM methods and algorithms ...
Problem solved?
52
A second look at ...
privacy
(very much influenced by joint work with Seda Grses, see her
presentation in this course)
53
1. Privacy as confidentiality:
"the right to be let alone" and to hide data
Data
Is this all
there is
to privacy?
54
2. Privacy as control:
informational self-determination
Data
Dont do
THIS !
e.g. data privacy: "the right of the

individual to decide what
information about himself should be
communicated to others and under
what circumstances" (Westin, 1970)
behind much of data-protection
legislation (see Eleni Kostas talk)
55
3. Privacy as practice:
Identity construction
Data
56
1. Privacy as practice:
Identity construction and the societal re-negotiation of
the public/private divide
"privacy negotiations"
(incl. work by/with
Teltzrow, Preibusch,
Spiekermann)
57
A second look at ...
Web / data mining

58
Is this how it works?
59
No ... ! What if What do our Webserver
someone logs tell us about viewing
bought a behaviour? How can we
book as a combine Webserver and
How do present for transaction logs?
people their
like/buy father? Which data
books? noise do we
have to remove
from our logs?
Should we
show the
recommen
dations at
the top or
bottom of Which of
the page? these
Only to association
registered rules are
customers frequent/co
? nfident
enough?
60
Data mining IS-PART-OF Knowledge Discovery?
Data mining IS-PART-OF Knowledge Discovery!
61
As an aside: what can happen when "business
understanding" and data understanding are neglected
62
The goal:
More than modelling and hiding

Towards a comprehensive view of Web
mining and privacy
[Berendt, Data Mining and Knowledge Discovery, accepted]

63
Goal: From the simple view ...
towards a more complex view
64
Plan
P. as P. as control P. as practice
confidentiality
Business /
application
understanding
Data for each cell:
understanding challenges
Data preparation opportunities
solution approaches
Modelling
Evaluation
(in the following:
Deployment simplification:
by phase, cell
differentiation
only in selected
cases
65
Business understanding:
Business models based on personal-data-as-value
66
Business understanding:
Business models based on avoiding data collection
67
Data understanding, in particular data collection
Threats:
Data collection may in itself be intrusive
Opportunities:
New forms of data collection (e.g. anonymous incident reporting)
Solution approaches:
Anonymisation technology
Use of pseudonyms
Other PETs that lead to fewer data being collected
68
Data preparation: data selection and data integration
Threats
data selection and
integration can lead to
record linkage and
therefore inferences
control via purpose
limitations becomes
essential
... threat or opportunity?:
69
Data integration: an example
Paper published by the MovieLens team (collaborative-filtering

movie ratings) who were considering publishing a ratings
dataset, see http://movielens.umn.edu/
Public dataset: users mention films in forum posts
Private dataset (may be released e.g. for research purposes):
users ratings
Film IDs can easily be extracted from the posts
Observation: Every user will talk about items from a sparse
relation space (those generally few films s/he has seen)
[Frankowski, D., Cosley, D., Sen, S., Terveen, L., & Riedl, J. (2006).
You are what you say: Privacy risks of public mentions. In Proc. SIGIR06]
Generalisation with more robust de-anonymization attacks and different data:
[Narayanan A, Shmatikov V (2009) De-anonymizing social networks.
In: Proc. 30th IEEE Symposium on Security and Privacy 2009]
70
Merging identities the computational problem
Given a target user t from the forum users, find similar users (in
terms of which items they related to) in the ratings dataset
Rank these users u by their likelihood of being t
Evalute:
If t is in the top k of this list, then t is k-identified
Count percentage of users who are k-identified
E.g. measure likelihood by TF.IDF (m: item)
72
Results
73
What do you think helps?
74
Data preparation: data construction
- Definition and examples
"constructive data preparation operations such as the production of
derived attributes, entire new records or transformed values for
existing attributes."
May involve usage and/or prediction of values for attributes such as
gender, age
ethnicity, skin colour, sexual orientation
"people who are nostalgic for the former East German State"
[http://www.sociovision.de/loesungen/sinus-milieus.html]
"terror risk score"
(cf. Pilkington E (2006) Millions assigned terror risk score on trips to the US. The
Guardian, 2 Dec. 2006. http://www.guardian.co.uk/usa/story/0,,1962299,00.html )
75
Data preparation: data construction
- Analysis
Threats: The construction and naming of new attributes may create
controversial psychological or social categories. The intentional or
unintentional reification produces a social category or norm that may be
offensive per se and/or lend itself to abuses such as further privacy-relevant
activities (privacy as practice).
At the same time an opportunity? (imagine categories like "prolific donors to
online free-speech causes and, during modelling, findings that they do
"good" things)
Solution approaches:
anything that avoids such inferences ( all PPDM/PPDP)?
However, with the focus of PPDM on (i) data utility, (ii) avoiding
inferences on / damages to individuals, creation of new attributes and
profiling are explicitly not addressed.
76
Modelling
- Definition and Analysis (threats and opportunities)
Identification of interesting patterns
global characterizations of the modelled set of instances (e.g. clusters)
local characterizations Of a subset of all instances (e.g., association rules)
Threats
KD result patterns may be descriptions of or ascriptions to unwished-for social
categories (s.a.)
may also have implications on the public-private divide:
"[A system in which individuals and groups determine which description
best fits them] also addresses the second sense of privacy that of public
participation in the definition of the public/private divide. One of the most
insidious aspects of market profiling is that the social models thus
produced are private property [e.g., trade secrets]. ... When this private
social modeling is combined with the private persuasive techniques of
targeted marketing, the result is an anti-democratic [...] process of social
shaping. [Phillips(2004), p. 703]
Opportunities
Controversial relationships as a possible starting point of liberating debates that
further privacy as practice.
Example abortion?! [Donohue and Levitt(2001)]
77
Modelling
- Analysis: Solution approaches from PPDM
Modifications to the data (so that "private data remain private) (s.a.)
AND
Modifications to the results (so that "private knowledge remains
private).
Rule hiding example: discrimination-aware data mining [Pedreschi et
al(2008)].
discriminatory attributes (US law): race, religion, pregnancy status, ...
Discriminatory classification rules: propose a decision (e.g., whether to
give a loan) based on a discriminatory attribute in a
direct way (appearing in the rule premise) or
indirect way (appearing in an associated rule).
The authors propose metrics to control for such discrimination.
78
Evaluation
the step at which to ascertain that the results of the previous stages
"properly achieve[...] the business objectives. A key objective is to
determine if there is some important business issue that has not
been sufficiently considered. At the end of this phase, a decision on
the use of the data mining results should be reached.
review all the previously raised problems to make sure that the
deployment will be as privacy-protecting as possible (or as desired).
Look at unexpected results!
Example discrimination-aware data mining:
aware of pre-defined discriminatory categories / mining patterns
But what about newly found categories?
79
Deployment
- Definition
the gained insight is used, for example in
real-time personalization of Web page delivery and design
decision processes: what contract to offer or deny a customer,
whether to search a traveller at the border or not, ...
80
Deployment
- Analysis
Threats: These operational steps may
be intrusive per se: e.g., searching someone at an airport in response to a high
"terror risk score, searching their home and/or computer,
contribute to the knowledge about a data subject and thus be similar to more
data being collected and stored,
install social categories and norms as facts with all the consequences of such
redefinitions of reality: less consumer choice, heightened social inequalities,
more people treated as criminals, etc.
be wrongly applied due to the inherently statistical nature of patterns:
error margin s (e.g. misclassification errors)
Inconvenience and worse of false positives!
Survey of incidents: e.g. Daten-speicherungde (2010) Flle von
Datenmissbrauch und irrtmern [cases of data abuse and errors].
http://daten-
speicherung.de/wiki/index.php?title=F%C3%A4lle_von_Datenmissbrauch_
und_-irrt%C3%BCmern&oldid=3639
Opportunities: for reverse patterns, see above
Solution approaches: economic pressure (loss of goodwill / public image)?!
See Facebook users discussion (e.g. Grses, Rizk & Gnther, 2008)
81
Discussion item: What is this an example of?
Tracing anonymous edits in Wikipedia http://wikiscanner.virgil.gr/
82
[Method: Attribute matching]
83
Results (an example)
84
An outlook:
Data mining for privacy

85
"Experiential vs. Reflective"
Experiental cognition and reflective cognition

"The experiential mode leads to a state in which we
perceive and react to the events around us, efficiently
and effortlessly. This ... is a key component of efficient
performance.
The reflective mode is that of comparison and contrast,
of thought, of decision making. This is the mode that
leads to new ideas, novel responses.
... [a proper balance is needed] ...
Alas, our educational system is more and more trapped
in an experiential mode: the brilliant inspired lecturer, the
prevalence of prepackaged films ... To engage the
student, the textbooks that follow a predetermined
sequence."
[Norman, Things That Make Us Smart]
Feedback and awareness tools for privacy protection?!
Goal: use data mining as basis [Grses & Berendt,
2010]
86
Example: Privacy Wizards for Social Networking Sites
[Fang & LeFevre 2010]
Interface: user specifies what they

want to share with whom
Not in an abstract way ("group X"
or "friends of friends" etc.)
Not for every friend separately
But for a subsect of friends, and
the system learns the "implicit
rules behind that"
Data mining: active learning

(system asks only about the most
informative friends instances)
Results: good accuracy, better for
"friends by communities" (linkage
information) than for "friends by
profile" (their profile data)
87
Privacy Wizards ... more feedback:
Expert interface showing the learned classifier
88
Thank you!
89
References I
(in the order in which they appear in the slides)
Barbaro M, Zeller T (9 August 2006) A face is exposed for aol searcher no. 4417749. New York Times
Owad T (2006) Data mining 101: Funding subversives with amazon wishlists.
http://www.applefritter.com/bannedbooks
Lindamood J, Heatherly R, Kantarcioglu M, Thuraisingham BM (2009) Inferring private information using
social network data. In: Proceedings of the 18th International Conference onWorldWideWeb,WWW2009,
Madrid, Spain, April 20-24, 2009, ACM, pp 11451146
Raymond Heatherly, Murat Kantarcioglu, Bhavani Thuraisinghaim, "Social Network Classification
Incorporating Link Type Values", IEEE International Conference on Intelligence and Security Informatics
2009.
Verykios VS, Bertino E, Fovino IN, Provenza LP, Saygin Y, Theodoridis Y (2004) State-of-the-art in
privacy preserving data mining. SIGMOD Record 33(1):5057
Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: SIGMOD Conference, ACM, pp 439
450
John Canny, Collaborative Filtering with Privacy, Proceedings of the 2002 IEEE Symposium on Security
and Privacy, p.45, May 12-15, 2002
John F. Canny: Collaborative filtering with privacy via factor analysis. SIGIR 2002: 238-245
Fung BCM, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: A survey on recent
developments. ACM Computing Surveys 42(4)
90
References II
(in the order in which they appear in the slides)
Berendt B (accepted) More than modelling and hiding: Towards a comprehensive view of Web mining
and privacy. Data Mining and Knowledge Discovery
Frankowski, D., Cosley, D., Sen, S., Terveen, L., & Riedl, J. (2006). You are what you say: Privacy risks
of public mentions. In Proc. SIGIR06
Narayanan A, Shmatikov V (2009) De-anonymizing social networks. In: Proc. 30th IEEE Symposium on
Security and Privacy 2009
Phillips D (2004) Privacy policy and PETs: The influence of policy regimes on the development and
social implications of privacy enhancing technologies. New Media & Society 6(6):691706
Donohue J, Levitt S (2001) The impact of legalized abortion on crime. Quarterly Journal of Economics
116(2):379420
Seda Grses, Ramzi Rizk, and Oliver Gnther. Privacy design in online social networks: Learning from
privacy breaches and community feedback. In Proc. of the Twenty Ninth International Conference on
Information Systems, 2008.
Norman D (1993) Things That Make Us Smart: Defending Human Attributes In The Age Of The
Machine. Perseus Books.
Grses S, Berendt B (2010) The social web and privacy: Practices, reciprocity and conflict detection in
social networks. In: Ferrari E, Bonchi F (eds) (2010) Privacy-Aware Knowledge Discovery: Novel
Applications and New Techniques. Chapman & Hall/CRC Press.
Lujun Fang, Kristen LeFevre: Privacy wizards for social networking sites. WWW 2010: 351-360.
91
Sources: I have re-used slides and pictures ...
(thanks to the Web community!)
Slides 10-11 are from
http://www.utdallas.edu/~bxt043000/cs7301_s10/Lecture25.ppt
Slides 28-35 are from (some slightly adapted)

http://www.rsrikant.com/talks/pakdd02.ppt
Slides 39-45 are from (some slightly adapted)

http://www.abdn.ac.uk/~csc263/teaching/AIS/lectures/abdn.only/CollaborativeFiltering.ppt
All picture credits are in the "Powerpoint comment fields"

92
Further reading: Surveys
Web data mining:

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and
Usage Data. Springer, Berlin etc., 2007.
Privacy-preserving data mining:
Aggarwal CC, Yu PS (eds) (2008) Privacy-Preserving Data
Mining: Models and Algorithms. Springer Publishing Company,
Incorporated
Privacy-preserving data publishing:
Fung BCM, Wang K, Chen R, Yu PS (2010) Privacy-preserving
data publishing: A survey on recent developments. ACM
Computing Surveys 42(4)

Web Mining and Privacy: Bettina Berendt

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Web Mining and Privacy: Bettina Berendt

Transféré par

Droits d'auteur :

Formats disponibles

1

1. What is (Web) data mining? And what does it have to do with

Privacy The dataset allows for Web mining (e.g.,

Privacy Where do people live who will buy

Technical background of the problem:

A mashup of different data sources

Group george w bush is my 45.88831329

Group texas conservatives 32.23171423

Group bears for bush 30.86484689

Group kerry is a fairy 28.50250433

Group aggie republicans 27.64720818

Group keep facebook clean 23.653477

Group i voted for bush 23.43173116

Group protect marriage one man one 21.60830487

Lindamood et al. 09 &

Trait Name Trait Value Weight Liberal

activities amnesty international 4.659100601

Employer hot topic 2.753844959

favorite tv shows queer as folk 9.762900035

grad school computer science 1.698146579

hometown mumbai 3.566007713

Relationship Status in an open relationship 1.617950632

religious views agnostic 3.15756412

looking for whatever i can get 1.703651985

"Market basket data": attributes with boolean domains

Transaction ID Attributes (basket items)

1 Spaghetti, tomato sauce

3 Spaghetti, tomato sauce, bread

5 bread, tomato sauce

Generating large k-itemsets with Apriori

Transaction ID Attributes (basket items)

Min. support = 40%

Contd. Transaction ID Attributes (basket items)

step 2: large 1-itemsets

Contd. Transaction ID Attributes (basket items)

step 3: large 2-itemsets

Spagetthi, Spagetthi, Spagetthi, Tomato sauce,

Spaghetti, Spaghetti, Spaghetti, Tomato s., Tomato s., Bread,

spaghetti Tomato sauce bread butter

Spagetthi, Spagetthi, Spagetthi, Tomato sauce,

Spaghetti, Spaghetti, Spaghetti, Tomato s., Tomato s., Bread,

spaghetti Tomato sauce bread butter

Spagetthi, Spagetthi, Spagetthi, Tomato sauce,

Spaghetti, Spaghetti, Spaghetti, Tomato s., Tomato s., Bread,

spaghetti Tomato sauce bread butter

Spagetthi, Spagetthi, Spagetthi, Tomato sauce,

Spaghetti, Spaghetti, Spaghetti, Tomato s., Tomato s., Bread,

spaghetti Tomato sauce bread butter

From itemsets to association rules

Schema: If subset then large k-itemset with support s and

Database inference problem: "The problem that arises when

An example of the "Randomization Approach" to PPDM:

30 | 70K | ... 50 | 40K | ...

65 | 20K | ... 25 | 60K | ... ...

Original values x1, x2, ..., xn

Use Bayes' rule for density functions

Use Bayes' rule for density functions

"People like what

User-based Collaborative Filtering

Next step: build a model of user types "global model"

Item 1 Item 2 Item 3 Item 4 Item 5

Similarity between users

Item 1 Item 2 Item 3 Item 4 Item 5