Privacy Preservation Using Data Mining

Privacy Preservation Using Data Mining
Abstract:
Many organizations collect large amounts of data about their clients and
customers. Such data was used for record keeping. Data mining can extract valuable
knowledge from this data .Organizations can obtain better results by pooling their data
together. However, the collected data may contain sensitive or private information about
the organizations or their customers, and privacy concerns are exacerbated if data is
shared between multiple organizations. Distributed data mining is concerned with the
computation of models from data that is distributed among multiple participants. Privacy-
preserving distributed data mining seeks to allow for the cooperative computation of such
models without the cooperating parties revealing any of their individual data items. Our
project makes two contributions in privacy-preserving data mining. First, we introduce
the concept of arbitrarily partitioned data, which is a generalization of both horizontally
and vertically partitioned data. Second, we provide an efficient privacy-preserving
protocol for k-means clustering in the setting of arbitrarily partitioned data. Privacy-
preserving distributed data mining allows the cooperative computation of data mining
algorithms without requiring the participating organizations to reveal their individual
data items to each other.
We present a simple I/O-efficient k-clustering algorithm that was designed with
the goal of enabling a privacy-preserving version of the algorithm. Our experiments show
that this algorithm produces cluster centers that are, on average, more accurate than the
ones produced by the well known iterative k-means algorithm. We use our new algorithm
as the basis for a communication-efficient privacy-preserving k-clustering protocol for
databases that are horizontally partitioned between two parties. Unlike existing privacy-
preserving protocols based on the k-means algorithm, this protocol does not reveal
intermediate candidate cluster centers. In this work, we propose methods for
constructing the dissimilarity matrix of objects .
Existing System:
The earlier techniques are data perturbation techniques for privacy preserving
classification model construction on centralized data , techniques for privacy preserving
association rule mining in distributed environments . Techniques from secure multiparty
computation form one approach to privacy-preserving data mining. Yao’s general
protocol for secure circuit evaluation can be used to solve any two-party privacy-
preserving distributed data mining problem. However, since data mining usually involves
millions or billions of data items, the communication cost of this protocol renders it
impractical for these purposes. The k-mean and re cluster algorithms are used in turn for
efficiency.
Proposed System:
In this project, we propose a privacy preserving clustering technique on

horizontally partitioned data. Our method is based on constructing the dissimilarity
matrix of objects from different sites in a privacy preserving manner which can be used
for privacy preserving clustering, as well as database joins, record linkage and other
operations that require pair-wise comparison of individual private data objects distributed
over multiple sites. We handle both numeric and alphanumeric valued attributes. Our
main contributions can be summarized as follows:
• Introduction of a protocol for secure multi-party computation of a
dissimilarity matrix over horizontally partitioned data,
• Proof of security of the protocol, analysis of communication costs ,
• Secure comparison methods for numeric, alphanumeric and categorical
attributes,
• Application to privacy preserving clustering.
Data Flow Diagram:
K –Means:
Dataset
K Means
Clustering
Clustered dataset
Dataset
Clustering Reclustering
Privacy preparation Using privacy

preserving protocol
Preserved dataset
Modules:
1. Data collection
2. Formation of Cluster
3. Privacy preservation
4. Comparison evaluation of algorithms
1. Data Collection and mining:
The data collection is the important part of the project. The dataset is prepared by
Integrating the organizations details. Such data was initially used only for record keeping.
These large collections of data could be “mined” for knowledge that could improve the
performance of the organization. While much data mining occurs on data within an
organization, it is quite common to use data from multiple sources in order to yield more
precise or useful knowledge. However, privacy and secrecy considerations can prohibit
organizations from being willing or able to share their data with each other. Therefore we
preserve Privacy-preserving data mining technique.
2. Formation of Cluster:
The k-means of clustering algorithm is used for cluster formation. K-Means

Clustering and Fuzzy Clustering are different than Hierarchical Clustering and Diversity
Selection in that the number of clusters, K, needs to be determined at the onset. The goal
is to divide the objects into K clusters such that some metric relative to the centroids of
the clusters is minimized. If the clustering is over binary objects, medoids need to be used
instead of centroids. A medoid is just a bit string that minimizes the sum of distances to
all objects in the cluster.
Various metrics to the centroids that can be minimized include
• Maximum distance to its centroid for any object.

• Sum of the average distance to the centroids over all clusters.
• Sum of the variance over all clusters.
• Total distance between all objects and their centroids.
3. Privacy preservation:
Privacy-preserving data mining solutions have been presented both with respect to
horizontally and vertically partitioned databases, in which either different data objects
with the same attributes, are owned by each party, or different attributes for the same data
objects are owned by each party, respectively. We introduce the notion of arbitrarily
partitioned data, which generalizes both horizontally and vertically partitioned data. In
arbitrarily partitioned data, different attributes for different items can be owned by either
party. Although extremely “patchworked” data is unlikely in practice, one advantage of
considering arbitrarily partitioned data is that protocols in this model apply both to
horizontally and vertically partitioned data, as well as to hybrid that are mostly, but not
completely, vertically or horizontally partitioned.
4. Comparison evaluation of algorithms
In this project , we present a simple deterministic algorithm, Recluster, for I/O-

efficient k-clustering. This algorithm, which was explicitly designed with conversion to a
privacy-preserving version in mind, examines each data item only once and uses only
sequential access to the data. For fixed k, Recluster runs in O(n) time and uses O(log n)
space. Our experimental results show that Recluster is, on average, more accurate in
identifying cluster centers than the k-means clustering algorithm. Although there are
other clustering algorithms that improve on the k-means algorithm, this is the first for
which an efficient cryptographic privacy-preserving version has been demonstrated.
Hardware Constraints
Processor Type : Pentium -IV

Speed : 2.4GHZ
Ram : 256 MB RAM
Hard disk : 80 GB HD
Software Constraints
Operating System : Windows Xp.

Programming Package : JAVA, Swing
Tools : Eclipse, Weka (Machine learning software)
SDK : JDK.1.5.0
System Architecture:
Data Mining System Data Mining System

1 2
Data Collection Reclustered
Privacy Preserved
Using Protocol
No. of Clusters
Centroid Clustered Data set

Clustering Using K-Mean Algorithm
Distance Object
to Centroid
Grouping based
on min. distance
Problem Definition
Data mining usually involves millions or billions of data items, the

communication cost of this protocol renders it impractical for these purposes. The k-
mean and re cluster algorithms are used in turn for efficiency.
In this project, we propose a privacy preserving clustering technique on horizontally

partitioned data. Our method is based on constructing the dissimilarity matrix of objects
from different sites in a privacy preserving manner which can be used for privacy
preserving clustering, as well as database joins, record linkage and other operations that
require pair-wise comparison of individual private data objects distributed over multiple
sit
Level 0:
System1 Data
Collection
Level 1:
Data Set Data Collection Clustering

Level 2:
Data Set Clustering No. of Clusters
Finding Centroid
Grouping min
distance
Level 3:
Data Set Clustering Centroid
Collection of
Clustered data’s
Applying privacy for

the clustered data set’s
Level 4:
Data Set Clustered Collection of

clustered Data
Privacy
Preservation
Encryption
&
Decryption
System 2
Data Minig Concepts:
Generally, data mining (sometimes called data or knowledge discovery) is the

process of analyzing data from different perspectives and summarizing it into useful
information - information that can be used to increase revenue, cuts costs, or both. Data
mining software is one of a number of analytical tools for analyzing data. It allows users
to analyze data from many different dimensions or angles, categorize it, and summarize
the relationships identified. Technically, data mining is the process of finding correlations
or patterns among dozens of fields in large relational databases.
Although data mining is a relatively new term, the technology is not. Companies
have used powerful computers to sift through volumes of supermarket scanner data and
analyze market research reports for years. However, continuous innovations in computer
processing power, disk storage, and statistical software are dramatically increasing the
accuracy of analysis while driving down the cost.
For example, one Midwest grocery chain used the data mining capacity of Oracle
software to analyze local buying patterns. They discovered that when men bought diapers
on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that
these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays,
however, they only bought a few items. The retailer concluded that they purchased the
beer to have it available for the upcoming weekend. The grocery chain could use this
newly discovered information in various ways to increase revenue. For example, they
could move the beer display closer to the diaper display.
Data, Information, and Knowledge
Data
Data are any facts, numbers, or text that can be processed by a computer. Today,
organizations are accumulating vast and growing amounts of data in different formats and
different databases. This includes:
• operational or transactional data such as, sales, cost, inventory, payroll, and
accounting
• nonoperational data, such as industry sales, forecast data, and macro economic
data
• meta data - data about the data itself, such as logical database design or data
dictionary definitions
Information
The patterns, associations, or relationships among all this data can provide
information. For example, analysis of retail point of sale transaction data can yield
information on which products are selling and when.
Knowledge
Information can be converted into knowledge about historical patterns and future
trends. For example, summary information on retail supermarket sales can be analyzed in
light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a
manufacturer or retailer could determine which items are most susceptible to promotional
efforts.
How does data mining work?
While large-scale information technology has been evolving separate transaction and
analytical systems, data mining provides the link between the two. Data mining software
analyzes relationships and patterns in stored transaction data based on open-ended user
queries. Several types of analytical software are available: statistical, machine learning,
and neural networks. Generally, any of four types of relationships are sought:
• Classes: Stored data is used to locate data in predetermined groups. For example,
a restaurant chain could mine customer purchase data to determine when
customers visit and what they typically order. This information could be used to
increase traffic by having daily specials.
• Clusters: Data items are grouped according to logical relationships or consumer

preferences. For example, data can be mined to identify market segments or
consumer affinities.
• Associations: Data can be mined to identify associations. The beer-diaper

example is an example of associative mining.
• Sequential patterns: Data is mined to anticipate behavior patterns and trends. For
example, an outdoor equipment retailer could predict the likelihood of a backpack
being purchased based on a consumer's purchase of sleeping bags and hiking
shoes.
Data mining consists of five major elements:
• Extract, transform, and load transaction data onto the data warehouse system.
• Store and manage the data in a multidimensional database system.
• Provide data access to business analysts and information technology

professionals.
• Analyze the data by application software.
• Present the data in a useful format, such as a graph or table.
Different levels of analysis are available:
• Artificial neural networks: Non-linear predictive models that learn through

training and resemble biological neural networks in structure.
• Genetic algorithms: Optimization techniques that use processes such as genetic
combination, mutation, and natural selection in a design based on the concepts of
natural evolution.
• Decision trees: Tree-shaped structures that represent sets of decisions. These

decisions generate rules for the classification of a dataset. Specific decision tree
methods include Classification and Regression Trees (CART) and Chi Square
Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree
techniques used for classification of a dataset. They provide a set of rules that you
can apply to a new (unclassified) dataset to predict which records will have a
given outcome. CART segments a dataset by creating 2-way splits while CHAID
segments using chi square tests to create multi-way splits. CART typically
requires less data preparation than CHAID.
• Nearest neighbor method: A technique that classifies each record in a dataset

based on a combination of the classes of the k record(s) most similar to it in a
historical dataset (where k 1). Sometimes called the k-nearest neighbor technique.
• Rule induction: The extraction of useful if-then rules from data based on
statistical significance.
• Data visualization: The visual interpretation of complex relationships in

multidimensional data. Graphics tools are used to illustrate data relationships.
JAVA:
It is a Platform Independent. Java is an object-oriented programming language
developed initially by James Gosling and colleagues at Sun Microsystems. The language,
initially called Oak (named after the oak trees outside Gosling's office), was intended to
replace C++, although the feature set better resembles that of Objective C.
Software Used
Java began as a client side platform independent programming language that

enabled stand-alone Java applications and applets. The numerous benefits of Java
resulted in an explosion in the usage of Java in the back end server side enterprise
systems. The Java Development Kit (JDK), which was the original standard platform
defined by Sun, was soon supplemented by a collection of enterprise APIs. The
proliferation of enterprise APIs, often developed by several different groups, resulted in
divergence of APIs and caused concern among the Java developer community.
Java byte code can execute on the server instead of or in addition to the client,
enabling you to build traditional client/server applications and modern thin client Web
applications. Two key server side Java technologies are servlets and JavaServer Pages.
Servlets are protocol and platform independent server side components which extend the
functionality of a Web server. JavaServer Pages (JSPs) extend the functionality of
servlets by allowing Java servlet code to be embedded in an HTML file.
Features of Java
• Platform Independence
o The Write-Once-Run-Anywhere ideal has not been achieved (tuning for
different platforms usually required), but closer than with other languages.
• Object Oriented
o Object oriented throughout - no coding outside of class definitions,
including main ().
o An extensive class library available in the core language packages.
• Compiler/Interpreter Combo
o Code is compiled to bytecodes that are interpreted by a Java virtual
machines (JVM) .
o This provides portability to any machine for which a virtual machine has
been written.
o The two steps of compilation and interpretation allow for extensive code
checking and improved security.
• Robust
o Exception handling built-in, strong type checking (that is, all data must be
declared an explicit type), local variables must be initialized.
• Several dangerous features of C & C++ eliminated:

o No memory pointers
o No preprocessor
o Array index limit checking
• Automatic Memory Management
o Automatic garbage collection - memory management handled by JVM.
• Security
o No memory pointers
o Programs runs inside the virtual machine sandbox.
o Array index limit checking
o Code pathologies reduced by
 byte code verifier - checks classes after loading
 Class loader - confines objects to unique namespaces. Prevents
loading a hacked "java.lang.SecurityManager" class, for example.
 Security manager - determines what resources a class can access
such as reading and writing to the local disk.
• Dynamic Binding
o The linking of data and methods to where they are located, is done at run-
time.
o New classes can be loaded while a program is running. Linking is done on
the fly.
o Even if libraries are recompiled, there is no need to recompile code that
uses classes in those libraries.
This differs from C++, which uses static binding. This can result in fragile
classes for cases where linked code is changed and memory pointers then
point to the wrong addresses.
• Good Performance
o Interpretation of byte codes slowed performance in early versions, but
advanced virtual machines with adaptive and just-in-time compilation and
other techniques now typically provide performance up to 50% to 100%
the speed of C++ programs.
• Threading
o Lightweight processes, called threads, can easily be spun off to perform
multiprocessing.
o Can take advantage of multiprocessors where available
o Great for multimedia displays.
• Built-in Networking
o Java was designed with networking in mind and comes with many classes
to develop sophisticated Internet communications.
TEST PROCEDURE
SYSTEM TESTING:
Testing is performed to identify errors. It is used for quality assurance.
Testing is an integral part of the entire development and maintenance process. The goal
of the testing during phase is to verify that the specification has been accurately and
completely incorporated into the design, as well as to ensure the correctness of the design
itself. For example the design must not have any logic faults in the design is detected
before coding commences, otherwise the cost of fixing the faults will be considerably
higher as reflected. Detection of design faults can be achieved by means of inspection as
well as walkthrough.
Testing is one of the important steps in the software development phase. Testing
checks for the errors, as a whole of the project testing involves the following test cases:
 Static analysis is used to investigate the structural properties of the Source code.
 Dynamic testing is used to investigate the behavior of the source code by
executing the program on the test data.
TEST DATA AND OUTPUT
1. UNIT TESTING:
Unit testing is conducted to verify the functional performance of each modular
component of the software. Unit testing focuses on the smallest unit of the software
design (i.e.), the module. The white-box testing techniques were heavily employed for
unit testing.
FUNCTIONAL TESTS:
Functional test cases involved exercising the code with nominal input
values for which the expected results are known, as well as boundary values and special
values, such as logically related inputs, files of identical elements, and empty files.
Three types of tests in Functional test:
 Performance Test
 Stress Test
 Structure Test
PERFORMANCE TEST:
It determines the amount of execution time spent in various parts of the unit,
program throughput, and response time and device utilization by the program unit.
STRESS TEST:
Stress Test is those test designed to intentionally break the unit. A Great deal
can be learned about the strength and limitations of a program by examining the manner
in which a programmer in which a program unit breaks.
STRUCTURED TEST:
Structure Tests are concerned with exercising the internal logic of a program and
traversing particular execution paths. The way in which White-Box test strategy was
employed to ensure that the test cases could Guarantee that all independent paths within a
module have been have been exercised at least once.
 Exercise all logical decisions on their true or false sides.

 Execute all loops at their boundaries and within their operational bounds.
 Exercise internal data structures to assure their validity.
 Checking attributes for their correctness.
 Handling end of file condition, I/O errors, buffer problems and textual
errors in output information
2. INTEGRATION TESTING:
Integration testing is a systematic technique for construction the program structure

while at the same time conducting tests to uncover errors associated with interfacing. i.e.,
integration testing is the complete testing of the set of modules which makes up the
product. The objective is to take untested modules and build a program structure tester
should identify critical modules. Critical modules should be tested as early as possible.
One approach is to wait until all the units have passed testing, and then combine them
and then tested. This approach is evolved from unstructured testing of small programs.
Another strategy is to construct the product in increments of tested units. A small set of
modules are integrated together and tested, to which another module is added and tested
in combination. And so on. The advantages of this approach are that, interface dispenses
can be easily found and corrected.
The major error that was faced during the project is linking error. When all the
modules are combined the link is not set properly with all support files. Then we checked
out for interconnection and the links. Errors are localized to the new module and its
intercommunications. The product development can be staged, and modules integrated in
as they complete unit testing. Testing is completed when the last module is integrated and
tested.
TESTING TECHNIQUES / TESTING STRATRGIES:
TESTING:
Testing is a process of executing a program with the intent of finding an error. A

good test case is one that has a high probability of finding an as-yet –undiscovered error.
A successful test is one that uncovers an as-yet- undiscovered error. System testing is the
stage of implementation, which is aimed at ensuring that the system works accurately and
efficiently as expected before live operation commences. It verifies that the whole set of
programs hang together. System testing requires a test consists of several key activities
and steps for run program, string, system and is important in adopting a successful new
system. This is the last chance to detect and correct errors before the system is installed
for user acceptance testing.
The software testing process commences once the program is created and the
documentation and related data structures are designed. Software testing is essential for
correcting errors. Otherwise the program or the project is not said to be complete.
Software testing is the critical element of software quality assurance and represents the
ultimate the review of specification design and coding. Testing is the process of
executing the program with the intent of finding the error. A good test case design is one
that as a probability of finding an yet undiscovered error. A successful test is one that
uncovers an yet undiscovered error. Any engineering product can be tested in one of the
two ways:
White box testing:
This testing is also called as Glass box testing. In this testing, by knowing the
specific functions that a product has been design to perform test can be conducted that
demonstrate each function is fully operational at the same time searching for errors in
each function. It is a test case design method that uses the control structure of the
procedural design to derive test cases. Basis path testing is a white box testing.
Basis path testing:
 Flow graph notation

 Cyclometric complexity
 Deriving test cases
 Graph matrices Control
Black Box Testing:
In this testing by knowing the internal operation of a product, test can be conducted to
ensure that “all gears mesh”, that is the internal operation performs according to
specification and all internal components have been adequately exercised. It
fundamentally focuses on the functional requirements of the software.
The steps involved in black box test case design are:
 Graph based testing methods

 Equivalence partitioning
 Boundary value analysis
 Comparison testing
SOFTWARE TESTING STRATEGIES:
A software testing strategy provides a road map for the software developer. Testing is a
set activity that can be planned in advance and conducted systematically. For this reason
a template for software testing a set of steps into which we can place specific test case
design methods should be strategy should have the following characteristics:
 Testing begins at the module level and works “outward” toward the
integration of the entire computer based system.
 Different testing techniques are appropriate at different points in time.
 The developer of the software and an independent test group conducts
testing.
 Testing and Debugging are different activities but debugging must be
accommodated in any testing strategy.
INTEGRATION TESTING:
Integration testing is a systematic technique for constructing the program structure

while at the same time conducting tests to uncover errors associated with. Individual
modules, which are highly prone to interface errors, should not be assumed to work
instantly when we put them together. The problem of course, is “putting them together”-
interfacing. There may be the chances of data lost across on another’s sub functions,
when combined may not produce the desired major function; individually acceptable
impression may be magnified to unacceptable levels; global data structures can present
problems.
VALIDATION TESTING:
Software validation is achieved through a series of tests that demonstrates

conformity with requirements. A test plan outlines the classes of test to be conducted and
a test procedure defines specific test cases that will be used to demonstrate conformity
with requirements. Thus the proposed system under consideration has been tested by
validation and found to be working satisfactorily.
PROGRAM TESTING:
The logical and syntax errors have been pointed out by program testing. A syntax
error is an error in a program statement that in violates one or more rules of the language
in which it is written. An improperly defined field dimension or omitted keywords are
common syntax error. These errors are shown through error messages generated by the
computer. A logic error on the other hand deals with the incorrect data fields, out-off-
range items and invalid combinations. Since the compiler s will not deduct logical error,
the programmer must examine the output. Condition testing exercises the logical
conditions contained in a module. The possible types of elements in a condition include a
Boolean operator, Boolean variable, a pair of Boolean parentheses A relational operator
or on arithmetic expression. Condition testing method focuses on testing each condition
in the program the purpose of condition test is to deduct not only errors in the condition
of a program but also other a errors in the program.
SECURITY TESTING:
Security testing attempts to verify the protection mechanisms built in to a system
well, in fact, protect it from improper penetration. The system security must be tested for
invulnerability from frontal attack must also be tested for invulnerability from rear attack.
During security, the tester places the role of individual who desires to penetrate system.
References:
1. J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann,
2001.
2. C. Clifton et al., “Tools for Privacy Preserving Distributed Data Mining,” SIGKDD
Explorations, vol. 4, no. 2, 2003
3. V.S. Verykios et al., “State-of-the-Art in Privacy Preserving Data Mining,” SIGMOD
Record, vol. 33, no. 1,
4. L. Wang, S. Jajodia, and D. Wijesekera, “Securing OLAP Data Cubes against Privacy
Breaches,” Proc. 25th IEEE Symp. Security and Privacy, IEEE Press, 2004,
5. R. Agrawal and R. Srikant, “Privacy-Preserving Data Mining,” Proc. 19th ACM

SIGMOD Int’l Conf. Management of Data, ACM Press, 2000,
6. R.J. Bayardo and R. Agrawal, “Data Privacy through Optimal k-Anonymization,”

Proc. 21st Int’l Conf. Data Eng., IEEE Press, 2005,
7. N. Zhang, S. Wang, and W. Zhao, “A New Scheme on Privacy- Preserving Data

Classification,” Proc. 11th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data
Mining, ACM Press, 2005,
8. Z. Huang, W. Du, and B. Chen, “Deriving Private Information from Randomized

Data,” Proc. 24th ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2005,
9. R. Agrawal, R. Srikant, and D. Thomas, “Privacy-Preserving OLAP,” Proc. 25th ACM

SIGMOD Int’l Conf. Management of Data, ACM Press, 2005, pp. 251-262.
10. R. Agrawal, A. Evfimievski, and R. Srikant, “Information Sharing across Private
Databases,” Proc. 22nd ACM SIGMOD Int’l Conf. Management of Data, ACM Press,
2003
Class Diagram:
System1
ReclusterFrame
openFile()
partyOneClustered()
partyTwoClustered()
showClusteredItem
exit()
Recursive Cluster
Split()
RecursiveClusterSub()
Check Attributes
MainCluster
mainCheckAttributes()
mainRecluster()
hashTable()
FTPServer
sendFile()
recieveFile()
run()
System2
One
One()
Connection()
Exit()
Start
initCompenents()
connectionPerformed()
FTP Client
setFile()
sendFile()
recieveFile()
displayMenu()
DataEncryption
Data Decryption
Encrypt()
Decrypt()
System1
Sequence Diagram:
User File Clustering Starting FTP Encry Decry

Selection FTP Server Client ption ption
Choosing a file
Clustering the File

Selected by the user
Invoking the FTP

Server
Response from
the System2
Encrypting the
Clustered file
Decrypting
And sending
the file to
user.
Use case Diagram:
System1
File Choosing
Clustering The
Selected File
Reclustering the
clustered File
Starting the FTP

server
Encryption
Decryption
Literature Survey:
Privacy preserving mining of distributed data has numerous applications. Each

application poses di#erent constraints: What is meant by privacy, what are the desired
results, how is the data distributed, what are the constraints on collaboration and
cooperative computing, etc. We suggest that the solution to this is a toolkit of
components that can be combined for specific privacy-preserving data mining
applications.
We also propose a classification hierarchy that sets the basis for analyzing the
work which has been performed in this context. A detailed review of the work
accomplished in this area is also given, along with the coordinates of each work to the
classification hierarchy. we address issues related to sharing information in a distributed
system consisting of autonomous entities, each of which holds a private database. Semi-
honest behavior has been widely adopted as the model for adversarial threats. However, it
substantially underestimates the capability of adversaries in reality.
we classify malicious adversaries into two widely existing subclasses, called

weakly malicious and strongly malicious adversaries, respectively. We define a measure
of privacy leakage for information sharing systems and propose protocols that can
effectively and efficiently protect privacy against different kinds of malicious
adversaries.
We define a measure of privacy leakage for information sharing systems and
propose protocols that can effectively and efficiently protect privacy against different
kinds of malicious adversaries.
We consider a scenario in which two parties owning confidential databases wish

to run a data mining algorithm on the union of their databases, without revealing any
unnecessary information. Our work is motivated by the need to both protect privileged
information and enable its use for research or other purposes.
Future Enhancement:
Most of the previous studies investigated the problem and proposed solutions
based on the assumption that all parties are honest or semi-honest. While it is sometimes
useful, this assumption substantially underestimates the capability of adversaries and thus
does not always hold in practical situations. We considered a space of more powerful
adversaries which include not only honest and semi-honest adversaries but also those
who are weakly malicious and strongly malicious.
Many extensions to our work exist, including 1) extending the information

sharing function from intersection to other operations, and 2) dealing with multiple
parties in the system, including dealing with correlated attacks from multiple adversaries.
Our results can be readily applied to some information sharing functions including
equijoin (V0 ⊲⊳ V1) and scalar product (V0 ·V1). We are currently investigating the
privacy preserving protocols for sum, union, and other information sharing functions.
Graph:

Privacy Preservation Using Data Mining

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Privacy Preservation Using Data Mining

Transféré par

Droits d'auteur :

Formats disponibles

Privacy Preservation Using Data Mining

In this project, we propose a privacy preserving clustering technique on

Privacy preparation Using privacy

1. Data Collection and mining:

The k-means of clustering algorithm is used for cluster formation. K-Means

• Maximum distance to its centroid for any object.

4. Comparison evaluation of algorithms

In this project , we present a simple deterministic algorithm, Recluster, for I/O-

Processor Type : Pentium -IV

Operating System : Windows Xp.

Data Mining System Data Mining System

Data Collection Reclustered

Centroid Clustered Data set

Data mining usually involves millions or billions of data items, the

In this project, we propose a privacy preserving clustering technique on horizontally

Data Set Data Collection Clustering

Data Set Clustering No. of Clusters

Data Set Clustering Centroid

Applying privacy for

Data Set Clustered Collection of

Generally, data mining (sometimes called data or knowledge discovery) is the

How does data mining work?

• Clusters: Data items are grouped according to logical relationships or consumer

• Associations: Data can be mined to identify associations. The beer-diaper

Data mining consists of five major elements:

• Store and manage the data in a multidimensional database system.

• Provide data access to business analysts and information technology

• Analyze the data by application software.

• Present the data in a useful format, such as a graph or table.

Different levels of analysis are available:

• Artificial neural networks: Non-linear predictive models that learn through

• Decision trees: Tree-shaped structures that represent sets of decisions. These

• Nearest neighbor method: A technique that classifies each record in a dataset

• Data visualization: The visual interpretation of complex relationships in

Java began as a client side platform independent programming language that

• Several dangerous features of C & C++ eliminated:

TEST DATA AND OUTPUT

 Exercise all logical decisions on their true or false sides.

Integration testing is a systematic technique for construction the program structure

Testing is a process of executing a program with the intent of finding an error. A

White box testing:

 Flow graph notation

Black Box Testing:

The steps involved in black box test case design are:

 Graph based testing methods

Integration testing is a systematic technique for constructing the program structure

Software validation is achieved through a series of tests that demonstrates

5. R. Agrawal and R. Srikant, “Privacy-Preserving Data Mining,” Proc. 19th ACM

6. R.J. Bayardo and R. Agrawal, “Data Privacy through Optimal k-Anonymization,”

7. N. Zhang, S. Wang, and W. Zhao, “A New Scheme on Privacy- Preserving Data

8. Z. Huang, W. Du, and B. Chen, “Deriving Private Information from Randomized

9. R. Agrawal, R. Srikant, and D. Thomas, “Privacy-Preserving OLAP,” Proc. 25th ACM

User File Clustering Starting FTP Encry Decry

Clustering the File

Invoking the FTP

Starting the FTP

Privacy preserving mining of distributed data has numerous applications. Each

we classify malicious adversaries into two widely existing subclasses, called

We consider a scenario in which two parties owning confidential databases wish

Many extensions to our work exist, including 1) extending the information

Vous aimerez peut-être aussi