Académique Documents
Professionnel Documents
Culture Documents
Abstract:
Many organizations collect large amounts of data about their clients and
customers. Such data was used for record keeping. Data mining can extract valuable
knowledge from this data .Organizations can obtain better results by pooling their data
together. However, the collected data may contain sensitive or private information about
the organizations or their customers, and privacy concerns are exacerbated if data is
shared between multiple organizations. Distributed data mining is concerned with the
computation of models from data that is distributed among multiple participants. Privacy-
preserving distributed data mining seeks to allow for the cooperative computation of such
models without the cooperating parties revealing any of their individual data items. Our
project makes two contributions in privacy-preserving data mining. First, we introduce
the concept of arbitrarily partitioned data, which is a generalization of both horizontally
and vertically partitioned data. Second, we provide an efficient privacy-preserving
protocol for k-means clustering in the setting of arbitrarily partitioned data. Privacy-
preserving distributed data mining allows the cooperative computation of data mining
algorithms without requiring the participating organizations to re- veal their individual
data items to each other.
We present a simple I/O-efficient k-clustering algorithm that was designed with
the goal of enabling a privacy-preserving version of the algorithm. Our experiments show
that this algorithm produces cluster centers that are, on average, more accurate than the
ones produced by the well known iterative k-means algorithm. We use our new algorithm
as the basis for a communication-efficient privacy-preserving k-clustering protocol for
databases that are horizontally partitioned between two parties. Unlike existing privacy-
preserving protocols based on the k-means algorithm, this protocol does not reveal
intermediate candidate cluster centers. In this work, we propose methods for
constructing the dissimilarity matrix of objects .
Existing System:
The earlier techniques are data perturbation techniques for privacy preserving
classification model construction on centralized data , techniques for privacy preserving
association rule mining in distributed environments . Techniques from secure multiparty
computation form one approach to privacy-preserving data mining. Yao’s general
protocol for secure circuit evaluation can be used to solve any two-party privacy-
preserving distributed data mining problem. However, since data mining usually involves
millions or billions of data items, the communication cost of this protocol renders it
impractical for these purposes. The k-mean and re cluster algorithms are used in turn for
efficiency.
Proposed System:
K –Means:
Dataset
K Means
Clustering
Clustered dataset
Dataset
Clustering Reclustering
Preserved dataset
Modules:
1. Data collection
2. Formation of Cluster
3. Privacy preservation
4. Comparison evaluation of algorithms
The data collection is the important part of the project. The dataset is prepared by
Integrating the organizations details. Such data was initially used only for record keeping.
These large collections of data could be “mined” for knowledge that could improve the
performance of the organization. While much data mining occurs on data within an
organization, it is quite common to use data from multiple sources in order to yield more
precise or useful knowledge. However, privacy and secrecy considerations can prohibit
organizations from being willing or able to share their data with each other. Therefore we
preserve Privacy-preserving data mining technique.
2. Formation of Cluster:
3. Privacy preservation:
Privacy-preserving data mining solutions have been presented both with respect to
horizontally and vertically partitioned databases, in which either different data objects
with the same attributes, are owned by each party, or different attributes for the same data
objects are owned by each party, respectively. We introduce the notion of arbitrarily
partitioned data, which generalizes both horizontally and vertically partitioned data. In
arbitrarily partitioned data, different attributes for different items can be owned by either
party. Although extremely “patchworked” data is unlikely in practice, one advantage of
considering arbitrarily partitioned data is that protocols in this model apply both to
horizontally and vertically partitioned data, as well as to hybrid that are mostly, but not
completely, vertically or horizontally partitioned.
Software Constraints
Privacy Preserved
Using Protocol
No. of Clusters
Grouping based
on min. distance
Problem Definition
Level 0:
System1 Data
Collection
Level 1:
Finding Centroid
Grouping min
distance
Level 3:
Collection of
Clustered data’s
Privacy
Preservation
Encryption
&
Decryption
System 2
Data Minig Concepts:
Although data mining is a relatively new term, the technology is not. Companies
have used powerful computers to sift through volumes of supermarket scanner data and
analyze market research reports for years. However, continuous innovations in computer
processing power, disk storage, and statistical software are dramatically increasing the
accuracy of analysis while driving down the cost.
For example, one Midwest grocery chain used the data mining capacity of Oracle
software to analyze local buying patterns. They discovered that when men bought diapers
on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that
these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays,
however, they only bought a few items. The retailer concluded that they purchased the
beer to have it available for the upcoming weekend. The grocery chain could use this
newly discovered information in various ways to increase revenue. For example, they
could move the beer display closer to the diaper display.
Data, Information, and Knowledge
Data
Data are any facts, numbers, or text that can be processed by a computer. Today,
organizations are accumulating vast and growing amounts of data in different formats and
different databases. This includes:
• operational or transactional data such as, sales, cost, inventory, payroll, and
accounting
• nonoperational data, such as industry sales, forecast data, and macro economic
data
• meta data - data about the data itself, such as logical database design or data
dictionary definitions
Information
The patterns, associations, or relationships among all this data can provide
information. For example, analysis of retail point of sale transaction data can yield
information on which products are selling and when.
Knowledge
Information can be converted into knowledge about historical patterns and future
trends. For example, summary information on retail supermarket sales can be analyzed in
light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a
manufacturer or retailer could determine which items are most susceptible to promotional
efforts.
While large-scale information technology has been evolving separate transaction and
analytical systems, data mining provides the link between the two. Data mining software
analyzes relationships and patterns in stored transaction data based on open-ended user
queries. Several types of analytical software are available: statistical, machine learning,
and neural networks. Generally, any of four types of relationships are sought:
• Classes: Stored data is used to locate data in predetermined groups. For example,
a restaurant chain could mine customer purchase data to determine when
customers visit and what they typically order. This information could be used to
increase traffic by having daily specials.
• Sequential patterns: Data is mined to anticipate behavior patterns and trends. For
example, an outdoor equipment retailer could predict the likelihood of a backpack
being purchased based on a consumer's purchase of sleeping bags and hiking
shoes.
• Extract, transform, and load transaction data onto the data warehouse system.
• Rule induction: The extraction of useful if-then rules from data based on
statistical significance.
Software Used
Java byte code can execute on the server instead of or in addition to the client,
enabling you to build traditional client/server applications and modern thin client Web
applications. Two key server side Java technologies are servlets and JavaServer Pages.
Servlets are protocol and platform independent server side components which extend the
functionality of a Web server. JavaServer Pages (JSPs) extend the functionality of
servlets by allowing Java servlet code to be embedded in an HTML file.
Features of Java
• Platform Independence
o The Write-Once-Run-Anywhere ideal has not been achieved (tuning for
different platforms usually required), but closer than with other languages.
• Object Oriented
o Object oriented throughout - no coding outside of class definitions,
including main ().
o An extensive class library available in the core language packages.
• Compiler/Interpreter Combo
o Code is compiled to bytecodes that are interpreted by a Java virtual
machines (JVM) .
o This provides portability to any machine for which a virtual machine has
been written.
o The two steps of compilation and interpretation allow for extensive code
checking and improved security.
• Robust
o Exception handling built-in, strong type checking (that is, all data must be
declared an explicit type), local variables must be initialized.
• Security
o No memory pointers
o Programs runs inside the virtual machine sandbox.
o Array index limit checking
o Code pathologies reduced by
byte code verifier - checks classes after loading
Class loader - confines objects to unique namespaces. Prevents
loading a hacked "java.lang.SecurityManager" class, for example.
Security manager - determines what resources a class can access
such as reading and writing to the local disk.
• Dynamic Binding
o The linking of data and methods to where they are located, is done at run-
time.
o New classes can be loaded while a program is running. Linking is done on
the fly.
o Even if libraries are recompiled, there is no need to recompile code that
uses classes in those libraries.
This differs from C++, which uses static binding. This can result in fragile
classes for cases where linked code is changed and memory pointers then
point to the wrong addresses.
• Good Performance
o Interpretation of byte codes slowed performance in early versions, but
advanced virtual machines with adaptive and just-in-time compilation and
other techniques now typically provide performance up to 50% to 100%
the speed of C++ programs.
• Threading
o Lightweight processes, called threads, can easily be spun off to perform
multiprocessing.
o Can take advantage of multiprocessors where available
o Great for multimedia displays.
• Built-in Networking
o Java was designed with networking in mind and comes with many classes
to develop sophisticated Internet communications.
TEST PROCEDURE
SYSTEM TESTING:
Testing is performed to identify errors. It is used for quality assurance.
Testing is an integral part of the entire development and maintenance process. The goal
of the testing during phase is to verify that the specification has been accurately and
completely incorporated into the design, as well as to ensure the correctness of the design
itself. For example the design must not have any logic faults in the design is detected
before coding commences, otherwise the cost of fixing the faults will be considerably
higher as reflected. Detection of design faults can be achieved by means of inspection as
well as walkthrough.
Testing is one of the important steps in the software development phase. Testing
checks for the errors, as a whole of the project testing involves the following test cases:
Static analysis is used to investigate the structural properties of the Source code.
Dynamic testing is used to investigate the behavior of the source code by
executing the program on the test data.
1. UNIT TESTING:
Unit testing is conducted to verify the functional performance of each modular
component of the software. Unit testing focuses on the smallest unit of the software
design (i.e.), the module. The white-box testing techniques were heavily employed for
unit testing.
FUNCTIONAL TESTS:
Functional test cases involved exercising the code with nominal input
values for which the expected results are known, as well as boundary values and special
values, such as logically related inputs, files of identical elements, and empty files.
Three types of tests in Functional test:
Performance Test
Stress Test
Structure Test
PERFORMANCE TEST:
It determines the amount of execution time spent in various parts of the unit,
program throughput, and response time and device utilization by the program unit.
STRESS TEST:
Stress Test is those test designed to intentionally break the unit. A Great deal
can be learned about the strength and limitations of a program by examining the manner
in which a programmer in which a program unit breaks.
STRUCTURED TEST:
Structure Tests are concerned with exercising the internal logic of a program and
traversing particular execution paths. The way in which White-Box test strategy was
employed to ensure that the test cases could Guarantee that all independent paths within a
module have been have been exercised at least once.
2. INTEGRATION TESTING:
The major error that was faced during the project is linking error. When all the
modules are combined the link is not set properly with all support files. Then we checked
out for interconnection and the links. Errors are localized to the new module and its
intercommunications. The product development can be staged, and modules integrated in
as they complete unit testing. Testing is completed when the last module is integrated and
tested.
TESTING TECHNIQUES / TESTING STRATRGIES:
TESTING:
Software testing is the critical element of software quality assurance and represents the
ultimate the review of specification design and coding. Testing is the process of
executing the program with the intent of finding the error. A good test case design is one
that as a probability of finding an yet undiscovered error. A successful test is one that
uncovers an yet undiscovered error. Any engineering product can be tested in one of the
two ways:
This testing is also called as Glass box testing. In this testing, by knowing the
specific functions that a product has been design to perform test can be conducted that
demonstrate each function is fully operational at the same time searching for errors in
each function. It is a test case design method that uses the control structure of the
procedural design to derive test cases. Basis path testing is a white box testing.
Basis path testing:
In this testing by knowing the internal operation of a product, test can be conducted to
ensure that “all gears mesh”, that is the internal operation performs according to
specification and all internal components have been adequately exercised. It
fundamentally focuses on the functional requirements of the software.
A software testing strategy provides a road map for the software developer. Testing is a
set activity that can be planned in advance and conducted systematically. For this reason
a template for software testing a set of steps into which we can place specific test case
design methods should be strategy should have the following characteristics:
Testing begins at the module level and works “outward” toward the
integration of the entire computer based system.
Different testing techniques are appropriate at different points in time.
The developer of the software and an independent test group conducts
testing.
Testing and Debugging are different activities but debugging must be
accommodated in any testing strategy.
INTEGRATION TESTING:
VALIDATION TESTING:
PROGRAM TESTING:
The logical and syntax errors have been pointed out by program testing. A syntax
error is an error in a program statement that in violates one or more rules of the language
in which it is written. An improperly defined field dimension or omitted keywords are
common syntax error. These errors are shown through error messages generated by the
computer. A logic error on the other hand deals with the incorrect data fields, out-off-
range items and invalid combinations. Since the compiler s will not deduct logical error,
the programmer must examine the output. Condition testing exercises the logical
conditions contained in a module. The possible types of elements in a condition include a
Boolean operator, Boolean variable, a pair of Boolean parentheses A relational operator
or on arithmetic expression. Condition testing method focuses on testing each condition
in the program the purpose of condition test is to deduct not only errors in the condition
of a program but also other a errors in the program.
SECURITY TESTING:
Security testing attempts to verify the protection mechanisms built in to a system
well, in fact, protect it from improper penetration. The system security must be tested for
invulnerability from frontal attack must also be tested for invulnerability from rear attack.
During security, the tester places the role of individual who desires to penetrate system.
References:
1. J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann,
2001.
2. C. Clifton et al., “Tools for Privacy Preserving Distributed Data Mining,” SIGKDD
Explorations, vol. 4, no. 2, 2003
3. V.S. Verykios et al., “State-of-the-Art in Privacy Preserving Data Mining,” SIGMOD
Record, vol. 33, no. 1,
4. L. Wang, S. Jajodia, and D. Wijesekera, “Securing OLAP Data Cubes against Privacy
Breaches,” Proc. 25th IEEE Symp. Security and Privacy, IEEE Press, 2004,
Class Diagram:
System1
ReclusterFrame
openFile()
partyOneClustered()
partyTwoClustered()
showClusteredItem
exit()
Recursive Cluster
Split()
RecursiveClusterSub()
Check Attributes
MainCluster
mainCheckAttributes()
mainRecluster()
hashTable()
FTPServer
sendFile()
recieveFile()
run()
System2
One
One()
Connection()
Exit()
Start
initCompenents()
connectionPerformed()
FTP Client
setFile()
sendFile()
recieveFile()
displayMenu()
DataEncryption
Data Decryption
Encrypt()
Decrypt()
System1
Sequence Diagram:
Choosing a file
Encrypting the
Clustered file
Decrypting
And sending
the file to
user.
Use case Diagram:
System1
File Choosing
Clustering The
Selected File
Reclustering the
clustered File
Encryption
Decryption
Literature Survey:
We also propose a classification hierarchy that sets the basis for analyzing the
work which has been performed in this context. A detailed review of the work
accomplished in this area is also given, along with the coordinates of each work to the
classification hierarchy. we address issues related to sharing information in a distributed
system consisting of autonomous entities, each of which holds a private database. Semi-
honest behavior has been widely adopted as the model for adversarial threats. However, it
substantially underestimates the capability of adversaries in reality.
Future Enhancement:
Most of the previous studies investigated the problem and proposed solutions
based on the assumption that all parties are honest or semi-honest. While it is sometimes
useful, this assumption substantially underestimates the capability of adversaries and thus
does not always hold in practical situations. We considered a space of more powerful
adversaries which include not only honest and semi-honest adversaries but also those
who are weakly malicious and strongly malicious.