Vous êtes sur la page 1sur 45

S.I.E.

S College of Arts, Science and Commerce,


Sion(W), Mumbai 400 002.

A PROJECTED REPORT
ON
Fuzzy Keyword Search over Encrypted Data in Cloud
Computing

SUBMITED TO
UNIVERSITY OF MUMBAI
BY
VENNELA KUMARARAJA
M.Sc.(COMPUTER SCIENCE)
20132014

S.I.E.S College of Arts, Science and Commerce,


Sion(W), Mumbai 400 002.
CERTIFICATE

This is to certify that Mr / Miss. __________________________________


Seat-No:

_______________ has successfully completed the

necessary course of experiments in the subject of


__________________________________________during the academic
year 2013 2014 completing with the requirements of University
of Mumbai, for the course of MSC - II. Computer Science.

______________
Prof. In-Charge

___________________
Head of the Department

____________
Examiner
College Seal

ACKNOWLEDGEMENT
No project is ever complete without the guidance of those expert how have already traded this
past before and hence become master of it and as a result, our leader. So we would like to take
this opportunity to take all those individuals how have helped us in visualizing this project.
We express out deep gratitude to our project guide Mrs. for providing timely assistant to
our query and guidance that she gave owing to her experience in this field for past many year.
She had indeed been a lighthouse for us in this journey.
We would also take this opportunity to thank our project co-ordinate Mr. for his guidance
in selecting this project and also for providing us all this details on proper presentation of this
project.
We extend our sincerity appreciation to all our Professor form COLLEGE OF
ENGINEERING for their valuable inside and tip during the designing of the project. Their
contributions have been valuable in so many ways that we find it difficult to acknowledge of
them individual.
We also great full to our HOD Mrs. for extending her help directly and indirectly through
various channel in our project work.
.

Thanking You,
________________

ABSTRACT
Abstract As Cloud Computing becomes prevalent, more and more
sensitive information are being centralized into the cloud. For the
protection of data privacy, sensitive data usually have to be encrypted
before outsourcing, which makes effective data utilization a very
challenging task. Although traditional searchable encryption schemes
allow a user to securely search over encrypted data through keywords
and selectively retrieve files of interest, these techniques support only
exact keyword search. That is, there is no tolerance of minor typos and
format inconsistencies which, on the other hand, are typical user
searching behavior and happen very frequently. This significant
drawback makes existing techniques unsuitable in Cloud Computing as
it greatly affects system usability, rendering user searching experiences
very frustrating and system efficacy very low. In this paper, for the first
time we formalize and solve the problem of effective fuzzy keyword
search over encrypted cloud data while maintaining keyword privacy.
Fuzzy keyword search greatly enhances system usability by returning
the matching files when users searching inputs exactly match the
predefined keywords or the closest possible matching files based on
keyword similarity semantics, when exact match fails. In our solution,
we exploit edit distance to quantify keywords similarity and develop an

advanced technique on constructing fuzzy keyword sets, which greatly


reduces the storage and representation overheads. Through rigorous
security analysis, we show that our proposed solution is secure and
privacy-preserving, while correctly realizing the goal of fuzzy keyword
search.

INDEX
SR.N
O
1)
2)
3)
4)
5)
6)
7)
8)
9)

TITLE
INTRODUCTION
LITERATURE SURVEY
PROBLEM DEFINITION
REQUIREMENT ANALYSIS

PG.NO
1
5
8
11
13

PLANNING AND ESTIMATION


15
ALGORITHM
22
IMPLEMENTATION
ADVANTAGES & DISADVANTAGES

27
29

FUTURE MODIFICATIONS

10)

APPLICATION

11)

BIBLIOGRAPHY

12)

SCREENSHOTS

13)

SOURCE CODE

31
33
48

INTRODUCTION

INTRODUCTION
As Cloud Computing becomes prevalent, more and more sensitive
information are being centralized into the cloud, such as emails, personal
health records, government documents, etc. By storing their data into the
cloud, the data owners can be relieved from the burden of data storage
and maintenance so as to enjoy the on-demand high quality data storage
service. However, the fact that data owners and cloud server are not in
the same trusted domain may put the outsourced data at risk, as the
cloud server may no longer be fully trusted. It follows that sensitive data
usually should be encrypted prior to outsourcing for data privacy and
combating unsolicited accesses. However, data encryption makes
effective data utilization a very challenging task given that there could
be a large amount of outsourced data files. Moreover, in Cloud
Computing, data owners may share their outsourced data with a large
number of users. The individual users might want to only retrieve certain
specific data files they are interested in during a given session. One of
the most popular ways is to selectively retrieve files through keywordbased search instead of retrieving all the encrypted files back which is
completely impractical in cloud computing scenarios. Such keywordbased search technique allows users to selectively retrieve files of

interest and has been widely applied in plaintext search scenarios, such
as Google search [1]. Unfortunately, data encryption restricts users
ability to perform keyword search and thus makes the traditional
plaintext search methods unsuitable for Cloud Computing. Besides this,
data encryption also demands the protection of keyword privacy since
keywords usually contain important information related to the data files.
Although encryption of keywords can protect keyword privacy, it further
renders the traditional plaintext search techniques useless in this
scenario. To securely search over encrypted data, searchable encryption
techniques have been developed in recent years [2][10]. Searchable
encryption schemes usually build up an index for each keyword of
interest and associate the index with the files that contain the keyword.
By integrating the trapdoors of keywords within the index information,
effective keyword search can be realized while both file content and
keyword privacy are well-preserved. Although allowing for performing
searches securely and effectively, the existing searchable encryption
techniques do not suit for cloud computing scenario since they support
only exact keyword search. That is, there is no tolerance of minor typos
and format inconsistencies. It is quite common that users searching
input might not exactly match those pre-set keywords due to the possible
typos, such as Illinois and Illinois, representation inconsistencies, such
as PO BOX and P.O. Box, and/or her lack of exact knowledge about the
data. The naive way to support fuzzy keyword search is through simple

spell check mechanisms. However, this approach does not completely


solve the problem and sometimes can be ineffective due to the following
reasons: on the one hand, it requires additional interaction of user to
determine the correct word from the candidates generated by the spell
check algorithm, which unnecessarily costs users extra computation
effort; on the other hand, in case that user accidentally types some other
valid keywords by mistake (for example, search for hat by carelessly
typing cat), the spell check algorithm would not even work at all, as it
can never differentiate between two actual valid words. Thus, the
drawbacks of existing schemes signify the important need for new
techniques that support searching flexibility, tolerating both minor typos
and format inconsistencies. In this paper, we focus on enabling effective
yet privacy preserving fuzzy keyword search in Cloud Computing. To
the best of our knowledge, we formalize for the first time the problem of
effective fuzzy keyword search over encrypted cloud data while
maintaining keyword privacy. Fuzzy keyword search greatly enhances
system usability by returning the matching files when users searching
inputs exactly match the predefined keywords or the closest possible
matching files based on keyword similarity semantics, when exact match
fails. More specifically, we use edit distance to quantify keywords
similarity and develop a novel technique, i.e., an wildcard-based
technique, for the construction of fuzzy keyword sets. This technique
eliminates the need for enumerating all the fuzzy keywords and the

resulted size of the fuzzy keyword sets is significantly reduced. Based


on the constructed fuzzy keyword sets, we propose an efficient fuzzy
keyword search scheme. Through rigorous security analysis, we show
that the proposed solution is secure and privacy-preserving, while
correctly realizing the goal of fuzzy keyword search.

LITERATURE
SURVEY

Motivation
Plain text fuzzy keyword search.
Recently, the importance of fuzzy search has received attention in the
context of plaintext searching in information retrieval community [11]
[13]. They addressed this problem in the traditional informationaccess
paradigm by allowing user to search without using try-and-see approach
for finding relevant information based on approximate string matching.
At the first glance, it seems possible for one to directly apply these string
matching algorithms to the context of searchable encryption by
computing the trapdoors on a character base within an alphabet.
However, this trivial construction suffers from the dictionary and
statistics attacks and fails to achieve the search privacy
Searchable encryption.
Traditional searchable encryption [2][8], [10] has been widely studied
in the context of cryptography. Among those works, most are focused
on efficiency improvements and security definition formalizations. The
first construction of searchable encryption was proposed by Song et al.
[3], in which each word in the document is encrypted independently
under a special two-layered encryption construction. Goh [4] proposed
to use Bloom filters to construct the indexes for the data files. To
achieve more efficient search, Chang et al. [7] and Curtmola et al. [8]

both proposed similar index approaches, where a single encrypted


hash table index is built for the entire file collection. In the index table,
each entry consists of the trapdoor of a keyword and an encrypted set of
file identifiers whose corresponding data files contain the keyword. As
a complementary approach, Boneh et al. [5] presented a public-key
based searchable encryption scheme, with an analogous scenario to that
of [3]. Note that all these existing schemes support only exact keyword
search, and thus are not suitable for Cloud Computing.
Others.
Private matching [14], as another related notion, has been studied mostly
in the context of secure multiparty computation to let different parties
compute some function of their own data collaboratively without
revealing their data to the others. These functions could be intersection
or approximate private matching of two sets, etc. The private
information retrieval [15] is an often-used technique to retrieve the
matching items secretly, which has been widely applied in information
retrieval from database and usually incurs unexpectedly computation
complexity.

PROBLEM
DEFINITION

Problem statement

In this paper, we consider a cloud data system consisting of data owner,


data user and cloud server. Given a collection of n encrypted data files C
= (F1, F2, . . . , FN) stored in the cloud server, a predefined set of
distinct keywords W = {w1, w2, ...,wp}, the cloud server provides the
search service for the authorized users over the encrypted data C. We
assume the authorization between the data owner and users is
appropriately done. An authorized user types in a request to selectively
retrieve data files of his/her interest. The cloud server is responsible for
mapping the searching request to a set of data files, where each file is
indexed by a file ID and linked to a set of keywords. The fuzzy keyword
search scheme returns the search results according to the following
rules:
1) if the users searching input exactly matches the pre-set
keyword, the server is expected to return the files containing the
keyword1;
2) if there exist typos and/or format inconsistencies in the
searching input, the server will return the closest possible results
based on pre-specified similarity semantics
Aim
we formalize and solve the problem of effective fuzzy keyword search
over encrypted cloud data

Scope
We propose a system where Fuzzy keyword search greatly enhances
system usability by returning the matching files when users searching
inputs exactly match the predefined keywords or the closest possible
matching files based on keyword similarity semantics, when exact match
fails.

Existing System
For the protection of data privacy, sensitive data usually have to be
encrypted before outsourcing, which makes effective data utilization a
very challenging task. Although traditional searchable encryption
schemes allow a user to securely search over encrypted data through
keywords and selectively retrieve files of interest, these techniques
support only exact keyword search. That is, there is no tolerance of
minor typos and format inconsistencies which, on the other hand, are
typical user searching behavior and happen very frequently. This
significant drawback makes existing techniques unsuitable in Cloud
Computing as it greatly affects system usability, rendering user
searching experiences very frustrating and system efficacy very low.

Disadvantages of Existing System


1) Dependent on Exact Keyword technique
2) no tolerance of minor typos and format inconsistencies

Proposed System

Proposed System:
Main Modules:
1. Wildcard Based Technique
2. Gram - Based Technique

1. Wildcard Based Technique:


In the above straightforward approach, all the variants of the keywords
have to be listed even if an operation is performed at the same position.
Based on the above observation, we proposed to use an wildcard to denote
edit operations at the same position. The wildcard-based fuzzy set edits
distance to solve the problems.
For example, for the keyword CASTLE with the pre-set edit distance 1, its
wildcard based fuzzy keyword set can be constructed as
SCASTLE, 1 = {CASTLE, *CASTLE,*ASTLE, C*ASTLE, C*STLE, CASTL*E, CASTL*,
CASTLE*}.

Edit Distance:

a.
b.
c.

Substitution
Deletion
Insertion

a.

Substitution : changing one character to another in a word;

b.

Deletion : deleting one character from a word;

c.

Insertion: inserting a single character into a word.

2. Gram Based Technique:

Another efficient technique for constructing fuzzy set is based on


grams. The gram of a string is a substring that can be used as a
signature for efficient approximate search. While gram has been widely
used for constructing inverted list for approximate string search, we use
gram for the matching purpose. We propose to utilize the fact that any
primitive edit operation will affect at most one specific character of the
keyword, leaving all the remaining characters untouched. In other
words, the relative order of the remaining characters after the primitive
operations is always kept the same as it is before the operations.

For example, the gram-based fuzzy set SCASTLE, 1 for keyword


CASTLE can be constructed as

{CASTLE, CSTLE, CATLE, CASLE, CASTE, CASTL,


ASTLE}.
Algorithm Description:
The approximate string matching algorithms among them can be
classified into two categories: on-line and off-line. The on-line
techniques, performing search without an index, are unacceptable for
their low search efficiency, while the off-line approach, utilizing
indexing techniques, makes it dramatically faster. A variety of indexing
algorithms, such as suffix trees, metric trees and q-gram methods, have
been presented. At the first glance, it seems possible for one to directly
apply these string matching algorithms to the context of searchable
encryption by computing the trapdoors on a character base within an
alphabet. However, this trivial construction suffers from the dictionary
and statistics attacks and fails to achieve the search privacy. An instance
M of the data type string-matching is an object maintaining a pattern and
a string. It provides a collection of different algorithms for computation
of the exact string matching problem. Each function computes a list of
all starting positions of occurrences of the pattern in the string.
System Architecture:

Advantages of Proposed system


1) Improve the success rate of prediction.
2) Reduce incorrect file retrieval.
3) Not dependent on Exact Keyword technique

4) Tolerance of minor typos and format inconsistencies

HARDWARE &
SOFTWARE
REQUIREMENT

Hardware and Software requirements


Hardware:
1. Processor: Pentium 4
2. RAM: 512 MB or more
3. Hard disk: 16 GB or more
Software(pending student confirmation)

PLANNING AND
ESTIMATION

Project Methodology
Software development Life Cycle
The entire project spanned for duration of 6 months. In order to
effectively design and develop a cost-effective model the Waterfall model
was practiced.

Requirement gathering and Analysis phase:


This phase started at the beginning of our project, we had
formed groups and modularized the project. Important points of
consideration were

1 Define and visualize all the objectives clearly.


2 Gather requirements and evaluate them.
3 Consider the technical requirements needed and then collect
technical specifications of various peripheral components
(Hardware) required.
4 Analyze the coding languages needed for the project.
5 Define coding strategies.
6 Analyze future risks / problems.
7 Define strategies to avoid this risks else define alternate
solutions to this risks.
8 Check financial feasibility.
9 Define Gantt charts and assign time span for each phase.
By studying the project extensively we developed a Gantt
chart to track and schedule the project. Below is the Gantt chart
of our project.

TimeLine

Cost Estimation

Cost estimation is done using cocomo model

cost Drivers
Product attributes
Required software reliability
Size of application database
Complexity of the product
Hardware attributes
Run-time performance constraints
Memory constraints
Volatility of the virtual machine envi
ronment
Required turnabout time
Personnel attributes
Analyst capability
Applications experience
Software engineer capability
Virtual machine experience
Programming language experience
Project attributes

Very
Low
0.75
0.70

1.46
1.29
1.42
1.21
1.14

Ratings
Nomin
Low
al
High

Very Extra
High High

0.88
0.94
0.85

1.00
1.00
1.00

1.15
1.08
1.15

1.40
1.16
1.30

1.65

1.00
1.00

1.11
1.06

1.30
1.21

1.66
1.56

0.87

1.00

1.15

1.30

0.87

1.00

1.07

1.15

1.19
1.13
1.17
1.10
1.07

1.00
1.00
1.00
1.00
1.00

0.86
0.91
0.86
0.90
0.95

0.71
0.82
0.70

Use of software tools


Application of software engineering
methods
Required development schedule

1.24

1.10

1.00

0.91

0.82

1.24

1.10

1.00

0.91

0.83

1.23

1.08

1.00

1.04

1.10

The Intermediate Cocomo formula now takes the form:


E=ai(KLoC)(bi).EAF

Using above calculation we found that


The total time period of the project is around 6 months, the per month cost comes out to be
Rs.12,000 , so the total comes to be Rs.72,000

FEASIBILITY STUDY
Feasibility study is made to see if the project on completion will serve the purpose of the
organization for the amount of work, effort and the time that spend on it. Feasibility study lets
the developer foresee the future of the project and the usefulness. A feasibility study of a system
proposal is according to its workability, which is the impact on the organization, ability to meet
their user needs and effective use of resources. Thus when a new application is proposed it
normally goes through a feasibility study before it is approved for development.
The document provide the feasibility of the project that is being designed and lists various
areas that were considered very carefully during the feasibility study of this project such as
Technical, Economic and Operational feasibilities. The following are its features:

TECHNICAL FEASIBILITY
The system must be evaluated from the technical point of view first. The assessment of this
feasibility must be based on an outline design of the system requirement in the terms of input,
output, programs and procedures. Having identified an outline system, the investigation must go
on to suggest the type of equipment, required method developing the system, of running the
system once it has been designed.

Technical issues raised during the investigation are:


Does the existing technology sufficient for the suggested one?
Can the system expand if developed?
The project should be developed such that the necessary functions and performance are
achieved within the constraints. The project is developed within latest technology. Through the
technology may become obsolete after some period of time, due to the fact that never version of
same software supports older versions, the system may still be used. So there are minimal
constraints involved with this project. The system has been developed using Java the project is
technically feasible for development.
ECONOMIC FEASIBILITY
The developing system must be justified by cost and benefit. Criteria to ensure that effort is
concentrated on project, which will give best, return at the earliest. One of the factors, which
affect the development of a new system, is the cost it would require.
The following are some of the important financial questions asked during preliminary
investigation:

The costs conduct a full system investigation.


The cost of the hardware and software.
The benefits in the form of reduced costs or fewer costly errors.

Since the system is developed as part of project work, there is no manual cost to spend for the
proposed system. Also all the resources are already available, it give an indication of the system
is economically possible for development.
BEHAVIORAL FEASIBILITY
This includes the following questions:
Is there sufficient support for the users?
Will the proposed system cause harm?
The project would be beneficial because it satisfies the objectives when developed and
installed. All behavioral aspects are considered carefully and conclude that the project is
behaviorally feasible.

Design &
Implementation

Start

Input login credentials

If verified?
No

Yes
Select file to upload
Encrypt data with key

Upload encrypted file


Generate search index
Stop

1 LEVEL DFD

2 level DFD

APPLICATION

1) search semantics
2) search ranking
3) File retrieval

Conclusion

Conclusion

In this paper, for the first time we formalize and solve the problem of
supporting efficient yet privacy-preserving fuzzy search for achieving
effective utilization of remotely stored encrypted data in Cloud
Computing. We design an advanced technique (i.e., wildcard-based
technique) to construct the storage-efficient fuzzy keyword sets by
exploiting a significant observation on the similarity metric of edit
distance. Based on the constructed fuzzy keyword sets, we further
propose an efficient fuzzy keyword search scheme. Through rigorous
security analysis, we show that our proposed solution is secure and
privacy-preserving, while correctly realizing the goal of fuzzy keyword
search.

FUTURE MODIFICATION

FUTURE MODIFICATION

As our ongoing work, we will continue to research on security


mechanisms that support:
1) search semantics that takes into consideration conjunction of
keywords, sequence of keywords, and even the complex natural
language semantics to produce highly relevant search results; and
2) search ranking that sorts the searching results according to the
relevance criteria.

BIBILIOGRAPHY

Bibliography
[1] Google, Britney spears spelling correction, Referenced online at
http:
//www.google.com/jobs/britney.html, June 2009.
[2] M. Bellare, A. Boldyreva, and A. ONeill, Deterministic and
efficiently
searchable encryption, in Proceedings of Crypto 2007, volume 4622 of
LNCS. Springer-Verlag, 2007.
[3] D. Song, D. Wagner, and A. Perrig, Practical techniques for
searches
on encrypted data, in Proc. of IEEE Symposium on Security and
Privacy00, 2000.
[4] E.-J. Goh, Secure indexes, Cryptology ePrint Archive, Report
2003/216, 2003, http://eprint.iacr.org/.
[5] D. Boneh, G. D. Crescenzo, R. Ostrovsky, and G. Persiano, Public
key
encryption with keyword search, in Proc. of EUROCRYP04, 2004.
[6] B. Waters, D. Balfanz, G. Durfee, and D. Smetters, Building an
encrypted and searchable audit log, in Proc. of 11th Annual Network
and Distributed System, 2004.
[7] Y.-C. Chang and M. Mitzenmacher, Privacy preserving keyword

searches on remote encrypted data, in Proc. of ACNS05, 2005.


[8] R. Curtmola, J. A. Garay, S. Kamara, and R. Ostrovsky, Searchable
symmetric encryption: improved definitions and efficient constructions,
in Proc. of ACM CCS06, 2006.
[9] D. Boneh and B. Waters, Conjunctive, subset, and range queries on
encrypted data, in Proc. of TCC07, 2007, pp. 535554.
[10] F. Bao, R. Deng, X. Ding, and Y. Yang, Private query on encrypted
data in multi-user settings, in Proc. of ISPEC08, 2008.
[11] C. Li, J. Lu, and Y. Lu, Efficient merging and filtering algorithms
for
approximate string searches, in Proc. of ICDE08, 2008.
[12] A. Behm, S. Ji, C. Li, , and J. Lu, Space-constrained gram-based
indexing for efficient approximate string search, in Proc. of ICDE09.
[13] S. Ji, G. Li, C. Li, and J. Feng, Efficient interactive fuzzy keyword
search, in Proc. of WWW09, 2009.

Vous aimerez peut-être aussi