Protection of Bigdata Privacy: Seminar Report

SEMINAR REPORT
On
PROTECTION OF BIGDATA PRIVACY

Submitted in partial fulfilments of the requirements for the award of
B.Tech Degree in Information Technology
Of
APJ ABDUL KALAM TECHNOLOGICAL UNIVERSITY
By
GAUTHAM KRISHNA
REG NO:TRV15IT027
November 2018
DEPARTMENT OF INFORMATION TECHNOLOGY
GOVERNMENT ENGINEERING COLLEGE BARTON HILL
THIRUVANATHAPURAM-695035
SEMINAR REPORT
On
PROTECTION OF BIGDATA PRIVACY

Submitted in partial fulfilments of the requirements for the award of
B.Tech Degree in Information Technology
Of
APJ ABDUL KALAM TECHNOLOGICAL UNIVERSITY
By
GAUTHAM KRISHNA
REG NO:TRV15IT027
November 2018
DEPARTMENT OF INFORMATION TECHNOLOGY
THIRUVANATHAPURAM-695035
TRIVANDRUM
Department of Information Technology
CERTIFICATE
This is to certify that this seminar report entitled Protection of Big data privacy is a bonafide
record of the work done by Gautham Krishna under our guidance towards partial fulfilment of
requirements for the award of Bachelor of Technology in Information Technology of APJ Abdul
Kalam Technological University during the year 2018.
Prof.Vijayanand K S Prof.Shamna H R Prof.Balu John
Assistant Professor Assistant Professor Associate Professor
(Seminar Guide) (Seminar Guide) (Head of theDepartment)

ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of this seminar would
be incomplete without the mention of the people who made it possible, without
whose constant guidance and encouragement would not have been possible to
prepare this report. First I would like to express my sincere gratitude and heartfelt
indebtedness to our Principal Dr.Rajasree M S for providing all the necessary
requirements for this work.
I also acknowledge my gratitude to Prof.Balu John, Head of Department

,Information Technology for his guidance and valuable suggestions during the
seminar preparation.
I am profoundly indebted to my seminar guides ,Prof.Vijayanand KJ S and

Prof.Shamna H R for their innumerable acts of timely advice, encouragement and
support.I sincerely express my gratitude to both of them.
I extend my sincere thankfulness to all the teachers and staff of Information

Technology Department for their constant support and cooperation.
I also place on record my gratitude to my parents ,friends and above all Lord
Almighty who made my humble venture a successful one.
Place: Thiruvanathapuram
Date: November 28, 2018 Gautham Krishna

ABSTRACT
In recent years, big data have become a hot research topic. The increasing amount
of big dataalso increases the chance of breaching the privacy of individuals. Since
big data require high computational power and large storage, distributed systems
are used. As multiple parties are involved in these systems, the risk of privacy
violation is increased. There have been a number of privacy-preserving
mechanisms developed for privacy protection at different stages (e.g., data
generation, data storage, and data processing) of a big data life cycle. This report
provides a comprehensive overview of the privacy preservation mechanisms in big
data and present the challenges for existing mechanisms. In particular, the report
illustrate the infrastructure of big data and the state-of-the-art privacy-preserving
mechanisms in each stage of the big data life cycle. The challenges and future
research directions related to privacy preservation in big data are also discussed.
Comparative study between various recent techniques of big data privacy is
mentioned.
TABLE OF CONTENTS
Chapter No. Title Page No
LIST OF FIGURES ii
LIST OF TABLES iii
1. INTRODUCTION 1
2.BIG DATA 2
2.1WHAT IS BIG DATA? 2
2.2 APPLICATIONS 3
2.3 V’S OF BIG DATA 3
2.4 BIG DATA LIFE CYCLE 4
2.5B IG DATA AND CLOUD COMPUTING 5
3. BIG DATA PRIVACY 6
4. PRIVACY AND SECURITY CONCERNS 7
4.1PRIVACY V/S SECURITY 7
4.2 PRIAVCY REQUIREMENTS IN BIG DATA 8
4.3 PRIVACY CONFORMANCE 9
5. BIG DATA PRIVACY TOOLS 10
5.1 FILE ENCRYPTION TOOLS 10
5.2 DISKENCRYPTION TOOLS 10
5.3 COMMERCIAL TOOLS 11
6. PRIVACY IN DATA GENERATION PHASE 12
6.1ACCESS RESTRICTION 12
6.2 FALSYFING DATA 13
7. PRIVACY IN DATA STORAGE PHASE 14
7.1 APPROACHES TO PRIVACY PRESERVATION IN 14
STORAGE ON CLOUD
i
7.2 INTEGRITY VERIFICATION OF BIG DATA STORAGE 15
8. PRIVACY IN DATA PROCESSING PHASE 17
8.1 PPDP 17
8.2 EXRACTING KNOWLEDGE FROM DATA 20
8.3 EXAMPLE FOR BIG DATA PRIVACY-HEALTH CARE 1
9. FUTURE RESEARCH CHALLENGES 22
10. CONCLUSION 23
REFERENCES 24
LIST OF FIGURES
No Title Page No
1 3 V’s OF BIG DATA 3
2 LIFE CYCLE OF BIG DATA 4
3 PRIVCY CONFORMACE TESTING 9
4 INTEGRITY VERIFIVCATION SCHEMES 16
LIST OF TABLES
NO Title Page No
1 DIFFERENCE BETWEEN PRIVACY AND SECURITY 7
2 SOME COMMONLY USED FILE ENCRYTION TOOLS 10
3 SOME COMMONLY USED DISK ENCRYPTION TOOLS 10
4 COMPARISON OF ENCRYPTION SCHEMES 15
5 DIFFERENT PRIVACY PRESERVING METHODS 18
6 A NON-ANONYMIZED DATA BASE 19
7 ANONYMIZED DATABASE 19
Protection of Big Data Privacy
1.INTRODUCTION
Due to recent technological development, the amount of data generated by social
networking sites, sensor networks, Internet, healthcare applications, and many
other companies, is drastically increasing day by day. The data generation rate is
growing so rapidly that it is becoming extremely difficult to handle it using
traditional methods or systems. All the enormous measure of data produced from
various sources inmultiple formats with very high speed is referred as big data.
Big data, if captured and analyzedin a timely manner, can be converted into
actionable insights which can be of significant value. Big data has become a very
active research area for last couple of years. Big data analytics is the term used to
describe the process of researching massive amounts of complex data in order to
reveal hidden patterns or identify secret correlations. Despite big data could be
effectively utilized for us to better understand the world and innovate in various
aspects of human endeavors, the exploding amount of data has increased potential
privacy breach. There have been a number of privacy-preserving mechanisms
developed for privacy protection at different stages (for example, data generation,
data storage, and data processing) of a big data life cycle. This report discuss all
the major concerns related to big data privacy.
Dept.of Information Technology 1 Introduction

2.BIG DATAOVERVIEW
2.1.WHAT IS BIG DATA?

Big data is a term used for very large data sets that have more varied and
complexstructure. It is a term used to describe data that is huge in amount and which keeps
growing with time. Big Data consists of structured, unstructured and semi-structured data. This
data can be used to track and mine information for analysis or research purpose. The nature of
big data adds more challenges when performing data storage and processing tasks. Big data, if
captured and analyzed in a timely manner, can be converted into actionable insights which can
be of significant value. It can help businesses and organizations to improve the internal decision
making power and can create new opportunities through data analysis. It can also help to
promote the scientific research and economy by transforming traditional business models and
scientific values .
Big data analytics is the term used to describe the process of researching massive amounts of
complex data in order to reveal hidden patterns or identify secret correlations. Big Data
Analytics largely involves collecting data from different sources, manipulate it in a way that it
becomes available to be consumed by analysts and finally deliver data products useful to the
organization business.
a)Structured data
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. It concerns all data which can be stored in database SQL in table with rows and
columns. They have relational key and can be easily mapped into pre-designed fields.
b)Unstructured Data
Any data with unknown form or the structure is classified as unstructured data. Unstructured data
may have its own internal structure, but does not conform neatly into a spreadsheet or database.
Most business interactions, in fact, are unstructured in nature. Typical example of unstructured
data is, a heterogeneous data source containing a combination of simple text files, images, videos
etc.
c)Semi-structured
Semi-structured data can contain both the forms of data. Semi-structured data is a form
of structured data that does not conform with the formal structure of data models associated with
relational databases or other forms of data tables. Examples of semi-structured -NoSQL
databases are considered as semi structured.
Dept.of Information Technology 2 Big Data Overview

2.2.APPLICATIONS
Big data has a wide range of applications .Big data helps gain instant insights from diverse data
sources. It has highly improved the business’ performances through real-time analytics. Big data
technologies manage huge amounts of data. Big data helps mitigate risk and make smart decision
by proper risk analysis. Some important applications include:
Healthcare-Big data reduces costs of treatment since there is less chances of having to
perform unnecessary diagnosis. It helps in predicting outbreaks of epidemics and also helps
in deciding what preventive measures could be taken to minimize the effects of the same.
Banking Zones and Fraud Detection-Big data is hugely used in the fraud detection in the
banking sectors. It detects any and all the illegal activities that are being carried out such as
misuse of credit cards, misuse of debit cards,customer statistics alteration etc.
Transportation-Since the rise of big data, it has been used in various ways to make the
transportation more efficient and easy. It includes traffic control, route planning, intelligent
transport systems, congestion management by predicting traffic conditions etc.
Weather patterns-Weather related data collected from different parts of the world can be
used in different ways such as in weather forecast, to study global warming, understanding the
patterns of natural disasters to make necessary preparations in case of crisis and to predict the
availability of usable water around the world and many more.
2.3.3 V’s of BIG DATA

The properties of big data are reflected by 3 V's, which are, volume, velocity and variety.
Volume-refers to the amount of data generated.
Velocity-With the emergence of social networking sites, there have been a dramatic increase in
the size of the data. The rate at which new data are generated is often characterized as velocity.
Variety-A common theme of big data is that the data are diverse, i.e., they may contain text,
audio, image, or video etc. This diversity of data is denoted by variety.
Fig.1.3 V’s of Big Data
Dept.of.Information Technology 3 Big Data Overview

2.4. BIG DATA LIFE CYCLE

Big data has to go through multiple phases during its life cycle. Data are distributed nowadays
and new technologies are being developed to store and process large repositories of data. For
example, cloud computing technologies, such as HadoopMapReduce, are explored for big data
storage and
processing.
Data generation: Data can be generated from various distributed sources. The amount of data
generated by humans and machines has exploded in the past few years. For example, everyday
2.5 quintillion bytes of data are generated on the web and 90 percent of the data in the world is
generated in the past few years. Facebook, a social networking site alone is generating 25TB of
new data everyday. Usually, the data generated is large, diverse and complex. Therefore, it is
hard for traditional systems to handle them. The data generated are normally associated with a
specific domain such as business, Internet, research, etc.
Data storage: This phase refers to storing and managing large-scale data sets. A data storage
systemconsists of two parts i.e., hardware infrastructure and data management . Hardware
infrastructurerefers to utilizing information and communications technology (ICT) resources for
various tasks (such asdistributed storage). Data management refers to the set of software
deployed on top of hardware infrastructure to manage and query large scale data sets. It should
also provide several interfaces to interact with and analyze stored data.
Data processing: Data processing phase refers basically to the process of data collection, data
transmission, pre-processing and extracting useful information. Data collection is needed
because data may be coming from different diverse sources i.e., sites that contains text, images
and videos. In data collection phase, data are acquired from specific data production environment
using dedicated data collection technology. In data transmission phase, after collecting raw data
from a specific data production environment we need a high speed transmission mechanism to
transmit data into aproper storage for various type of analytic applications. Finally, the pre-
processing phase aims at removing meaningless and redundant parts of the data so that more
storage space could be saved.
Fig 2.Life cycle of big data
Dept.of Information Technology 4 Big Data Overview

2.5 BIG DATA AND CLOUD COMPUTING

Big data need massive computation and storage, which brings in the need for cloud computing.
Cloud computing is driving enterprises and businesses to adopt cloud, because of many
advantages it is offering, such as cost saving and scalability. It also offers huge processing power
and storage capability. Technologies used in cloud computing like virtualization, distributed
storage and processing have made it possible to perform tasks that had been considered difficult
in conventional system. However, on the other hand, could computing also results in serious
cloud specific privacy issues. People hesitate to transfer their private or sensitive data to the
cloud unless they are sure that their data will be secure on the cloud. There are some challenges
for building a trustworthy and secure big data storage and processing system on cloud which are
as follows:
Outsourcing: To reduce the capital and operational expenditure, organizations nowadays prefer
to outsource their data to the cloud. However, outsourcing data to cloud also means that the
customers will lose physical control on their data. The loss of control over the data has become
one of the main cause of cloud insecurity. Outsourced data should also be verifiable to customers
in terms of confidentiality and integrity.
Multi-tenancy: Virtualization has made it possible to share the same cloud platform by multiple
customers.The data that belong to different cloud users may be placed on the same physical
storage by some resource allocation policy.
Massive computation: Due to the capability of cloud computing for handling massive data
storage andintense computations, traditional mechanisms to protect individual's privacy are not
sufficient.
Dept.of.Information Technology 5 Big Data Overview

3.BIG DATA PRIVACY
Big Data are datasets which can’t be processed in conventional database ways to their size.
Despite big data could be effectively utilized for us to better understand the world and innovate
in various aspects of human endeavors, the exploding amount of data has increased potential
privacy breach. For example,Amazon and Google can learn our shopping preferences and
browsing habits. Social networking sites such as Facebook store all the information about our
personal life and social relationships. Popular video sharing websites such as YouTube
recommends us videos based on our search history. With all the power driven by big data,
gathering, storing and reusing our personal information for the purpose of gaining commercial
profits, have put a threat to our privacy and security. In 2006, AOL released 20 million search
queries for 650 users by removing the AOL id and IP address for research purposes. However, it
took researchers only couple of days to re-identify the users.
Users' privacy may be breached under the following circumstances: _
Personal information when combined with external datasets may lead to the inference of new
facts about the users. Those facts may be secretive and not supposed tobe revealed to others.
Personal information is sometimes collected and usedto add value to business. For example,
individual'sshopping habits may reveal a lot of personal information.
The sensitive data are stored and processed in a locationnot secured properly and data leakage
may occur duringstorage and processing phases.
Dept.of Information Technology 6 Big Data Privacy

4.PRIVACY AND SECURITY CONCERNS
Privacy-Information privacy is the privilege to have some control over how the personal
information is collected and used. Information privacy is the capacity of an individual or group
to stop information about themselves from becoming known to people other than those they give
the information to. One serious user privacy issue is the identification of personal information
during transmission over the Internet .
Security-Security is the practice of defending information and information assets through the
use of technology, processes and training from:-Unauthorized access, disclosure, disruption,
modification, inspection, recording, and destruction.
4.1.PRIVACY V/S SECURITY

Data privacy is focused on the use and governance of individual data—things like setting up
policies in place to ensure that consumers’ personal information is being collected, shared and
utilized in appropriate ways.
Security concentrates more on protecting data from malicious attacks and the misuse of stolen
data for profit . While security is fundamental for protecting data, it’s not sufficient for
addressing privacy.
Table 1.Difference between privacy and security
Dept.of Information Technology 7 Privacy and Security Concerns

4.2 PRIVACY REQUIREMENTS IN BIGDATA
Big data analytics draw in various organizations; a hefty portion of them decide not to utilize
these services because of the absence of standard security and privacy protection tools.
Confidentiality:
Confidentiality is thecorner stone of big data privacy and security. We need to protect data
from leakage. The hacker who wants to obtain useful information in big data will attack
storage system to steal data. Confidentiality should be ensured at data collection, processing
and management.
Efficiency:
Different from traditional data big data has its characteristics of velocity, volume and
variety. For achieving efficiency we require high bandwidth. Efficiency is crucial in big
data security considering these three V’S.
Authenticity:
Real time data with veracity is needed to support wise decision making. Thus authenticity is
essential during the whole data life time to ensure trusted data sources, reputed data
processors and eligible data requesters. Authenticity can avoid wrong analysis result.
Availability:
Big data should be available any timewe need it. Otherwise it could lose its
value.Corresponding applications or services based on big data cannot work well. Therefore
availability should be ensured during the whole life time of big data.
Integrity:
To get valuable and accurate data ensuring its integrity is essential. We cannot analyze right
information with incomplete data especially, when the lost data is sensitive and useful.
Therefore integrity is required during the whole life time of big data.
4.3 PRIVACY CONFORMANCE TESTING

Businesses and government agencies are generating and continuously collecting large amounts
of data. Ensures conformance to privacy terms and regulations are constrained in current big data
analytics and mining practices. Developers should be able to verify that their applications
conform to privacy agreements and that sensitiveinformation is kept private regardless of
changes in the applications and/or privacy regulations. To address these challenges, a need for
new contributions in the areas of formal methods and testing procedures should be identified.
New paradigms for privacy conformance testing to the four areas of the ETL (Extract,
Transform, and Load) process are:
Dept.of Information Technology 8 Privacy and Security Concerns

1. Pre‐hadoop process validation:This step does the representation of the data loading process.
At this step, the privacy specifications characterize the sensitive pieces of data that can uniquely
identify a user or an entity. Privacy terms can likewise indicate which pieces of data can be
stored and for how long. At this step, schema restrictions can take place as well.
2. Map‐reduce process validation:This process changes big data assets to effectivelyreact to a

query. Privacy terms can tell the minimum number of returned recordsrequired to cover
individual values, in addition to constraints on data sharingbetween various processes.
3. ETL process validation :Similar to step (2), warehousing rationale should be confirmedat
this step for compliance with privacy terms. Some data values may beaggregated anonymously
or excluded in the warehouse if that indicates high probabilityof identifying individuals.
4. Reports testing: Reports are another form of questions, conceivably with higher visibilityand
wider audience. Privacy terms that characterize purpose are fundamental tocheck that sensitive
data is not reported with the exception of specified uses.
Fig 3.Privacy Conformance Testing
Dept.of Information technology 9 Privacy and Security Concerns

5.BIG DATA PRIVACY TOOLS

5.1FILE ENCRYPTION TOOLS:
Setting the information in a difficult to reach registry and making the record disjointed by others
will keep it secure as a rule. Encryption utilizes a calculation that shrouds the content's
importance. Some of the commonly used file encryption tools are given below:
Table 2.Some Commonly used File Encryption Tools
5.2.DISK ENCRYPTION TOOLS:

To ensure privacy of the information put away on a computer disks a data security procedure called disk
encryption is utilized. Disk encryption programming can straightforwardly work on a whole disk volume,
a catalog, or even a solitary record .Some of the commonly used disk encryption tools are given below:
Table 3.Some Commonly used Disk Encryption Tools
Dept.of Information Technology 10 Big Data Privacy Tools

5.3.COMMERCIAL TOOLS:
 IBM Threat Protection System is a robust and comprehensive set of tools and best
practices that are built on a framework that spans hardware, software and services to
address intelligence, integration and expertise required for Big Data security and privacy
issues.
 HP ArcSight, is another tool that can strengthen security intelligence, and can able to
convey the propelled relationship, application assurance, and system barriers to shield
today's cloud IT base from refined digital dangers.
 Another set of products (Identity-Based Encryption, Format-Preserving Encryption and

many more) are given by Voltage Security Inc. It gives new effective systems to ensure
information over its full lifecycle.
 Cisco's Threat Research, Analysis, and Communications (TRAC) devices are also
efficient tools for providing security for big data.
Dept.of Information Technology 11 Big Data Privacy Tools

6.PRIVACY IN DATA GENERATION

PHASE
Data generation can be classified into active data generation and passive data generation. Active
data generation means that the data owner is willing to provide the data to a third party, while
passive data generation refers to the situations that the data are generated by data owner's online
activity (e.g., browsing) and the data owner may not even be aware of that the data are being
collected by a third party. The major challenge for data owner is that how can he protect his data
from any third party who may be willing to collect them. The data owner wants to hide his
personal and sensitive information as much as possible and is concerned about how much control
he could have over the information. We can minimize the risk of privacy violation during data
generation by either restricting the access or by falsifying data.
6.1 ACCESS RESTRICTION

If the data owner thinks that the data may reveal sensitive information which is not supposed to
be shared, he can simply refuse to provide such data. For that, the data owner has to adopt
effective access control methods so that the data can be prevented from being stolen by some
third party. If the dataowner is providing the data passively, some measures could be taken to
ensure privacy, such as anti-tracking extensions, advertisement/script blockers , encryption tools
and anti-malware and anti-virus software to protect the data stored digitally on their computer or
laptop. These tools can help to protect user's personal data by limiting the access.
Anti-tracking extensions: When browsing the Internet, a user can utilize an anti-tracking
extension to block the trackers from collecting the cookies. Popular anti-tracking extensions
include Disconnect, Do Not Track Me, Ghostery etc. A major technology used for anti-tracking
is called Do Not Track (DNT), which enables users to opt out of tracking by websites they do not
visit.
Advertisement and script blockers: This type of browser extensions can block advertisements
on the sites, and kill scripts and widgets that send the user’s data to some unknown third party.
Example tools include AdBlock Plus, NoScript,FlashBlock etc.
Encryption tools: To make sure a private online communication between two parties cannot be
intercepted by third parties, a user can utilize encryption tools, such as MailCloak and
TorChat, to encrypt his emails, instant messages, or other types of web traffic. Also, a user can
encrypt all of his internet traffic by using a VPN (virtual private network) service.
Anti virus and anti malware:Antivirus usually deals with the older, more established threats,
such as Trojans, viruses, and worms. Anti-malware, focuses on newer stuff, such as polymorphic
malware and malware delivered by zero-day exploits. Antivirus protects users from predictable-
yet-still-dangerous malware. Anti-malware protects users from the latest, currently in the wild,
and even more dangerous threats.
Dept.of Information Technology 12 Privacy In Data Generation Phase

6.2FALSYFYING DATA
In some circumstances, it is not possible to prevent access of sensitive data. In that case, data can
be distorted using certain tools before the data are fetched by some third party. If the data are
distorted, the true information cannot be easily revealed. The following techniques are used by
the data owner to falsify the data:
 Using “sockpuppets” to hide one’s true activities. A sockpuppet12 is a false online

identity through which a member of an Internet community speaks while pretending to be
another person, like a puppeteer manipulating a hand puppet. By using multiple
sockpuppets, the data produced by one individual’s activities will be deemed as data
belonging to different individuals, assuming that the data collector does not have enough
knowledge to relate different sockpuppets to one specific individual. As a result, the
user’s true activities are unknown to others and his sensitive information (e.g. political
preference) cannot be easily discovered.
 Using a fake identity to create phony information. In 2012, Apple Inc. was assigned a
patent called “Techniques to pollute electronic profiling” which can help to protect
user’s privacy. This patent discloses a method for polluting the information gathered by
“network eavesdroppers” by making a false online identity of a principal agent, e.g. a
service subscriber. The clone identity automatically carries out numerous online actions
which are quite different from a user’s true activities. When a network eavesdropper
collects the data of a user who is utilizing this method, the eavesdropper will be interfered
by the massive data created by the clone identity. Real information about of the user is
buried under the manufactured phony information.
 Using security tools to mask one’s identity. When a user signs up for a web service or
buys something online, he is often asked to provide information such as email address,
credit card number, phone number, etc. A browser extension called MaskMe, which was
release by the online privacy company Abine, Inc. in 2013, can help the user to create
and manage aliases (or Masks) of these personal information. Users can use these aliases
just like they normally do when such information is required, while the websites cannot
get the real information. In this way, user’s privacy is protected.
Dept.of Information Technology 13 Privacy In Data Generation Phase

7.PRIVACY IN DATA STORAGE PHASE

we need to ensure that the stored data are protected against any security threats. In modern
informationsystems, data centres play an important role of performing complex commutations
and retrieving large amount of data. Storage virtualization is process in which multiple network
storage devices are combinedinto what appears to be a single storage device. However, using a
cloud service offered by cloud provider means that the organization's data will be outsourced to a
third party such as cloud provider. This could affect the privacy of the data.
7.1APPROACHES TO PRIVACY PRESERVATION STORAGE

ON CLOUD
When data are stored on cloud, data security mainly has three dimensions, confidentiality,
integrity and availability. The approaches to preserve the privacy of the user when data are stored
on the cloud are as follows.
 ATTRIBUTE BASED ENCRYPTION: ABE is an encryption technique which ensures

end to end big data privacy in cloud storage system. In ABE access polices are defined by
data owner and dataare encrypted under those policies. The data can only bedecrypted by
the users whose attributes satisfy the access policies defined by the data owner.
 IDENTITY BASED ENCRYPTION:IBE is an alternative to PKE(Public Key

Encryption) which is proposed to simplify key management in a certificate-based public
key infrastructure (PKI) by using human identities like email address or IP address as
public keys. To preserve the anonymity of sender and receiver, the IBE scheme was
proposed.
 HOMOMORPHIC ENCRYPTION: The data on cloud is encrypted and stored them

on cloud and allow the cloud to perform computations over encrypted data. Fully
homomorphic encryption is the type of encryption which allows functions to be
computed on encrypted data . Given only the encryption of a message, one can obtain an
encryption of a function of that message by computing directly on the encryption.
 STORAGE PATH ENCRYPTION: It secures storage of big data on clouds. The big
data are first separated into many sequenced parts and then each part is stored on a
different storage media owned by different cloud storage providers. To access the data,
different parts are first collected together from different data centres and then restored
into original form before it is presented to the data owner.
Dept.of Information Technology 14 Privacy In Data Storage Phase

Table 4. Comparison of Encryption Schemes.
7.2INTEGRITY VERIFICATION OF BIG DATA STORAGE

At the point when cloud computing is used for big data storage, data owner loses control over
data. The outsourced data are at risk as cloud server may not be completely trusted. The integrity
of data storage in traditional systems can be verified through number of ways i.e., Reed-Solomon
code, checksums, trapdoor hash functions,message authentication code (MAC), and digital
signatures etc. To verify the integrity of the data stored on cloud, straight forward approach is to
retrieve all the data from the cloud. To verify the integrity of data withouthaving to retrieve the
data from cloud . In integrity verification scheme, the cloud server can only provide the
substantial evidence of integrity of data when all the data are intact.
Data owners could perform integrity verification by themselves or delegate the task to trusted
third parties. The basic framework of any integrity verification scheme consist of three
participating parties: client, cloud storage server (CSS) and third party auditor (TPA). The client
stores the data on cloud and the objective of TPA is to verify the integrity of data. The main life
cycle of a remote integrityverification scheme consists of the following steps:
Setup and data upload: In order to verify the data without retrieving the actual file, the client
needs toprepare verification metadata. Metadata are computed from the original data and is
stored alongside theoriginal data.
Authorization for TPA: The TPA who can verify data from cloud server on data owner's behalf
needs to be authorized by the data owner. There is also a security risk if the third party can ask
for indefinite integrity proofs over certain dataset.
Challenge and verification of data storage: To verify the integrity of the data, a challenge
message is sent to the server by TPA on client's behalf. The server will compute a response
based on the challenge message and send it to TPA. The TPA can then verify the response to
find whether the data are intact.

Data update: Data update occurs when some operations are performed on the data. The client
needs to perform updates to some of the cloud data storage. Common could data update includes
insert, delete, and modify operations.
Metadata update: After some update operation is performed on the data, the client will need to
update the metadata according with the existing keys. The metadata are updated in order to keep
the datastorage verifiable without retrieving all the data.
Verification of updated data: Client also needs to verify if the data update is processed
correctly or not as the cloud cannot be fully trusted. This is an essential step to ensure that the
updated data still can be verified correctly in future.
Fig 4.Integrity Verification Schemes

8.PRIVACY IN DATA PROCESSING

Privacy protection in data processing part can be divided into two phases. In the first phase, the
goal is to safeguard information from unsolicited disclosure because the collected data may
contain sensitive information about the data owner. .In the second phase, the goal is to extract
meaningful information from the data without violating the privacy.
8.1 PPDP
During PPDP, the collected data may contain sensitive information about the data owner.
Directly releasing the information for further processing may violate the privacy of the data
owner, hence data modification is needed in such a way that it does not disclose any personal
information about the owner. On the other hand, the modified data should still be useful, not to
violate the original purpose of datapublishing. The original data are assumed to be sensitive and
private and consist of multiplerecords. Each record may consist of the following four attributes:
Identifier (ID): The attributes which can be used to uniquely identify a person e.g., name,
driving licensenumber, and mobile number etc.
Quasi-identifier (QID): The attributes that cannot uniquely identify a record by themselves but
if linkedwith some external dataset may be able to re-identify the records.
Sensitive attribute (SA): The attributes that a person may want to conceal e.g., salary and
disease.
Non-sensitive attribute (NSA): Non-sensitive attributes are attributes which if disclosed will
not violate the privacy of the user. All attributes other than identifier, quasi-identifier and
sensitive attributes are classified as non-sensitive attributes.
The data are anonymized by removing the identifiers and modifying the quasi-identifiers before
publishing or storing for further processing. As a result of anonymization, identity of the data
owner and sensitive values are hidden from the adversaries. How much data should be
anonymised mainly depends on how much privacy we want to preserve in that data. De-
identification is a traditional technique for privacy-preserving data mining,There are three -
privacy-preserving methods of De-identification, namely, K-anonymity,L-diversity and T-
closeness.K-anonymity is used to prevent the record linkage, l-diversity to prevent attribute
linkage and record linkage, t-closeness to prevent probabilistic attacks and attribute linkage.
Dept.of Information Technology 17 Privacy In Data Processing Phase

Table 5.Different Privacy Preserving Methods
ANONYMYZATION TECHQNIQUES:
To preserve the privacy, one of the following anonymization operations are applied to
the data:
 Generalization: Generalization works by replacing the value of specific QID attributes

with less specificdescription. In this operation some values are replaced by a parent value
in the taxonomy of an attribute. An example of it can be representing a job attribute with
artist instead of singer or actor. The types of generalization techniques include full
domain generalization, subtree generalization, multidimensional generalization, sibling
generalization, and cell generalization.
 Suppression: In suppression, some values are replaced with a special character (e.g.,
``*''), which indicates that a replaced value is not disclosed. Example of suppression
schemes include record suppression, value suppression, and cell suppression.

 Anatomization: Instead of modifying the quasi-identifier or sensitive attributes,

anatomization works byde-associating the relationship between the two. In this method,
the data on QID and SA are released in two separate tables. One table contains quasi-
identifier and the other table contains sensitive attributes. Both tables contain one
common attribute which is often called GroupID. The same group will have the same
value for GroupID linked to the sensitive values in the group.
 Permutation: In permutation, the relationship between quasi-identifier and numerically

sensitive attribute is de-associated by partitioning a set of records into groups and
shuffling their sensitive values within each group.
Table. 6A Non-anonymized database consisting of the patient record
Table 7. Anonymized database with respect to the attributes ‘Age’, ‘Gender’ and ‘State of
domicile

PRIVACY-UTILITY TRADE-OFF:
A high level of data anonymization indicates that the privacy is well protected. However, on the
other hand, it may also affect the utility of the data, which means that less values can be
extracted from the data. Therefore, balancing the trade-off between privacy and utility is very
important in big data applications. The reduction in data utility is represented by information
loss. Various methods have been proposed for measuring the information loss, some of the
examples include minimal distortion , discernibility metric , the normalized average equivalence
class size metric , weighted certainty penalty , and information theoretic metrics . To solve the
problems of trade-off between privacy and utility, PPDP algorithms usually take greedy
approach to achieve proper trade-off. These algorithms work by generating multiple tables using
the given metrics of privacy preservation and information loss, all of which satisfy the
requirement of specific privacy model during the anonymization process. Output ofthe greedy
algorithm is the table with minimum information loss.
8.2.EXTRACTING KNOWLEDGE FROM DATA

To extract useful information from big data without breaching the privacy, privacy preserving
data mining techniques have been developed to identify patterns and trends from data. Those
techniques cannot be applied straightaway to big data as big data may contain large, complex and
dynamically varying data. To handle big data in an efficient manner, those techniques should be
modifed, or some special set of techniques should be used. There are several techniques
proposed to analyze large-scale and complex data. These techniques can be broadly grouped into
clustering, classification and association rule based techniques.
 Clustering is one of the popular data processing techniques for its capability of analyzing
un-familiar data. The fundamental idea behind clustering is to separate unlabelled input
data into several different groups.
 Classification is a technique of identifying, to which predefined group a new input data
belongs. Similar to clustering algorithm, classification algorithms are traditionally
designed to work in centralized environments.
 While clustering and classification try to group the input data, association rules are
designed to find the important relationships or patterns between the input data.

8.3 EXAMPLE FOR BIG DATA PRIVACY –HEALTH CARE

The new wave of digitizing medical records has seen a paradigm shift in the healthcare industry.
As a result, healthcare industry is witnessing an increase in sheer volume of data in terms of
complexity, diversity and timeliness In healthcare, several factors provide the necessary impetus
to harness the power of big data . The harnessing the power of big data analysis and genomic
research with real-time access to patient records could allow doctors to make informed decisions
on treatments . Big data will compel insurers to reassess their predictive models. The real-time
remote monitoring of vital signs through embedded sensors (attached topatients) allows health
care providers to be alerted in case of an anomaly. Big data presented a comprehensive survey of
different tools and techniques used in Pervasive healthcare in a disease-specific manner. It
covered the major diseases and disorders that can be quickly detected and treated with the use of
technology, such as fatal and non-fatal falls, Parkinson’s disease, cardio-vascular disorders,
stress, etc.
Adoption of big data in healthcare significantly increases security and patient privacy concerns.
At the outset, patient information is stored in data centres with varying levels of security.
Traditional security solutions cannot be directly applied to large and inherently diverse data sets.
With the increase in popularity of healthcare cloud solutions, complexity in securing massive
distributed Software as a Service (SaaS) solutionsincreases with varying data sources and
formats. Hence, big data governance, real time security analytics, privacy preserving analytics,
etcare necessary prior to exposingdata to analytics.

9.FUTURE RESEARCH CHALLENGES

The amount of data are growing everyday and it is impossible to imagine the next generation
applications without producing and executing data driven algorithms. A lot of works have been
done to preserve the privacy of users from data generation to data processing, but there still exist
several open issues and challenges.
To ensure that the data are only accessible by authorized users and for end to end secure transfer
of data, access control methods and different encryption techniques like IBE, ABE, and PRE, are
used. The main problem of encrypting large datasets using existing techniques is that we have to
retrieveor decrypt the whole dataset before further operations could be performed. To solve these
kind of problems, we need encryption techniques which allows data sharing between different
partieswithout decrypted and re-encrypting process.
Data is anonymized by removing the personal details to preserve the privacy of users. It indicates
that it would not be possible to identify an individual only from the anonymized data. Thus, we
need topropose new privacy and utility metrics. Furthermore, data anonymization is a
cumbersome process and it needs to be automated to cope with the growing 3 V's.
As our personal data are gradually collected and stored on centralized cloud server over the time,
we need to understand the associated risk regarding privacy. The concept of centralized
collection and storage of personal data should be challenged. To adopt the view of data
distribution, we need algorithms that are capable to work over extreme data distribution and
build models that learn in a big data context.
Machine learning and data mining should be adapted to unleash the full potential of collected
data. To protect privacy, machine learning algorithms such as classification, clustering and
association rule mining need to be deployed in a privacy preserving way.
Dept.of Information Technology 22 Future Research Challenges

10.CONCLUSION
Big data is large amount of data which is unorganized and unstructured. Big data privacy is very
important issue in while organizing big data. Due to recent technological development, the
amount ofdata generated by social networking sites, sensor networks,Internet, healthcare
applications, and many other companies, is drastically increasing day by day.The amount of data
are growing everyday and it is impossible to imagine the next generation applications without
producing and executing data driven algorithms.The privacy and security concern is also
growing day by day. So a number of techniques are deployed to ensure privacy and security of
this huge data. More and more challenging areas need to be identified and solutions to the
privacy problems need to be found out.
Dept.of Information Technology 23 Conclusion

REFERENCES
[1] J. Manyikaet al., Big data: The Next Frontier for Innovation, Competition,and Productivity.
Zürich, Switzerland: McKinsey Global Inst., Jun. 2011,pp. 1_137.
[2] B. Matturdi, X. Zhou, S. Li, and F. Lin, ``Big data security and privacy:A review,'' China
Commun., vol. 11, no. 14, pp. 135_145, Apr. 2014
[3] J. Gantz and D. Reinsel, ``Extracting value from chaos,'' in Proc. IDCIView, Jun. 2011, pp.
1_12.
[4] A. Katal, M. Wazid, and R. H. Goudar, ``Big data: Issues, challenges,tools and good
practices,'' in Proc. IEEE Int. Conf. Contemp. Comput.,Aug. 2013, pp. 404_409.
[5] L. Xu, C. Jiang, J.Wang, J. Yuan, and Y. Ren, ``Information security in bigdata: Privacy and
data mining,'' in IEEE Access, vol. 2, pp. 1149_1176,Oct. 2014.
[6] H. Hu, Y. Wen, T.-S. Chua, and X. Li, ``Toward scalable systems for bigdata analytics: A
technology tutorial,'' IEEE Access, vol. 2, pp. 652_687,Jul. 2014.
[7] Z. Xiao and Y. Xiao, ``Security and privacy in cloud computing,'' IEEECommun. Surveys
Tuts., vol. 15, no. 2, pp. 843_859, May 2013.
[8] C. Hongbing, R. Chunming, H. Kai,W.Weihong, and L. Yanyan, ``Securebig data storage

and sharing scheme for cloud tenants,'' China Commun.,vol. 12, no. 6, pp. 106_115, Jun. 2015.
Dept.of Information Technology 24

Protection of Bigdata Privacy: Seminar Report

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Protection of Bigdata Privacy: Seminar Report

Transféré par

Droits d'auteur :

Formats disponibles

SEMINAR REPORT

PROTECTION OF BIGDATA PRIVACY

APJ ABDUL KALAM TECHNOLOGICAL UNIVERSITY

GOVERNMENT ENGINEERING COLLEGE BARTON HILL

PROTECTION OF BIGDATA PRIVACY

APJ ABDUL KALAM TECHNOLOGICAL UNIVERSITY

GOVERNMENT ENGINEERING COLLEGE BARTON HILL

Department of Information Technology

Prof.Vijayanand K S Prof.Shamna H R Prof.Balu John

Assistant Professor Assistant Professor Associate Professor

(Seminar Guide) (Seminar Guide) (Head of theDepartment)

I also acknowledge my gratitude to Prof.Balu John, Head of Department

I am profoundly indebted to my seminar guides ,Prof.Vijayanand KJ S and

I extend my sincere thankfulness to all the teachers and staff of Information

Date: November 28, 2018 Gautham Krishna

Chapter No. Title Page No

LIST OF TABLES iii

2.1WHAT IS BIG DATA? 2

2.3 V’S OF BIG DATA 3

2.4 BIG DATA LIFE CYCLE 4

2.5B IG DATA AND CLOUD COMPUTING 5

3. BIG DATA PRIVACY 6

4. PRIVACY AND SECURITY CONCERNS 7

4.1PRIVACY V/S SECURITY 7

4.2 PRIAVCY REQUIREMENTS IN BIG DATA 8

4.3 PRIVACY CONFORMANCE 9

5. BIG DATA PRIVACY TOOLS 10

5.1 FILE ENCRYPTION TOOLS 10

5.2 DISKENCRYPTION TOOLS 10

5.3 COMMERCIAL TOOLS 11

6. PRIVACY IN DATA GENERATION PHASE 12

6.2 FALSYFING DATA 13

7. PRIVACY IN DATA STORAGE PHASE 14

7.1 APPROACHES TO PRIVACY PRESERVATION IN 14

8. PRIVACY IN DATA PROCESSING PHASE 17

8.2 EXRACTING KNOWLEDGE FROM DATA 20

8.3 EXAMPLE FOR BIG DATA PRIVACY-HEALTH CARE 1

9. FUTURE RESEARCH CHALLENGES 22

1 3 V’s OF BIG DATA 3

2 LIFE CYCLE OF BIG DATA 4

3 PRIVCY CONFORMACE TESTING 9

4 INTEGRITY VERIFIVCATION SCHEMES 16

1 DIFFERENCE BETWEEN PRIVACY AND SECURITY 7

2 SOME COMMONLY USED FILE ENCRYTION TOOLS 10

3 SOME COMMONLY USED DISK ENCRYPTION TOOLS 10

4 COMPARISON OF ENCRYPTION SCHEMES 15

5 DIFFERENT PRIVACY PRESERVING METHODS 18

6 A NON-ANONYMIZED DATA BASE 19

Dept.of Information Technology 1 Introduction

2.1.WHAT IS BIG DATA?

Dept.of Information Technology 2 Big Data Overview

2.3.3 V’s of BIG DATA

Fig.1.3 V’s of Big Data

Dept.of.Information Technology 3 Big Data Overview

2.4. BIG DATA LIFE CYCLE

Fig 2.Life cycle of big data

Dept.of Information Technology 4 Big Data Overview

2.5 BIG DATA AND CLOUD COMPUTING

Dept.of.Information Technology 5 Big Data Overview

3.BIG DATA PRIVACY

Users' privacy may be breached under the following circumstances: _

Dept.of Information Technology 6 Big Data Privacy

4.PRIVACY AND SECURITY CONCERNS