Académique Documents
Professionnel Documents
Culture Documents
On
Of
By
GAUTHAM KRISHNA
REG NO:TRV15IT027
November 2018
DEPARTMENT OF INFORMATION TECHNOLOGY
THIRUVANATHAPURAM-695035
SEMINAR REPORT
On
Of
By
GAUTHAM KRISHNA
REG NO:TRV15IT027
November 2018
DEPARTMENT OF INFORMATION TECHNOLOGY
THIRUVANATHAPURAM-695035
GOVERNMENT ENGINEERING COLLEGE BARTON HILL
TRIVANDRUM
CERTIFICATE
This is to certify that this seminar report entitled Protection of Big data privacy is a bonafide
record of the work done by Gautham Krishna under our guidance towards partial fulfilment of
requirements for the award of Bachelor of Technology in Information Technology of APJ Abdul
Kalam Technological University during the year 2018.
The satisfaction that accompanies the successful completion of this seminar would
be incomplete without the mention of the people who made it possible, without
whose constant guidance and encouragement would not have been possible to
prepare this report. First I would like to express my sincere gratitude and heartfelt
indebtedness to our Principal Dr.Rajasree M S for providing all the necessary
requirements for this work.
I also place on record my gratitude to my parents ,friends and above all Lord
Almighty who made my humble venture a successful one.
Place: Thiruvanathapuram
In recent years, big data have become a hot research topic. The increasing amount
of big dataalso increases the chance of breaching the privacy of individuals. Since
big data require high computational power and large storage, distributed systems
are used. As multiple parties are involved in these systems, the risk of privacy
violation is increased. There have been a number of privacy-preserving
mechanisms developed for privacy protection at different stages (e.g., data
generation, data storage, and data processing) of a big data life cycle. This report
provides a comprehensive overview of the privacy preservation mechanisms in big
data and present the challenges for existing mechanisms. In particular, the report
illustrate the infrastructure of big data and the state-of-the-art privacy-preserving
mechanisms in each stage of the big data life cycle. The challenges and future
research directions related to privacy preservation in big data are also discussed.
Comparative study between various recent techniques of big data privacy is
mentioned.
TABLE OF CONTENTS
LIST OF FIGURES ii
1. INTRODUCTION 1
2.BIG DATA 2
2.2 APPLICATIONS 3
6.1ACCESS RESTRICTION 12
STORAGE ON CLOUD
i
7.2 INTEGRITY VERIFICATION OF BIG DATA STORAGE 15
8.1 PPDP 17
10. CONCLUSION 23
REFERENCES 24
LIST OF FIGURES
No Title Page No
LIST OF TABLES
NO Title Page No
7 ANONYMIZED DATABASE 19
Protection of Big Data Privacy
1.INTRODUCTION
Due to recent technological development, the amount of data generated by social
networking sites, sensor networks, Internet, healthcare applications, and many
other companies, is drastically increasing day by day. The data generation rate is
growing so rapidly that it is becoming extremely difficult to handle it using
traditional methods or systems. All the enormous measure of data produced from
various sources inmultiple formats with very high speed is referred as big data.
Big data, if captured and analyzedin a timely manner, can be converted into
actionable insights which can be of significant value. Big data has become a very
active research area for last couple of years. Big data analytics is the term used to
describe the process of researching massive amounts of complex data in order to
reveal hidden patterns or identify secret correlations. Despite big data could be
effectively utilized for us to better understand the world and innovate in various
aspects of human endeavors, the exploding amount of data has increased potential
privacy breach. There have been a number of privacy-preserving mechanisms
developed for privacy protection at different stages (for example, data generation,
data storage, and data processing) of a big data life cycle. This report discuss all
the major concerns related to big data privacy.
2.BIG DATAOVERVIEW
Big data analytics is the term used to describe the process of researching massive amounts of
complex data in order to reveal hidden patterns or identify secret correlations. Big Data
Analytics largely involves collecting data from different sources, manipulate it in a way that it
becomes available to be consumed by analysts and finally deliver data products useful to the
organization business.
a)Structured data
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. It concerns all data which can be stored in database SQL in table with rows and
columns. They have relational key and can be easily mapped into pre-designed fields.
b)Unstructured Data
Any data with unknown form or the structure is classified as unstructured data. Unstructured data
may have its own internal structure, but does not conform neatly into a spreadsheet or database.
Most business interactions, in fact, are unstructured in nature. Typical example of unstructured
data is, a heterogeneous data source containing a combination of simple text files, images, videos
etc.
c)Semi-structured
Semi-structured data can contain both the forms of data. Semi-structured data is a form
of structured data that does not conform with the formal structure of data models associated with
relational databases or other forms of data tables. Examples of semi-structured -NoSQL
databases are considered as semi structured.
2.2.APPLICATIONS
Big data has a wide range of applications .Big data helps gain instant insights from diverse data
sources. It has highly improved the business’ performances through real-time analytics. Big data
technologies manage huge amounts of data. Big data helps mitigate risk and make smart decision
by proper risk analysis. Some important applications include:
Healthcare-Big data reduces costs of treatment since there is less chances of having to
perform unnecessary diagnosis. It helps in predicting outbreaks of epidemics and also helps
in deciding what preventive measures could be taken to minimize the effects of the same.
Banking Zones and Fraud Detection-Big data is hugely used in the fraud detection in the
banking sectors. It detects any and all the illegal activities that are being carried out such as
misuse of credit cards, misuse of debit cards,customer statistics alteration etc.
Transportation-Since the rise of big data, it has been used in various ways to make the
transportation more efficient and easy. It includes traffic control, route planning, intelligent
transport systems, congestion management by predicting traffic conditions etc.
Weather patterns-Weather related data collected from different parts of the world can be
used in different ways such as in weather forecast, to study global warming, understanding the
patterns of natural disasters to make necessary preparations in case of crisis and to predict the
availability of usable water around the world and many more.
Data generation: Data can be generated from various distributed sources. The amount of data
generated by humans and machines has exploded in the past few years. For example, everyday
2.5 quintillion bytes of data are generated on the web and 90 percent of the data in the world is
generated in the past few years. Facebook, a social networking site alone is generating 25TB of
new data everyday. Usually, the data generated is large, diverse and complex. Therefore, it is
hard for traditional systems to handle them. The data generated are normally associated with a
specific domain such as business, Internet, research, etc.
Data storage: This phase refers to storing and managing large-scale data sets. A data storage
systemconsists of two parts i.e., hardware infrastructure and data management . Hardware
infrastructurerefers to utilizing information and communications technology (ICT) resources for
various tasks (such asdistributed storage). Data management refers to the set of software
deployed on top of hardware infrastructure to manage and query large scale data sets. It should
also provide several interfaces to interact with and analyze stored data.
Data processing: Data processing phase refers basically to the process of data collection, data
transmission, pre-processing and extracting useful information. Data collection is needed
because data may be coming from different diverse sources i.e., sites that contains text, images
and videos. In data collection phase, data are acquired from specific data production environment
using dedicated data collection technology. In data transmission phase, after collecting raw data
from a specific data production environment we need a high speed transmission mechanism to
transmit data into aproper storage for various type of analytic applications. Finally, the pre-
processing phase aims at removing meaningless and redundant parts of the data so that more
storage space could be saved.
Outsourcing: To reduce the capital and operational expenditure, organizations nowadays prefer
to outsource their data to the cloud. However, outsourcing data to cloud also means that the
customers will lose physical control on their data. The loss of control over the data has become
one of the main cause of cloud insecurity. Outsourced data should also be verifiable to customers
in terms of confidentiality and integrity.
Multi-tenancy: Virtualization has made it possible to share the same cloud platform by multiple
customers.The data that belong to different cloud users may be placed on the same physical
storage by some resource allocation policy.
Massive computation: Due to the capability of cloud computing for handling massive data
storage andintense computations, traditional mechanisms to protect individual's privacy are not
sufficient.
Big Data are datasets which can’t be processed in conventional database ways to their size.
Despite big data could be effectively utilized for us to better understand the world and innovate
in various aspects of human endeavors, the exploding amount of data has increased potential
privacy breach. For example,Amazon and Google can learn our shopping preferences and
browsing habits. Social networking sites such as Facebook store all the information about our
personal life and social relationships. Popular video sharing websites such as YouTube
recommends us videos based on our search history. With all the power driven by big data,
gathering, storing and reusing our personal information for the purpose of gaining commercial
profits, have put a threat to our privacy and security. In 2006, AOL released 20 million search
queries for 650 users by removing the AOL id and IP address for research purposes. However, it
took researchers only couple of days to re-identify the users.
Personal information when combined with external datasets may lead to the inference of new
facts about the users. Those facts may be secretive and not supposed tobe revealed to others.
Personal information is sometimes collected and usedto add value to business. For example,
individual'sshopping habits may reveal a lot of personal information.
The sensitive data are stored and processed in a locationnot secured properly and data leakage
may occur duringstorage and processing phases.
Privacy-Information privacy is the privilege to have some control over how the personal
information is collected and used. Information privacy is the capacity of an individual or group
to stop information about themselves from becoming known to people other than those they give
the information to. One serious user privacy issue is the identification of personal information
during transmission over the Internet .
Security-Security is the practice of defending information and information assets through the
use of technology, processes and training from:-Unauthorized access, disclosure, disruption,
modification, inspection, recording, and destruction.
Big data analytics draw in various organizations; a hefty portion of them decide not to utilize
these services because of the absence of standard security and privacy protection tools.
Confidentiality:
Confidentiality is thecorner stone of big data privacy and security. We need to protect data
from leakage. The hacker who wants to obtain useful information in big data will attack
storage system to steal data. Confidentiality should be ensured at data collection, processing
and management.
Efficiency:
Different from traditional data big data has its characteristics of velocity, volume and
variety. For achieving efficiency we require high bandwidth. Efficiency is crucial in big
data security considering these three V’S.
Authenticity:
Real time data with veracity is needed to support wise decision making. Thus authenticity is
essential during the whole data life time to ensure trusted data sources, reputed data
processors and eligible data requesters. Authenticity can avoid wrong analysis result.
Availability:
Big data should be available any timewe need it. Otherwise it could lose its
value.Corresponding applications or services based on big data cannot work well. Therefore
availability should be ensured during the whole life time of big data.
Integrity:
To get valuable and accurate data ensuring its integrity is essential. We cannot analyze right
information with incomplete data especially, when the lost data is sensitive and useful.
Therefore integrity is required during the whole life time of big data.
1. Pre‐hadoop process validation:This step does the representation of the data loading process.
At this step, the privacy specifications characterize the sensitive pieces of data that can uniquely
identify a user or an entity. Privacy terms can likewise indicate which pieces of data can be
stored and for how long. At this step, schema restrictions can take place as well.
3. ETL process validation :Similar to step (2), warehousing rationale should be confirmedat
this step for compliance with privacy terms. Some data values may beaggregated anonymously
or excluded in the warehouse if that indicates high probabilityof identifying individuals.
4. Reports testing: Reports are another form of questions, conceivably with higher visibilityand
wider audience. Privacy terms that characterize purpose are fundamental tocheck that sensitive
data is not reported with the exception of specified uses.
5.3.COMMERCIAL TOOLS:
IBM Threat Protection System is a robust and comprehensive set of tools and best
practices that are built on a framework that spans hardware, software and services to
address intelligence, integration and expertise required for Big Data security and privacy
issues.
HP ArcSight, is another tool that can strengthen security intelligence, and can able to
convey the propelled relationship, application assurance, and system barriers to shield
today's cloud IT base from refined digital dangers.
Cisco's Threat Research, Analysis, and Communications (TRAC) devices are also
efficient tools for providing security for big data.
Anti-tracking extensions: When browsing the Internet, a user can utilize an anti-tracking
extension to block the trackers from collecting the cookies. Popular anti-tracking extensions
include Disconnect, Do Not Track Me, Ghostery etc. A major technology used for anti-tracking
is called Do Not Track (DNT), which enables users to opt out of tracking by websites they do not
visit.
Advertisement and script blockers: This type of browser extensions can block advertisements
on the sites, and kill scripts and widgets that send the user’s data to some unknown third party.
Example tools include AdBlock Plus, NoScript,FlashBlock etc.
Encryption tools: To make sure a private online communication between two parties cannot be
intercepted by third parties, a user can utilize encryption tools, such as MailCloak and
TorChat, to encrypt his emails, instant messages, or other types of web traffic. Also, a user can
encrypt all of his internet traffic by using a VPN (virtual private network) service.
Anti virus and anti malware:Antivirus usually deals with the older, more established threats,
such as Trojans, viruses, and worms. Anti-malware, focuses on newer stuff, such as polymorphic
malware and malware delivered by zero-day exploits. Antivirus protects users from predictable-
yet-still-dangerous malware. Anti-malware protects users from the latest, currently in the wild,
and even more dangerous threats.
6.2FALSYFYING DATA
In some circumstances, it is not possible to prevent access of sensitive data. In that case, data can
be distorted using certain tools before the data are fetched by some third party. If the data are
distorted, the true information cannot be easily revealed. The following techniques are used by
the data owner to falsify the data:
Using a fake identity to create phony information. In 2012, Apple Inc. was assigned a
patent called “Techniques to pollute electronic profiling” which can help to protect
user’s privacy. This patent discloses a method for polluting the information gathered by
“network eavesdroppers” by making a false online identity of a principal agent, e.g. a
service subscriber. The clone identity automatically carries out numerous online actions
which are quite different from a user’s true activities. When a network eavesdropper
collects the data of a user who is utilizing this method, the eavesdropper will be interfered
by the massive data created by the clone identity. Real information about of the user is
buried under the manufactured phony information.
Using security tools to mask one’s identity. When a user signs up for a web service or
buys something online, he is often asked to provide information such as email address,
credit card number, phone number, etc. A browser extension called MaskMe, which was
release by the online privacy company Abine, Inc. in 2013, can help the user to create
and manage aliases (or Masks) of these personal information. Users can use these aliases
just like they normally do when such information is required, while the websites cannot
get the real information. In this way, user’s privacy is protected.
STORAGE PATH ENCRYPTION: It secures storage of big data on clouds. The big
data are first separated into many sequenced parts and then each part is stored on a
different storage media owned by different cloud storage providers. To access the data,
different parts are first collected together from different data centres and then restored
into original form before it is presented to the data owner.
Data owners could perform integrity verification by themselves or delegate the task to trusted
third parties. The basic framework of any integrity verification scheme consist of three
participating parties: client, cloud storage server (CSS) and third party auditor (TPA). The client
stores the data on cloud and the objective of TPA is to verify the integrity of data. The main life
cycle of a remote integrityverification scheme consists of the following steps:
Setup and data upload: In order to verify the data without retrieving the actual file, the client
needs toprepare verification metadata. Metadata are computed from the original data and is
stored alongside theoriginal data.
Authorization for TPA: The TPA who can verify data from cloud server on data owner's behalf
needs to be authorized by the data owner. There is also a security risk if the third party can ask
for indefinite integrity proofs over certain dataset.
Challenge and verification of data storage: To verify the integrity of the data, a challenge
message is sent to the server by TPA on client's behalf. The server will compute a response
based on the challenge message and send it to TPA. The TPA can then verify the response to
find whether the data are intact.
Data update: Data update occurs when some operations are performed on the data. The client
needs to perform updates to some of the cloud data storage. Common could data update includes
insert, delete, and modify operations.
Metadata update: After some update operation is performed on the data, the client will need to
update the metadata according with the existing keys. The metadata are updated in order to keep
the datastorage verifiable without retrieving all the data.
Verification of updated data: Client also needs to verify if the data update is processed
correctly or not as the cloud cannot be fully trusted. This is an essential step to ensure that the
updated data still can be verified correctly in future.
8.1 PPDP
During PPDP, the collected data may contain sensitive information about the data owner.
Directly releasing the information for further processing may violate the privacy of the data
owner, hence data modification is needed in such a way that it does not disclose any personal
information about the owner. On the other hand, the modified data should still be useful, not to
violate the original purpose of datapublishing. The original data are assumed to be sensitive and
private and consist of multiplerecords. Each record may consist of the following four attributes:
Identifier (ID): The attributes which can be used to uniquely identify a person e.g., name,
driving licensenumber, and mobile number etc.
Quasi-identifier (QID): The attributes that cannot uniquely identify a record by themselves but
if linkedwith some external dataset may be able to re-identify the records.
Sensitive attribute (SA): The attributes that a person may want to conceal e.g., salary and
disease.
Non-sensitive attribute (NSA): Non-sensitive attributes are attributes which if disclosed will
not violate the privacy of the user. All attributes other than identifier, quasi-identifier and
sensitive attributes are classified as non-sensitive attributes.
The data are anonymized by removing the identifiers and modifying the quasi-identifiers before
publishing or storing for further processing. As a result of anonymization, identity of the data
owner and sensitive values are hidden from the adversaries. How much data should be
anonymised mainly depends on how much privacy we want to preserve in that data. De-
identification is a traditional technique for privacy-preserving data mining,There are three -
privacy-preserving methods of De-identification, namely, K-anonymity,L-diversity and T-
closeness.K-anonymity is used to prevent the record linkage, l-diversity to prevent attribute
linkage and record linkage, t-closeness to prevent probabilistic attacks and attribute linkage.
ANONYMYZATION TECHQNIQUES:
To preserve the privacy, one of the following anonymization operations are applied to
the data:
Suppression: In suppression, some values are replaced with a special character (e.g.,
``*''), which indicates that a replaced value is not disclosed. Example of suppression
schemes include record suppression, value suppression, and cell suppression.
Table 7. Anonymized database with respect to the attributes ‘Age’, ‘Gender’ and ‘State of
domicile
PRIVACY-UTILITY TRADE-OFF:
A high level of data anonymization indicates that the privacy is well protected. However, on the
other hand, it may also affect the utility of the data, which means that less values can be
extracted from the data. Therefore, balancing the trade-off between privacy and utility is very
important in big data applications. The reduction in data utility is represented by information
loss. Various methods have been proposed for measuring the information loss, some of the
examples include minimal distortion , discernibility metric , the normalized average equivalence
class size metric , weighted certainty penalty , and information theoretic metrics . To solve the
problems of trade-off between privacy and utility, PPDP algorithms usually take greedy
approach to achieve proper trade-off. These algorithms work by generating multiple tables using
the given metrics of privacy preservation and information loss, all of which satisfy the
requirement of specific privacy model during the anonymization process. Output ofthe greedy
algorithm is the table with minimum information loss.
Clustering is one of the popular data processing techniques for its capability of analyzing
un-familiar data. The fundamental idea behind clustering is to separate unlabelled input
data into several different groups.
Classification is a technique of identifying, to which predefined group a new input data
belongs. Similar to clustering algorithm, classification algorithms are traditionally
designed to work in centralized environments.
While clustering and classification try to group the input data, association rules are
designed to find the important relationships or patterns between the input data.
Adoption of big data in healthcare significantly increases security and patient privacy concerns.
At the outset, patient information is stored in data centres with varying levels of security.
Traditional security solutions cannot be directly applied to large and inherently diverse data sets.
With the increase in popularity of healthcare cloud solutions, complexity in securing massive
distributed Software as a Service (SaaS) solutionsincreases with varying data sources and
formats. Hence, big data governance, real time security analytics, privacy preserving analytics,
etcare necessary prior to exposingdata to analytics.
To ensure that the data are only accessible by authorized users and for end to end secure transfer
of data, access control methods and different encryption techniques like IBE, ABE, and PRE, are
used. The main problem of encrypting large datasets using existing techniques is that we have to
retrieveor decrypt the whole dataset before further operations could be performed. To solve these
kind of problems, we need encryption techniques which allows data sharing between different
partieswithout decrypted and re-encrypting process.
Data is anonymized by removing the personal details to preserve the privacy of users. It indicates
that it would not be possible to identify an individual only from the anonymized data. Thus, we
need topropose new privacy and utility metrics. Furthermore, data anonymization is a
cumbersome process and it needs to be automated to cope with the growing 3 V's.
As our personal data are gradually collected and stored on centralized cloud server over the time,
we need to understand the associated risk regarding privacy. The concept of centralized
collection and storage of personal data should be challenged. To adopt the view of data
distribution, we need algorithms that are capable to work over extreme data distribution and
build models that learn in a big data context.
Machine learning and data mining should be adapted to unleash the full potential of collected
data. To protect privacy, machine learning algorithms such as classification, clustering and
association rule mining need to be deployed in a privacy preserving way.
10.CONCLUSION
Big data is large amount of data which is unorganized and unstructured. Big data privacy is very
important issue in while organizing big data. Due to recent technological development, the
amount ofdata generated by social networking sites, sensor networks,Internet, healthcare
applications, and many other companies, is drastically increasing day by day.The amount of data
are growing everyday and it is impossible to imagine the next generation applications without
producing and executing data driven algorithms.The privacy and security concern is also
growing day by day. So a number of techniques are deployed to ensure privacy and security of
this huge data. More and more challenging areas need to be identified and solutions to the
privacy problems need to be found out.
REFERENCES
[1] J. Manyikaet al., Big data: The Next Frontier for Innovation, Competition,and Productivity.
Zürich, Switzerland: McKinsey Global Inst., Jun. 2011,pp. 1_137.
[2] B. Matturdi, X. Zhou, S. Li, and F. Lin, ``Big data security and privacy:A review,'' China
Commun., vol. 11, no. 14, pp. 135_145, Apr. 2014
[3] J. Gantz and D. Reinsel, ``Extracting value from chaos,'' in Proc. IDCIView, Jun. 2011, pp.
1_12.
[4] A. Katal, M. Wazid, and R. H. Goudar, ``Big data: Issues, challenges,tools and good
practices,'' in Proc. IEEE Int. Conf. Contemp. Comput.,Aug. 2013, pp. 404_409.
[5] L. Xu, C. Jiang, J.Wang, J. Yuan, and Y. Ren, ``Information security in bigdata: Privacy and
data mining,'' in IEEE Access, vol. 2, pp. 1149_1176,Oct. 2014.
[6] H. Hu, Y. Wen, T.-S. Chua, and X. Li, ``Toward scalable systems for bigdata analytics: A
technology tutorial,'' IEEE Access, vol. 2, pp. 652_687,Jul. 2014.
[7] Z. Xiao and Y. Xiao, ``Security and privacy in cloud computing,'' IEEECommun. Surveys
Tuts., vol. 15, no. 2, pp. 843_859, May 2013.