Data Leakage Detection

DATA LEAKAGE DETECTION
MAMTA SINGH1, PRITI TRIPATHI2 & RENUKA SINGH3

1,2&3
Department of Computer Science and Engineering, Institute of Technology and Management, GIDA, Gorakhpur, India
AbstractThis paper contains concept of data leakage, its causes of leakage and different techniques to protect and detect the data leakage. The value of the data is incredible, so it should not be leaked or altered. In the field of IT huge database is being used. This database is shared with multiple people at a time. But during this sharing of the data, there are huge chances of data vulnerability, leakage or alteration. So, to prevent these problems, a data leakage detection system has been proposed. This paper includes brief idea about data leakage detection and a methodology to detect the data leakage persons. Data leakage is the main hindrance in data distribution. A distributor has given sensitive data to a set of supposedly trusted agents where they can make use of it. Some of the data is leaked and found in unauthorized places which are distributed by the distributor to agents. Eventually if the data founds to be some other places other than the agents who received the actual data, then distributor need to identify the guilty agents. Traditionally this data leakage is handled by watermarking technique which requires modification of data. In this paper, we present we analyze the guilty model that detects the agents using data allocation strategies without any modification of original data. The guilty agent is one who leaked a portion of distributed data. The idea is to distribute the data intelligently to agents based on sample and explicit data request in order to improve the chance of detecting the guilty agents. The algorithm implemented using fake object by distributor which will improve the chance of detecting the guilty agents. Keywords- Fake Object, Data Allocation Strategies, Data Leakage.
I.
INTRODUCTION Data leakage is the unauthorized transmission of data or information from within an organization to an external destination or recipient [8][12]. Data leakage is defined as the accidental or intentional distribution of private or sensitive data to an unauthorized entity. Sensitive data of companies and organization includes intellectual property, financial information, patient information, personal credit card data and other information depending upon the business and the industry. Furthermore, in many cases, sensitive data shred among various stakeholders such as employees working from outside the organizational premises, business partners and customers. This increases the risk of confidential information falling into unauthorized hands [2]. Furthermore, in many cases, sensitive data is shared among various stakeholders such as employees working from outside the organizational premises (e.g., on laptops), business partners and customers. This increases the risk of confidential information falling into unauthorized hands. In the course of doing business, sometimes data must be handed over to supposedly trusted third parties for some enhancement or operations. Lets take the example;[11] a hospital may give patient records to researcher who will devise new treatments. Similarly a company may have partnership with other companies that require sharing of customer data. Another enterprise may outsource its data processing, so data must be given to various other companies. Owner of data is termed as the distributor and the supposedly third parties are called as the agents. In this project, our goal is to identify the guilty agent when the distributors sensitive data have been leaked
by some agents. Perturbation and watermarking are techniques which can help in such situations. Perturbation is a very useful technique where the data is modified and made less sensitive before being handed to agents. For example, one can add random noise to certain attributes or one can replace exact values by ranges on the original record. However in some cases, it is not important to alter the original record. Suppose if an outsourcer is doing our payroll, he must have the exact salary and customer bank account numbers. If medical researchers treating the patients (as opposed to simply computing statistic) they may need accurate data for the patients [12][15]. Traditionally, leakage detection is handled by the watermarking. For example a unique code is embedded in each distributed copy. If that copy is later found in the hands of an unauthorized party, the leaker can be identified. Watermarks can be very useful in some cases but again, involve some modification of the original data. Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious [research paper]. In short, watermarking is suitable for all the application because its lost the original data. There are some disadvantages of it. That is It involves some modification of data that is making the data less sensitive by altering attributes of the data. The second problem is that these watermarks can be sometimes destroyed if the recipient is malicious. In this paper, we develop an algorithm of data allocation strategies for finding the guilty agents that improves the chances of identifying a leaker. We also consider the option of adding fake objects to the distributed set. Such object do not corresponds to real
Undergraduate Academic Research Journal (UARJ), ISSN: 2278 1129, Volume-1, Issue-3,4, 2012 31
Data Leakage Detection
entities but appear realistic to the agents. Means that fake objects act as a type of watermarks for the entire set, without modifying any original data. If it turns out that an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty. II. PROBLEM SETUP AND NOTATION A. Entities and Agents Let the distributor database owns a set S= {t1, t2,., tm} which consists of data objects. Let the no of agents be A1, A2, ..., An [6][10]. The distributor distributes a set of records S to any agents based on their request such as sample or explicit request. Sample request Ri= SAMPLE (T, mi): Any subset of mi records from T can be given to Ui [1]. Explicit request Ri= EXPLICIT (T;condi): Agent Ui receives all T objects that satisfy condition[12][13]. The objects in T could be of any type and size, e.g. they could be tuples in a relation, or relations in a database. After giving objects to agents, the distributor discovers that a set S of T has leaked. This means that some third party called the target has been caught in possession of S. For example, [9] this target may be displaying S on its web site, or perhaps as part of a legal discovery process, the target turned over S to the distributor. Since the agents (A1, A2, ..., An) have some of the data, it is reasonable to suspect them leaking the data. However, the agents can argue that they are innocent, and that the S data was obtained by the target through other means. B. Guilty Agents Guilty agents are the agents who had leaked the data. Suppose the agent say Ai had leaked the data knowingly or unknowingly [1]. Then automatically notification will be the send to the distributor defining that agent Ai had leaked the particular set of records which also specifies sensitive or non sensitive records. Our goal is to estimate the likelihood that the leaked data came from the agents as opposed to other sources[4][5][6]. C. Data Allocation Problem The main focus of this paper is the data allocation problem: how can the distributor intelligently give data to agents in order to improve the chances of detecting a guilty agent. There are four instances of this problem, depending on the type of data requests made by agents and whether fake objects are allowed [1][3][6]. Agent makes two types of requests, called sample and explicit.
Figure1. Leakage problem instances
D. Fake Objects Fake objects are objects generated by the distributor that are not in set S. The objects are designed to look like real objects, and are distributed to agents together with the S objects, in order to increase the chances of detecting agents that leak data[2][12]. III. RELATED WORK
The guilt detection approach we present is related to the data provenance problem: tracing the lineage of an S object implies essentially the detection of the guilty agents. It provides a good overview on the research conducted in this field [1][3]. Suggested solutions are domain specific, such as lineage tracing for data Warehouses, and assume some prior knowledge on the way a data view is created out of data sources[4]. Our problem formulation with objects and sets is more general and simplifies lineage tracing, since we do not consider any data transformation from Ri sets to S.As far as the data allocation strategies are concerned, our work is mostly relevant to watermarking that is used as a means of establishing original ownership of distributed objects. Watermarks were initially used in images, video and audio data whose digital representation includes considerable redundancy. Our approach and watermarking are similar in the sense of providing agents with some kind of receiver identifying information [6]. However, by its very nature, a watermark modifies the item being watermarked. If the object to be watermarked cannot be modified, then a watermark cannot be inserted. In such cases, methods that attach watermarks to the distributed data are not applicable. Finally, there are also lots of other works on mechanisms that allow only authorized users to access sensitive data through access control policies [11][12]. Such approaches prevent in some sense data leakage by sharing information only with trusted parties. However, these policies are restrictive and may make it impossible to satisfy agents requests.
IV.
EXISTING SYSTEM
Leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Watermarks were initially used in images, video and audio data whose digital representation includes considerable redundancy [1]. Watermarking aims to identify a data owner and, hence, is subject to attacks where a pirate claims ownership of the data or weakens a merchants claims. A. Watermarking Methodology Nowadays, the digital assets such as software, images, video, audio and text are pirated which is a strong concern for owners of these assets. The protection schemes for such assets are based upon the insertion of digital watermarks into them. In this process, a particular object or record from the data is selected for the purpose of watermarking satisfying criteria. The criterion is these marks should have insignificant impact on the usefulness of data. The procedure of watermarking introduces small errors into the object being watermarked. These intentional errors are called marks and all the marks together constitute the watermark. These marks are chosen in such a way that it has least impact on the data and placed such that a malicious user cannot destroy them. In traditional technique, leakage detection is handled by watermarking which is method of implanting a unique code on each of the distributed copy. When this copy is later discovered in the hands of an unauthorized party, the leaker can be identified. V. PROPOSED SYSTEM
The distributors data allocation to agents has one Constraint and one objective. The distributors constraint is to satisfy agents requests, by providing them with the number of objects they request or with all available objects that satisfy their conditions[13]. His objective is to be able to detect an agent who leaks any portion of his data. The main objective to maximize the chances of detecting a guilty agent that leaks all his data objects. In this paper we develop a model for assessing the guilt of agents is developed[1]. DATA ALLOCATION STRATEGIES In this section, we describe allocation strategies that categorize the normal and authorized agents based on their requests given to the distributor. We deal with both the explicit data requests and sample data requests of the agents [9][12]. A. Explicit Data Requests The authorized agents send the request for available records which contain both sensitive and non sensitive data in the distributor owed set ie. Ri=EXPLICIT ({t1,t2 ,,tn}, cond1), then the request is said to be explicit data request to the
distributor[3][15]. The distributor cannot remove or alter Re data to decrease the overlap between requests from all other agents. So the distributor adds fake objects along with the requested data which do not influence the condition mentioned in agents request ie. R={t1,t2,,tn, f}. If the distributor is able to create more fake objects, he could further improve the objective. We present the algorithms for explicit data requests allocation, agent selection for e-random and e-optimal as follows [9][10]. Algorithms 1: Evaluation of Explicit Data Request 1: Calculate total fake records as sum of fake records allowed. 2: While total fake objects > 0 3: Select agent that will yield the greatest improvement in the sum objective i.e. i=argmax((1\|Ri|)-(1\|Ri|+1))sigmaj Ri Rj 4: Create fake record 5: Add this fake record to the agent and also to fake record set. 6: Decrement fake record from total fake record set. B. SAMPLE REQUESTS With sample data requests, agents are not interested in particular objects. Hence, object sharing is not explicitly defined by their requests. The distributor is forced to allocate certain objects to multiple agents only if the number of requested objects mi exceeds the number of objects in set T [15]. The more data objects the agents request in total, the more recipients, on average, an object has; and the more objects are shared among different agents, the more difficult it is to detect a guilty agent[9]. Algorithms 2: Evaluation of Sample Data Request 1: Initialize Min_overlap 1, the minimum out of the maximum relative overlaps that the allocations of different objects to Ui. 2: for k {k |tk Ri} do Initialize max_rel_ov 0, the maximumrelative Overlap between and any set that the allocation of tk toUi 3: for all j = 1,..., n : j = i and tk R do Calculate absolute overlap as abs_ov | Ri Rj| + 1 Calculate relative overlap as rel_ov abs_ov / min ( mi, mj ) 4: Find maximum relative as max_rel_ov MAX (max_rel_ov, rel_ov) If max_rel_ov min_overlap then min_overlap max_rel_ov ret_k k Return ret_k VI. CONCLUSION
From the above study we conclude that in a perfect world there would be no need to hand over sensitive data to agents who may unknowingly or maliciously leak it. In spite of these difficulties, we have presented that it is possible to assess the likelihood that an agent is responsible for a leak, based on the probability that objects can be identified by other
means. Data leakage is a silent type of threat. Your employee as an insider can intentionally or accidentally leak sensitive information. This sensitive information can be electronically distributed via email, Web sites, FTP, instant messaging, spreadsheets, databases, and any other electronic means available all without your knowledge. To assess the risk of distributing data two things are important, where first one is data allocation strategy that helps to distribute the tuples among customers with minimum overlap and second one is calculating guilt probability which is based on overlapping of his data set with the leaked data set. The algorithms we have presented implement a variety of data distribution strategies that can improve the distributors chances of identifying a leaker. ACKNOWLEDGEMENT The preferred spelling of the word acknowledgment in America is without an e after the g. Avoid the stilted expression, One of us (R.B.G.) thanks., Instead, try R.B.G. thanks. Put applicable sponsor acknowledgments here; DO NOT place them on the first page of your paper or as a footnote. REFERENCES
[1] N. Sandhya, G. Haricharan Sharma, K. Bhima, "Exerting Modern Techniques for Data Leakage Problems Detect, International Journal of Electronics Communication and Computer Engineering (IJECCE), Vol. 3, Issue (1) NCRTCST, ISSN 2249 071X Sandip A. Kale C1, Prof. S. V. Kulkarni C2, Data Leakage Detection: A Survey, IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661 Vol. 1, Issue 6 (July-Aug 2012), PP 32-35www.iosrjournals.org P. Saranya, "Online Data Leakage Detection And Analysis, IJART, Vol. 2 Issue 2, March 2012
[4] [5]
P. Papadimitriou, H. Garcia-Molina, Data Leakage Detection, technical report, Stanford University, 2008. Shivappa M. Metagar, Sanjaykumar J. Hamilpure ,B. P. Savukar, "Water Marking Technique: An Unique Approach For Detecting The Data Leakage," Volume 2, Issue 5, Sept 2012 Archana Vaidya, Prakash Lahange, Kiran More, Shefali Kachroo & Nivedita Pandey, "DATA LEAKAGE DETECTION," Vol. 3, Issue 1, pp. 315-321. Jagtap N.P., Patil S.S. And Adhiya K. P., "Implementation Of Guilt Model With Data Watcher For Data Leakage Detection System, Volume 4, Issue 1, 2012. Rohit Pol, Vishwajeet Thakur, Ruturaj Bhise, Prof. Akash Kate , Data leakage Detection, International Journal of Engineering Research and Applications (IJERA) ISSN: 22489622 ,Vol. 2, Issue 3, May-Jun 2012, pp. 404-410. Sujana Dommala & M.SreeDevi,"Data Leakage Detection Using Fake Objects," Internati-onal Conference on Computer Science and Information Technology, ISBN: 97893-81693-86-5, 10th June, 2012-Tirupati. R. Arul Murugan, Kavitha .E, Nivedha .M, Subashini .S, "Data Leakage Detection And Prevention Using Perturbation And Unobtrusive Analyzes," International Journal of Communications and Engineering, Volume 04 No.4, Issue: 03 March2012.
[6]
[7]
[8]
[9]
[10]
[11] Naresh Bollam, Mr. V. Malsoru, "REVIEW ON DATA LEAKAGE DETECTION," International Journal of Engineering Research and Applications (IJERA),Vol. 1, Issue 3, pp.1088-1091. [12] Rupesh Mishra, D.K. Chitre, "Data Leakage and Detection of Guilty Agent, International Journal of Scientific & Engineering Research, Volume 3, Issue 6, June-2012. [13] Unnati Kavali, Tejal Abhang, Mr. Vaibhav Narawade, International Journal of Engineering Research and Applications, Vol. 2, Issue 2, Mar-Apr 2012, pp.1448-1452 [14] Rudragouda G Patil, "Development of Data leakage Detection Using Data Allocation Strategies, International Journal of Computer Applications in Engineering Sciences ,Vol. I, ISSUE II, JUNE 2011. [15] Jayavarapu Karthik and Dr.P. Harini, "Data Leakage Detection, International Conference on Computing and Control Engineering (ICCCE 2012), 12 & 13 April, 2012.
[2]
[3]

Data Leakage Detection

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Leakage Detection

Transféré par

Droits d'auteur :

Formats disponibles

DATA LEAKAGE DETECTION

MAMTA SINGH1, PRITI TRIPATHI2 & RENUKA SINGH3