Académique Documents
Professionnel Documents
Culture Documents
(19) United States (12) Patent Application Publication (10) Pub. N0.: US 2013/0254838 A1
Ahuja et a].
(54) SYSTEM AND METHOD FOR DATA MINING
AND SECURITY POLICY MANAGEMENT
(57)
ABSTRACT
(63)
Continuation of application No. 12/410,875, ?led on Mar, 25, 2009, now Pat, NO_ 8,447,722
Publication Classi?cation
(51) Int. Cl. G06F 21/60
(2006.01)
embodiments, the method further includes generating a Cap ture rule that de?nes items the capture system should capture. The method also includes generating a discovery rule that de?nes objects the capture system should register. In still other embodiments, the method includes developing a policy
based on the rule, Where the policy identi?es hoW one or more documents are permitted to traverse a network.
DATA FRGM
RQUTER!
NETWO RK
31%
CAPTURE SYSTEM
31 0
USER lNTERFACE
w is V
A F
NETWQRK
lNTERFAQE
PACKET
QBJEQT
QEJ ECT
QEJECT
MQUULE
MQQULE
MODULE
MQDULE
MUDULE
/
300
/
302 304
x
306
\
308
US 2013/0254838 A1
LAN
1%
X
38
SWETCH
1136 <
i Q2
,1
ff ._ ENTERNET
i\ ii}
FEREWALL RQUTER
5&4 <
F 1G. 1
(PRIQR ART)
SERVER
US 2013/0254838 A1
LAN
21?
\
2@5\~ SWE'iCH
2Q<
O
o CAPTURE
SYSTEM
266
,1
m 282
J
mummy! v
\
([N ENTERNET *
21G
RQUTER
SERVER
204 < 2
[1113
mm
Q
FIG. 12?,
Q
SERVER
US 2013/0254838 A1
mm FRGM
ROUTER!
NETWORK
3112
\
CAFTURE $YSTEM
339
w
MQDULE
OEJ'ECT
USER iNTERFACE
w
PACKET 0mm"
A}?
9mm
NEMQRK
iNTERFACE
MQDULE
MQDULE
MGDULE
MGDULE
/'
3&0
/
3G2
/
3G4
i
3G5
\
368
FR}. 3
wii
was? ASSEMBLY MGDULE
/PACKtT$,
..
,1
a.
Paomcm
PRGTQCOL
7 6mm?
FEE. 4
US 2013/0254838 A1
5G5 \
\m
CQNTENT 5G4
STQRE '
TAG
2C, ......
DATABASE 592
Egg
pf,
$99
CAPTURE/REGESTRATMN SYSTEM
is
$532 \
NETWQRK
ElTERFACE
MQDULE
USER
<= a ENTERFACE
'
pf" 6&8
A
w
sh
w
$94 _\
GBJECT
QAPTURE =~
?nial-E3 $TGRE A
~11
REGiSTRATiGN w
ggg J
MGDULE
6i 2
US 2013/0254838 A1
mom
USER
we
F
LESER
as
REGESTRATEQN MODULE
WE
R
FRQM GBJECT
Ti E)
w /
.
T? 2
.1
f
CAPTURE
REGISTRAHQN
ESEARCH
N6 HFiCA'i'iON
MQDULEE
:NGINE
k ENGINE
MGDUL:
-v-~--v\
m CGNTENT
STQRE
61 2
DATABASE
H. '"2
START
START
-------------
"g
91Q\ EXTRACT/DECQDE
QQNTENT
%
U2G-\ SELECT N SPECGAL TOKENfS
92O"\
NQRMALEZE TEXT
i
% GENERATE SEGNATURES
L
GENERATE HA$H QF N
@630 /
SPECiAL "roams
Q48 J
FRGM TOKEN$
Fifi, i6
US 2013/0254838 A1
8Q2"\
CAPTURE GBJECT
v
%
8G5 ' CQMPARE @BJECT
"\
SiGNATLiRE$ m $i3NATL5RES
8%
ANY
MATCHES Mm
812-
f
%
EDENTEFY USEE WHQ
3i 4 J"
i
ALERT THE USER (3? THE iNTERCEPTEQN
K 81%
818/
Ah/3588i
YES GRANTE A
822/
TGWARBS DESTENATEON
US 2013/0254838 A1
US 2013/0254838 A1
@3611
'3@3"\
@EQSX
136?
139w
13; 1/
CREATED ENDEXES
US 2013/0254838 A1
TERM
KEYWCRDi CQNFEDENTEAL
KEYWQRDQ ENFQREVEATION
REFERENCHS}
REFERENCELREFERENCE3
REFERENCELREFEREECEZ
1215M
,...
m ETADATA
METADATA 1 MAIL FRQM
REFERENCES}
METADATA 2 3
1218/
....
15m
%
RETREEVE iNTERESECTED DOCUMENTS FELE iNFGRMATiGN
\- i553?
FIG. 15
US 2013/0254838 A1
METADATA
I
QBJECTES
1663
immm'
@505
RULE
US 2013/0254838 A1
....3
....i \\
C3 LU
------------------- n
Lu
iiiiiiiiiiiiiiiiiii H
g.
>@
<w?z
team Q m-*"~Jw
Qg?g
....i
4 E
i;
WW!
B2: C)
(.3
C)
QMUERY/NDAET
561 EM TAD TA 1
US 2013/0254838 A1
f. maw
mzaipwh 295.
.@E g :.
. g?o \\
\2... mw?3%gazW.x
.zapmzw
m x. .
m Em i mmpg
US 2013/0254838 A1
993
19%
@9@7\
1909f
191 1 f
MENE DAiABASES)
GENERATSOLAP CUBE
REC-ENE FiRAMETERS
FQR RULE 0E PQLECY
1913f
1915/
US 2013/0254838 A1
26501
CONTENT
STGRAGE
\v
2mg
uses:
SYSTEM
2303
\
2%65
DATABASE
2Q21
"m": A hi k 4 !
RULE
;
.
i
E;
A;
GTHER
MQDULES
CAPTURE SYSTEM
k
if i
W
HLTER M TRANSFGRM QLAP w PRESENTA'HGN A h RULE
"
MQDULE
" GENERATQR
MGDULE
' GENERATGR
2G1? / A
=1
292? r"!
2M3 \K
/
2015 K
1
201?
29s?
2%
<
QTHEREWMNG MQEEULES
AQMNESTRATEON SYSTEM
\\
2023
US 2013/0254838 A1
2164
CACHE \
21%?
~< =1 PROCE?OR<$> ./
A
210?
DESPLAY
A
W
SYSTEM
MEMORY
1
A
"
A
"
m
'
Mm
'
m
"
GRAPHECS
PRGCESSOR
/
2W3
2165/
\.
2W2
i?
\
2106
is
US 2013/0254838 A1
SYSTEM AND METHOD FOR DATA MINING AND SECURITY POLICY MANAGEMENT TECHNICAL FIELD OF THE INVENTION
[0017]
ing How;
[0018] FIG. 14 illustrates an example of a keyWord and metadata index at a particular point in time; [0019] FIG. 15 illustrates a simpli?ed example querying
[0001]
enterprises employ numerous took to control the dissemina tion of such information and many of these took attempt to
an effective data management system, capable of securing and controlling the movement of important information, pro vides a signi?cant challenge to security professionals, com ponent manufacturers, service providers, and system admin
istrators alike.
BRIEF DESCRIPTION OF THE DRAWINGS
[0026] Although the present system Will be discussed With reference to various illustrated examples, these examples
should not be read to limit the broader spirit and scope of the
[0004] The present invention is illustrated by Way of example, and not by Way of limitation, in the FIGURES of the accompanying draWings in Which like reference numerals
refer to similar elements and in Which: [0005] FIG. 1 is a block diagram illustrating a computer netWork connected to the Internet;
[0006]
tion;
[0008] FIG. 4 is a block diagram illustrating an object assembly module according to one embodiment of the
present invention;
[0009] FIG. 5 is a block diagram illustrating an object store module according to one embodiment of the present inven
tion;
[0010] FIG. 6 is a block diagram illustrating a document registration system according to one embodiment of the
present invention;
[0011] FIG. 7 is a block diagram illustrating a registration module according to one embodiment of the present inven
Unless speci?cally stated otherWise, it Will be appreciated that throughout the description of the present invention, use of terms such as processing, computing, calculating, determining, displaying, etc., refer to the action and pro
cesses of a computer system, or similar electronic computing
tion;
[0012] FIG. 8 illustrates an embodiment of the How of the
physical (electronic) quantities Within the computer system s registers and memories into other data similarly represented as physical quantities Within the computer system memories
or registers or other such information storage, transmission or
display devices.
[0028] FIG. 1 illustrates a simple prior art con?guration of
a local area netWork (LAN) 100 connected to Internet 102. Connected to LAN 100 are various components such as serv ers 104, clients 106, and sWitch 108. Numerous other net
Working components and computing devices may be con nected to the LAN 100. LAN 100 may be implemented using
US 2013/0254838 A1
Capture System
[0033] FIG. 3 shows an embodiment of a capture system 312 in detail and it includes a network interface module 300,
[0029]
the Internet 102 via a router 110. Router 110 may be used to implement a ?rewall. Firewalls are used to try to provide users
of LANS with secure access to the Internet as well as to
provide a separation of a public Web server (e. g., one of servers 104) from an internal network (e.g., LAN 100). Data
plicity, the capture system has been labeled as capture system 312. However, the discussion regarding capture system 312 is equally applicable to capture system 200.A network interface module 300 receives (captures) data, such as data packets,
from a network or router. Example network interface modules
packets.
[0030] FIG. 2 is a simpli?ed block diagram of a capture system 200, which includes a LAN 212, a set of clients 206,
a set of servers 204, a router 210, and an Internet element 202.
capture system.
[0034] This captured data is passed from network interface
module 300 to a packet capture module 302, which extracts
obtain packets. Additionally, packet data may be extracted from a packet by removing the packets header and check
sum.
[0031] Clients 206 are endpoints or customers wishing to affect or otherwise manage a communication in the proffered architecture. The term client may be inclusive of devices
used to initiate a communication, such as a computer, a per
[0035]
form the original, or reasonable equivalent of the, PDF from the captured packets associated with the PDF document. A complete data stream is obtained by reconstruction of mul tiple packets. The process by which a packet is created is beyond the scope of this application.
[0036]
appropriate algorithms and communication protocols that facilitate the data mining and policy management operations
detailed herein.
[0037] One or more tables may be included in these net
be of interest, including: network security reasons, intellec tual property concerns, corporate governance regulations, and other corporate policy concerns. Example documents
include, but are not limited to, Microsoft Of?ce documents
work appliances (or within capture system 200). In other embodiments, these tables may be provided externally to
these elements, or consolidated in any suitable fashion. The tables are memory elements for storing information to be
(such as Word, Excel, etc.), text ?les, images (such as JPEG, BMP, GIF, PRIG, etc.), Portable Document Format (PDF) ?les, archive ?les (such as GZIP, ZIP, TAR, JAR, WAR, RAR, etc.), email messages, email attachments, audio ?les, video
?les, source code ?les, executable ?les, etc.
US 2013/0254838 A1
electronic register, diagram, record, index, list, or queue. Alternatively, the tables may keep such information in any
suitable random access memory (RAM), read only memory
(ROM), erasable programmable ROM (EPROM), electroni cally erasable PROM (EEPROM), application speci?c inte
grated circuit (ASIC), softWare, hardWare, or in any other suitable component, device, element, or object Where appro priate and based on particular needs.
[0038] FIG. 4 illustrates a more detailed embodiment of
object assembly module 304 and FIG. 4 is discussed in con junction With some of the internal components of FIG. 3 for
purposes of explanation. This object assembly module includes a re-assembler 400, protocol demultiplexer (de mux) 402, and a protocol classi?er 404. Packets entering the
object assembly module 304 are provided to re-assembler 400. Re-assembler 400 groups (assembles) the packets into at
least one unique How. A TCP/IP ?oW contains an ordered
sequence of packets that may be assembled into a contiguous data stream by re-assembler 400. An example How includes
packets With an identical source IP and destination IP address and/or identical TCP source and destination ports. In other
may still contain multiple content objects depending on the protocol used. For example, a single ?oW using HTTP may
contain over 100 objects of any number of content types. To
In an embodiment, a state machine is maintained for each TCP connection, Which ensures that the capture system has a clear picture of content moving across every connection.
[0039]
properties and/or signature(s) of various documents to deter mine the content type of each object. For example, a Word
document has a signature that is distinct from a PoWerPoint document or an email. Object classi?cation module 306
vation of a starting packet. This starting packet is normally de?ned by the data transfer protocol being used. For example,
the starting packet of a TCP How is a SYN packet. The How terminates upon observing a ?nishing packet (e.g., a Reset or FIN packet in TCP/IP) or via a timeout mechanism if the ?nished packing is not observed Within a predetermined time constraint.
extracts each object and sorts them according to content type. This classi?cation prevents the transfer of a document Whose ?le extension or other property has been altered. For example, a Word document may have its extension changed from .doc
[0040] A How assembled by re-assembler 400 is provided to a protocol demultiplexer (demux) 402. Protocol demux
402 sorts assembled ?oWs using ports, such as TCP and/or
to .dock but the properties and/or signatures of that Word document remain the same and detectable by object classi? cation module 3 06. In other Words, object classi?cation mod
numbers With speci?ed protocols. For example, because Web Hyper Text Transfer Protocol (HTTP) packets, such as, Web tra?ic packets, are typically associated With TCP port 80,
packets that are captured over TCP port 80 are speculatively classi?ed as being HTTP. Examples of other Well-knoWn
signature; 2) grammar analysis; 3) statistical analysis; 4) ?le classi?cation; 5) document biometrics; and 6) concept maps.
Content signatures are used to look for prede?ned byte strings or text and number patterns (i.e., Social Security numbers, medical records, and bank accounts). When a signature is recogniZed, it becomes part of the classi?cation vector for
that content. While bene?cial When used in combination With
ports include TCP port 20 (File Transfer Protocol (FTP), TCP port 88 (Kerberos authentication packets), etc. Thus, protocol demux 402 separates the ?oWs by protocols.
[0041] A protocol classi?er 404 further sorts ?oWs. Proto col classi?er 404 (operating in either parallel or in sequence to
other metrics, signature matching alone may lead to a high number of false positives. [0046] Grammar analysis determines if an obj ects content
is in a speci?c language and ?lters accordingly based on this information, Various types of content have their oWn gram
mar or syntax. For example, C source code uses if/then
protocol demux 402) applies signature ?lters to a How to identify the protocol based solely on the transported data. Protocol classi?er 404 uses a protocols signature(s) (i.e., the characteristic data sequences of a de?ned protocol) to verify
[0047]
of the extensions applied to the ?le or compression. The ?le classi?cation mechanism looks for speci?c ?le markers instead of relying on normal telltale signs such as .xls or .pdf.
Document biometrics identi?es sensitive data even if the data
US 2013/0254838 A1
rich elements in ?les regardless of the order or combination in Which they appear. For example, a sensitive Word document
may be identi?ed even if text elements inside the document or
the ?le name itself have been changed. Excerpts of larger ?les, e.g., a single column exported from an Excel spread sheet containing Social Security numbers, may also be iden ti?ed. Document biometrics takes snapshots of protected
documents in order to build a signature set for protecting
[0048] Statistical analysis assigns Weights to the results of signature, grammar, and biometric analysis. That is, the cap
ture system tracks hoW many times there Was a signature, grammar, or biometric match in a particular document or ?le.
Destination IP address of object Source port number ofobject Destination port number of the object Protocol that carried the object Canonical count identifying object Within a protocol
Encoding
Size Timestamp OWner
author)
Con?guration Signature Tag Signature
Attribute
Capture rule directing the capture of object Hash signature of object Hash signature of all preceding tag ?elds
One or more attributes related to the object
system and, further, may include virtually any item (in addi
tion to those items discussed herein). The capture system may
also be made accessible to any netWork-connected machine through netWork interface module 300 and/ or user interface
310. In one embodiment, user interface 310 is a graphical user interface providing the user With easy access to the various
[0053] There are various other possible tag ?elds and some tag ?elds listed in Table 1 may not be used. In an embodiment, tag database 500 is not implemented as a database and another
data structure is used. The mapping of tags to objects may be obtained by using unique combinations of tag ?elds to con
struct an objects name. For example, one such possible com bination is an ordered list of the source IP, destination IP,
features of capture system 312. For example, user interface 310 may provide a capture rule-authoring tool that alloWs any capture rule desired to be Written. These rules are then applied
source port, destination port, instance, and timestamp. Many other such combinations, including both shorter and longer
names, are possible. A tag may contain a pointer to the storage
also provide pre-con?gured capture rules that the user selects from along With an explanation of the operation of such
location Where the indexed object is stored. The tag ?elds shoWn in Table 1 can be expressed more generally, to empha size the underlying information indicated by the tag ?elds in various embodiments. Some of the possible generic tag ?elds
are set forth in Table 2:
TABLE 2
De?nition
Device Identity Source Address Destination Address Source Port Destination Port Protocol Instance
Content
Identi?er of capture device Origination Address of object Destination Address of object Origination Port ofobject Destination Port of the object Protocol that carried the object Canonical count identifying object Within a protocol capable of carrying multiple data Within a single
connection
content store 502 are ?les 504 grouped by content type. For
Encoding
Size Timestamp OWner
(entirely or in part) using, for example, some netWork storage technique such as netWork attached storage (NAS), storage
area netWork (SAN), or other database.
Con?guration
User requesting the capture of object (rule author) Capture rule directing the capture of object
Signature
Tag Signature
Attribute
Signature of object
Signature of all preceding tag ?elds
One or more attributes related to the object
[0052]