Vous êtes sur la page 1sur 31

US 20130254838A1

(19) United States (12) Patent Application Publication (10) Pub. N0.: US 2013/0254838 A1
Ahuja et a].
(54) SYSTEM AND METHOD FOR DATA MINING
AND SECURITY POLICY MANAGEMENT

(43) Pub. Date:


(52) US. Cl.

Sep. 26, 2013

CPC .................................. .. G06F 21/606 (2013.01)


USPC ............................................................ .. 726/1

(71) ApplicantszRatinder Ahuja, Saratoga, CA (US);


Faizel Lakhani, Campbell, CA (U S)
(72) Inventors: Ratinder Ahuja, Saratoga, CA (US);
Faizel Lakhani, Campbell, CA (U S)

(57)

ABSTRACT

A method is provided in one example and includes generating


a query for a database for information stored in the database. The information relates to data discovered through a capture

(21) APP1~ NOJ 13/896,210


(22)
_ Flled:

system. The method further includes generating an Online


Analytical Processing (OLAP) element to represent informa tion received from the query. A rule based on the OLAP
element is generated and the rule affects data management for one or more documents that satisfy the rule. In more speci?c

May 161 2013


_ _ Related Us Apphcatlon Data

(63)

Continuation of application No. 12/410,875, ?led on Mar, 25, 2009, now Pat, NO_ 8,447,722

Publication Classi?cation
(51) Int. Cl. G06F 21/60
(2006.01)

embodiments, the method further includes generating a Cap ture rule that de?nes items the capture system should capture. The method also includes generating a discovery rule that de?nes objects the capture system should register. In still other embodiments, the method includes developing a policy
based on the rule, Where the policy identi?es hoW one or more documents are permitted to traverse a network.

DATA FRGM

RQUTER!

NETWO RK

31%
CAPTURE SYSTEM

31 0
USER lNTERFACE
w is V

A F

NETWQRK
lNTERFAQE

PACKET

QBJEQT

QEJ ECT

QEJECT

v CAPTU RE W ASSEMBLY m CLASSEFFGATEGN a! STGRE

MQUULE

MQQULE

MODULE

MQDULE

MUDULE

/
300

/
302 304

x
306

\
308

Patent Application Publication

Sep. 26, 2013 Sheet 1 0f 15

US 2013/0254838 A1

LAN

1%
X

38

SWETCH

1136 <

i Q2
,1

ff ._ ENTERNET
i\ ii}
FEREWALL RQUTER

5&4 <

F 1G. 1
(PRIQR ART)
SERVER

Patent Application Publication

Sep. 26, 2013 Sheet 2 0f 15

US 2013/0254838 A1

LAN

21?
\

2@5\~ SWE'iCH

2Q<

O
o CAPTURE

SYSTEM

266

,1
m 282
J

mummy! v
\

([N ENTERNET *

21G
RQUTER

SERVER
204 < 2

[1113

mm
Q

FIG. 12?,

Q
SERVER

Patent Application Publication

Sep. 26, 2013 Sheet 3 0f 15

US 2013/0254838 A1

mm FRGM

ROUTER!

NETWORK

3112
\
CAFTURE $YSTEM

339

w
MQDULE

OEJ'ECT

USER iNTERFACE

w
PACKET 0mm"

A}?
9mm

NEMQRK
iNTERFACE

-* CAPTURE w ASSEMBLY - CLASSEHGATEQN m smRE

MQDULE

MQDULE

MGDULE

MGDULE

/'
3&0

/
3G2

/
3G4

i
3G5

\
368

FR}. 3

wii
was? ASSEMBLY MGDULE

/PACKtT$,

..

,1

r REASbEMBLER + DEMUX m CLAssRFER / \ \


489 4E2 484

a.

Paomcm

PRGTQCOL

7 6mm?

Jim; """ '

FEE. 4

Patent Application Publication

Sep. 26, 2013 Sheet 4 0f 15

US 2013/0254838 A1

DATA FRQM OBJECT ASSEMBLY MGDULE

{DATA FRQM OBJECT CLASSEFECAWGN MOSULE

5G5 \
\m

CQNTENT 5G4
STQRE '

TAG

2C, ......

DATABASE 592

OBJECT STQRE MODULE


. 5

Egg

pf,

$99

CAPTURE/REGESTRATMN SYSTEM
is

$532 \

NETWQRK
ElTERFACE
MQDULE

USER
<= a ENTERFACE
'

pf" 6&8

A
w

sh
w

$94 _\

GBJECT
QAPTURE =~

?nial-E3 $TGRE A

~11

REGiSTRATiGN w

______ DATABASE iiiiiiiii WW iiiiii I

ggg J

MGDULE

6i 2

Patent Application Publication

Sep. 26, 2013 Sheet 5 0f 15

US 2013/0254838 A1

mom

USER

we
F

LESER
as

REGESTRATEQN MODULE

WE
R
FRQM GBJECT

Ti E)
w /
.

T? 2
.1
f

CAPTURE

REGISTRAHQN

ESEARCH

N6 HFiCA'i'iON

MQDULEE

:NGINE

k ENGINE

MGDUL:

-v-~--v\

m CGNTENT

STQRE

"W E" 5: SEJNATLRE

61 2

DATABASE

H. '"2
START
START
-------------

gm 0* sass? FSRST M TOKENS

"g

91Q\ EXTRACT/DECQDE
QQNTENT

%
U2G-\ SELECT N SPECGAL TOKENfS

FRGM THE M TOKENS

92O"\

NQRMALEZE TEXT
i
% GENERATE SEGNATURES

L
GENERATE HA$H QF N

@630 /

SPECiAL "roams

/ $K5P AHEAD P TQKENS 194$ FRGM FERST GF M mxaws

Q48 J

FRGM TOKEN$

Fifi, i6

Patent Application Publication

Sep. 26, 2013 Sheet 6 0f 15

US 2013/0254838 A1

8Q2"\

CAPTURE GBJECT
v

QALCULATE QBJECT SEGNATURES

%
8G5 ' CQMPARE @BJECT

"\

SiGNATLiRE$ m $i3NATL5RES

(3F REGESTEREQ nocumams

8%
ANY

MATCHES Mm

81 Q\ MALT DELEVERY OF CAPTURED QBJECT


r EDENTEFY REGiSTERED DQCUMENT

812-
f

FRQM MATCHENG SEGNATURE$

%
EDENTEFY USEE WHQ

3i 4 J"

REGESTEREU THE DOCUMENT

i
ALERT THE USER (3? THE iNTERCEPTEQN
K 81%

0F THE CAPTURED UBJECT


REQUEST PERMESSON T0 BELEVER

818/

(EAFTUREE) GBJECT TQ UESTENATEDN

Ah/3588i
YES GRANTE A

RGUTE CAPTUREB GBJECT

822/

TGWARBS DESTENATEON

Patent Application Publication

Sep. 26, 2013 Sheet 7 0f 15

US 2013/0254838 A1

Patent Application Publication

Sep. 26, 2013 Sheet 8 0f 15

US 2013/0254838 A1

@3611
'3@3"\

CAPTURE PACKET S'E'REEAM


ANALYZE PACKET STREAM

@EQSX

CGPYIMQVE GBJECT DATA


T0 $TORAGE DEVECE
W

, QREATE KEYWQRB iNDEX{E$)iENTRiE$

136?

FGR CAPTURED CQNTENT


QREATE METADATA arwaxqasyammas BASED 6N CAPTURE-5D CQNTENT
QUERY ONE {3R MGRE OF

139w

13; 1/

CREATED ENDEXES

Patent Application Publication

Sep. 26, 2013 Sheet 9 0f 15

US 2013/0254838 A1

TERM
KEYWCRDi CQNFEDENTEAL
KEYWQRDQ ENFQREVEATION

REFERENCHS}
REFERENCELREFERENCE3
REFERENCELREFEREECEZ

1215M

,...

m ETADATA
METADATA 1 MAIL FRQM

REFERENCES}

METADATA 2 3

LEQPQLD PDF HQ?

REFERENCE LREFERENCE 1_.REFERENCE 1,REFERENCE 4 3

1218/

....

QUERY ONE GR MORE KEYWORD iNDEXES FQR xezwvnmqs;

15m

QUERY QNE 0R MORE METADATA f h5g3 ENEDEXES FOR METADATA

iNTERSEC'E' RESULTS (3F KEYWQRD AND METADATA QUEREES X @565

%
RETREEVE iNTERESECTED DOCUMENTS FELE iNFGRMATiGN

\- i553?

FIG. 15

Patent Application Publication

Sep. 26, 2013 Sheet 10 0f 15

US 2013/0254838 A1

METADATA
I

QBJECTES
1663

immm'

mm? Fmmmmi EMWWWWW Vmmmww'i i'mmwmwi

@505
RULE

Patent Application Publication

Sep. 26, 2013 Sheet 11 0f 15

US 2013/0254838 A1

....3
....i \\

C3 LU

------------------- n

Lu

iiiiiiiiiiiiiiiiiii H

g.

>@

<w?z

team Q m-*"~Jw

Qg?g
....i

4 E

i;

WW!
B2: C)

(.3

C)

QMUERY/NDAET
561 EM TAD TA 1

Patent Application Publication

Sep. 26, 2013 Sheet 12 0f 15

US 2013/0254838 A1

f. maw

mzaipwh 295.

.@E g :.
. g?o \\

\2... mw?3%gazW.x

.zapmzw

m x. .

m Em i mmpg

Patent Application Publication

Sep. 26, 2013 Sheet 13 0f 15

US 2013/0254838 A1

19m \ CAPTURE/PROCESS DATAIQBdECTS

993
19%
@9@7\
1909f
191 1 f

MENE DAiABASES)
GENERATSOLAP CUBE
REC-ENE FiRAMETERS
FQR RULE 0E PQLECY

CREATE RU; QR PQLECY


STQRE RUE: QR POLECY

1913f
1915/

RUN RULEiR PQEEY

RECEWE upm'rin PARAMETERS


FQR RULE 0E POLECY

Patent Application Publication

Sep. 26, 2013 Sheet 14 0f 15

US 2013/0254838 A1

26501
CONTENT
STGRAGE

\v
2mg

uses:
SYSTEM

2303
\

2%65

DATABASE
2Q21
"m": A hi k 4 !

RULE

;
.
i

E;
A;

GTHER

MQDULES

CAPTURE SYSTEM
k
if i

W
HLTER M TRANSFGRM QLAP w PRESENTA'HGN A h RULE

"

MQDULE

" GENERATQR

MGDULE

' GENERATGR

2G1? / A
=1

292? r"!

2M3 \K
/

2015 K
1

201?

DATA MENENG Momma

29s?

2%
<

QTHEREWMNG MQEEULES

AQMNESTRATEON SYSTEM

\\
2023

Patent Application Publication

Sep. 26, 2013 Sheet 15 0f 15

US 2013/0254838 A1

2164
CACHE \

21%?
~< =1 PROCE?OR<$> ./
A

210?
DESPLAY
A

W
SYSTEM
MEMORY

1
A
"

A
"

m
'

Mm
'

m
"

GRAPHECS
PRGCESSOR

/
2W3
2165/

\.
2W2
i?

\
2106

is

US 2013/0254838 A1

Sep. 26, 2013

SYSTEM AND METHOD FOR DATA MINING AND SECURITY POLICY MANAGEMENT TECHNICAL FIELD OF THE INVENTION

[0017]

FIG. 13 illustrates an example indexing and search

ing How;
[0018] FIG. 14 illustrates an example of a keyWord and metadata index at a particular point in time; [0019] FIG. 15 illustrates a simpli?ed example querying

[0001]

The present invention relates to computer networks

and, more particularly, to a system and a method for data

?oW using metadata and keyWord indexing;


[0020] FIG. 16 illustrates an embodiment of the interac tions betWeen different aspects of a capture system and user activities for creating a rule or policy used for interpreting

mining and security policy management.


BACKGROUND OF THE INVENTION

[0002] Computer netWorks have become indispensable


took for modern business. Enterprises can use netWorks for communications and, further, can store data in various forms

captured or registered data;


[0021] FIG. 17 illustrates an embodiment of using a query of tag and historical databases to generate an online analytical

and at various locations. Critical information frequently


propagates over a netWork of a business enterprise. Modern

processing (OLAP) cube;


[0022] FIG. 18 graphically illustrates an embodiment of a

enterprises employ numerous took to control the dissemina tion of such information and many of these took attempt to

method for the generation of a cube;


[0023] FIG. 19 illustrates an embodiment of a method to generate a rule from an OLAP cube; [0024] FIG. 20 illustrates an embodiment of a system that uses OLAP to help a user de?ne a rule and/or policy; and [0025] FIG. 21 shoWs an embodiment of a computing sys
tem.

keep outsiders, intruders, and unauthoriZed personnel from


accessing valuable or sensitive information. Commonly, these took can include ?reWalls, intrusion detection systems,

and packet sniffer devices.


[0003] The ability to offer a system or a protocol that offers

an effective data management system, capable of securing and controlling the movement of important information, pro vides a signi?cant challenge to security professionals, com ponent manufacturers, service providers, and system admin
istrators alike.
BRIEF DESCRIPTION OF THE DRAWINGS

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

[0026] Although the present system Will be discussed With reference to various illustrated examples, these examples
should not be read to limit the broader spirit and scope of the

present invention. Some portions of the detailed description


that folloWs are presented in terms of algorithms and sym bolic representations of operations on data Within a computer

[0004] The present invention is illustrated by Way of example, and not by Way of limitation, in the FIGURES of the accompanying draWings in Which like reference numerals
refer to similar elements and in Which: [0005] FIG. 1 is a block diagram illustrating a computer netWork connected to the Internet;

memory. These algorithmic descriptions and representations


are the means used by those skilled in the art of computer science to most effectively convey the sub stance of their Work to others skilled in the art. An algorithm is generally con ceived to be a self-consistent sequence of steps leading to a

[0006]

FIG. 2 is a block diagram illustrating one con?gu

ration of a capture system according to one embodiment of

desired result. The steps are those requiring physical manipu

the present invention;


[0007] FIG. 3 is a block diagram illustrating the capture system according to one embodiment of the present inven

lations of physical quantities. Usually, though not necessarily,


these quantities take the form of electrical or magnetic signals

capable of being stored, transferred, combined, compared


and otherWise manipulated. [0027] It has proven convenient at times, principally for
reasons of common usage, to refer to these signals as bits,

tion;
[0008] FIG. 4 is a block diagram illustrating an object assembly module according to one embodiment of the

present invention;
[0009] FIG. 5 is a block diagram illustrating an object store module according to one embodiment of the present inven

values, elements, symbols, characters, terms, numbers, or the


like. Keep in mind, hoWever, that all of these and similar terms are to be associated With the appropriate physical quantities and are merely convenient labels applied to these quantities.

tion;
[0010] FIG. 6 is a block diagram illustrating a document registration system according to one embodiment of the

present invention;
[0011] FIG. 7 is a block diagram illustrating a registration module according to one embodiment of the present inven

Unless speci?cally stated otherWise, it Will be appreciated that throughout the description of the present invention, use of terms such as processing, computing, calculating, determining, displaying, etc., refer to the action and pro
cesses of a computer system, or similar electronic computing

tion;
[0012] FIG. 8 illustrates an embodiment of the How of the

device, that manipulates and transforms data represented as

operation of a registration module;


[0013] FIG. 9 is a How diagram illustrating an embodiment of a How to generate signatures; [0014] FIG. 10 is a How diagram illustrating an embodi

physical (electronic) quantities Within the computer system s registers and memories into other data similarly represented as physical quantities Within the computer system memories
or registers or other such information storage, transmission or

display devices.
[0028] FIG. 1 illustrates a simple prior art con?guration of
a local area netWork (LAN) 100 connected to Internet 102. Connected to LAN 100 are various components such as serv ers 104, clients 106, and sWitch 108. Numerous other net

ment of changing tokens into document signatures;


[0015] [0016] FIG. 11 illustrates an embodiment of a registration FIG. 12 illustrates an example embodiment of a

engine that generates signatures for documents;


netWork capture device;

Working components and computing devices may be con nected to the LAN 100. LAN 100 may be implemented using

US 2013/0254838 A1

Sep. 26, 2013

various wireline (e.g., Ethernet) or wireless technologies


(e.g., IEEE 802.1 1x). LAN 100 could also be connected to other LANs.

Capture System
[0033] FIG. 3 shows an embodiment of a capture system 312 in detail and it includes a network interface module 300,

[0029]

In this prior con?guration, LAN 100 is connected to

packet capture module 302, an object module 304, an object


classi?cation module 306, an object store module 308, and a user interface 310. A capture system (such as capture system
200 or 312) may also be referred to as a content analyZer, content/ data analysis system, or other similar name. For sim

the Internet 102 via a router 110. Router 110 may be used to implement a ?rewall. Firewalls are used to try to provide users
of LANS with secure access to the Internet as well as to

provide a separation of a public Web server (e. g., one of servers 104) from an internal network (e.g., LAN 100). Data

leaving LAN 100 and going to Internet 102 passes through


router 110. Router 120 simply forwards packets as is from LAN 100 to Internet 102. Router 110 routes packets to and from a network and the Internet. While the router may log that

plicity, the capture system has been labeled as capture system 312. However, the discussion regarding capture system 312 is equally applicable to capture system 200.A network interface module 300 receives (captures) data, such as data packets,
from a network or router. Example network interface modules

a transaction has occurred (i.e., packets have been routed), it


does not capture, analyze, or store the content contained in the

300 include network interface cards (NICs) (for example,


Ethernet cards). More than one NIC may be present in a

packets.
[0030] FIG. 2 is a simpli?ed block diagram of a capture system 200, which includes a LAN 212, a set of clients 206,
a set of servers 204, a router 210, and an Internet element 202.

capture system.
[0034] This captured data is passed from network interface
module 300 to a packet capture module 302, which extracts

packets from the captured data. Packet capture module 302


may extract packets from streams with different sources and/ or destinations. One such case is asymmetric routing where a packet sent from source A to destination B travels along a ?rst path and responses sent from destination B to source

Capture system 200 may be con?gured sequentially in front


of, or behind, router 210. In systems where a router is not

used, capture system 200 may be located between LAN 212


and Internet 202. Stated in other terms, if a router is not used, capture system 200 can operate to forward packets to Internet 202, in accordance with one example paradigm. In one
embodiment, capture system 200 has a user interface acces sible from a LAN-attached device such as client(s) 206.

A travel along a different path. Accordingly, each path


could be a separate source for packet capture module 302 to

obtain packets. Additionally, packet data may be extracted from a packet by removing the packets header and check
sum.

[0031] Clients 206 are endpoints or customers wishing to affect or otherwise manage a communication in the proffered architecture. The term client may be inclusive of devices
used to initiate a communication, such as a computer, a per

[0035]

When an object is transmitted, such as an email

attachment, it is broken down into packets according to vari


ous data transfer protocols such as Transmission Control

sonal digital assistant (PDA), a laptop or electronic notebook,


a cellular telephone, or any other device, component, ele

Protocol/Internet Protocol (TCP/IP), UDP, HTTP, etc. An


object assembly module 304 reconstructs the original or a

ment, or object capable of initiating voice, audio, or data

reasonably equivalent document from the captured packets.


For example, a PDF document broken down into packets before being transmitted from a network is reassembled to

exchanges within the proffered architecture. The endpoints


may also be inclusive of a suitable interface to the humanuser,
such as a microphone, a display, or a keyboard or other

terminal equipment. The endpoints may also be any device


that seeks to initiate a communication on behalf of another entity or element, such as a program, a database, or any other

component, device, element, or object capable of initiating a


voice or a data exchange within the proffered architecture. Data, as used herein in this document, refers to any type of numeric, voice, or script data, or any type of source or object code, or any other suitable information in any appropriate format that may be communicated from one point to another.

form the original, or reasonable equivalent of the, PDF from the captured packets associated with the PDF document. A complete data stream is obtained by reconstruction of mul tiple packets. The process by which a packet is created is beyond the scope of this application.

[0036]

In alternative embodiments, instead of being imple

mented in conjunction with (or included within) a router,


capture system 200 may be included as part of other network

appliances such as switches, gateways, bridges, loadbalanc


ers, servers, or any other suitable device, component, ele ment, or object operable to exchange information in a net

[0032] In operation, capture system 200 intercepts data


leaving a network [such as LAN 212]. In an embodiment, the

work environment. Moreover, these network appliances and/


or capture systems may include any suitable hardware,

capture system also intercepts data being communicated


internally to a network such as LAN 212. Capture system 200 can reconstruct documents leaving the network and store them in a searchable fashion. Capture system 200 is then used to search and sort through all documents that have left the network. There are many reasons why such documents may

software, components, modules, interfaces, or objects that


facilitate the operations thereof. This may be inclusive of

appropriate algorithms and communication protocols that facilitate the data mining and policy management operations
detailed herein.
[0037] One or more tables may be included in these net

be of interest, including: network security reasons, intellec tual property concerns, corporate governance regulations, and other corporate policy concerns. Example documents
include, but are not limited to, Microsoft Of?ce documents

work appliances (or within capture system 200). In other embodiments, these tables may be provided externally to
these elements, or consolidated in any suitable fashion. The tables are memory elements for storing information to be

(such as Word, Excel, etc.), text ?les, images (such as JPEG, BMP, GIF, PRIG, etc.), Portable Document Format (PDF) ?les, archive ?les (such as GZIP, ZIP, TAR, JAR, WAR, RAR, etc.), email messages, email attachments, audio ?les, video
?les, source code ?les, executable ?les, etc.

referenced by their corresponding network appliances. As


used herein in this document, the term table is inclusive of any suitable database or storage medium (provided in any

appropriate format) that is capable of maintaining informa


tion pertinent to the operations detailed herein in this Speci

US 2013/0254838 A1

Sep. 26, 2013

?cation. For example, the tables may store information in an

[0042] Protocol classi?cation helps identify suspicious


activity over non-standard ports. A protocol state machine is used to determine Which protocol is being used in a particular

electronic register, diagram, record, index, list, or queue. Alternatively, the tables may keep such information in any
suitable random access memory (RAM), read only memory

netWork activity. This determination is made independent of


the port or channel on Which the protocol is active. As a result,

(ROM), erasable programmable ROM (EPROM), electroni cally erasable PROM (EEPROM), application speci?c inte
grated circuit (ASIC), softWare, hardWare, or in any other suitable component, device, element, or object Where appro priate and based on particular needs.
[0038] FIG. 4 illustrates a more detailed embodiment of

the capture system recogniZes a Wide range of protocols and

applications, including SMTP, FTP, HTTP, P2P, and propri


etary protocols in client-server applications. Because proto col classi?cation is performed independent of Which port number Was used during transmission, the capture system
monitors and controls tra?ic that may be operating over non

object assembly module 304 and FIG. 4 is discussed in con junction With some of the internal components of FIG. 3 for

standard ports. Non-standard communications may indicate


that an enterprise is at risk from spyWare, adWare, or other malicious code, or that some type of netWork abuse or insider

purposes of explanation. This object assembly module includes a re-assembler 400, protocol demultiplexer (de mux) 402, and a protocol classi?er 404. Packets entering the
object assembly module 304 are provided to re-assembler 400. Re-assembler 400 groups (assembles) the packets into at
least one unique How. A TCP/IP ?oW contains an ordered

threat may be occurring.

[0043] Object assembly module 304 outputs each ?oW,

organiZed by protocol, representing the underlying objects


being transmitted. These objects are passed to object classi
?cation module 306 (also referred to as the content classi ?er) for classi?cation based on content. A classi?ed How

sequence of packets that may be assembled into a contiguous data stream by re-assembler 400. An example How includes
packets With an identical source IP and destination IP address and/or identical TCP source and destination ports. In other

may still contain multiple content objects depending on the protocol used. For example, a single ?oW using HTTP may
contain over 100 objects of any number of content types. To

Words, re-assembler 400 assembles a packet stream (?oW) by


sender and recipient. Thus, a How is an ordered data stream of
a single communication betWeen a source and a destination.

deconstruct the How, each object contained in the How is

individually extracted and decoded, if necessary, by object


classi?cation module 306. [0044] Object classi?cation module 306 uses the inherent

In an embodiment, a state machine is maintained for each TCP connection, Which ensures that the capture system has a clear picture of content moving across every connection.

[0039]

Re-assembler 400 begins a neW ?oW upon the obser

properties and/or signature(s) of various documents to deter mine the content type of each object. For example, a Word
document has a signature that is distinct from a PoWerPoint document or an email. Object classi?cation module 306

vation of a starting packet. This starting packet is normally de?ned by the data transfer protocol being used. For example,
the starting packet of a TCP How is a SYN packet. The How terminates upon observing a ?nishing packet (e.g., a Reset or FIN packet in TCP/IP) or via a timeout mechanism if the ?nished packing is not observed Within a predetermined time constraint.

extracts each object and sorts them according to content type. This classi?cation prevents the transfer of a document Whose ?le extension or other property has been altered. For example, a Word document may have its extension changed from .doc

[0040] A How assembled by re-assembler 400 is provided to a protocol demultiplexer (demux) 402. Protocol demux
402 sorts assembled ?oWs using ports, such as TCP and/or

to .dock but the properties and/or signatures of that Word document remain the same and detectable by object classi? cation module 3 06. In other Words, object classi?cation mod

ule 306 functions beyond simple extension ?ltering.


[0045] According to an embodiment, a capture system uses
one or more of six mechanisms for classi?cation: 1) content

UDP ports, by performing speculative classi?cation of the


?oWs contents based on the association of Well-knoWn port

numbers With speci?ed protocols. For example, because Web Hyper Text Transfer Protocol (HTTP) packets, such as, Web tra?ic packets, are typically associated With TCP port 80,
packets that are captured over TCP port 80 are speculatively classi?ed as being HTTP. Examples of other Well-knoWn

signature; 2) grammar analysis; 3) statistical analysis; 4) ?le classi?cation; 5) document biometrics; and 6) concept maps.
Content signatures are used to look for prede?ned byte strings or text and number patterns (i.e., Social Security numbers, medical records, and bank accounts). When a signature is recogniZed, it becomes part of the classi?cation vector for
that content. While bene?cial When used in combination With

ports include TCP port 20 (File Transfer Protocol (FTP), TCP port 88 (Kerberos authentication packets), etc. Thus, protocol demux 402 separates the ?oWs by protocols.
[0041] A protocol classi?er 404 further sorts ?oWs. Proto col classi?er 404 (operating in either parallel or in sequence to

other metrics, signature matching alone may lead to a high number of false positives. [0046] Grammar analysis determines if an obj ects content
is in a speci?c language and ?lters accordingly based on this information, Various types of content have their oWn gram
mar or syntax. For example, C source code uses if/then

protocol demux 402) applies signature ?lters to a How to identify the protocol based solely on the transported data. Protocol classi?er 404 uses a protocols signature(s) (i.e., the characteristic data sequences of a de?ned protocol) to verify

grammar. Legal documents, resumes, and earnings results


also have a particular grammar. Grammar analysis also enables an organiZation to detect the presence of non-English
language-based content on their netWork.

the speculative classi?cation performed by protocol demux


402. If protocol classi?er 404 determines that the speculative classi?cation is incorrect, it overrides it. For example, if an
individual or program attempted to masquerade an illicit

[0047]

File classi?cation identi?es content types regardless

communication (such as ?le sharing) using an apparently

benign port (for example, TCP port 80), protocol classi?er


404 Would use the HTTP protocol signature(s) to verify the

of the extensions applied to the ?le or compression. The ?le classi?cation mechanism looks for speci?c ?le markers instead of relying on normal telltale signs such as .xls or .pdf.
Document biometrics identi?es sensitive data even if the data

speculative classi?cation performed by protocol demux 402.

has been modi?ed, Document biometrics recogniZes content

US 2013/0254838 A1

Sep. 26, 2013

rich elements in ?les regardless of the order or combination in Which they appear. For example, a sensitive Word document
may be identi?ed even if text elements inside the document or

simply a place to deposit the captured objects. The indexing


of the objects stored in content store 502 is accomplished using tag database 500. Tag database 500 is a database data structure in Which each record is a tag that indexes an object
in content store 502 and contains relevant information about

the ?le name itself have been changed. Excerpts of larger ?les, e.g., a single column exported from an Excel spread sheet containing Social Security numbers, may also be iden ti?ed. Document biometrics takes snapshots of protected
documents in order to build a signature set for protecting

the stored object. An example of a tag record in tag database


500 that indexes an object stored in content store 502 is set forth in Table 1:

them, In an embodiment, document biometrics distinguishes


betWeen public and con?dential information Within the same document.
TABLE 1
Field Name
MAC Address Source IP

[0048] Statistical analysis assigns Weights to the results of signature, grammar, and biometric analysis. That is, the cap
ture system tracks hoW many times there Was a signature, grammar, or biometric match in a particular document or ?le.

De?nition (Relevant Information)


NIC MAC address Source IP address ofobject

This phase of analysis contributes to the systems overall


accuracy. [0049] Concept maps may be used to de?ne and track com

Destination IP Source Port Destination Port Protocol Instance

Destination IP address of object Source port number ofobject Destination port number of the object Protocol that carried the object Canonical count identifying object Within a protocol

plex or unique content, Whether at rest, in motion, or captured.


Concept maps are based on combinations of data classi?ca
Content

capable of carrying multiple data Within a single


TCP/IP connection

Content type of the object

tion mechanisms and provide a Way to protect content using

Encoding
Size Timestamp OWner

Encoding used by the protocol carrying object


Size ofobject Time that the object Was captured User requesting the capture of object (possibly rule

compound policies. Object classi?cation module 306 may


also determine Whether each object should be stored or dis carded. This determination is based on de?nable capture rules

used by object classi?cation module 306. For example, a


capture rule may indicate that all Web traf?c is to be dis carded. Another capture rule may indicate that all PoWerPoint documents should be stored except for ones originating from the CEOs IP address. Such capture rules are implemented as regular expressions or by other similar means. [0050] Capture rules may be authored by users of a capture

author)
Con?guration Signature Tag Signature
Attribute

Capture rule directing the capture of object Hash signature of object Hash signature of all preceding tag ?elds
One or more attributes related to the object

system and, further, may include virtually any item (in addi
tion to those items discussed herein). The capture system may
also be made accessible to any netWork-connected machine through netWork interface module 300 and/ or user interface
310. In one embodiment, user interface 310 is a graphical user interface providing the user With easy access to the various

[0053] There are various other possible tag ?elds and some tag ?elds listed in Table 1 may not be used. In an embodiment, tag database 500 is not implemented as a database and another

data structure is used. The mapping of tags to objects may be obtained by using unique combinations of tag ?elds to con
struct an objects name. For example, one such possible com bination is an ordered list of the source IP, destination IP,

features of capture system 312. For example, user interface 310 may provide a capture rule-authoring tool that alloWs any capture rule desired to be Written. These rules are then applied

source port, destination port, instance, and timestamp. Many other such combinations, including both shorter and longer
names, are possible. A tag may contain a pointer to the storage

by object classi?cation module 306 When determining


Whether an object should be stored. User interface 310 may

also provide pre-con?gured capture rules that the user selects from along With an explanation of the operation of such

location Where the indexed object is stored. The tag ?elds shoWn in Table 1 can be expressed more generally, to empha size the underlying information indicated by the tag ?elds in various embodiments. Some of the possible generic tag ?elds
are set forth in Table 2:

standard included capture rules. Generally, by default, the

capture rule(s) implemented by object classi?cation module


306 captures all objects leaving the netWork that the capture
system can access. If the capture of an object is mandated by one or more capture rules, object classi?cation module 306 may determine Where in object store module 308 the captured
Field Name

TABLE 2
De?nition

object should be stored.


[0051] FIG. 5 illustrates an embodiment of object store

module 308. According to this embodiment, the object store


includes a tag database 500 and a content store 502. Within

Device Identity Source Address Destination Address Source Port Destination Port Protocol Instance
Content

Identi?er of capture device Origination Address of object Destination Address of object Origination Port ofobject Destination Port of the object Protocol that carried the object Canonical count identifying object Within a protocol capable of carrying multiple data Within a single
connection

content store 502 are ?les 504 grouped by content type. For

example, if object classi?cation module 306 determines that


an object is a Word document that should be stored it can store

Content type of the object

Encoding
Size Timestamp OWner

Encoding used by the protocol carrying object


Size ofobject Time that the object Was captured

it in ?le 504 reserved for Word documents. Object store


module 506 may be internal to a capture system or external

(entirely or in part) using, for example, some netWork storage technique such as netWork attached storage (NAS), storage
area netWork (SAN), or other database.

Con?guration

User requesting the capture of object (rule author) Capture rule directing the capture of object

Signature
Tag Signature
Attribute

Signature of object
Signature of all preceding tag ?elds
One or more attributes related to the object

[0052]

In regards to the tag data structure, in an embodi

ment, content store 502 is a canonical storage location that is