Académique Documents
Professionnel Documents
Culture Documents
8,538,989 B1*
2002/0083039 A1
2004/0093321 A1
2006/0026147 A1
2007/0083497 A1 *
2007/0260594 A1
2008/0010273 A1
2008/0126143 A1
Notice:
(52)
3/2009
101341464 A
W0 0131479
1/2009
5/2001
(Continued)
(2006.01)
Marie-Catherine De Marneffe, Christopher D. Manning, Stanford
............... .. 707/737; 707/776
(Continued)
References Cited
5,666,502 A
9/1997 Capps
5,946,647 A
6,513,036 B2
6,832,218 B1
6,847,959 B1
6,944,612 B2
7,562,076 B2
7,565,139
7,761,414
7,818,324
7,895,196
7,933,900
8,005,720
8,086,604
(Continued)
OTHER PUBLICATIONS
USPC
(58)
4/2007
2009/0224867 A1
CN
W0
Int. Cl.
G06F 17/30
US. Cl.
(51)
9/2013
2009/0063500 A1*
(22) Filed:
US 8,954,438 B1
B2
B2
B1
B2
B2
B2
B2
(57)
ABSTRACT
7/2009 Kapur
?rst element list can be extracted from the one or more docu
7/2009
7/2010
10/2010
2/2011
4/2011
8/2011
12/2011
8,285,595 B2 *
10/2012
8,286,885 B1
8,316,029 B2
HEURISTIC OR
PATTERN BASED
EXTRACTION?
PATTERN
HEURISTIC
BASED
EXTRACT AN ELEMENT
EXTRACT AN ELEMENT
312
E14
51.0
US 8,954,438 B1
Page 2
(56)
2010/0057694
2010/0070448
2010/0241416
2011/0137883
2011/0184981
2011/0202493
2012/0101858
2012/0101901
2012/0150572
2012/0246153
2012/0330906
2013/0054542
2013/0110833
2013/0232147
2013/0268314
References Cited
OTHER PUBLICATIONS
WO 2006110480
WO 2010120925
10/2006
10/2010
2010,
available
at
http://blog.facebook.com/blog.php
?post:443390892130, pp. 1-2.
Dekang Lin, Dependency-based Evaluation of Minipar, In Work
shop on the Evaluation of Parsing Systems, May 1, 1998, 14 pages,
Granada, Spain.
Rion Snow, Daniel Jurafsky, Andrew Y. Ng, Learning Syntactic
Patterns for Automatic Hypernym Discovery, 2011, 8 pages.
Veselin Stoyanov, Claire Cardie, Nathan Gilbert, Ellen Riloff, David
Buttler, David Hysom, Reconcile: A Coreference Resolution
Research Platform, May 13, 2010, 14 pages.
nouncements/announcing-tripit-?rst-intelligent-travel-organizer
do-it-yourself-trip, pp. 1-2.
* cited by examiner
US. Patent
Sheet 1 0f7
FIG. 1
US 8,954,438 B1
US. Patent
US 8,954,438 B1
Sheet 2 0f 7
m?w@z5rk2exw:
NNN
@mmN
AREDeQNm
w29E0m53:
wmi?th. EOMm
E8\53
w .makn
@>m1e2sFXz
US. Patent
Sheet 4 0f7
US 8,954,438 B1
m
AEQEEE QNE (3R MGRE
DOCUMENTS
%
AcEEEE (ME OR MGRE
{DOCUMENTS
RENDER DGCUMENTS
%
STGRE RENDERED DGCUIVIENTS
EXTRACT A PLURALETY OF
ENTITY NAMES
FIG. 4
E
CLUSTER DGCUMENTS BY HOST
&
LIST
FIG. 5
m
EXTRACT ELEMENT LIST USING
HEURISTIC RULES
EXTRAGT A SECGNG
ELEMENT LIST USING
PATTERN BASED RULES
m
GENERATE ELEMENT LIST
PATTERN
%
EXTRACT ELEMENT LIE? USING
PATTERN BASED RULEE
FIG. 6
FIG. 7
US. Patent
Sheet 5 0f7
US 8,954,438 B1
HEURESTTC QR
PATTERN BAEED
EXTRACTECPN?
PATTERN
BASED
HEURTSTEC
EXTRACT AN ELEMENT
LTST USENG HEURESTEC
RULES
EXTRACT AN ELEMENT
LTST USENG PATTERN
BASED RULES
i
GENERATE ELEMENT LEST
PATTERN
FIG. 8
US. Patent
Sheet 6 0f7
US 8,954,438 B1
?
902
\\
http:/iwwwwmdbxsm/FunnyShaw/FSOGU5{episodes
_
.x
YI
Seascm 5
Season H
Samara 3E5
$eason iii
Epismii-z i
Episcade ii
Episode iii
Episacie 5V
Epigcacie V
Episude Vi
Episode Vii
Episede Vii!
Episacie 5X
Episode X
931:) A
<HTML ciass>
<head>
932
<titie>
FIG. 9
US. Patent
Sheet 7 0f 7
US 8,954,438 B1
166
WEB UQCUMENT
NQQE <head>
NOSE <body>
NQDE <titie>
NGUE <div>
NQQE Text:
NQEBE <ii>
|
NQBE Text:
Eiemem; List
FIG. 10
NOSE Attribute
US 8,954,438 B1
1
BACKGROUND
20
a plurality of entity names from the one or more documents 30 suitable media item or series thereof, or any combination
thereof. An entity can have one or more associated elements
from one of the plurality of ho sts using an entity name pattern,
such as, for example, an episode, a song, any other suitable
extract a ?rst element list from one or more documents based
ment list pattern based at least in part on the ?rst element list,
35
and extract a second element list from the one or more docu
40
45
50
implementations;
55
used to extract an entity name from each web page title. Web
page titles can be clustered by ho st, and then an entity name is
some implementations;
tations;
65
US 8,954,438 B1
3
able serial bus and/ or parallel bus, any other suitable compo
dered web pages 216. For example, server 110 can cluster
20
such as hard disks. In a further example, server 110 can 25 name. In some implementations, one or more components of
35
entity.
all of web servers 112, 114, and 116, any other suitable
device, or any combination thereof, via network 120. In some
implementations, network 120 can include several network
scales such as, for example, a local area network (LAN), a
wide area network (WAN), or other network, which may be
example, server 110 and web servers 112, 114, and 116 can
communicate using an intemet protocol such as transmission
mation from one or more of web servers 112, 114, and 116. In
some implementations, one or more of web servers 112, 114,
and 116 can be con?gured to push information to server 110.
In some implementations, server 110 can be con?gured as a
client, and one or more ofweb servers 112, 114, and 116 can
50
rendered web pages 302. Step 310 can include server 110
distinguish, sort and cluster web pages for entity XXX 312,
web pages for entityYYY 314, and web pages for any other
suitable entities, of rendered web pages 302 by entity name.
65
list for entity XXX 322, an element list for entityYYY 324, an
element list for any other suitable entity, or any combination
US 8,954,438 B1
5
or more documents. For example, server 110 can use the URL
tion 350, or both. For example, server 110 can perform heu
pattern
e.g., an element list for entity XXX 322 or an element list for
entity YYY 324, can include one or more elements, e. g.,
episodes, songs, or other elements, one or more groups of
regexp:Ahttp://www\.wmdb\.com/title/tt\+d/epi
dation 330 for the element list for entity XXX 322, the ele
ment list for entityYYY 324, any other suitable element list,
one document).
an element list, for the same entity, from one or more other
hosts. For example, server 110 can compare the element list
for entity XXX 322, derived from host ZZZ.com, with another
element list for entity XXX from another host uuu.com.
Cross-validation aids in preventing errors, inconsistencies,
and/or low con?dence in the generated element list for an
entity. For example, if an episode IX of entity XXX is deter
Step 406 includes server 110 storing the one or more ren
mined to appear in more than one element list from more than
thereof.
FIG. 5 is a ?ow diagram 500 of illustrative steps for extract
ing an entity name, in accordance with some implementations
of the present disclosure.
30
110 can cluster one or more web pages for host uuu.com and
110 can perform further processing on the web pages for each
aggregation 340.
45
order, steps 310, 320, 330, and 340. Server 110 can then
perform a second run by performing, in order, steps 310, 350,
55
example, for a particular host, server 110 can use the pattern
(.*)\(\d+\)-Episode List to extract the entity name for a
television show (e. g., in which the * character represents the
mentations, the entity name pattern can take the form PRE
FIX(.*)SUFFIX to distill an entity name from a web page,
especially when the same PREFIX and SUFFIX are used by
the host for multiple entities. The entity name pattern can be
US 8,954,438 B1
8
7
FIG. 6 is a ?ow diagram 600 of illustrative steps for extract
can search for a title node for each rendered web page using,
25
30
sode from each web page. For example, server 110 can use an
episode title of a television show as an identi?er.
35
server 110 can extract the episode title Death has a shadow
45
50
55
sure.
US 8,954,438 B1
9
10
example, server 110 can extract the episode title Death has a
20
25
that particular episode. Server 110 can search for a title node
than one element list at step 814. For example, server 110 can
50
55
tree path. It will be understood that web page 900 can have a
60
different layout than that shown in FIG. 9, and that the tech
niques disclosed herein can be applied to any suitable web
page, which can be rendered in any suitable manner.
65
US 8,954,438 B1
11
12
element node;
20
25
node are an attribute node and a text node including the list
text. Server 110 may use an element list pattern to extract the
30
35
40
9. A system comprising:
45
pattern;
heuristic rules;
rules;
validating the ?rst element list based at least in part on a
comparison of the ?rst element list with one or more
reference lists, wherein the ?rst element list and the one
1. A method comprising:
accessing, using processing equipment, one or more docu
65
US 8,954,438 B1
14
13
generate, using the processing equipment, an element
10
ments by language.
20