Vous êtes sur la page 1sur 6

Text Data Mining: A Case Study

Charles Wesley Ford, Chia-Chu Chiang, Hao Wu, Radhika R. Chilka, and *John R. Talburt
*
Department of Computer Science Acxiom Corporation
University of Arkansas at Little Rock #1 Information Way
2801 South University Avenue, Little Rock P. O. Box 8180, Little Rock
Arkansas 72204-1099, USA Arkansas 72202-2289, USA
E-mail: cwford@ualr.edu E-mail: John.Talburt@acxiom.com

Abstract complete information. A semantically meaningful extension


to the relational operators such as projection, selection, un-
A vast amount of data is available on the web. Many ion, and join have been proposed to evaluate relational ex-
companies have started to mine this data to augment datasets pressions over databases with null values in a semantically
used in production. This paper presents an industry study for correct way.
one such text data mining. The study and analysis of the The handling of uncertainty in databases has been recog-
extracted data from the web could be used to help improve nized as a topic worthy of investigation. In [9, 10], a data-
products and services. A prototype has been completed and base is defined to have ordinary entity sets, each entity set
benched marked in the laboratory. defined as a collection of fuzzy entities. A fuzzy entity may
have uncertain or imprecise attribute values represented by
1. Introduction possibility distributions. However, no membership degree is
associated with the entities. In [1], a database is defined to
Since the World Wide Web (WWW) is rapidly emerging
contain fuzzy entity sets. A membership degree is assigned
as an important resource for information retrieval, the major-
to each entity in the range of [0,1]. However, no possibility
ity of human information is predicted to be available on the
distributions are assigned to the attribute values. In [7, 13],
Web in ten years [6]. We can further conclude that the ac-
a fuzzy database consists of fuzzy entity sets, each contain-
celeration of the development of the technology in this area
ing a set of fuzzy entities. Thus, each fuzzy entity has a
has reduced the time frame immensely. Unfortunately, the
membership degree and fuzzy attribute values that are repre-
vast range of information content, quality, and format pre-
sented by possibility distributions. The approach used in [7,
sent formidable challenges to the creation of automated
13] is more natural than the approaches used in [1, 9, 10],
processes for finding, extracting, and evaluating Web data.
since they define databases by combining fuzzy set theory
This paper focuses on an important technique for evaluat-
[14] with database technology. In [14], Zadeh introduced the
ing the quality of Web data in which the retrieved informa-
concept of fuzzy set, a set whose boundary is not sharp or
tion is evaluated against a trusted information knowledge-
precise. A fuzzy set is in contrasts to the concept of a crisp
base (called an Index). However, it is often the case that
set, whose boundary is required to be precise. In a fuzzy set,
both the extracted Web data and the information recorded in
a member is no longer defined as being definitely in or defi-
the Index are incomplete and exhibit some degree of varia-
nitely out of the set. A member may be inside the set to a
tion in both content and format. Moreover, the Index is typi-
greater or lesser degree. Because membership in a fuzzy set
cally very large, and a query against it, especially an ap-
is not a matter of affirmation or denial, such as yes or no, 1
proximate query, may return a very large result set that must
or 0, a grade of membership by numbers between 0 and 1 is
be further analyzed in order to arrive at a final assessment of
assigned to a member in a fuzzy set. A membership function
the correctness (vis--vis the Index) of the Web data upon
may be defined to assign a membership degree to each
which the query was based. One difficulty arises from the
member in a fuzzy set. Fuzzy set theory in databases is con-
lack of a-priori data with which one can evaluate the ex-
cerned with the uncertainties resulting from the imprecision
tracted data for correctness.
of meaning of a concept expressed by a linguistic term in a
2. Related work query such as young, old, and the like. For example, query-
ing the old persons from a database is an example of fuzzi-
In order to extend the applicability of traditional data- ness.
bases, some new techniques have been proposed to deal with Probability theory which deals with the uncertainties of
unknown, uncertain, undefined or imprecise query results. expectation can be applied to fuzzy sets to derive some out-
In [5], the authors present their concerns regarding null val- come. For example, we may be interested in the probability
ues in an entity. An entity with null values is defined as car- that a person named John J. Doe is the same person as
rying incomplete information in the database. A piece of John James Doe. Here, the sense of uncertainty revolves
incomplete information raises a problem of processing around making a prediction about an event. This falls in the
which may result in obtaining partial results based on in-

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)
0-7695-2315-3/05 $ 20.00 IEEE
realm of probability theory. Discussions on the application the input implied. In the text data statistical analysis phase,
of databases can be found in [4, 8, 11, 12]. the extracted data fragments with confidence levels are
evaluated against an Authoritative Source, which is
3. System overview presumed to contain complete and correct information. In
The system for web data mining as shown in Figure 1, this paper, we present an industry project for text data
consists of five phases: web site identification, content selec- mining and show the results of statistical analysis.
tion, information retrieval, approximate data query, and text
4. Case Study
data statistical analysis.
The goal of this study is to extract information related to
WWW events that are published on public websites. An event could
Workstation IBM Compatible
be marriage, obituary, birth, etc. This study focuses on ob-
taining obituary data intended for use in preventing fraud
Web Sites Content Information Approximate where someone assumes the identity of a recently deceased
Identification Selection Retrieval Query
individual.
The primary Authoritative Source for deceased informa-
Text Data
Auth. Statistical Trusted
tion is the list of deceased individuals published by the So-
Source Analysis Index cial Security Administration. However, it may take months
from the time a death is reported until the time it is pub-
Figure 1. Overview of the system lished, often too late for effective fraud prevention. On the
other hand, information posted on web sites may be current
In the web site identification phase, the web sites corre- within a few days. Therefore for purposes of the case study,
sponding to the subject matter are semi-automatically cho- web sites were selected with information that was at least
sen and compiled into a list, using criteria such as availabil- one year old so that enough time had elapsed for it to also
ity, relevance, and importance of web sites to our goal. Key appear in the Authoritative Source.
words searching using existing web search engines is one Obituary sites of suitable age were discovered from the
tool used to compile the initial list of web sites. This list is web and classified into three categories, namely: A, B, and
then filtered. C. Web sites in category A contain formatted pages making
The next step is to determine which web pages are rele- them the easiest ones from which data can be extracted. For
vant to our subject. A web site usually contains many web these web pages, the obituary announcements were collected
pages, some of which may contain relevant information. The programmatically. A pattern matcher was then employed to
selection of web pages is based on the kind of information extract the data with little loss. Category B sites are usually
we are looking for. For example, if we are retrieving a spe- unformatted, thus requiring special text analysis technique to
cific type of announcements, the system should be able to make sure the relevant data are extracted since the relevant
determine which pages contain data of interest. information may appear anywhere in the web page. Cate-
For the information retrieval phase, a crawler has been gory C contains web pages including newsgroups, forums,
designed to automatically extract information from selected and mailing lists, etc. Currently, the system focuses only on
web pages. The web crawler reads in a list of web sites, the Category A.
downloads the web pages, and follows the hyper links to In many cases, each extracted data fragment consist of
other pages. It keeps a copy of all visited pages for prevent- first name, middle name or initial, last name, city, and the
ing repeatedly accessing the same web pages and identifying deceased date. These are usually not complete enough for
dead links. Data of interest may be embedded in phrases, application to fraud prevention because they lack a complete
sentences, or paragraphs. The pattern matcher uses a pattern address. One way to complete the missing information is to
matching technique described in Hashemi [3] to locate the use a Trusted Index of name and address information. The
patterns containing the required data in the text and then name and address fragments extracted from the text matched
retrieve the data from the patterns. The retrieved data are against the Trusted Index to create a set of name and address
stored in a file for validating the Index. candidates where each candidate matches one or more of the
In the approximate data query phase [2], the extracted data fragments. For example, the fragment John Jones, age
data fragments are used to evaluate the quality of Web data 80, of Anytown, Arkansas might match to J. Jones, age 75,
in which the retrieved information is evaluated against a of 123 Oak St; John A. Jones, age not given, of 567 Elm
Trusted Index. However, it is often the case that both the Street; and John B. Jones, age 82, of 99 Pine St; all in
extracted Web data and the information recorded in the Anytown, Arkansas based on the Trusted Index.
Trusted Index are incomplete and exhibit some degree of Since the result set contains uncertainties, each record in
variation in both content and format. Moreover, the trusted the set is assigned a confidence levels between 0 and 100
Index is typically very large, and a query against it, inclusive. Pseudo-code for a confidence function is given in
especially an approximate query, may return a very large Figure 5.
result set that must be further analyzed in order to arrive at a Finally, the result set was compared to the Authoritative
final assessment or confidence value. Therefore, for the Source in order to determine which of the candidate records
result set, the system will assign a confidence level to each actually corresponded to the name and address of a
record indicating the match level and implied confidence
that the record is the one that matches the actual record that

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)
0-7695-2315-3/05 $ 20.00 IEEE
individual reported as deceased, and to correlate these shows the results of this process in which 85% or more of
findings with the assigned confidence level. the queries return non-empty result sets. For these records,
we assign confidence levels to them. The algorithm of
5. Results assigning the confidence levels is shown in Figure 5.
In this study, we conducted five experiments for text
800
mining from five different web sites. Figure 2 shows the 12
700
number of announcements appearing in each web site and
the number of those extracted successfully by the grabber. 600

Each announcement contains the information about the 500

deceased person. In Run 5, since the web site is always 400


28 746 # of Data Fr agments not
Suppl emented
updated constantly, N/A indicates that the exact number of 300 # of Data Fr agments Suppl emented

announcements in the web site is unknown. Our grabber 200 343 5


33
14
extracts the announcements from each web site and stores 100 163 136
202

the extracted data fragments into files. The results of the 0


extracted data fragments show our grabber can extract more 1 2 3 4 5

than 90% of the announcements. R uns

800 7 Figure 4. Number of data fragments supplemented


700
Confidence(record[])
600

500
{
31
for each record i {
400 783
not grabbed if fistname, lastname, and city
300 grabbed match the query data
200 392 8
4 confidence[i] = 80%;
100 172 150
if age match
0
1 2 3 4 confidence[i] += 10%;
Runs
else if age missing
confidence[i] -= 10%;

Figure 2. Number of announcements grabbed else confidence -= 20%;


if middle initial match
Once the announcements of a web site are grabbed into a
confidence[i] += 5%;
file, we invoke the pattern matcher to extract the required
if middle initial missing
information such as the names and the cities and store them
into another file. Figure 3 shows the results of the pattern confidence[i] += 5% * adjustment factor;
matcher. Each data fragment will contain the first name, the if middle initial wrong
middle name, the last name, and the city of the deceased confidence[i] -= 5%;
person. It turns out that our pattern matcher can extract the if (different records with different
required data from approximate more than 95% of the consumer link has same confidence)
grabbed announcements except for Run 5. then

800
reduce the con by 5% for all the records
25
700 who have the equal or lower confidence.
600 }
500 return confidence[]
400 21 758 # of Dat a Fr agment s not }
Ext r act ed
300
371
38 # of Dat a Fr agment s Figure 5. Confidence level assignment algorithm
200 4 Ext r act ed
0
235
100 168 150

0
In Figure 5, if a query returns a result set containing
1 2 3 4 5 multiple records with the same link (special values assigned
R u ns by the organization), then all records with equal or lower
confidence levels are reduced by 5%. For example, if a
Figure 3. Number of data fragments extracted person has lived in different locations before, the result set
Since each data fragment is not complete enough for use, we may contain multiple records having the same link with
group each data fragment of the deceased person into a different addresses. In this situation, their confidence levels
query. We further supplement the data fragments by adding are adjusted by 5%. The adjustment factor seen in the
addresses to them through queries against the Trusted Index. algorithm is 0 in this study.
The results of some queries may be an empty set. Some After confidence levels are assigned to the records, we
queries may result in multiple records in the set. Figure 4 then classify the queries into different categories based on

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)
0-7695-2315-3/05 $ 20.00 IEEE
their confidence levels for statistical analysis. Figure 6 60
shows the categories of the queries. For example, a query is
50
assigned to a 95 category, if the query returns at least one
record with a 95 confidence level in the result set. The exact 40
Not Hits
number of queries in each category for each run are shown 30 Hits
in Appendix 1.
20

10
100%
90%
0
80% [95] [90] [85] [80] [75] [70] [65] [60] [55] [50] (0-50)
70% confidence
60%
50%
40%
Figure 7c. Number of records hit in run 3
30% 40
20% 35
10%
30
0%
1 2 3 4 5 Not Hits
25
Hits
Run 20

15
95 90 85 80 75 70 65 60 55 50 (0-50) 0 10

5
Figure 6. Categories of queries 0
[95] [90] [85] [80] [75] [70] [65] [60] [55] [50] (0-50)

For each result set of each query, we remove duplicate confidence

records from the set and match the records against the
Authoritative Source. Figure 7 shows the number of records Figure 7d. Number of records hit in run 4
with the confidence levels for each run taken against the
Authoritative Source. Refer to Appendix 1 for the exact 35

number of each category for each run. 30

90
25
80 Not Hits
20 Hits
70

60 15
Not hits
50
Hits 10
40
5
30

20 0
[95] [90] [85] [80] [75] [70] [65] [60] [55] [50] (0-50)
10
confidence
0
[95] [90] [85] [80] [75] [70] [65] [60] [55] [50] (0-
50) Figure 7e. Number of records hit in run 5
confidence

From Figure 7, it appears that only the records with


Figure 7a. Number of records hit in run 1 confidence level 95 have more hits than those that did not
hit. The records with confidence levels below 95 seem not to
200 be very promising due to the uncertainties in these records.
180
The hit rates of runs are shown in Figure 8. The records with
160
140
confidence level 95 have an approximate hit rate of 80%.
120 Not Hits This indicates that 80% of records with the 95 confidence
100 Hits level appear in the Authoritative Source.
80
60 90.00%

40 80.00%
20 70.00% run1

0 60.00% run2
Hit Rate

[95] [90] [85] [80] [75] [70] [65] [60] [55] [50] (0- 50.00% run3
50)
40.00% run4
confidence 30.00% run5

20.00%
Figure 7b. Number of records hit in run 2 10.00%
0.00%
[95] [90] [85] [80] [75] [70] [65] [60] [55] [50] (0-
50)
Confidence

Figure 8. Number of records hit in five runs

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)
0-7695-2315-3/05 $ 20.00 IEEE
5. Observations 7. References
In this project, we have shown the experimental results [1] J. F. Baldwin, A Fuzzy Relational Inference Language for
of text data mining from the obituary web sites. Figure 8 Expert Systems, Proceedings of the 13th IEEE International
indicates that the announcements of the deceased persons in Symposium on Multiple-Valued Logic, pp. 416-423, 1983.
the web sites may not appear in the Authoritative Source. [2] C. W. Ford, C.-C. Chiang, H. Wu, R. R. Chilka, and J. R.
Talburt, Confidence on Approximate Query in Large Data-
Although extracted records may be incomplete, those that sets, Proceedings of the International Conference on Infor-
scored 95 confidence were validated by the Authoritative mation Technology: Coding and Computing (IEEE ITCC04),
Source 80% of the time. In some cases where the data was pp. 480-484, 2004.
complete, a hit did not occur. This could suggest that the [3] R. Hashemi, C. W. Ford, T. Vamprooyen, and J. R. Talburt,
Authoritative Source may not be as complete as one might Extraction of Features with Unstructured Representation
hope. Another possibility could be that the announcements from HTML Documents, Proceedings of the IADIS Interna-
in the web sites do not truly reflect actual deceased persons. tional Conference WWW/Internet 2002, pp. 47-53, 2002.
[4] M. Henrion and M. J. Druzdzel, Qualitative Propagation and
6. Summary Scenario-Based Approaches to Explanation of Probabilistic
Reasoning, Proceedings of the 6th Conference in Artificial In-
In this project, we mined text data from five obituary telligence, Boston: Massachusetts, p. 10-20, 1990.
web sites publishing notices approximately one year old. [5] T. Imielinski and W. Lipski Jr., Incomplete Information in
The goal of this study was to evaluate how often and how Relational Databases, Journal of the ACM, Vol. 31, Issue 4,
accurately name and address fragments extracted from these pp. 761-791, 1984.
notices developed into complete name and address [6] M. N. Garofalakis, R. Rastogi, S. Seshadri, and K. Shim,
information corresponding to the deceased individual. The Data Mining and the Web: Past, Present and Future, Pro-
fragments were used to select candidate records from a large ceedings of the Second International Workshop on Web In-
Trusted Index of name and address information. For formation and Data Management, November 1999, Kansas
example, the text data mined from the obituary web sites City: Missouri, pp. 43-47.
[7] H. Nakajima, T. Sogoh, and M. Arao, Development of an
usually contain the name, city, and state of the deceased
Efficient Fuzzy SQL for Large Scale Fuzzy Relational Data-
person. The street address of the deceased person is usually base, Proceedings of the Fifth International Fuzzy Systems
not given in the announcement. Records from the Trusted Assoc. World Congress 93, 1993.
Index were used to try and complete the address fragment. [8] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Net-
However, the records from the Trusted Index can also add works of Plausible Inference, San Mateo: California, Morgan
uncertainties to the mined text data because there can be Kaufman, 1988.
several different records that match to the same fragment. In [9] H. Prade and C. Testemale, Generalizing Database Relational
order to classify the extents of the uncertainties, we assign Algebra for the Treatment of Incomplete or Uncertain Infor-
different confidence levels to candidate records based on the mation and Vague Queries, Information Sciences, Vol. 34,
degree of match to the fragment and the amount of pp. 115-143, 1984.
confirmation or contradiction among the candidates. In [10] H. Prade and C. Testemale, Fuzzy Relational Databases:
theory, the higher the confidence level assigned to a Representational Issues and Reduction Using Similarity
candidate record, the more likely it will appear in the Measures, Journal of American Society for Information Sci-
ence, Vol. 38, No. 20, pp. 118-126, 1988.
Authoritative Source. We then conducted experiments to
[11] M. P. Wellman, Fundamental Concepts of Qualitative Prob-
show how well our confidence levels assignments predicted abilistic Networks, Artificial Intelligence, Vol. 44, pp. 257-
this result. We have not yet determined how adjustment of 303, 1990.
the weights of the various factors used to compute the [12] M. P. Wellman and M. Henrion, Qualitative Intercausal Rela-
confidence level would affect the results. tion, or Explaining Explaining Away, Proceedings of the
We used an Authoritative Source, which was presumed 2nd International Conference on Principles of Knowledge Rep-
to contain complete and accurate information on deceased resentation and Reasoning, Cambridge: Massachusetts, p.
individuals. We took our mined text fragment to extract 535-546, 1991.
candidate name and address records from a Trusted Index. [13] Q. Yang, W. Zhang, C. Liu, J. Wu, C. Yu, H. Nakajima, and
Statistical analysis of the results show that the candidate N. D. Rishe, Efficient Processing of Nested Fuzzy SQL Que-
records with the confidence level 95 are found in the ries in a Fuzzy Database, IEEE Transactions on Knowledge
Authoritative Source approximately 80% of the time. It and Data Engineering, Vol. 13, No. 6, pp. 884-901, 2001.
[14] L. A. Zadeh, Fuzzy Sets, Information and Control, Vol. 8,
means that the records having the names, addresses, and
pp. 338-353, 1965.
cities matched the data stored in our Index only show 80%
of the records matched the Authoritative Source. The other Acknowledgements
20% of the records do not appear in Authoritative Source.
We have concluded that additional attributes, when This work was funded by a grant from the ACXIOM
available, may help us refine the values assigned as Corporation in Little Rock, Arkansas.
confidence.

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)
0-7695-2315-3/05 $ 20.00 IEEE
Appendix 1.

Website Grabber Pattern Flashlink Confidence Queries # % After removal of duplicates


matcher
423 392(92%) 371(95%) 343(89%) Records # Hits (Hits/Records#) %
[95] 85 24.78% 85 68 80.00%
[90] 23 6.71% 41 23 56.10%
[85] 2 0.58% 4 2 50.00%
[80] 17 4.96% 25 13 52.00%
[75] 26 7.58% 42 18 42.86%
[70] 15 4.37% 47 11 23.40%
[65] 6 1.75% 28 6 21.43%
[60] 1 0.29% 16 2 12.50%
[55] 0 0.00% 5 1 20.00%
[50] 1 0.29% 18 0 0.00%
(0-50) 0 0.00% 7 1 14.29%
[0] 167 48.69% N/A N/A N/A
Total 343 100% 318 145 45.60%
790 783(99%) 758(97%) 746(98.4%)
[95] 200 26.81% 200 150 75.00%
[90] 80 10.72% 139 64 46.04%
[85] 8 1.07% 21 3 14.29%
[80] 38 5.09% 54 25 46.30%
[75] 58 7.77% 133 45 33.83%
[70] 44 5.90% 173 26 15.03%
[65] 18 2.41% 117 13 11.11%
[60] 6 0.80% 39 5 12.82%
[55] 0 0.00% 28 1 3.57%
[50] 1 0.13% 18 2 11.11%
(0-50) 0 0.00% 29 0 0.00%
[0] 293 39.28% N/A N/A N/A
Total 746 100% 951 334 35.10%
180 172(96%) 168(98%) 163(97%)
[95] 51 31.29% 51 42 82.35%
[90] 22 13.50% 33 14 42.42%
[85] 2 1.23% 6 1 16.67%
[80] 15 9.20% 18 4 22.22%
[75] 11 6.75% 27 13 48.15%
[70] 10 6.13% 32 9 28.13%
[65] 3 1.84% 12 3 25.00%
[60] 0 0.00% 5 2 40.00%
[55] 0 0.00% 4 1 25.00%
[50] 0 0.00% 6 0 0.00%
(0-50) 0 0.00% 2 0 0.00%
[0] 49 30.06% N/A N/A N/A
Total 163 100% 196 89 44.90%
154 150(97%) 150(100%) 136(91%)
[95] 38 27.94% 38 31 81.58%
[90] 13 9.56% 17 8 47.06%
[85] 0 0.00% 0 0 #DIV/0!
[80] 3 2.21% 8 0 0.00%
[75] 12 8.82% 23 8 34.78%
[70] 9 6.62% 25 5 20.00%
[65] 1 0.74% 18 1 5.56%
[60] 2 1.47% 8 1 12.50%
[55] 0 0.00% 1 0 0.00%
[50] 0 0.00% 1 0 0.00%
(0-50) 0 0.00% 0 0 #DIV/0!
[0] 58 42.65% N/A N/A N/A
Total 136 100% 139 54 38.80%
N/A 273 235(86%) 202(85.2%)
[95] 35 17.33% 35 28 80.00%
[90] 15 7.43% 25 9 36.00%
[85] 1 0.50% 3 0 0.00%
[80] 7 3.47% 8 1 12.50%
[75] 21 10.40% 32 13 40.63%
[70] 12 5.94% 23 6 26.09%
[65] 2 0.99% 9 0 0.00%
[60] 0 0.00% 5 2 40.00%
[55] 0 0.00% 0 0 #DIV/0!
[50] 0 0.00% 2 0 0.00%
(0-50) 0 0.00% 1 0 0.00%

[0] 109 53.96% N/A N/A N/A


Total 202 100% 143 59 41.30%

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)
0-7695-2315-3/05 $ 20.00 IEEE

Vous aimerez peut-être aussi