Académique Documents
Professionnel Documents
Culture Documents
Charles Wesley Ford, Chia-Chu Chiang, Hao Wu, Radhika R. Chilka, and *John R. Talburt
*
Department of Computer Science Acxiom Corporation
University of Arkansas at Little Rock #1 Information Way
2801 South University Avenue, Little Rock P. O. Box 8180, Little Rock
Arkansas 72204-1099, USA Arkansas 72202-2289, USA
E-mail: cwford@ualr.edu E-mail: John.Talburt@acxiom.com
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)
0-7695-2315-3/05 $ 20.00 IEEE
realm of probability theory. Discussions on the application the input implied. In the text data statistical analysis phase,
of databases can be found in [4, 8, 11, 12]. the extracted data fragments with confidence levels are
evaluated against an Authoritative Source, which is
3. System overview presumed to contain complete and correct information. In
The system for web data mining as shown in Figure 1, this paper, we present an industry project for text data
consists of five phases: web site identification, content selec- mining and show the results of statistical analysis.
tion, information retrieval, approximate data query, and text
4. Case Study
data statistical analysis.
The goal of this study is to extract information related to
WWW events that are published on public websites. An event could
Workstation IBM Compatible
be marriage, obituary, birth, etc. This study focuses on ob-
taining obituary data intended for use in preventing fraud
Web Sites Content Information Approximate where someone assumes the identity of a recently deceased
Identification Selection Retrieval Query
individual.
The primary Authoritative Source for deceased informa-
Text Data
Auth. Statistical Trusted
tion is the list of deceased individuals published by the So-
Source Analysis Index cial Security Administration. However, it may take months
from the time a death is reported until the time it is pub-
Figure 1. Overview of the system lished, often too late for effective fraud prevention. On the
other hand, information posted on web sites may be current
In the web site identification phase, the web sites corre- within a few days. Therefore for purposes of the case study,
sponding to the subject matter are semi-automatically cho- web sites were selected with information that was at least
sen and compiled into a list, using criteria such as availabil- one year old so that enough time had elapsed for it to also
ity, relevance, and importance of web sites to our goal. Key appear in the Authoritative Source.
words searching using existing web search engines is one Obituary sites of suitable age were discovered from the
tool used to compile the initial list of web sites. This list is web and classified into three categories, namely: A, B, and
then filtered. C. Web sites in category A contain formatted pages making
The next step is to determine which web pages are rele- them the easiest ones from which data can be extracted. For
vant to our subject. A web site usually contains many web these web pages, the obituary announcements were collected
pages, some of which may contain relevant information. The programmatically. A pattern matcher was then employed to
selection of web pages is based on the kind of information extract the data with little loss. Category B sites are usually
we are looking for. For example, if we are retrieving a spe- unformatted, thus requiring special text analysis technique to
cific type of announcements, the system should be able to make sure the relevant data are extracted since the relevant
determine which pages contain data of interest. information may appear anywhere in the web page. Cate-
For the information retrieval phase, a crawler has been gory C contains web pages including newsgroups, forums,
designed to automatically extract information from selected and mailing lists, etc. Currently, the system focuses only on
web pages. The web crawler reads in a list of web sites, the Category A.
downloads the web pages, and follows the hyper links to In many cases, each extracted data fragment consist of
other pages. It keeps a copy of all visited pages for prevent- first name, middle name or initial, last name, city, and the
ing repeatedly accessing the same web pages and identifying deceased date. These are usually not complete enough for
dead links. Data of interest may be embedded in phrases, application to fraud prevention because they lack a complete
sentences, or paragraphs. The pattern matcher uses a pattern address. One way to complete the missing information is to
matching technique described in Hashemi [3] to locate the use a Trusted Index of name and address information. The
patterns containing the required data in the text and then name and address fragments extracted from the text matched
retrieve the data from the patterns. The retrieved data are against the Trusted Index to create a set of name and address
stored in a file for validating the Index. candidates where each candidate matches one or more of the
In the approximate data query phase [2], the extracted data fragments. For example, the fragment John Jones, age
data fragments are used to evaluate the quality of Web data 80, of Anytown, Arkansas might match to J. Jones, age 75,
in which the retrieved information is evaluated against a of 123 Oak St; John A. Jones, age not given, of 567 Elm
Trusted Index. However, it is often the case that both the Street; and John B. Jones, age 82, of 99 Pine St; all in
extracted Web data and the information recorded in the Anytown, Arkansas based on the Trusted Index.
Trusted Index are incomplete and exhibit some degree of Since the result set contains uncertainties, each record in
variation in both content and format. Moreover, the trusted the set is assigned a confidence levels between 0 and 100
Index is typically very large, and a query against it, inclusive. Pseudo-code for a confidence function is given in
especially an approximate query, may return a very large Figure 5.
result set that must be further analyzed in order to arrive at a Finally, the result set was compared to the Authoritative
final assessment or confidence value. Therefore, for the Source in order to determine which of the candidate records
result set, the system will assign a confidence level to each actually corresponded to the name and address of a
record indicating the match level and implied confidence
that the record is the one that matches the actual record that
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)
0-7695-2315-3/05 $ 20.00 IEEE
individual reported as deceased, and to correlate these shows the results of this process in which 85% or more of
findings with the assigned confidence level. the queries return non-empty result sets. For these records,
we assign confidence levels to them. The algorithm of
5. Results assigning the confidence levels is shown in Figure 5.
In this study, we conducted five experiments for text
800
mining from five different web sites. Figure 2 shows the 12
700
number of announcements appearing in each web site and
the number of those extracted successfully by the grabber. 600
500
{
31
for each record i {
400 783
not grabbed if fistname, lastname, and city
300 grabbed match the query data
200 392 8
4 confidence[i] = 80%;
100 172 150
if age match
0
1 2 3 4 confidence[i] += 10%;
Runs
else if age missing
confidence[i] -= 10%;
800
reduce the con by 5% for all the records
25
700 who have the equal or lower confidence.
600 }
500 return confidence[]
400 21 758 # of Dat a Fr agment s not }
Ext r act ed
300
371
38 # of Dat a Fr agment s Figure 5. Confidence level assignment algorithm
200 4 Ext r act ed
0
235
100 168 150
0
In Figure 5, if a query returns a result set containing
1 2 3 4 5 multiple records with the same link (special values assigned
R u ns by the organization), then all records with equal or lower
confidence levels are reduced by 5%. For example, if a
Figure 3. Number of data fragments extracted person has lived in different locations before, the result set
Since each data fragment is not complete enough for use, we may contain multiple records having the same link with
group each data fragment of the deceased person into a different addresses. In this situation, their confidence levels
query. We further supplement the data fragments by adding are adjusted by 5%. The adjustment factor seen in the
addresses to them through queries against the Trusted Index. algorithm is 0 in this study.
The results of some queries may be an empty set. Some After confidence levels are assigned to the records, we
queries may result in multiple records in the set. Figure 4 then classify the queries into different categories based on
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)
0-7695-2315-3/05 $ 20.00 IEEE
their confidence levels for statistical analysis. Figure 6 60
shows the categories of the queries. For example, a query is
50
assigned to a 95 category, if the query returns at least one
record with a 95 confidence level in the result set. The exact 40
Not Hits
number of queries in each category for each run are shown 30 Hits
in Appendix 1.
20
10
100%
90%
0
80% [95] [90] [85] [80] [75] [70] [65] [60] [55] [50] (0-50)
70% confidence
60%
50%
40%
Figure 7c. Number of records hit in run 3
30% 40
20% 35
10%
30
0%
1 2 3 4 5 Not Hits
25
Hits
Run 20
15
95 90 85 80 75 70 65 60 55 50 (0-50) 0 10
5
Figure 6. Categories of queries 0
[95] [90] [85] [80] [75] [70] [65] [60] [55] [50] (0-50)
records from the set and match the records against the
Authoritative Source. Figure 7 shows the number of records Figure 7d. Number of records hit in run 4
with the confidence levels for each run taken against the
Authoritative Source. Refer to Appendix 1 for the exact 35
90
25
80 Not Hits
20 Hits
70
60 15
Not hits
50
Hits 10
40
5
30
20 0
[95] [90] [85] [80] [75] [70] [65] [60] [55] [50] (0-50)
10
confidence
0
[95] [90] [85] [80] [75] [70] [65] [60] [55] [50] (0-
50) Figure 7e. Number of records hit in run 5
confidence
40 80.00%
20 70.00% run1
0 60.00% run2
Hit Rate
[95] [90] [85] [80] [75] [70] [65] [60] [55] [50] (0- 50.00% run3
50)
40.00% run4
confidence 30.00% run5
20.00%
Figure 7b. Number of records hit in run 2 10.00%
0.00%
[95] [90] [85] [80] [75] [70] [65] [60] [55] [50] (0-
50)
Confidence
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)
0-7695-2315-3/05 $ 20.00 IEEE
5. Observations 7. References
In this project, we have shown the experimental results [1] J. F. Baldwin, A Fuzzy Relational Inference Language for
of text data mining from the obituary web sites. Figure 8 Expert Systems, Proceedings of the 13th IEEE International
indicates that the announcements of the deceased persons in Symposium on Multiple-Valued Logic, pp. 416-423, 1983.
the web sites may not appear in the Authoritative Source. [2] C. W. Ford, C.-C. Chiang, H. Wu, R. R. Chilka, and J. R.
Talburt, Confidence on Approximate Query in Large Data-
Although extracted records may be incomplete, those that sets, Proceedings of the International Conference on Infor-
scored 95 confidence were validated by the Authoritative mation Technology: Coding and Computing (IEEE ITCC04),
Source 80% of the time. In some cases where the data was pp. 480-484, 2004.
complete, a hit did not occur. This could suggest that the [3] R. Hashemi, C. W. Ford, T. Vamprooyen, and J. R. Talburt,
Authoritative Source may not be as complete as one might Extraction of Features with Unstructured Representation
hope. Another possibility could be that the announcements from HTML Documents, Proceedings of the IADIS Interna-
in the web sites do not truly reflect actual deceased persons. tional Conference WWW/Internet 2002, pp. 47-53, 2002.
[4] M. Henrion and M. J. Druzdzel, Qualitative Propagation and
6. Summary Scenario-Based Approaches to Explanation of Probabilistic
Reasoning, Proceedings of the 6th Conference in Artificial In-
In this project, we mined text data from five obituary telligence, Boston: Massachusetts, p. 10-20, 1990.
web sites publishing notices approximately one year old. [5] T. Imielinski and W. Lipski Jr., Incomplete Information in
The goal of this study was to evaluate how often and how Relational Databases, Journal of the ACM, Vol. 31, Issue 4,
accurately name and address fragments extracted from these pp. 761-791, 1984.
notices developed into complete name and address [6] M. N. Garofalakis, R. Rastogi, S. Seshadri, and K. Shim,
information corresponding to the deceased individual. The Data Mining and the Web: Past, Present and Future, Pro-
fragments were used to select candidate records from a large ceedings of the Second International Workshop on Web In-
Trusted Index of name and address information. For formation and Data Management, November 1999, Kansas
example, the text data mined from the obituary web sites City: Missouri, pp. 43-47.
[7] H. Nakajima, T. Sogoh, and M. Arao, Development of an
usually contain the name, city, and state of the deceased
Efficient Fuzzy SQL for Large Scale Fuzzy Relational Data-
person. The street address of the deceased person is usually base, Proceedings of the Fifth International Fuzzy Systems
not given in the announcement. Records from the Trusted Assoc. World Congress 93, 1993.
Index were used to try and complete the address fragment. [8] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Net-
However, the records from the Trusted Index can also add works of Plausible Inference, San Mateo: California, Morgan
uncertainties to the mined text data because there can be Kaufman, 1988.
several different records that match to the same fragment. In [9] H. Prade and C. Testemale, Generalizing Database Relational
order to classify the extents of the uncertainties, we assign Algebra for the Treatment of Incomplete or Uncertain Infor-
different confidence levels to candidate records based on the mation and Vague Queries, Information Sciences, Vol. 34,
degree of match to the fragment and the amount of pp. 115-143, 1984.
confirmation or contradiction among the candidates. In [10] H. Prade and C. Testemale, Fuzzy Relational Databases:
theory, the higher the confidence level assigned to a Representational Issues and Reduction Using Similarity
candidate record, the more likely it will appear in the Measures, Journal of American Society for Information Sci-
ence, Vol. 38, No. 20, pp. 118-126, 1988.
Authoritative Source. We then conducted experiments to
[11] M. P. Wellman, Fundamental Concepts of Qualitative Prob-
show how well our confidence levels assignments predicted abilistic Networks, Artificial Intelligence, Vol. 44, pp. 257-
this result. We have not yet determined how adjustment of 303, 1990.
the weights of the various factors used to compute the [12] M. P. Wellman and M. Henrion, Qualitative Intercausal Rela-
confidence level would affect the results. tion, or Explaining Explaining Away, Proceedings of the
We used an Authoritative Source, which was presumed 2nd International Conference on Principles of Knowledge Rep-
to contain complete and accurate information on deceased resentation and Reasoning, Cambridge: Massachusetts, p.
individuals. We took our mined text fragment to extract 535-546, 1991.
candidate name and address records from a Trusted Index. [13] Q. Yang, W. Zhang, C. Liu, J. Wu, C. Yu, H. Nakajima, and
Statistical analysis of the results show that the candidate N. D. Rishe, Efficient Processing of Nested Fuzzy SQL Que-
records with the confidence level 95 are found in the ries in a Fuzzy Database, IEEE Transactions on Knowledge
Authoritative Source approximately 80% of the time. It and Data Engineering, Vol. 13, No. 6, pp. 884-901, 2001.
[14] L. A. Zadeh, Fuzzy Sets, Information and Control, Vol. 8,
means that the records having the names, addresses, and
pp. 338-353, 1965.
cities matched the data stored in our Index only show 80%
of the records matched the Authoritative Source. The other Acknowledgements
20% of the records do not appear in Authoritative Source.
We have concluded that additional attributes, when This work was funded by a grant from the ACXIOM
available, may help us refine the values assigned as Corporation in Little Rock, Arkansas.
confidence.
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)
0-7695-2315-3/05 $ 20.00 IEEE
Appendix 1.
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)
0-7695-2315-3/05 $ 20.00 IEEE