Académique Documents
Professionnel Documents
Culture Documents
Rima Hazra, Anomitra Saha, Shubhra Baran Deb and Debasis Mitra
Department of Information Technology
National Institute of Technology, Durgapur, India
Email:frimahazra93, anomitra1993, shubhra.baran.deb, debasis.mitrag@gmail.com
Abstract—Scholarly digital libraries have become an important source based on either supervised approach or unsupervised approach. Most
of bibliographic records for scientific communities. Author name search of the supervised solutions exploit learning function to train the
is one of the most common query exercised in digital libraries. The name model and generic machine learning algorithms to solve the problem.
ambiguity problem in the context of author search in digital libraries,
arising from multiple authors sharing the same name, poses many However, with tremendous growth in the size of the DL databases,
challenges. A number of name disambiguation methods have been cost of obtaining training data becomes prohibitively large; thereby
proposed in the literature so far. A variety of bibliographic attributes making supervised approaches impractical. In unsupervised
have been considered in these methods. However, hardly any effort has approach, name dis-ambiguation is usually considered as a clustering
been made to assess the potential contribution of these attributes. We, for issue [8] such that all the articles authored by a unique individual is
the first time, evaluate the potential strength and/or weaknesses of these
placed in a group (cluster). Among these methods, ones using
attributes by a rigorous course of experiments on a large data set. We
also explore the potential utility of some attributes from different multiple level of clustering (termed as hierarchical clustering) are
perspective. A close look reveals that most of the earlier work require found to be most efficient for real data [8]. In general, these methods
one or more attributes which are difficult to obtain in practical compare bibliographic attributes (e.g. author’s names, title,
applications. Based on this empirical study, we identify three very publication venue, author’s institutional affiliations and email) of
common and easy to access attributes and propose a two-step hierarchical
ambiguous authors and use some similarity func-tions to determine
clustering technique to solve name ambiguity using these attributes only.
Experimental results on data set extracted from a popular digital library the clusters [1]. Although, these methods achieves convincing results
show that the proposed method achieves significantly high level of for many real life datasets, many of them suffer from a serious
accuracy (> 90%) for most of the instances. drawback: requirements of extra attributes, such as author’s
institutional affiliation, address, article abstract etc. [9], which are
I. INTRODUCTION difficult to obtain in practical applications [1].
Nowadays, researchers mostly use scholarly digital libraries In this work, we perform empirical study to evaluate the potential
(DLs), such as DBLP, Google Scholar, Microsoft Academic Search, strength and/or weaknesses of the bibliographic at-tributes considered
CiteSeer, PubMed to perform literature review of relevant works [1]. earlier by revisiting their potential utility from various perspectives. We
These DLs usually offer millions of citation records with each record also explore if some attributes, not considered yet, might be worth
having bibliographic attributes such as title, authors’ names, considering. Finally, we select three very common and easily accessible
publication venue etc. However, many of them cannot correctly attributes and propose a two-step hierarchical clustering technique that
index (distinguish) publica-tion records of various individuals uses only these three attributes to solve the name ambiguity problem. In
because of the so called name ambiguity problem. Such ambiguity the first step, articles are grouped into fragments by utilizing the co-
mainly arises when multiple authors share exactly the same name. author relationship. If two articles authored by an ambiguous name have
For example, there are more than 200 different individual authors the same co-author, they are grouped in the same fragment. The strength
sharing the name “Wei Wang” [2]. Name ambiguity may also arise of this step is evident from the fact that there is very little chance that two
when an author uses her short name in some publications and full different authors with the same name have the same co-author [7], [8]. In
name in the other [3]. For example, an author named Mark Jones the second step, the fragments thus formed are further refined by
uses M. Jones in his one publication and Mark Jones in another. considering the correlation between the articles with respect to article
Author name search is a very popular query used in digital library title and year of publication. The explanation behind choosing these three
databases [4]. Ambiguous names often lead to confusion and attributes is presented in Section II. In order to evaluate the strength of
mistakes in identifying citation records of authors’ articles. A DL the proposed technique, we have performed experiments with several
user, while searching for publications of a particular author, often instances of ambiguous names extracted from DBLP database.
receives a combination of articles written by different authors sharing Experimental results show that the proposed method achieves
the same name. The problem can be solved by devising automatic significantly high accuracy for most of the instances.
methods to disambiguate all the present citation records of a DL.
A number of name diambiguation methods have been pro-posed in The remainder of this paper is organised as follows. Section II
the literature so far [5], [6], [7]. Existing solutions are presents a summary of prior work in the related area and
have an author named aj. Note that the required graph can be else if title sim = T SMOD and actspan sim = AST RUE then merge Fi; Fj
constructed easily by referring the matrix M. The graph for
this example is shown in Fig. 3. A breadth first traversal on end
the graph yields three connected components fF1; F2; F3g end
end
where F1 = fc1; c3; c5g, F2 = fc2; c4g, and F3 = fc6g. The
existence of the connected components may also be verified Algorithm 1: Merging fragments in 2nd level of clustering
from Fig. 3.
1) Computation of title sim: As mentioned in Section II, it is
a1 a2 a3 a4 a5 a6 a7 a8
unlikely that two authors sharing ambiguous name work in the same
c1 1 1 0 0 0 0 0 0
c2 0 0 1 1 0 0 0 0 research domain and hence, research domain may be a crucial factor
c3
c4
1 0 0 0 0 0 0 0 to be considered in name disam-biguation. The articles written by an
0 0 1 0 0 0 0 0
c5 0 1 0 0 1 0 0 0 author often contain keywords relevant to their domain of research.
c
6 0 0 0 0 0 1 1 1 Accordingly, computation of similarity between the article titles of
Table II various fragments based on the keyword matching may be useful.
CO-AUTHOR MATRIX (M) FOR SET CITATION RECORDS CA OF TABLE I
The procedure for keyword matching involves a preprocessing step
to eliminate stop words. Stop words typically refer to some common
c1 : A c5 : A words or phrases used to construct sentences in any language. For
a2 example, in English language, prepositions, articles, verbs,
a5 conjunctions may be considered as stop words. However, sometimes,
a word considered as keyword for some applications may be
a1 considered as stop word for another application. Table III shows the
c4 : A
a4 a3 stop words and the keywords for article title corresponding to the
record c3 of Table I.
c2 : A
a6,a7,a8 Title string Asymptotic Expansions of Moments of the Waiting Time in a Shared-Processor
of an Interactive System.
c3 : A c6 : A Stop words of, the, in, a
Keywords Asymptotic, Expansion, Moments, Waiting, Time, Shared, Processor, Interactive,
System.
Fig. 3. Graph representation of co-author relationship for the citation records Table III
of Table I LIST OF STOP WORDS AND KEYWORDS FOR AN EXAMPLE TITLE STRING
B. Clustering based on article title and year of publication Once the stop words are eliminated, the amount of keyword
matching (similarity) between each pair of articles (ci; cj), where ci 2
In this step, the set of fragments obtained above are further refined
by applying one more level of clustering. The two fragments are Fi and cj 2 Fj, is obtained by computing Rat-cliff/Obershelp pattern
merged into a single cluster if they satisfy a similarity criterion matching algorithm [25]. The algorithm uses the idea of gestalt
(which describes how a person can identify pattern as a whole, not
specially designed based on the correlation between the articles with
merely as collections of parts) to determine the similarity between
respect to article title and year of publication. If the similarity in titles the pair of strings. It
of two articles (expressed
explores two sorted (in alphabetical order) strings ti and tj to locate fragment Fi is identified by AFi (Section IV-A). In order to measure
the longest common subsequence (LCS) of characters. The LCS is actspan sim between the author AFi and AFj , we introduce two
considered as an anchor. Similar procedure is applied recursively parameters, namely, actspan overlap and actspan distance. The
with the remaining characters (if any) on the left and on the right of actspan overlap(AFi ; AFj ) for two authors AFi and AFj is computed
the anchor until one of the string get exhausted. The similarity score as the overlap (in number of years), if any, between actspan(AFi )
(sim score) is expressed as percentage match between the pair of and actspan(AFj ). The actspan distance(AFi ; AFj ) for two authors
strings and computed as twice the total number of characters AFi and AFj is computed as the difference (in number of years),
common (summation of characters in LCS at each step) between the between actpeak(AFi ) and actpeak(AFj ). Finally, actspan sim cor-
two strings divided by the total number of characters in the two responding to AFi and AFj is assigned the value AST RUE,
strings. if actspan overlap(AFi ; AFj ) is found to be greater than
Based on a detailed empirical study on the publications of a large
T hoverlap and actspan distance(AFi ; AFj ) is less than
number of individuals, we have following observations: Usually,
T hdistance, where T hoverlap, T hdistance are two predefined
there is at least a pair of articles authored by a typical author, where
thresholds chosen empirically. The procedure of computing
the corresponding article titles include a significant number of
common keywords (related to her research domain). However, actspan sim(Fi; Fj) is outlined in Algorithm 3.
sometimes, there may be case that there is no pair having a large Input : actspan overlap(AFi ; AFj ), actspan distance(AFi ; AFj )
number of common keywords but there are two or more pairs sharing Output: actspan sim(AFi ; AFj )
fewer number of common keywords within their title string. Keeping
this in mind, we assign values to title sim by considering three if actspan overlap(AFi ; AFj ) > T hoverlap then
different scenarios (Algorithm 2). If there exists at least one pair of if actspan distance(AFi ; AFj ) < T hdistance then
) AS
actspan sim(AF ; AF T RUE
j
articles (ci; cj) (where ci 2 Fi and cj 2 Fj), such that the corresponding i
end
similarity score (sim score(ci; cj)) is significantly high (let say 85%), else
AS
actspan sim(AF ; AF j )
title sim is assigned the value T SHIGH . The value T SMOD is i
F ALSE
assigned to title sim if there exists no pair having such high similarity end
score but there exist at least two pairs of articles (ci; cj) and (ci0 ; cj0 ) Algorithm 3: Computation of actspan sim
(where ci; ci0 2 Fi and cj; cj0 2 Fj), such that both similarity scores are V. EXPERIMENTAL RESULTS
moderately high (let say 65%). The value T SLOW is assigned to title In order to evaluate our proposed scheme, we have per-formed
sim otherwise. experiments on a large set of bibliographic data extracted form
DBLP. Note that DBLP have already disam-biguated a few
Input : ffFig; fFjgg ambiguous names by some means. We extract citation records of
Output: title sim(Fi; Fj) thirteen such already disambiguated names. A summary of data set is
title sim(Fi; Fj) None; shown in the first three columns of Table IV. The dataset contains
for T itlei in Fi do
samples with a large variety in terms of number of true authors and
ti remove stopwords(T itlei);
number of articles. A copy of the disambiguated data is kept as the
golden data set [26] (to be used as a reference for evaluating the
for T itlej in Fj do
accuracy). The golden data corresponding to a typical name A is
tj remove stopwords(T itlej);
years, the median is considered as actpeak(ai). Recall that the pairwise measures mentioned
individual author (having name A) associated with a
evident from the high level of accuracy achieved for data set
extracted from DBLP.
REFERENCES
[1] F. A. A. et al., “A brief survey of automatic methods for author name
disambiguation,” SIGMOD Rec., vol. 41, pp. 15–26, 2012.
[2] S. Li et al., “Author name disambiguation using a new categorical
distribution similarity,” in European Conference on Machine Learning and
Knowledge Discovery in Databases, 2012, pp. 569–584.
[3] M. Khabsa et al., “Large scale author name disambiguation in digital
libraries,” in International Conference on Big Data, 2014, pp. 41–42.
[4] J. Huang et al., “Efficient name disambiguation for large-scale databases,” in
10th European Conference on Principle and Practice of Knowledge
Discovery in Databases, 2006, pp. 536–544.
[5] A. Veloso et al., “Cost-effective on-demand associative author name
disambiguation,” Inf. Process. Manage., vol. 48, pp. 680–697, 2012.
Fig. 4. F1 scores obtained by the proposed scheme for various names [6] H. Han et al., “A hierarchical naive bayes mixture model for name
disambiguation in author citations,” in ACM Symposium on Applied
above are computed by examining the number of article pairs Computing, 2005, pp. 1065–1069.
assigned under same individual by the proposed technique as [7] X. Fan et al., “On graph-based name disambiguation,” J. Data and
Information Quality, vol. 2, pp. 10:1–10:23, 2011.
compared to that obtained from the corresponding golden data. If two [8] Y. Liu et al., “A fast method based on multiple clustering for name
articles belonging to same individual (as per the golden data set) are disambiguation in bibliographic citations,” Journal of the Association for
found to be put in the same cluster by the proposed method, then the Information Science and Technology, vol. 66, pp. 634–644, 2015.
[9] D. A. Pereira et al., “Using web information for author name disam-
pair is considered as a correct pair. Otherwise, the pair is considered biguation,” in ACM/IEEE Conf. on Digital Libraries, 2009, pp. 49–58.
as an incorrect pair. Note that only the pairs belonging to same [10] H. Hui et al., “Two supervised learning approaches for name disam-
clusters are considered during counting. The formula used to biguation in author citations,” in 4th ACM/IEEE-CS Joint Conference
on Digital Libraries, 2004, pp. 296–305.
compute the three evaluated metrics mentioned above is given [11] X. Yin et al., “Object distinction: Distinguishing objects with identical
below: names,” in IEEE 23rd International Conference on Data Engineering,
2007, pp. 1242–1246.
P aiwise P recision = #P airsCorrectlyP redicted [12] V. I. Torvik and N. R. Smalheiser, “Author name disambiguation in
#T otalP airsP redicted
MEDLINE,” ACM Trans. Knowl. Discov. Data, vol. 3, pp. 11:1–11:29,
P aiwise Recall = #P airsCorrectlyP redicted 2009.
#T otalCorrectP airs [13] G. S. Mann and D. Yarowsky, “Unsupervised personal name disam-
biguation,” in Seventh Conference on Natural Language Learning at
P aiwise F 1 = 2 P airwise P recision P airwise Recall HLT-NAACL, 2003, pp. 33–40.
P airwise P recision+P airwise Recall
[14] T. Arif et al., “Author name disambiguation using two stage clustering,”
Last three columns of Table IV show the pairwise precision, recall INROADS(Special Issue), vol. 3, pp. 340–345, 2014.
[15] H. Han et al., “A model-based k-means algorithm for name disambigua-
and F1 scores for each of the ambiguous names under consideration. tion,” in International Semantic Web Conference, 2003.
For better illustration, The F1 scores are also shown with the help of [16] D. M. Blei et al., “Latent dirichlet allocation,” J. Mach. Learn. Res., vol.
bar charts (Fig. 4). It is apparent that the proposed name 3, pp. 993–1022, 2003.
[17] I. Bhattacharya and L. Getoor, “A latent dirichlet model for unsupervised
disambiguation technique achieves significantly high level of entity resolution,” in Sixth SIAM International Conference on Data
accuracy for most of the cases. Mining, 2006, pp. 47–58.
[18] M. Ester et al., “A density-based algorithm for discovering clusters in large
Author Name Number of Number of Precision Recall F1 Score spatial databases with noise,” in Proc. of 2nd International Conference on
True Authors Articles Knowledge Discovery and, 1996, pp. 226–231.
He Sun 5 46 100 80.63 89.27
[19] J. Tang et al., “A unified probabilistic framework for name disambigua-
Qiang Yang 13 457 98.98 98.96 98.97
Rui Zhang 17 859 39.18 34.56 36.73 tion in digital library,” IEEE Trans. on Knowl. and Data Eng., vol. 24,
Jianwei Zhang 9 286 100 98.57 99.28 pp. 975–987, 2012.
Sourav Chakraborty 5 54 100 40 57.22 [20] F. H. Levin and C. A. Heuser, “Evaluating the use of social networks in
Arnab Roy 6 56 100 84.91 91.84 ame disambiguation in digital libraries,” Journal of Information and
Tao Xie 6 321 97.34 64.82 77.82 Data Management, vol. 1, pp. 183–197, 2010.
Hui Fang 8 149 57.29 53 55.11
Rohit Singh 7 29 100 83.65 91.10 [21] H. Tran et al., “Author name disambiguation by using deep neural
Anil K. Jain 3 595 99.20 76.36 86.29 network,” in Intelligent Information and Database Systems, 2014, vol.
Rakesh Agrawal 2 227 100 84.89 91.83 8397, pp. 123–132.
Feng Wang 4 488 73.37 10.11 17.77 [22] R. G. Cota et al., “An unsupervised heuristic-based hierarchical method
Micheal Wagner 16 153 100 99.39 99.69 for name disambiguation in bibliographic citations,” Journal of the
Table IV American Society for Information Science and Technology, vol. 61, pp.
PAIRWISE PRECISION, RECALL AND F1 SCORES OBTAINED BY THE 1853–1870, 2010.
PROPOSED SCHEME FOR VARIOUS AMBIGUOUS NAMES [23] N. Deo, Graph Theory with Applications to Engineering and Computer
Science (Prentice Hall Series in Automatic Computation). Upper Saddle
River, NJ, USA: Prentice-Hall, Inc., 1974.
VI. CONCLUSION [24] S. S. Skiena, The Algorithm Design Manual, 2nd ed. Springer
A two-step hierarchical clustering technique for name dis- Publishing Company, Incorporated, 2008.
ambiguation has been proposed. The method alleviates the need of [25] J. W. Ratcliff and J. A. Obershelp, “Ratcliff/obershelp pattern recognition,”
Dictionary of Algorithms and Data Structures, 1983. [Online]. Available:
extra and difficult to obtain attributes. The use of three very common http://nist.gov/dads/HTML/ratcliffObershelp.html
and easily accessible attributes makes the name disambiguation [26] Y. nan Qian et al., “Dynamic author name disambiguation for growing
digital libraries.” Inf. Retr. Journal, vol. 18, 2015.
efficient. The level of accuracy is further improved by the use of year
of publication of articles. The strength of the proposed technique is