Vous êtes sur la page 1sur 6

An Efficient Technique for Author Name Disambiguation

Rima Hazra, Anomitra Saha, Shubhra Baran Deb and Debasis Mitra
Department of Information Technology
National Institute of Technology, Durgapur, India
Email:frimahazra93, anomitra1993, shubhra.baran.deb, debasis.mitrag@gmail.com

Abstract—Scholarly digital libraries have become an important source based on either supervised approach or unsupervised approach. Most
of bibliographic records for scientific communities. Author name search of the supervised solutions exploit learning function to train the
is one of the most common query exercised in digital libraries. The name model and generic machine learning algorithms to solve the problem.
ambiguity problem in the context of author search in digital libraries,
arising from multiple authors sharing the same name, poses many However, with tremendous growth in the size of the DL databases,
challenges. A number of name disambiguation methods have been cost of obtaining training data becomes prohibitively large; thereby
proposed in the literature so far. A variety of bibliographic attributes making supervised approaches impractical. In unsupervised
have been considered in these methods. However, hardly any effort has approach, name dis-ambiguation is usually considered as a clustering
been made to assess the potential contribution of these attributes. We, for issue [8] such that all the articles authored by a unique individual is
the first time, evaluate the potential strength and/or weaknesses of these
placed in a group (cluster). Among these methods, ones using
attributes by a rigorous course of experiments on a large data set. We
also explore the potential utility of some attributes from different multiple level of clustering (termed as hierarchical clustering) are
perspective. A close look reveals that most of the earlier work require found to be most efficient for real data [8]. In general, these methods
one or more attributes which are difficult to obtain in practical compare bibliographic attributes (e.g. author’s names, title,
applications. Based on this empirical study, we identify three very publication venue, author’s institutional affiliations and email) of
common and easy to access attributes and propose a two-step hierarchical
ambiguous authors and use some similarity func-tions to determine
clustering technique to solve name ambiguity using these attributes only.
Experimental results on data set extracted from a popular digital library the clusters [1]. Although, these methods achieves convincing results
show that the proposed method achieves significantly high level of for many real life datasets, many of them suffer from a serious
accuracy (> 90%) for most of the instances. drawback: requirements of extra attributes, such as author’s
institutional affiliation, address, article abstract etc. [9], which are
I. INTRODUCTION difficult to obtain in practical applications [1].
Nowadays, researchers mostly use scholarly digital libraries In this work, we perform empirical study to evaluate the potential
(DLs), such as DBLP, Google Scholar, Microsoft Academic Search, strength and/or weaknesses of the bibliographic at-tributes considered
CiteSeer, PubMed to perform literature review of relevant works [1]. earlier by revisiting their potential utility from various perspectives. We
These DLs usually offer millions of citation records with each record also explore if some attributes, not considered yet, might be worth
having bibliographic attributes such as title, authors’ names, considering. Finally, we select three very common and easily accessible
publication venue etc. However, many of them cannot correctly attributes and propose a two-step hierarchical clustering technique that
index (distinguish) publica-tion records of various individuals uses only these three attributes to solve the name ambiguity problem. In
because of the so called name ambiguity problem. Such ambiguity the first step, articles are grouped into fragments by utilizing the co-
mainly arises when multiple authors share exactly the same name. author relationship. If two articles authored by an ambiguous name have
For example, there are more than 200 different individual authors the same co-author, they are grouped in the same fragment. The strength
sharing the name “Wei Wang” [2]. Name ambiguity may also arise of this step is evident from the fact that there is very little chance that two
when an author uses her short name in some publications and full different authors with the same name have the same co-author [7], [8]. In
name in the other [3]. For example, an author named Mark Jones the second step, the fragments thus formed are further refined by
uses M. Jones in his one publication and Mark Jones in another. considering the correlation between the articles with respect to article
Author name search is a very popular query used in digital library title and year of publication. The explanation behind choosing these three
databases [4]. Ambiguous names often lead to confusion and attributes is presented in Section II. In order to evaluate the strength of
mistakes in identifying citation records of authors’ articles. A DL the proposed technique, we have performed experiments with several
user, while searching for publications of a particular author, often instances of ambiguous names extracted from DBLP database.
receives a combination of articles written by different authors sharing Experimental results show that the proposed method achieves
the same name. The problem can be solved by devising automatic significantly high accuracy for most of the instances.
methods to disambiguate all the present citation records of a DL.
A number of name diambiguation methods have been pro-posed in The remainder of this paper is organised as follows. Section II
the literature so far [5], [6], [7]. Existing solutions are presents a summary of prior work in the related area and

978-1-5090-1936-6/16/$31.00 ©2016 IEEE


motivation behind the current work. A formal presentation of the As evident from many prior works [7], [11], our experiments also
problem appears in Section III. Section IV describes the proposed found the co-author relationship to be the most dominant contributor
technique. Experimental results are summarized in Section V. among all the attributes considered so far. How-ever, the sole use of
Finally, conclusions are drawn in Section VI. co-author relationship can hardly achieve desirable accuracy for most
of the cases. Our experiments also reveal that all other attributes
II. PRIOR WORK AND MOTIVATION considered so far suffer from at least one of the following
A number of name disambiguation methods have been proposed shortcomings: (i) inability to contribute significantly for variety of
in the literature in the last few decades. Existing work on this topic data sets [21], (ii) difficulty in obtaining the required information
can broadly be discriminated by the types of methods and the types (e.g., author’s curriculum vitae [9], institutional affiliation [19]
of features (attributes) used. The techniques using supervised hardly ap-pears in any citation record, thereby requires reference to
methods exploit learning function to train the model and generic external sources). We further notice that the domain (area) of
machine learning algorithms to solve the problem [8], [10]. An research for two different authors sharing an ambiguous name rarely
associative technique proposed by Veloso et al. [5] is shown to matches. Hence, research domain may be an useful attribute to
reduce the amount of training data significantly. Yin et al. [11] distinguish authors having ambiguous name. Although, it is almost
develop a method called DISTINCT for author distinction, which impossible to obtain research area directly from citation records, it
takes a set of distinguishable articles as training data automatically. can be indirectly extracted from the article titles which are quite easy
Torvik et al. [12] proposed a technique to disambiguate author names to obtain. Usually, most of the researchers work in a specific domain
for the articles in MEDLINE. The technique is shown to be and publish articles in their respective domain. The research domain
applicable for other bibliographic databases too. of an author is often reflected in the keywords used in the titles of the
To get rid of increasing cost involved in obtaining training data articles written by her.
with the tremendous growth in the size of the DL databases, a We further observe a very interesting point related to the year of
number of techniques based on unsupervised learning have been publication, which none of the earlier work did. A careful scrutiny on
proposed. In unsupervised approach, name disambiguation is usually the histogram of number of articles pub-lished by various authors
considered as a clustering issue [8], [13], [14]. A technique using a having ambiguous names (already disambiguated by any means) in
variant of famous K-means clustering algorithm was presented by different years, we find that in many cases, activity span of two
Han et al. [15]. A hierarchical naive bayes mixture model was used different individuals having same name usually do not coincide in
in [6] to further improve the accuracy. A technique based on an time. Activity span of an author is defined by the interval (start year,
extension of Latent Dirichlet Allocation model [16] coupled with a end year) during which she publishes. For example, as shown in Fig.
probabilistic model was proposed by Bhattacharya and Getoor [17]. 1, among four individual authors sharing the name “Sourav
Huang et al. [4] presented a blocking method to get candidate classes 1
Chakraborty” , activity span of three (namely, Sourav Chakraborty
of authors and DBSCAN clustering [18] to group articles by author. 002, Sourav Chakraborty 003, Sourav Chakraborty 004) do not
Tang et al. [19] tried to formalize the problem in a unified overlap. Similar observations hold for many other ambiguous names
framework. Hidden Markov Random Fields are used to model such as He Sun, Tao Xie.
relation between two articles and Bayesian Information Criterion
(BIC) is used to estimate author number for each disambiguation
step. A few clustering techniques [7], [8], utilizing the concepts of
graph theory, have also been presented in recent past. A technique
using relationship network among authors (author social network)
was proposed in [20].
Existing techniques for author name disambiguation con-sider a
variety of bibliographic attributes such as author’s names, co-
authors’ names, article title, publication venue, au-thor’s institutional
affiliations, author’s research area, author’s email etc. To the best of
our knowledge, no effort has been made to judge the potential
contribution of the above attributes in solving the name ambiguity
problem. This motivates us to conduct rigorous experiments with a
large set of real data already disambiguated (either manually, or by
some existing methods) to identify the potential strength and/or Fig. 1. Histogram showing the activity span of various authors with an
weaknesses of the various attributes, mentioned above, with respect ambiguous name
to name disambiguation. We also put efforts to identify some other In a nutshell, above observations motivate us to consider the following
attributes (if any) which has not been considered yet but might have
attributes of citation records in our proposed name
the potential to contribute significantly.
1
http://dblp.uni-trier.de/pers/hd/c/Chakraborty:Sourav
disambiguation technique: (i) co-author relationship, (ii) and/or efficiency of such techniques strongly depend on the
article title, (iii) year of publication. The manner in which bibliographic attributes considered at various levels of clus-tering. In
these attributes are used to disambiguate the authors is this section, we present our proposed hierarchical clustering
described in Section IV. technique that uses only three easily accessible and effective
III. PROBLEM FORMULATION attributes (as outlined in Section II) namely, co-author relationship,
As mentioned in Section I, a digital library user, while searching article title and publication year. It may be noted that given an author
for publications of a particular individual, often receives a combined name A, the collection of bibliographic records extracted from a DL,
set of articles written by different in-dividuals sharing the same corresponding to all the articles where A appears as one of the author
name. The objective of name disambiguation method is to partition works as the input to the proposed name disambiguation method. The
the set into a number of subsets where each set contains (all and method consists of two steps of clustering. Initially, we assume that
ideally only all) the articles written by an individual author.
each single article forms a cluster (i.e., for each and every article,
Formally, the problem of name disambiguation may be formulated as
there is an unique individual author named A). In the first step, initial
follows: Let CA = fc1; c2; ; ckg be a set of citation records such that
single-article clusters are grouped into fragments by utilizing the co-
one of the author of each citation record ci has name A. The author relationship. In the second step, the fragments thus formed are
objective of author name disambiguation is to identify a set RA = fr1; further refined by merging some of them based on article title and
r2; ; rmg of real individuals and partition the year of publication. The overview of the proposed technique is
set CA into subsets fCr1 g; fCr2 g; fCrm g, where each ri represents a depicted in Fig. 2. The details of various steps are described in
reference to a real individual having name A and subsequent subsections.
m denotes the total number of such individuals.
The problem of name disambiguation can be better illus-trated
using an example presented in Table I. There are total six articles
where each of them has an author named Debasis Mitra (A).
Apparently, it seems that all these six articles are authored by a
single individual (Debasis Mitra). However, a careful scrutiny can
reveal the fact that the set actually belongs to three distinct
individuals (marked with different colors) all sharing the name
Debasis Mitra. The articles c1; c3, and c5 belong to Debasis Mitra
(r1) who is a professor at the Florida Institute of Technology, Florida,
USA. The articles c2; c4 belong to Debasis Mitra (r2) who is an
Assistant Professor at National Institute of Technology, Durgapur,
India, and the article c6 belongs to Debasis Mitra (r3) who worked at
Bell labs, NJ, USA. A proper name disambiguation technique should
partition the set of records fc1; c2; c3; c4; c5; c6g (Table I) into three
subsets viz. fc1; c3; c5g, fc2; c4g, and fc6g.
Articles Fig. 2. Flowchart of the proposed name disambiguation technique
c1 : Optimization and Design of Network Routing Using Refined Asymptotic
Approximations. A. Clustering based on co-author relationship
Authors: Debasis Mitra (A), John A. Morrison (a1), K. G. Ramakrishnan (a2)
c2 : Certificate-based encoding of gate level description for secure transmission. Authors:
As mentioned in Section II, co-author relationship plays a major
Sandip Ghosal (a3), Debasis Mitra (A), role in name disambiguation. Given the set CA = fc1; c2; ; ckg of
Subhasis Bhattacharjee (a4) single-article clusters, we first extract the list of authors for each
c3 : Asymptotic Expansions of Moments of the Waiting Time in a Shared-Processor of an cluster. The co-author relationship between various clusters is then
Interactive System.
Authors: Debasis Mitra (A), John A. Morrison (a1)
represented as an undirected graph as follows: Each individual
c4 : Secure transmission of gate level description. Authors: cluster ci represents a node and there is an undirected edge between
Sandip Ghosal (a3), Debasis Mitra (A)
two nodes ci and cj if article ci and article cj has at least one common
c5 : The structure and management of service level agreements in networks. Authors:
Eric Bouillet (a5), Debasis Mitra (A), K. G. Ramakrishnan (a2) co-author except A. Two articles (ci; cj) are said to be directly
c6 : Reconstruction of 4-D Dynamic SPECT Images from Inconsistent Projection using a correlated with respect to co-author relationship if they share a
Spline Initialized FADs Algorithm (SIFADS) common co-author am (except A). Similarly, two articles (ci; cj) are
Authors: Mohmoud Abdalah (a6), Rostyslav Boutchko (a7), Debasis Mitra (A), Grant
T. Gullberg (a8) said to be transitively correlated with respect to co-author
Table I relationship if ci shares a common co-author ap (except A) with
A FEW CITATION RECORDS OF AMBIGUOUS AUTHORS another article ck, which in turn shares another co-author aq (except
IV. PROPOSED WORK A) with the article cj and so on. In other words, two articles (ci; cj)
As apparent from earlier research on related area [8], [14], are correlated (either directly or transitively) with respect to co-
[22], hierarchical clustering techniques are found to be effective author relationship if there exist a path from ci to cj in the underlying
graph. It may be noted that in reality,
to solve name disambiguation. The applicability
there is very little chance that two articles authored by two different as title sim) from two different fragments is sufficiently high (T
individuals having the same name would be directly or transitively SHIGH ), the two fragments are merged without consid-ering the
correlated with respect to co-author relationship. Hence, all the publication year further. If title sim is found to be moderate (T
articles forming a connected component [23] in the underlying graph
may be considered to be authored by the same individual with name SMOD), then the two fragments are merged if the similarity in the
A. We use breadth first traversal [24] on the graph to find the activity spans of the two authors corresponding to two fragments
connected components and use that information to merge the single- (expressed as actspan sim) is found to be true (AST RUE). Note
article clusters into a set of fragments fF1g; fF2g; ; fFng, where n k. that the activity span of the author corresponding to a fragment
After this step, it is assumed that all the records belonging to a could be found by considering the publication year of the articles
of that fragment. Two fragments are not considered for merging
fragment Fi is authored by a unique individual (AFi ) with name A.
at all if title sim is found to be low (T SLOW ). The overall
The set of fragments thus obtained are fed as input to the second step
procedure for possible merging of fragments is depicted in
described in the next subsection.
Algorithm 1. The detail methods of computing title sim and
As an illustrative example, let us consider the set of citation
actspan sim is described next.
records of Table I. The co-author informations as extracted from the
Input : Intermediate fragments:fF1; F2:::; Fng
set CA are summarized in a 2-D matrix M (Table Output: Set of clusters: Cr1; Cr2; ::; Crm
II). The rows of the matrix represent articles c1; c2; ; c6
for Fi in CA do
and columns represent authors a1; a2; ; a8 (except A).
for Fj in CA do
M[i][j] = 1 indicates that the article ci has an author named
if title sim = T SHIGH then merge Fi;
aj. Similarly, M[i][j] = 0 indicates that the article ci does not Fj

have an author named aj. Note that the required graph can be else if title sim = T SMOD and actspan sim = AST RUE then merge Fi; Fj
constructed easily by referring the matrix M. The graph for
this example is shown in Fig. 3. A breadth first traversal on end

the graph yields three connected components fF1; F2; F3g end
end
where F1 = fc1; c3; c5g, F2 = fc2; c4g, and F3 = fc6g. The
existence of the connected components may also be verified Algorithm 1: Merging fragments in 2nd level of clustering
from Fig. 3.
1) Computation of title sim: As mentioned in Section II, it is
a1 a2 a3 a4 a5 a6 a7 a8
unlikely that two authors sharing ambiguous name work in the same
c1 1 1 0 0 0 0 0 0
c2 0 0 1 1 0 0 0 0 research domain and hence, research domain may be a crucial factor
c3
c4
1 0 0 0 0 0 0 0 to be considered in name disam-biguation. The articles written by an
0 0 1 0 0 0 0 0
c5 0 1 0 0 1 0 0 0 author often contain keywords relevant to their domain of research.
c
6 0 0 0 0 0 1 1 1 Accordingly, computation of similarity between the article titles of
Table II various fragments based on the keyword matching may be useful.
CO-AUTHOR MATRIX (M) FOR SET CITATION RECORDS CA OF TABLE I
The procedure for keyword matching involves a preprocessing step
to eliminate stop words. Stop words typically refer to some common
c1 : A c5 : A words or phrases used to construct sentences in any language. For
a2 example, in English language, prepositions, articles, verbs,
a5 conjunctions may be considered as stop words. However, sometimes,
a word considered as keyword for some applications may be
a1 considered as stop word for another application. Table III shows the
c4 : A
a4 a3 stop words and the keywords for article title corresponding to the
record c3 of Table I.
c2 : A
a6,a7,a8 Title string Asymptotic Expansions of Moments of the Waiting Time in a Shared-Processor
of an Interactive System.
c3 : A c6 : A Stop words of, the, in, a
Keywords Asymptotic, Expansion, Moments, Waiting, Time, Shared, Processor, Interactive,
System.
Fig. 3. Graph representation of co-author relationship for the citation records Table III
of Table I LIST OF STOP WORDS AND KEYWORDS FOR AN EXAMPLE TITLE STRING

B. Clustering based on article title and year of publication Once the stop words are eliminated, the amount of keyword
matching (similarity) between each pair of articles (ci; cj), where ci 2
In this step, the set of fragments obtained above are further refined
by applying one more level of clustering. The two fragments are Fi and cj 2 Fj, is obtained by computing Rat-cliff/Obershelp pattern
merged into a single cluster if they satisfy a similarity criterion matching algorithm [25]. The algorithm uses the idea of gestalt
(which describes how a person can identify pattern as a whole, not
specially designed based on the correlation between the articles with
merely as collections of parts) to determine the similarity between
respect to article title and year of publication. If the similarity in titles the pair of strings. It
of two articles (expressed
explores two sorted (in alphabetical order) strings ti and tj to locate fragment Fi is identified by AFi (Section IV-A). In order to measure
the longest common subsequence (LCS) of characters. The LCS is actspan sim between the author AFi and AFj , we introduce two
considered as an anchor. Similar procedure is applied recursively parameters, namely, actspan overlap and actspan distance. The
with the remaining characters (if any) on the left and on the right of actspan overlap(AFi ; AFj ) for two authors AFi and AFj is computed
the anchor until one of the string get exhausted. The similarity score as the overlap (in number of years), if any, between actspan(AFi )
(sim score) is expressed as percentage match between the pair of and actspan(AFj ). The actspan distance(AFi ; AFj ) for two authors
strings and computed as twice the total number of characters AFi and AFj is computed as the difference (in number of years),
common (summation of characters in LCS at each step) between the between actpeak(AFi ) and actpeak(AFj ). Finally, actspan sim cor-
two strings divided by the total number of characters in the two responding to AFi and AFj is assigned the value AST RUE,
strings. if actspan overlap(AFi ; AFj ) is found to be greater than
Based on a detailed empirical study on the publications of a large
T hoverlap and actspan distance(AFi ; AFj ) is less than
number of individuals, we have following observations: Usually,
T hdistance, where T hoverlap, T hdistance are two predefined
there is at least a pair of articles authored by a typical author, where
thresholds chosen empirically. The procedure of computing
the corresponding article titles include a significant number of
common keywords (related to her research domain). However, actspan sim(Fi; Fj) is outlined in Algorithm 3.
sometimes, there may be case that there is no pair having a large Input : actspan overlap(AFi ; AFj ), actspan distance(AFi ; AFj )
number of common keywords but there are two or more pairs sharing Output: actspan sim(AFi ; AFj )
fewer number of common keywords within their title string. Keeping
this in mind, we assign values to title sim by considering three if actspan overlap(AFi ; AFj ) > T hoverlap then

different scenarios (Algorithm 2). If there exists at least one pair of if actspan distance(AFi ; AFj ) < T hdistance then
) AS
actspan sim(AF ; AF T RUE
j
articles (ci; cj) (where ci 2 Fi and cj 2 Fj), such that the corresponding i

end
similarity score (sim score(ci; cj)) is significantly high (let say 85%), else
AS
actspan sim(AF ; AF j )
title sim is assigned the value T SHIGH . The value T SMOD is i
F ALSE

assigned to title sim if there exists no pair having such high similarity end

score but there exist at least two pairs of articles (ci; cj) and (ci0 ; cj0 ) Algorithm 3: Computation of actspan sim
(where ci; ci0 2 Fi and cj; cj0 2 Fj), such that both similarity scores are V. EXPERIMENTAL RESULTS
moderately high (let say 65%). The value T SLOW is assigned to title In order to evaluate our proposed scheme, we have per-formed
sim otherwise. experiments on a large set of bibliographic data extracted form
DBLP. Note that DBLP have already disam-biguated a few
Input : ffFig; fFjgg ambiguous names by some means. We extract citation records of
Output: title sim(Fi; Fj) thirteen such already disambiguated names. A summary of data set is
title sim(Fi; Fj) None; shown in the first three columns of Table IV. The dataset contains
for T itlei in Fi do
samples with a large variety in terms of number of true authors and
ti remove stopwords(T itlei);
number of articles. A copy of the disambiguated data is kept as the
golden data set [26] (to be used as a reference for evaluating the
for T itlej in Fj do
accuracy). The golden data corresponding to a typical name A is
tj remove stopwords(T itlej);

sim score Title Similar(ti, tj);


represented as a set CGA = ffGr1 g; fGr2 g; ; fGrp gg, where r1; r2; ;
if sim score T SHIGH then counthigh + + ; rp represent the true authors and each fGri g represents the set of
else if sim score T SMOD then countmod + + ;
articles authored by ri. For each name, set of articles corresponding
else if sim score T SLOW then countlow + + ; to all the disambiguated authors are merged to construct the input for
end the proposed technique. In other words, the input data corresponding
end to A is represented
TS
if counthigh 1 then title sim(Fi; Fj) HIGH ;
as CIA = ffGr1 g [ fGr2 g [ fGrp gg. The final result obtained by
else if countmod 2 then title sim(Fi; Fj)T SMOD ;
executing the proposed name disambiguation on
else title sim(Fi; Fj) T SLOW ;
the input data thus constructed for A may be expressed as
Algorithm 2: Computation of title sim
CPA = ffCr1 g; fCr2 g; ; fCrm gg, where r1; r2; ; rm represent the
2) Computation of actspan sim: Activity span of an author ai,
disambiguated authors and each fCri g represents
denoted by actspan(ai), is expressed as the in-terval (start year, the set of articles authored by ri.
end year) during which ai publishes. The activity peak The level of accuracy achieved by the proposed scheme is
corresponding to an author ai, denoted by actpeak(ai), is defined measured by comparing CPA with CGA for each and every name (A)
as the year in which ai publishes using the standard evaluation metrics such as pairwise precision,
maximum number of articles (let say denoted by pmaxai). pairwise recall and pairwise F1 scores. It may be noted that these
ai
In case, a publishes p number of articles in multiple metrics have also been used in many earlier work [8], [11], [19]. The
i max

years, the median is considered as actpeak(ai). Recall that the pairwise measures mentioned
individual author (having name A) associated with a
evident from the high level of accuracy achieved for data set
extracted from DBLP.
REFERENCES
[1] F. A. A. et al., “A brief survey of automatic methods for author name
disambiguation,” SIGMOD Rec., vol. 41, pp. 15–26, 2012.
[2] S. Li et al., “Author name disambiguation using a new categorical
distribution similarity,” in European Conference on Machine Learning and
Knowledge Discovery in Databases, 2012, pp. 569–584.
[3] M. Khabsa et al., “Large scale author name disambiguation in digital
libraries,” in International Conference on Big Data, 2014, pp. 41–42.
[4] J. Huang et al., “Efficient name disambiguation for large-scale databases,” in
10th European Conference on Principle and Practice of Knowledge
Discovery in Databases, 2006, pp. 536–544.
[5] A. Veloso et al., “Cost-effective on-demand associative author name
disambiguation,” Inf. Process. Manage., vol. 48, pp. 680–697, 2012.
Fig. 4. F1 scores obtained by the proposed scheme for various names [6] H. Han et al., “A hierarchical naive bayes mixture model for name
disambiguation in author citations,” in ACM Symposium on Applied
above are computed by examining the number of article pairs Computing, 2005, pp. 1065–1069.
assigned under same individual by the proposed technique as [7] X. Fan et al., “On graph-based name disambiguation,” J. Data and
Information Quality, vol. 2, pp. 10:1–10:23, 2011.
compared to that obtained from the corresponding golden data. If two [8] Y. Liu et al., “A fast method based on multiple clustering for name
articles belonging to same individual (as per the golden data set) are disambiguation in bibliographic citations,” Journal of the Association for
found to be put in the same cluster by the proposed method, then the Information Science and Technology, vol. 66, pp. 634–644, 2015.
[9] D. A. Pereira et al., “Using web information for author name disam-
pair is considered as a correct pair. Otherwise, the pair is considered biguation,” in ACM/IEEE Conf. on Digital Libraries, 2009, pp. 49–58.
as an incorrect pair. Note that only the pairs belonging to same [10] H. Hui et al., “Two supervised learning approaches for name disam-
clusters are considered during counting. The formula used to biguation in author citations,” in 4th ACM/IEEE-CS Joint Conference
on Digital Libraries, 2004, pp. 296–305.
compute the three evaluated metrics mentioned above is given [11] X. Yin et al., “Object distinction: Distinguishing objects with identical
below: names,” in IEEE 23rd International Conference on Data Engineering,
2007, pp. 1242–1246.
P aiwise P recision = #P airsCorrectlyP redicted [12] V. I. Torvik and N. R. Smalheiser, “Author name disambiguation in
#T otalP airsP redicted
MEDLINE,” ACM Trans. Knowl. Discov. Data, vol. 3, pp. 11:1–11:29,
P aiwise Recall = #P airsCorrectlyP redicted 2009.
#T otalCorrectP airs [13] G. S. Mann and D. Yarowsky, “Unsupervised personal name disam-
biguation,” in Seventh Conference on Natural Language Learning at
P aiwise F 1 = 2 P airwise P recision P airwise Recall HLT-NAACL, 2003, pp. 33–40.
P airwise P recision+P airwise Recall
[14] T. Arif et al., “Author name disambiguation using two stage clustering,”
Last three columns of Table IV show the pairwise precision, recall INROADS(Special Issue), vol. 3, pp. 340–345, 2014.
[15] H. Han et al., “A model-based k-means algorithm for name disambigua-
and F1 scores for each of the ambiguous names under consideration. tion,” in International Semantic Web Conference, 2003.
For better illustration, The F1 scores are also shown with the help of [16] D. M. Blei et al., “Latent dirichlet allocation,” J. Mach. Learn. Res., vol.
bar charts (Fig. 4). It is apparent that the proposed name 3, pp. 993–1022, 2003.
[17] I. Bhattacharya and L. Getoor, “A latent dirichlet model for unsupervised
disambiguation technique achieves significantly high level of entity resolution,” in Sixth SIAM International Conference on Data
accuracy for most of the cases. Mining, 2006, pp. 47–58.
[18] M. Ester et al., “A density-based algorithm for discovering clusters in large
Author Name Number of Number of Precision Recall F1 Score spatial databases with noise,” in Proc. of 2nd International Conference on
True Authors Articles Knowledge Discovery and, 1996, pp. 226–231.
He Sun 5 46 100 80.63 89.27
[19] J. Tang et al., “A unified probabilistic framework for name disambigua-
Qiang Yang 13 457 98.98 98.96 98.97
Rui Zhang 17 859 39.18 34.56 36.73 tion in digital library,” IEEE Trans. on Knowl. and Data Eng., vol. 24,
Jianwei Zhang 9 286 100 98.57 99.28 pp. 975–987, 2012.
Sourav Chakraborty 5 54 100 40 57.22 [20] F. H. Levin and C. A. Heuser, “Evaluating the use of social networks in
Arnab Roy 6 56 100 84.91 91.84 ame disambiguation in digital libraries,” Journal of Information and
Tao Xie 6 321 97.34 64.82 77.82 Data Management, vol. 1, pp. 183–197, 2010.
Hui Fang 8 149 57.29 53 55.11
Rohit Singh 7 29 100 83.65 91.10 [21] H. Tran et al., “Author name disambiguation by using deep neural
Anil K. Jain 3 595 99.20 76.36 86.29 network,” in Intelligent Information and Database Systems, 2014, vol.
Rakesh Agrawal 2 227 100 84.89 91.83 8397, pp. 123–132.
Feng Wang 4 488 73.37 10.11 17.77 [22] R. G. Cota et al., “An unsupervised heuristic-based hierarchical method
Micheal Wagner 16 153 100 99.39 99.69 for name disambiguation in bibliographic citations,” Journal of the
Table IV American Society for Information Science and Technology, vol. 61, pp.
PAIRWISE PRECISION, RECALL AND F1 SCORES OBTAINED BY THE 1853–1870, 2010.
PROPOSED SCHEME FOR VARIOUS AMBIGUOUS NAMES [23] N. Deo, Graph Theory with Applications to Engineering and Computer
Science (Prentice Hall Series in Automatic Computation). Upper Saddle
River, NJ, USA: Prentice-Hall, Inc., 1974.
VI. CONCLUSION [24] S. S. Skiena, The Algorithm Design Manual, 2nd ed. Springer
A two-step hierarchical clustering technique for name dis- Publishing Company, Incorporated, 2008.
ambiguation has been proposed. The method alleviates the need of [25] J. W. Ratcliff and J. A. Obershelp, “Ratcliff/obershelp pattern recognition,”
Dictionary of Algorithms and Data Structures, 1983. [Online]. Available:
extra and difficult to obtain attributes. The use of three very common http://nist.gov/dads/HTML/ratcliffObershelp.html
and easily accessible attributes makes the name disambiguation [26] Y. nan Qian et al., “Dynamic author name disambiguation for growing
digital libraries.” Inf. Retr. Journal, vol. 18, 2015.
efficient. The level of accuracy is further improved by the use of year
of publication of articles. The strength of the proposed technique is

Vous aimerez peut-être aussi