Vous êtes sur la page 1sur 16

260

IEEE TRANSACTIONS ON SERVICES COMPUTING,

VOL. 5,

NO. 2,

APRIL-JUNE 2012

Semantics-Based Automated Service Discovery


Aabhas V. Paliwal, Student Member, IEEE, Basit Shafiq, Member, IEEE, Jaideep Vaidya, Member, IEEE, Hui Xiong, Senior Member, IEEE, and Nabil Adam, Senior Member, IEEE
AbstractA vast majority of web services exist without explicit associated semantic descriptions. As a result many services that are relevant to a specific user service request may not be considered during service discovery. In this paper, we address the issue of web service discovery given nonexplicit service description semantics that match a specific service request. Our approach to semanticbased web service discovery involves semantic-based service categorization and semantic enhancement of the service request. We propose a solution for achieving functional level service categorization based on an ontology framework. Additionally, we utilize clustering for accurately classifying the web services based on service functionality. The semantic-based categorization is performed offline at the universal description discovery and integration (UDDI). The semantic enhancement of the service request achieves a better matching with relevant services. The service request enhancement involves expansion of additional terms (retrieved from ontology) that are deemed relevant for the requested functionality. An efficient matching of the enhanced service request with the retrieved service descriptions is achieved utilizing Latent Semantic Indexing (LSI). Our experimental results validate the effectiveness and feasibility of the proposed approach. Index TermsWeb services publishing, web services discovery, services discovery process and methodology.

1 INTRODUCTION

large number of web services structure a service- However, this is not sufficient to improve the selection and oriented architecture and facilitate the creation of matching process. Most service descriptions that exist to http://ieeexploreprojects.blogspot.com distributed applications over the web. These web services date are syntactic in nature. Existing service discovery offer various functionalities in the areas of communications, approaches often adopt keyword-matching technologies to data enhancement e-commerce, marketing, utilities among locate the published web services. This syntax-based others. Some of the web services are published and invoked matchmaking returns discovery results that may not in-house by various organizations. These web services may accurately match the given service request. As a result, be used for business applications, or in government and only a few services that are an exact syntactical match of the military. However, this requires careful selection and service request may be considered for selection. Thus, the composition of appropriate web services. The web services discovery process is also constrained by its dependence on within the service registry (UDDI) [16] have predefined human intervention for choosing the appropriate service categories that are specified by the service providers. As a based on its semantics. result, similar services may be listed under different Semantic web technology is a promising approach for categories. Given the large number of web services and the automated service discovery and selection [23]. A majority distribution of similar services in multiple categories in the of the current approaches for web service discovery call for existing UDDI infrastructure, it is difficult to find services that satisfy the desired functionality. Such service discovery semantic web services that have semantic tagged descripmay involve searching a large number of categories to find tions through various approaches, e.g., OWL-S, Web appropriate services. Therefore, there is a need to categorize Services Description Language (WSDL)-S [24], [22]. Howweb services based on their functional semantics rather than ever, these approaches have several limitations. First, it is impractical to expect all new services to have semantic based on the classifications of service providers. Semantic categorization of web services will facilitate tagged descriptions. Second, descriptions of the vast service discovery by organizing similar services together. majority of already existing web services are specified using WSDL and do not have associated semantics. Also, from the service requestors perspective, the requestor may . A.V. Paliwal is with CIMIC, Rutgers University, 1110 Stony Brook Way, not be aware of all the knowledge that constitutes the North Brunswick, NJ 08902. E-mail: aabhas@cimic.rutgers.edu. domain. Specifically, the service requestor may not be . B. Shafiq, J. Vaidya, H. Xiong, and N. Adam are with the MSIS Department and CIMIC, Rutgers University, 1 Washington Park, Newark, aware of all the terms related to the service request. As a result of which many services relevant to the request may NJ 07102. E-mail: basit@cimic.rutgers.edu, jsvaidya@business.rutgers.edu, hxiong@rutgers.edu, adam@adam.rutgers.edu. not be considered in the service discovery process. Manuscript received 26 July 2008; revised 4 Nov. 2008; accepted 11 Feb. 2010; In order to address the limitations of existing appublished online 3 Aug. 2011. proaches, an integrated approach needs to be developed For information on obtaining reprints of this article, please send e-mail to: for addressing the two major issues related to automated tsc@computer.org, and reference IEEECS Log Number TSC-2008-07-0068. service discovery: 1) semantic-based categorization of web Digital Object Identifier no. 10.1109/TSC.2011.19.
1939-1374/12/$31.00 2012 IEEE Published by the IEEE Computer Society

PALIWAL ET AL.: SEMANTICS-BASED AUTOMATED SERVICE DISCOVERY

261

within the UDDI (public/organizational) under a weather category that provides weather information or yet another service provider publishes a city information web service (WS2), listed under the utilities category that outputs information about the amount of rainfall received. Thus, a standard text-based service discovery of the requested service will include WS1 within the predefined weather category; however it will not include a potentially appropriate service WS2 within the utilities category. The service description for WS1 is This web service returns historical weather information for a given US postal code, date, and time. with Fig. 1. Semantics-based automated service discovery. inputs as PostalCode, Date, Time, and outputs as Temperature, Humidity, Pressure, Precipitation. WS2 is described as services; and 2) selection of services based on semantic Describes city information for a specific US city and state. service description rather than syntactic keyword matching. with its input parameters City, State, and output Moreover, the approach needs to be generic and should not parameters as Population, Temperature, Wind, Precipitation. be tied to a specific description language. Thus, any given In addition, the user formed service request may not web service could be described using WSDL, OWL-S, or include all the relevant keywords for discovering all the through other means. appropriate services within the UDDI. For example, Furthermore, the approach should make no assumptions the user may search for a web service stating Find the about the kinds of web services. In specific, we do not make temperature and rainfall based on zip code. However, any assumption about whether the web services are there may be services published that provide relevant developed in-house or offered to users by third party information based on regions, city names, addresses. service providers. These services could be combined with other locator In this paper, we present a novel approach for semanticservices to yield better results. Also, some of the based automated service discovery. Specifically, the propublished web services may provide relevant informaposed approach focuses on semantic-based service categortion grouped under the term weather or the user may ization and selection as depicted in Fig. 1. In our proposed not be aware of other parameters, e.g., precipitation, approach, semantic-based categorization of web services is http://ieeexploreprojects.blogspot.com web services providing the same utilized by other performed at the UDDI that involves semantics augmented resultant information. classification of web services into functional categories. The semantically related web services are grouped together Based on the above example it is evident that for an even though they may be published under different efficient web service discovery 1) the user must be able to categories within the UDDI. Service selection then consists discover all appropriate web services within the UDDI of two key steps: 1) parameters-based service refinement; irrespective of the predefined categories, and 2) all and 2) semantic similarity-based matching. The web service appropriate web services must be successfully discovered input and output parameters contain the underlying even if the user is not aware of all the relevant terms that functional knowledge that is extracted for improving include all appropriate web services. service discovery. Parameter-based service refinement The rest of the paper is organized as follows: Section 2 exploits a combination of service descriptions and input provides background material and Section 3 presents an and output to narrow the set of appropriate services overview of the proposed approach. Section 4 provides a matching the service request, by combining semantics with detailed discussion on semantic categorization of web syntactic characteristic of a WSDL document. The refined services in UDDI. The detailed description for parametersset of web services is then matched against an enhanced based service refinement is presented in Section 5. Section 6 service request as part of Semantic Similarity-based Match- includes a discussion on semantic similarity-based matching. The service request is enhanced by adding relevant ing. The implementation details of the proposed approach ontology concepts, which improves the matching of the and our evaluations are presented in Section 7. We present service request with the web services. We now present a the related work in Section 8. Finally, conclusion and future brief running example that is used throughout the paper to work are presented in Section 9. better explain the proposed approach. Example 1. Consider a user who requires information about the amount of rainfall in a particular region to estimate groundwater recharge for planning sustainable groundwater development. The user considers searching for an appropriate web service by specifying a keyword-based service request. Within the UDDI, service providers may use different terminology for the specification of web service categories. For example, the user requested web service (WS1) may be published by a service provider

BACKGROUND

In this section, we provide a brief background of the methodologies utilized for semantic categorization of web services, parameters-based service refinement, and semantic similarity-based matching. We briefly discuss the parameters for ranking semantic relationships in the context of semantic-based service categorization. We also briefly discuss the hyperclique pattern discovery technique used

262

IEEE TRANSACTIONS ON SERVICES COMPUTING,

VOL. 5,

NO. 2,

APRIL-JUNE 2012

for service refinement. Finally, we provide an overview of LSI in the context of semantic similarity-based matching.

TABLE 1 Example Hyperclique Patterns

2.1 Ranking of Semantic Relationships Semantic relationship among ontology concepts is generally ranked based on three parameters including relevance, specificity, and the span of the relationship [5]. Below, we describe these parameters. Relevance (Rel). Concepts may be associated with each Let I fi1 ; i2 ; . . . ; im g be a set of distinct items and let other with reference to multiple domains that are specific to T represent the set of vectors with elements corresponding user applications. The associated domain for a particular concept may be expressed as a high-level concept in an to input/output parameters and service description terms. upper ontology. For example, the concepts temperature Each vector in T is a subset of I. We call X  I an item and pressure are associated in the atmospheric domain as set. An item set with k items is called a k-item set. The well as in the chemical reactivity domain. These domains support of X, suppX, is the fraction of vectors containing may be represented by the weather and chemical concepts X. If suppX is no less than a user-specified minimum in an upper ontology, respectively. Relevance comprises the support, X is called a frequent item set. The confidence of associated domain concept specified by the user and is association rule X1 ! X2 is defined as confX1 ! X2 indicative of the contextual relationship between the suppX [ X =suppX . It estimates the likelihood that 1 2 1 concepts. the presence of a subset X1  X implies the presence of We use the predicate Rel to specify the relevance the other item set X2 X X1 . [10]. between any two concepts ti and tj . The predicate A hyperclique pattern [34] is a new type of association Relti ; tj evaluates true if the concepts ti and tj are linked pattern that contains items that are highly affiliated with each to a common concept in the upper ontology. other. Specifically, the presence of an item in one service Specificity (Sp). The concepts are classified based on description vector strongly implies the presence of every their position in the concept hierarchy. Concepts in the other item that belongs to the same hyperclique pattern. The lower level of the hierarchy are specific concepts where as h-confidence measure captures the strength of this associathe higher level concepts are termed as generic concepts. tion and, for an item set P fi1 ; i2 ; . . . ; im g, is defined as the For example, the entity location may be conveyed through concepts address and postal code. Address is a generic minimum confidence of all association rules of the item set with a left hand side of http://ieeexploreprojects.blogspot.com one item, i.e., concept whereas postal code is a specific concept. We use the predicate Sp to specify the specificity hconfP minfconffi1 ! i2 ; . . . ; im g; relationship between any two concepts ti and tj . The conffi2 ! i1 ; i3 ; . . . ; im g; . . . ; conffim ! i1 ; . . . ; im1 ggg; predicate Spti ; tj evaluates true if there is a downward path (indicating specialization) from ti to tj in the ontology. where conf follows the classic definition of association rule Span (S). The span of the relationships expressing the confidence [2]. An item set P is a hyperclique pattern if semantic association conveys the strength of linkage among hconfP ! hc , where hc is the minimum h-confidence concepts. The span, specified to restrict the scope of the user threshold. request, includes the coverage and the depth of the For example, consider an item set P fA; B; Cg. Assume associated concepts. Coverage includes the concepts at the that suppfAg 0:1, suppfBg 0:1, suppfCg 0:06, and peer level of the considered concept where as the depth suppfA; B; Cg 0:06, where supp is the item set support. includes level of descendants to be included. If the concepts Then, conffA ! B; Cg suppfA; B; Cg=suppfAg 0:6, are linked within the specified span, the value of Span conffB ! A; Cg 0:6, and conffC ! A; Bg 1. Hence, Sti ; tj is equal to 1, else it is set to 0. Ranking of the semantic association includes relevance, hconfP minfconffA ! B; Cg; conffB ! A; Cg; specificity, and span. For a given web service, that includes conffC ! A; Bgg 0:6: ft1 ; t2 ; . . . ; tn g concepts describing the service, the overall Table 1 shows some example hyperclique patterns rank is expressed as: Rti ; tj , identified from a real-world web services data set, which Rti ; tj k1 Relti ; tj k2 Spti ; tj k3 Sti ; tj ; includes web service descriptions from various service categories, e.g., weather, financial, graphics, busiwhere 0 < k1 ; k2 ; k3 < 1 and k1 k2 k3 1. k1 ; k2 ; k3 are user-specified weights associated with ness, communication, and location. For example, the relevance, specificity, and span, respectively, to obtain the hyperclique pattern {mapurl, distanceunits, time, routeoptions} is from the location category. overall rank of the semantic association. 2.2 Hyperclique Patterns Discovery In this paper, we apply hyperclique patterns [34] for web service discovery. Hyperclique patterns are based on the concepts of frequent item sets [2]. Next, we first briefly review the concepts of frequent item sets and then describe the basic concepts of hyperclique patterns. 2.3 LSI As part of our approach, we utilize LSI over a set of WSDL documents and the terms in the service description and parameters. LSI, after analyzing a base set of web service documents, finds relations between web service terms including service description and parameters. Given a term

PALIWAL ET AL.: SEMANTICS-BASED AUTOMATED SERVICE DISCOVERY

263

web services published in the UDDI. The next step deals with selection of web services for a given service request. This step involves two tasks: 1) refinement of the set of web services based on the input, output, and description parameters of the service. The purpose of this refinement is to select a set of services from the service categorization module representing the desired functionality in terms of the input and output service parameters, 2) enhancement of the web service request with relevant ontology terms, and the matching of this enhanced service request with the set of candidate web services for selecting appropriate service. Fig. 2 illustrates the main components of the overall system that performs the two key steps related to automated service discovery. The Service Categorization module serves as the back end of the system and is executed once independently of individual service request. On the other hand, the Service Selection process is executed for each service request and Fig. 2. Automated service discovery components. The service categorization is performed offline on a regular basis and is independent of the serves as the front end of the overall system. service request. Service selection is executed online and in real time on The ontology guided web services categorization, as a per request basis. illustrated in Steps 1 to 5 of Fig. 2, takes advantage of clustering. In our approach, as illustrated in Steps 1 and 2, query, LSI translates it into concepts, and finds matching individual web services are represented as a vector that documents and corresponding web services. comprises of the terms of the service description and of the LSI is a statistical approach used to capture term services input and output parameters. We refer to this relationships and underlying domain semantics [9]. LSI vector as the Service Description Vector (SDV). The initial extends the Vector Space Model (VSM) in accounting for the task for service categorization involves improving the order and association between terms. The association semantic content of the SDV. We achieve this by extending between terms and documents are calculated and utilized the service description vector with relevant ontology in LSI to reveal an underlying structure or pattern of word concepts and terms. The improvement of the semantic usage across service descriptions. The LSI involving content is followed by the process of grouping of services http://ieeexploreprojects.blogspot.com Singular Value Decomposition (SVD) is an important with similar service functionality and published under factorization of a rectangular real or complex matrix. The different service categories. For the grouping of web original matrix is approximated by a linear combination of a services, we apply clustering to this web service data set, decomposition set of term to text-object association data. as illustrated in Step 3. The next step of our approach, i.e., For example, at matrix Xo of terms and objects can be Step 4 involves the proper labeling of each group of the decomposed into the product of three matrices. clustered web services. The labeling of web service groups X To So Oo , such that To and Oo have orthonormal involves 1) determining the semantic category to which the columns and So is a diagonal matrix. This is an SVD of X. member services belong based on the service functionality, Keeping only the k largest singular values of So with their and 2) the actual semantic categorization of the web services corresponding columns in matrices To and Oo results in the within the UDDI. We achieve this by associating an ontology matrix X0 , where X 0 is a unique matrix of rank k that is concept for each cluster. Following this, we retrieve the web closest to X such that: X X0 T S O. One of the main challenges identified by LSI is related to service entries to represent the semantic information in the the cost of computing and storing SVD. Local LSI is an UDDI by creating tModels in the registry. The tModel approach for dealing with this computationally intensive corresponds to concepts from the upper ontology, SUMO task which only considers the top-ranked service descrip- [15], representing functionality of the service in a relevant tions relevant to the service request. The results obtained domain. The ontology is linked with the respective tModel are applicable to our domain as it shares several common- using the overviewURL: tag of the tModels. The categorization of web services is followed by service alities with their domain, as follows: 1) the set of services relevant to the service requestin our case it is the set of selection from the relevant group of services. This is web service descriptions based on a common category that achieved by parameter-based service refinement as illuprovide functionality for service requests; 2) low LSI strated in Steps 5 to 7 of Fig. 2. Parameter-based service dimensions, in their case one or twothis is important refinement includes narrowing the set of appropriate since in our case web service descriptions are short passages services matching the service request based on service parameters, i.e., input, output, and description. The refined that result in low dimension vectors. set of web services is then matched against an enhanced service request as part of Semantic Similarity-based Match3 OVERVIEW OF THE PROPOSED APPROACH ing, as illustrated in Steps 8 to 13 of Fig. 2. Parameter-based Fig. 1 illustrates the key steps of the proposed approach for refinement of web services begins with a representation of semantic-based service discovery. The first step of the the web service parameters as a vector in which each entry proposed approach involves semantic categorization of the records the terms of the operations input and output. The

264

IEEE TRANSACTIONS ON SERVICES COMPUTING,

VOL. 5,

NO. 2,

APRIL-JUNE 2012

set of related web services is represented by a collection of efficient service discovery and a view to simplify our such vectors, as illustrated in Step 5. Next we mine this web approach we, however, currently validate our approach service collection to find the frequent patterns that satisfy a utilizing web services described with WSDL. However, our given support level and confidence level [34], as illustrated approach is not specific to a single approach to describing in Step 6. The frequent patterns represent the combination web services and can be applied to syntactic web services, of input and output service parameters related to the Semantic web services as well as a combined set of semantic service request. The terms in the patterns discovered by and syntactic web services. mining the input/output term vectors may not be semantically related. Therefore, the set of discovered patterns is 4 SEMANTIC CATEGORIZATION OF WEB SERVICES pruned based on the ranking of semantic relationships among the terms. Then for each remaining pattern we In our approach we begin with the semantic categorization of retrieve the web services that have the pattern expressed as UDDI wherein we combine ontologies with an established part of the service description. The pruning of discovered hierarchical clustering methodology, following the service patterns followed by retrieval of associated web services is description vector building process. For each term in the service description vector, a corresponding concept is illustrated in Step 7. The refined set of web services is then matched against located in the relevant ontology. If there is a match, the an enhanced service request as part of Semantic Similarity- concept is added to the description vector. Additional based Matching, as illustrated in Steps 8 to 13 of Fig. 2. A concepts are added and irrelevant terms are deleted based key part of this process involves enhancing the service on semantic relationships between the concepts. The resultrequest. Step 9 demonstrates the initial processing of the ing set of service descriptions is clustered based on the service request and its transformation to a service request relationship between the ontology concepts and service vector. Our approach for semantic similarity-based match- description terms. Finally, the relevant semantic information ing utilizes ontology linking to enhance the service request is added to the UDDI for effective service categorization. with relevant ontology terms. In Steps 10 and 11 we With respect to our running example, additional concepts compute the semantic rank of the relevant terms from from weather ontology are added to the description vectors the ontology and utilize the semantic ranking to determine for WS1 and WS2. Following this, both WS1 and WS2 the inclusion of the ontology concept in the original service are grouped together utilizing hierarchical clustering. All the request. Step 12 indicates the formation of the enhanced services within this cluster (including WS1 and WS2) are service request based on certain techniques described in the then associated with an upper ontology concept weather latter part of this paper. For matching the enhanced service as a category. Below is an outline of the key steps of our http://ieeexploreprojects.blogspot.com request with the refined set of web service description approach as illustrated in Steps 1 to 4 of Fig. 2, vector, we employ Latent Semantic Indexing (LSI) techni1. Build the web service description vectors. que. Step 8 involves the conversion of the refined set of web 2. Append relevant ontology concepts and delete services into the term document matrix for LSI. irrelevant terms based on the ranking of semantic In the proposed approach, both semantic-based service relationships among the terms. categorization and parameter-based service refinement 3. Mine web service collection utilizing hierarchical depend on the service description in the WSDL file. clustering and associate an upper ontology concept Additionally, we consider keyword-based search for service for each cluster and the relevant ontology concept discovery. The brief textual descriptions of web service for the corresponding subcluster. functionality and little documentation on how to invoke The details of each of the steps for the semantic categorizathem makes keyword-based searches vulnerable to returning irrelevant search results and therefore serves as a tion of UDDI is included in Sections 4.1 to 4.4. primitive means for effectively discovering web services. 4.1 Web Service Vector Formation Semantic annotation and matching of web services has been proposed to address the drawbacks of syntactic web service The WSDL file forms part of the initial WSDL set and its descriptions. However, existing web services on the web corresponding description and associated parameters are usually are not equipped with semantic descriptions [21]. A parsed as follows: the WSDL document processing focused search for semantic service descriptions conducted includes the extraction of the associated operation paraby [20] with a specialized metasearch engine Sousuo found meters by extracting all terms under the <element name> and not more than about 100 semantic service descriptions in <documentation> tag. The next step in the WSDL processing prominent formats like OWL-S, WSML, WSDL-S, and involves removal of markups and index entries, removal of SAWSDL on the web. Klusch and Zhing [20] state that this punctuation, and using white space as term delimiters. A quantity appears tiny compared to more than half million collection of individual web service vectors represents the RDF sources indexed by the semantic web search engine entire data set denoted as W S fws1 ; t01 . . . wsi ; t0i g, Swoogle, and several hundreds of validated web service where t0 ft1 . . . tk g for all ws 2 W S, the set of web services descriptions in WSDL found by Sousuo on the web. and t 2 T , set of all different terms in WS. Web service vector Semantic annotations aim to provide for richer specifica- formation is included in Step 1 of Fig. 2. tions of web services. As a result, supplementing web services with a semantic description of their functionality 4.2 Web Service Vector Modification will further improve their discovery and integration based Enhancing the service vectors with concepts from the core on the proposed approach. With the goal of supporting ontology resolves issues related to synonyms and induces

PALIWAL ET AL.: SEMANTICS-BASED AUTOMATED SERVICE DISCOVERY

265

domain related concepts that provide the context. To achieve semantic enhancement we utilize an approach that augments the WordNet noun database with SUMO mappings [25], [15]. These mappings provide a natural language index to the ontology concepts, mediate between the structured concepts and free text and validate ontology content. This facilitates modeling of domain elements with relevant ontology concepts by associating SUMO concepts with input nouns via WordNet synsets. Thus, for our running example, the service vector for WS1 is formed as weather information US postal code date time temperature pressure humidity hour minute seconds zipcode rain address street city state month year snow wind precipitation. The WS2 service vector is city information US state address latitude population male female longitude region description temperature wind precipitation weather pressure humidity rain snow wind. The first step in this phase of our approach involves adding relevant ontology concepts to the initial service vector. Our approach considers all concepts for enhancing the web service description. The add step extends each service vector by additional WordNet elements. This is followed by the retrieval of corresponding mapped SUMO concepts represented by the set C. The modified web service vector is the union of t0i and ci where ci 2 C. The next step of this phase involves deleting irrelevant terms based on the ranking of semantic relationships among Fig. 3. modifyServiceVector Algorithm. the terms. The complex relationships are based on property sequences that link the two concepts in the semantic the lower part of the concept hierarchy, this is indicative of association. Two concepts ei and ej are semantically greater specificity and as a result Sp 1. Since the concepts associated with each other if there exists one or more fall within the specified span of weather domain in http://ieeexploreprojects.blogspot.com relationship Relij , between the concepts ei and ej , where the SUMO ontology, S 1. The semantic rank score of the 1 i < n and 1 j < n, and n is the number of terms in a association pattern is calculated to an integer rounded value web service. Next for each of these concepts we find the as 0:33 1 0:33 1 0:33 1 1. The associated semanrelevance, specificity, and the user specified span. The user tic rank is utilized to determine the inclusion or deletion of assigns weights (k1 ; k2 , and k3 ) for each of the parameters the concept to the service description vector. The modify(Rel; Sp ; S) as a threshold for concept selection. This also ServiceVector algorithm (Fig. 3) gives the details. Step 2 of makes the ranking process more flexible. Our current Fig. 2 executes the modification of the web service vector. approach assigns binary values to the ranking parameters. Assigning a range of specific values to these parameters is 4.3 Clustering and Ontology Concept Association part of our future work. After enhancing the service description vector with relevant To illustrate our approach consider the following terms ontology concept, clustering of service vectors is performed within the web service vector, {temperature, pressure, to group functionally similar services together. Hierarchical postal code}. k1 ; k2 ; k3 , as explained in Section 2.1, are clustering facilitates classification of all the services, such user-specified weights associated with relevance, specifi- that each subcluster and the combinations of subclusters city, and span, respectively, to obtain the overall rank of the create a hierarchya structure that is more informative semantic association between the web service vector terms. than the unstructured set of clusters. This is the primary The selection of the weight coefficients is a key challenge for reason that we adopt hierarchical, group-average agglomrelevant research. It is heuristics based and subjective to erative clustering to group web services, since we want to some extent presently. The aim is to put forward the have informative clusters of the web services descriptions. optimal selection of weights used in the equation to Also, the approach of Heb and Kushmerick [11], of using minimize the variation bias. To overcome this difficulty, a the information contained in the service description to range of weights were computed with reasonable assump- dynamically create the categories for service classification, tions given for the observed results and analyzed results. illustrates that hierarchical clustering is the best clustering The objective weight coefficients were obtained in the approach for service classification. minimum variance of the difference between the analyzed The step following the formation of clusters includes field and ideal field. For this phase of our approach we associating relevant SUMO ontology concepts. The associaconsider equal user-specified weights, i.e., k1 0:33; k2 tion of concepts to each cluster facilitates web service 0:33 and k3 0:33; coverage 2, and depth 2. The three discovery by mapping to functional categories. A cluster i concepts are linked to the concept weather specified in the is defined as i cj where, cj is the corresponding ontology upper ontology. Thus Rel 1. The concepts are located in concept. The ontology concepts render semantic for web

266

IEEE TRANSACTIONS ON SERVICES COMPUTING,

VOL. 5,

NO. 2,

APRIL-JUNE 2012

Fig. 4. associateOntologyCluster Algorithm.

service categorization. Our approach utilizes the mapping of WordNet elements to SUMO concepts. We build a set which contains all concepts that exist in at least one service 5 PARAMETERS-BASED SERVICE REFINEMENT description and eliminate duplicate concepts. This is followed by locating the position of the remaining concepts The next step is service selection from the relevant category in the concept hierarchy Hc . Each concept is checked for of services using parameter-based service refinement. Web http://ieeexploreprojects.blogspot.com subsumes or subsumed relationship with the elements of the service parameters, i.e., input, output, and description, aid set. The resultant superconcept is then mapped to the cluster. service refinement through narrowing the set of appropriate The mapping of the ontology concept to the cluster extends services matching the service request. The relationship between web service input and output semantic information in UDDI. This is executed by the creation of tModels for the associated web services of the parameters may be represented as statistical associations. cluster in the registry. The associateOntologyCluster algorithm These associations relay information about the operation parameters that are frequently associated with each other. as shown in Fig. 4 provides the details of our approach. To group web service input and output parameters into 4.4 Web Service RegistryReliance on UDDI meaningful associations, we apply a hyperclique pattern One question that may arise is the extent to which our discovery approach [10]. These associations combined with approach is reliant on UDDI. UDDI is a platform- the semantic relevance are then leveraged to discover and independent, open industry initiative, XML-based registry rank web services. For the running example, the first step of our approach enabling service providers to publish service listings and discover each other and define how the services interact for parameters-based service refinement is to build the over the Internet. There are several UDDI business service parameters association pattern item set for all registries (UBRs) that provide the ability to locate the services within the weather cluster (including WS1 and services matching the search criteria in an efficient manner. WS2) [28]. The next step involves pruning the association Some of these UBRs include Microsoft, SAP, and National pattern based on concepts extracted from domain ontology Biological Information Infrastructure (NBII) among others. and a confidence threshold. This provides a set of ranked In addition, various web service search portals (e.g., web services matching service functionality. Below is an RemoteMethods, Xmethods), search engines, e.g., Google, outline of the key steps of our approach as illustrated in Yahoo, and Baidu and web service crawler engines (e.g., Al- Steps 5 to 8 of Fig. 2, masri [3]) have originated that are being used for service 1. Retrieve associated parameters forming the associadiscovery. These portals and crawlers may not necessarily tion pattern item set. comply with the original and established web service 2. Perform Hyperclique pattern discoveries on the standards such as UDDI. However, they also incorporate association pattern item set. the storage of web services collected from various sources 3. Rank the semantic associations between the terms. into a central data repository which can be queried by users. 4. Prune the association patterns collection. In this sense, our service discovery approach can be applied on all web services retrieved through, search engines, Sections 5.1 to 5.4 provide a detailed discussion of each of portals, crawlers, and UBRs. In addition, there also exists the steps for parameters-based service refinement.

the approach of publishing and discovering web services across multiple registries grouped into registry federations, e.g., [32] for enhancing the discovery process. In our approach, the use of UDDI is only as a base to be compliant with the universally adopted standard for service discovery. Our proposed approach can, however, be extended to the various approaches mentioned above for discovering web services. For supporting semantic-based service discovery, the proposed approach adds semanticbased service categorization and service request enhancement as separate layers on top of the UDDI. Addition of these layers affects the performance of the discovery process in terms of increased timing delays. Given the fact that service categorization is performed offline and only service request enhancement is performed during runtime, therefore the increase in the timing delay will not be significant. While we do not directly measure this delay due to service request enhancement in our experiments, it can indirectly be measured based on the size of the original service request and the enhanced service request. Typically, a service request vector includes a maximum of 25 elements and searching the ontology for an additional few tens of elements will not have a significant overhead given the efficient ontology search mechanisms (a search only requires logn time where n is the size of the ontology).

PALIWAL ET AL.: SEMANTICS-BASED AUTOMATED SERVICE DISCOVERY

267

5.1 Service Parameters Retrieval As discussed earlier, the web service description is provided in the WSDL document. For retrieving the relevant service parameters, the corresponding WSDL document is processed to extract the associated operation parameters by retrieving all terms under the <element name> tag. The WSDL processing also includes stoplist removal and stemming to strip word endings. 5.2 Hyperclique Pattern Discovery The process of searching hyperclique patterns can be viewed as the generation of a level-wise pattern tree. Every level of the tree contains patterns with the same number of nodes. If the level is increased by one, the pattern size (number of objects in the pattern) is also increased by one. Every pattern has a branch (subtree) which contains all the supersets of this pattern. Our algorithm for finding hyperclique patterns is breadth-first. We first check all the patterns at the first level. If a pattern is not satisfied with the user-specified support and h-confidence thresholds, the whole branch corresponding to this pattern can be pruned without further checking. This is due to the antimonotone property of support and h-confidence measures. Consider Fig. 5. rankSemanticAssociations Algorithm. the h-confidence measure, the antimonotone property guarantees that the h-confidence value of a pattern is depth 2. The three concepts are linked to the concept greater than or equal to that of any superset of this pattern. weather specified in the upper ontology. Thus, Rel 1. The Following this manner, the pattern tree grows level-by- concepts are located in the lower part of the concept level until all the patterns have been generated. In hierarchy, this is indicative of greater specificity and as a accordance, for our example the input and output result Sp 1. Next we determine if the concepts fall within parameters for WS1 are postalcodehttp://ieeexploreprojects.blogspot.com date time temperature the specified span within the weather domain. As illuhumidity pressure precipitation and for WS2 are city state strated by the WeatherConcepts ontology, the concepts are population temperature wind precipitation. The hyperclique included in the specified span thus S 1. The semantic patterns, along with support and h-confidence {hyperclique rank score of the association pattern is calculated as 0:3 pattern (support, h-confidence)} generated are {temperature, 1 0:4 1 0:3 1 1. The associated semantic rank is pressure (9.52, 50 percent)}, {temperature, pressure, precipitation utilized to sort the association pattern collection. (14.29, 75 percent)} and {temperature, pressure, city (6.4, 5.4 Association Pattern Collection Pruning 50 percent)}. This algorithm is very efficient for handling A large number of association patterns are generated in the large-scale data sets [34]. These patterns indicate the association pattern mining phase. Patterns containing support and the h-confidence levels of association. The irrelevant information that will negatively influence the patterns are selected on the basis of the h-confidence service discovery process need to be discarded. The thresholds. For our approach, we set the h-confidence pruning of the association pattern collection is based on threshold to 50 percent. [4]: 1) eliminate the association patterns that have a low semantic relationship ranking between its terms; 2) retain 5.3 Ranking Semantic Associations The complex relationships are based on property sequences the generic patterns with high confidence. This is illustrated that link the two entities in the semantic association. The in the pruneAssociationPatterns algorithm in Fig. 6. FuncrankSemanticAssociations algorithm as shown in Fig. 5 tions 1 and 2 are listed in Fig. 6. Function 1. Given two patterns. X1 ) Y 1 and X2 ) Y 2, provides the details of our approach. the first pattern is eliminated if ScoreSemRank fpX1; Y 1g < Two entities ei and ej are semantically associated with each other if there exists one or more relationship Relij ScoreSemRank fpX2; Y 2g. Function 2. Given two patterns X1 and X2 ; X1 is ranked where 1 i < n and 1 j < n. higher than X2 1) if X1 has higher confidence than X2 , Next for each of these entities we find the relevance, specificity and the user-specified span. The user assigns confX1 > confX2 , 2) if the confidences are equal, support weights for each of the parameters to refine the request. for X1 must exceed that for X2 , suppX1 > suppX2 . This also makes the ranking process more flexible. Our current approach assigns binary values to the ranking 6 SEMANTIC SIMILARITY-BASED MATCHING parameters. Assigning a range of specific values to these parameters is part of our future work. To illustrate our The parameter-based refined set of web services is then approach consider the following association pattern {tem- matched against an enhanced service request as part of perature, pressure, postal code}. The user-specified weights Semantic Similarity-based Matching. A key part of this are k1 0:3, k2 0:4 and k3 0:3, coverage 2, and process involves enhancing the service request. Our

268

IEEE TRANSACTIONS ON SERVICES COMPUTING,

VOL. 5,

NO. 2,

APRIL-JUNE 2012

enhanced web service request overcome these limitations. In this paper, we report on the experiment in which we evaluate the benefits and drawbacks of the added value and pitfalls of semantic enhancement of web service request over pure keyword matching technique. Thus, though keyword-based web service discovery has proven its usefulness, applying semantics-based web service request strategies should greatly increase the resulting precision of searches and enable new types of web service requests to be formed. For our running example, we use the weather web service request: Service Request (SR). Find the temperature and rainfall based on zip code. Below is an outline of the key steps of our approach (as illustrated in Steps 9 to 13 of Fig. 2), followed by a detailed discussion of each of the steps. Preprocess service request and determine the overall search category of web services for the search. 2. Index the web service description collection and Fig. 6. pruneAssociationPatterns Algorithm. retrieve relevant service descriptions. 3. Preprocess the service descriptions set and retrieve approach for web semantic similarity-based service selecassociated concepts related to the initial service request from the ontology framework. tion employs ontology-based request enhancement and LSI4. Acquire the associated concepts related to the initial based service matching. service request to expand the request. Transform the The basic idea of the proposed approach is to enhance service description set into a term-document matrix. the service request with relevant ontology terms and then 5. Perform SVD on this matrix. find the similarity measure of the semantically enhanced 6. Project the description vectors and the request service request with the web service description vectors vector and utilize the cosine measure to determine generated in the service refinement phase [27]. For evaluathttp://ieeexploreprojects.blogspot.com similarity. ing this similarity, we employ LSI-based technique that uses We now go into the details of each step. cosine measure as the similarity metric. A key issue in discovery of web services refers to the query language utilized to form the web service request. 6.1 Service Request Preprocessing The web service request can be formed in two ways, i.e., a The service request is parsed and preprocessed. Preprocessyntactic web service request and a semantic web service sing includes: the removal of markups, translation of upper request. The syntactic web service request, in its most basic case characters into lower case, punctuation removal, and form, utilizes simple text to form a web service request. white space used as a term delimiters, stoplist removal, and Syntactic web query languages such as XQuery, XSLT, stemming to strip word endings. The outcome of this GQL, and Lucene among others, which have been tailored preprocessing is in a term vector yielding term frequency. specifically for declarative and efficient access and proces- In our weather service example, the SR is transformed to sing of web data, may also be utilized to form the web {temperature, rain fall, zip code}. The SR terms are then service request. The web service request may also be searched in the upper ontology to extract the related upper formed of a set of semantics-based XML languages, such as concepts. These concepts are utilized to determine the RDF and OWL, that rely on ontologies to explicitly specify category of the web services to be searched for discovering the content of the tags to annotate the service request. Most the most appropriate web service satisfying the requested of the RDF query languages today are relational based, functionality. The upper concepts are retrieved by extractsuch as SPARQL, RQL, and TRIPLE among others. ing the root concepts of the concept hierarchy that have the Compared with formal queries, keyword-based queries SR terms as its leaf nodes. In our example, this results in have the following advantages: 1) a simple syntax in terms {weather}. of a list of keyword phrases, 2) Open vocabularies wherein 6.2 Service Description Retrieval the users can use their own words to express their information requirement, and 3) the familiarity of the user The corresponding relevant service collection forms the with these interfaces due to their widespread usage. categorized WSDL set. As shown in Fig. 2, Step 5 involves However, the fundamental disadvantages of a keyword- the selection of web service descriptions (WSDL files) that based web service request are the lack of precision and the are categorized as weather services in the UDDI. categoryBag that is an optional element of tModels is used for lack of verifiability. A new, semantics-based approach is necessary not only service categorization. A service can specify its position to reduce this information overload problem, but also to within the general classification scheme by, for example, an enable more effective and productive services over the web. optional list of name-value pairs that are used to give Our research validates the limitations of keyword-based taxonomy information, like industry, product, or geosearching and provides an approach in which semantics graphic codes. These documents are then parsed and 1.

PALIWAL ET AL.: SEMANTICS-BASED AUTOMATED SERVICE DISCOVERY

269

Fig. 8. obtainReducedDimensionForm Algorithm.

retrieval of concepts traversing two links expressing an association [18]. This restricts us in gathering concepts across a single ontology at the same level. However, multiple iterations of the concept gathering function enable us to traverse one ontology at each step across the three broad levels. For example, corresponding to our request, our initial ontology modeled a weather forecast as having a set of features with a specific feature having a set of characteristics. This, however, requires querying across two associations, e.g., weatherforecastWF hasParameter http://ieeexploreprojects.blogspot.com temperatureTtemperatureT has featureFfeatureF is Fig. 7. generateEnhancedRequest Algorithm. unitU. Since this is not feasible, we currently list the processed to form the term-document matrix. The WSDL feature (sky, station, temperature, visibility, wind) without document processing includes the extraction of the text modeling the details of the feature. Therefore, the under the <documentation> tag. The extracted text forms associated concepts are represented as {weatherforecast|sky, the service description. Additionally, we consider the station, temperature, visibility, wind}. The expanded request associated operation parameters by extracting all terms is thus a union of the original terms and the ontology concept along with their concept hierarchies as mentioned under the <element name> tag. above. The enhanced service request is represented as; 6.3 Ontology Concept Acquisition Enhanced Service Request (ESR): windchill heat humidity The initial web service discovery process is not explicit as dewpoint wind pressure conditions visibility sunrise sunset state most of the users are not entirely aware of document moonrise moonset precipitation temperature rainfall zip code collection as well as the domain information. It is, therefore, region address city state latitude longitude postal code generateEnhancedRequest Algorithm (Fig. 7) shows the difficult to formulate a precise SR. This guides us toward iterative SR formulation. This part of our approach is based main steps involved in the generation of the expanded SR. on the introduction of relevance feedback for information Currently, service request expansion is implemented by retrieval [30]. Our approach builds on the manual process to using windowing and information display. Specifically, the provide a semiautomated technique to expand the SR based retrieved relevant terms are graphically displayed for the on the existing terms that make up the SR. The primary user. The terms chosen by the user are then included to objective is to extract associated concepts from the domain reformulate the expanded service request. The WSDL file forms part of the categorized WSDL set ontologies that are determined as relevant and enhance the existing SR. We developed and reused ontologies to form a and its corresponding description and associated paradomain ontology framework. The ontology concepts were meters are parsed as explained above. The next step in the extracted by ontology linking based on ontology-to-ontol- WSDL processing involves removal of markups and index entries, removal of punctuation and using white space as ogy mapping. term delimiters. WSDL processing also includes stoplist 6.4 Service Request Expansion and removal and stemming to strip word endings. obtainReduTerm-Document Matrix Formation cedDimensionForm Algorithm (Fig. 8), describes the proceOne of the assumptions in our experiments is related to dures of establishing the term-document matrix. WSDL simplification in modeling of the ontology framework. processing results in a term-document matrix wherein each Currently, our approach for linking ontologies is based on cell entry indicates the frequency with which a term appears

270

IEEE TRANSACTIONS ON SERVICES COMPUTING,

VOL. 5,

NO. 2,

APRIL-JUNE 2012

in [17]. We have also added additional WSDL files from xmethods [14] and from individual file search using search engines, e.g., Google. Data Set (Da ) comprises unlabeled web services and additional web services downloaded across various domains. In the data set Da , service categories were preassigned by users manually. Data Set (Db ) represents the categorized services from [17]. Data Set (Dab ) represents the combined set of web services. This collection of web services is classified into 30 categories. The classified service descriptions support a large number of varied requests and provide a sufficient testbed for service Fig. 9. MappingRequest Algorithm. discovery. For the experimental evaluation of semantic categorization of UDDI, the data set is represented into four in a document. Consequently, the term-document items are versions, i.e., where in the maximum number of services in transformed using an ltc weighting: normalization of the a category is restricted to 5, 10, and 15, respectively. The document length following the calculation of the log values categories that contain excess documents are not excluded, of individual cell items, multiplying each item for a term by however, only the maximum number of documents in the the IDF weight of the term. particular version is considered. The min-3 max-5 version, however, disregards all categories that contain less than 6.5 SVD Transformation three web service instances. The SVD program calculates the best reduced dimension For evaluating the proposed approach for semantic approximation for the transformed term-document matrix. categorization of web services, we structure four preclusA reduced dimension vector for each term and each tering techniques. The process of data analysis and document and a vector of the singular values form the clusters formation is preceded by a preprocessing step outcome of the SVD analysis. This reduced dimensional that includes stopword removal, stemming, and pruning to representation is used for determining the appropriate web reduce the noise in the data. Additionally we also consider services. The cosine similarity between the term-term, addition of related concepts to the data using ontology, request-description is used as a measure of similarity for deletion of irrelevant terms with and without adding new further analysis of this representation. concepts. In particular, we consider the following data setup for clustering, http://ieeexploreprojects.blogspot.com

6.6 Service Request Projection This step involves projecting the description vectors and the request vector and utilizing the cosine measure to determine similarity. This is followed by ranking the corresponding web services as most appropriate based on a higher similarity measure. See MappingRequest algorithm (Fig. 9) for details.

EXPERIMENTAL EVALUATION

The effectiveness of our approach is shown by conducting three set of experiments: 1) Semantic categorization of the web services in the UDDI; we evaluate the effectiveness of our results by utilizing f-measure. F-measure [6] is based on precision and recall of each cluster C from a set of services with service categories preassigned by users manually. 2) Semantic similarity-based matching; we compute scores to rate the matching that are the average of a 10-pt precisionrecall curve. The average of the precision is evaluated at 6, 10, 18 service descriptions retained and the average of recall evaluated at 50-100 service descriptions retrieved, and 3) the overall time taken, measured in seconds, for service discovery. To be able to evaluate, we developed a prototype of our approach. The implementation and deployment details of our approach are described in [27].

Orig.the initial setup is utilized to serve as a baseline for further comparisons. This setup includes all initial preprocessing techniques, i.e., stoplist, stemming, and pruning. 2. Addthis setup includes related concepts from the core ontology. This expansion of service vector builds on the mapping of the WordNet lexical database to the SUMO ontology. 3. Deletethis involves the removal of the irrelevant terms from the service vectors. Irrelevant terms are determined based on the frequency of their occurrence. In particular, we delete all terms that appear a lesser number of times as compared to a preset threshold. 4. Add and Deletethis technique is a combination of add and delete. Clustering is performed on each of the above cases. The clustering results are derived from a preassigned set of categorized web services. We present our results for each of the web service data sets in combination with the four techniques. Figs. 10a, 10b, 10c, and 10d plot the cluster size versus the average f-measure over all the data sets for each technique. For experiment test runs higher f-measure values indicate higher quality of the clusters formed. 1.

7.1

Semantic Categorization of Web Services in UDDI A total of 25 service requests and 800 service descriptions formed the collection of web services. The collection included web services compiled by the project described

7.1.1 Experiment 1Orig. Setup Fig. 10a depicts the results for the original setup. As observed in all the experiments the f-measure values are far from 1. The experimental results in Fig. 10a serve as a baseline for comparing the results with other data setups.

PALIWAL ET AL.: SEMANTICS-BASED AUTOMATED SERVICE DISCOVERY

271

7.1.3 Experiment 3Delete Setup Better results for cluster quality were observed with term reduction from the service description vectors as illustrated in Fig. 10c. The term reduction involved pruning individual term vectors of irrelevant and low frequency terms which increases the specificity of the services. 7.1.4 Experiment 4Add and Delete Setup This setup aims to maintain a balance between the generality and the specificity of terms in web service descriptions. This is achieved by expansion of the term vectors with relevant ontology concepts and subsequent reduction of terms from the web service descriptions. The results follow those observed in the add set of experiments. The technique, where in ontology concepts are added to all terms of web service descriptions followed by pruning, results in increased generality. The best results (illustrated in Fig. 10d) compared to all techniques were observed in the technique, where in ontology concepts are added to relevant terms of web service descriptions followed by pruning. This results in an increase in specificity and reduction of generality of the terms in web service description. The improved results may be explained on account of the generality-specificity balance achieved by added semantic providing a good representative set for better categorization and the overall reduction of noise added to the vector representations. 7.1.5 Summary of Results

http://ieeexploreprojects.blogspot.com data setups that the results improve We can see in all four
(in terms of F-measure) with an increase in the number of clusters. These results validate the scalability of our approach. Also, it can be noted in Figs. 10a, 10b, 10c, and 10d that the graphs clusters formed with all available service descriptions yield lower f-measures as compared to those formed in experiments with controlled cluster size. This may be explained by an increase in the purity of clusters with lesser number of service descriptions in comparison to that of a cluster with maximum number of service descriptions for individual categories. Another aspect of our evaluation deals with the frequency of service categorization for the entire UDDI. We perform service categorization on an incremental basis. We assume that the ontology is not perfect and that the ontology is updated to represent additional domain objects and their interrelationships. Then the categorization must be performed every time a newer service is added to the UDDI. However, periodic categorizations may be required if the service additions are frequent, as can be expected in real-life situations with large user and provider communities. However, we can update the service category by isolating the upper ontology concept that remains unchanged and then recategorizing all the services that fall in its child concepts. When evaluating the efficiency of our approach, there are a number of factors that affect the timings obtained viz., the size of the underlying ontology and the number of service to be categorized. We found that the total processing time for the service categorization was 259 seconds for our test set of 800 web services with an approximate 1,000 concepts of the ontology data.

Fig. 10. (a) Experiment 1Orig. Setup. (b) Experiment 2Add. Setup (addition of ontology concepts to relevant terms of the service description vector). (c) Experiment 3Delete Setup. (d) Experiment 4Add and Delete Setup (addition of ontology concepts to all terms of the service description vector).

7.1.2 Experiment 2Add Setup Fig. 10b shows the results for the Add setup. We observe that adding relevant terms from ontology yields an improvement over experiments conducted with the original data sets as illustrated in Fig. 10b. This leads us to a conclusion that adding relevant domain knowledge for all the terms is not all that helpful. The lack of high returns in results is on account of the generic nature of the SUMO ontology that does not focus on a specific domain. This may be due to the fact that a large number of web service descriptions have overlapping categories. The addition of terms related to these overlapping domains creates additional noise which is not resolved by the clustering algorithm. A possible approach to overcome this effect would be to consider addition of concepts from the ontology to only the relevant terms, accounting for context. The ontology serves as a guide for clustering that incorporates domain knowledge and more focused information. We consider two criteria viz., span and depth, to determine the coverage of the ontology concepts. The exact parameters determining the coverage aim to achieve the smallest set of additional ontology concepts while maintaining the best overall coverage within the smallest set.

272

IEEE TRANSACTIONS ON SERVICES COMPUTING,

VOL. 5,

NO. 2,

APRIL-JUNE 2012

TABLE 2 Service Scores for Individual Weather Services

For evaluating the analytical complexity of the proposed service categorization approach, let n represents the total number of concepts that form the ontology and m represents the total number of web services. For searching a specific concept in the ontology, O(log n) search operations need to be performed. The add operation for including the relevant concepts for each web service occurs in constant time. For this reason the standard representation of our approach for service categorization would be Omlogn n.

discovery the service request expansion method is better as indicated by the recall. The experiments were conducted using categorized services that included 1) 50 web services that have news, financial, location, and graphics as their service category. 2) 100 web services that have news, financial, location, graphics, games, business, flights, web, and music as their service category. The ranking of the services change as more dimensions are added to the service collection under consideration. However, we notice that all the relevant services are retrieved in the top 20 percent of the number of services being considered. The experimental results also indicate that the categorization of services yields significantly better results in terms of the specific web services being returned for a particular service request.

7.3 Performance We have compared the time taken to match a Service Request with a web service description (for Dab ) within service sets that include 1) predefined categories, 2) semantic categorization, and 3) entire service set or the set of uncategorized services. The basis of this experiment is to validate our approach for an ontology guided web service Categorization. It was observed that the time taken for 7.2 Semantic Similarity-Based Matching service matching within pre-defined categories, semantiFor evaluating our approach for semantic similarity-based cally categorized (our approach) and uncategorized services service discovery we set out to discover relevant services for was 2.58, 3.65, and 406.8 seconds, respectively. We observe an average of ten service requests. For the purpose of this that our approach provides a balance in terms of quality of paper we report our results for the request Find the the service selected and also the time taken for matching of temperature and rainfall based on a given zip code. an appropriate service. The initial discovery is based onhttp://ieeexploreprojects.blogspot.com for service discovery seems accepa smaller number of The observed time WSDL files with a focus on precision. The next discovery table, especially given that most of the time users will experiment examines a larger section of WSDL files with a submit more incremental, and hence less time consuming focus on maximizing recall. In order to assess the impact of requests. The time it takes to load the system though could service request expansion with relevant terms from ontol- be improved. In the future, we plan to further evaluate the ogy concepts on service discovery, we compare the cosine scalability of our approach, along with detailed experimenmeasure-based similarity scores of the two different service tation with actual users to fine tune the way in which our selection methods; with enhanced service request and integrated functionality is presented and to eventually original service request. Table 2 shows LSI-based service evaluate the full benefits of our approach from a perforselection scores for the six weather web services obtained mance and solution quality standpoint. from calculating the cosine measure-based similarity results between the service descriptions and the service request. 7.4 Deployment web services {W3 and W6} are the most appropriate In the existing architecture, the service provider/requestor matches for our example service request. With an expanded accesses the UDDI through an application server. To deploy service request over categorized services we notice an our approach, we need to enhance this by incorporating a improved result over the original service request. However, semantic application server as well as an ontology this is not observed when we exclude the related concepts repository. The Application Server now executes our derived from the ontology. Services {W3 and W6} have approach to select the most suitable services based on higher scores from a similar web service {W1} for ESR in semantics processed by the Semantic Application Server in comparison to SR. The higher scores are indicative of the conjunction with the ontology repository. The Semantic appropriateness of the service in terms of the requested Application Server should include an ontology reasoner functionality. The expanded service requests, thus, facilitate (e.g., Racer) that utilizes description logics to load and improved differentiation between the appropriate services query ontologies to extract the relevant concepts for and the rest of the services on account of the higher score semantic categorization of web service descriptions and differences indicating a better match to the service request. enhancement of service requests. Further, the performance of LSI and expanded service Since our proposed work considers semantic functionrequest is measured by observing the precision and recall ality of web services for service discovery and ranking, we levels at 6, 10, and 18 services. The expanded service do not explicitly address other QoS measures such as trust request has greater overall precision indicating it returns a and reputation. However, this can be easily incorporated as higher percentage of relevant services over the three levels follows: for example, a trust and reputation registry could of services retained. Comparing the two methods for service be integrated with the UDDI server. Now, the selection of

PALIWAL ET AL.: SEMANTICS-BASED AUTOMATED SERVICE DISCOVERY

273

applications. In standard FCA, the set of attributes does not carry any structure. By considering this set as a set of ontology concepts, we can model relations and dependencies between the attributes. Although this does not increase the complexity of the resulting lattices (as concept lattices cover, up to isomorphism, the whole class of complete lattices), it enriches the conceptual structure and provides new means of interaction and analysis. FCA may also complement our approach by facilitating ontology merge and linking to provide a better depth and span in terms of the domain concepts coverage. In [26], Oldham et al. propose a framework to semiautomate the semantic annotation of web services for classification-based on matching web service data types and domain ontology concepts making use of schema matching. The main drawback of this is that it is not simple to find similarities with domain ontology concepts as no single domain concept contains the complete structure of a complex schema containing all service parameters. Existing approaches to web service matching address either syntactic and/or semantic matching, e.g., Sajjanhar et al. [29] have studied LSI to acquire the semantic associations between short textual web service descriptions, Corella et al. [8] describe a heuristic approach for semiautomated web services classification based on a previously classified services corpus. These approaches utilize the initial web services descriptions advertised by service provider and functionality request specified by the service requestor. These initial descriptions do not include http://ieeexploreprojects.blogspot.com any semantic augmentation. Our approach extends this work by adding semantics to the service request. As 8 RELATED WORK validated by the experimental results, this helps us achieve The challenges pertaining to automatic classification of web improved results for appropriate service discovery. The services have been addressed in prior work [7], [11], [8], most widely used IR technique constitutes the Vector[26]. In [11], Heb and Kushmerick propose an approach of Space Model [31]. VSM, however, considers the syntactic using the information contained in the service description to dynamically create the categories for service classifica- aspect of term association and does not account for the tion, comparing five clustering algorithms. The classifica- underlying semantic structure. Kokash et al. [12] address tion process has similarities to our approach in terms of the inadequacy of VSM approach by expanding both the construction of term vectors with relevant words and service query and the WSDL descriptions. A Hybrid utilizing a hierarchical clustering approach for achieving matching approach is proposed that may combine various the best results. Our approach builds on this by 1) including matching methods (e.g., syntactic and semantic) into a relevant semantic concepts based on semantic relationship composite algorithm. This enables ad hoc composition of ranking for expanding the domain coverage, 2) deletion of several (pre-existing) matching approaches based on nonrelevant terms resulting in the reduction of noise and predefined criteria. In principle, this is similar to our increase in the purity of the clusters. work. However, although it may provide flexibility, it also Bruno et al. [7] propose a classification approach increases the human intervention for selection of a utilizing Support Vector Machines (SVM) to classify the composite algorithm applicable to a set of services for term vectors. Bruno et al. [7] also make use of concept lattice created using Formal Lattice Analysis to identify concepts specific application. The usage of synonyms does not capture the overall for a specific domain as well as the relationships between semantics of the domain and application functionality. services belonging to a class. This approach is the closest to our approach. Our approach, however, is based on gleaning However, our approach utilizes concepts extracted from of semantic utilizing a domain ontology hierarchy. Ad- domain ontology. These extracted concepts account for ditionally, from our point of view, this approach does not relationships between the domain objects and provide a address the issue of SVM mapping training data to higher comprehensive coverage for the underlying semantics for dimensional space, and then finding the maximal marginal both the domain and the application functionality. Our approach appends the syntactic service description with hyperplane to separate the data. One of the approaches for enhancing the training time of relevant semantic terms. This enables uniform combination SVM, specifically when dealing with large data sets, of syntactic and semantic matching rendering our approach recommends hierarchical clustering analysis. Also ontolo- more generalizable for overall service matching and gies can be used to improve Formal Concept Analysis (FCA) requiring minimal human interaction. appropriate web services needs to incorporate the calculation of a weighted average of the functionality and the trust and reputation of a web service. Depending upon the number of web services and service requests, we may need to use XML gateway devices to offload the work of parsing and transformation of XML to reduce the computational burden. Another issue with deployment is that of reflecting, cascading, and management of updates in ontologies within the associated web services concepts. Ontology updates must be carefully managed. There are three key tasks associated with this: first, the revised ontology needs to be assessed and evaluated to ensure logical consistency and check the level of axiomatization. The metadata of the linked web services may also need to be updated. This is particularly important if the ontology is a domain specific ontology as the updates to the concepts may result in a changed categorization of the associated web services. The frequency and scale of these changes will impact the execution and performance of our approach. Note however, that both assessment and recategorization can be done offline, while the existing ontology is still being utilized. The revised ontology can then serve as a drop-in replacement. Alternatively, the UDDI server may decide to adopt and maintain different versions of the ontology. In this case, the web service requestors need to be notified, in an intuitive manner, of the version changes to ontologies and web services. While these challenges must be considered, good design can ensure robust deployment of our approach in terms of computational complexities and overheads.

274

IEEE TRANSACTIONS ON SERVICES COMPUTING,

VOL. 5,

NO. 2,

APRIL-JUNE 2012

Our approach has similarities to existing approaches [33], [1], [2], [19] of natural language processing techniques that address the text part of the challenge in content-based image retrieval (CBIR). These approaches, however, were used in isolation to one another. Our approach, on the other hand, combines both these techniques using concept lists, distance within an ontological structure and latent semantic indexing. In [13], Sassen et al. describe the SeCSE approach for architecture time service discovery that is based on ontologies that are validated, easy to use, complete, and widely accepted in domains. In contrast to this, our approach begins with the description of an ontology framework that includes upper ontologies, e.g., SUMO and more descriptive domain and application related ontologies. We propose a linked ontology structure for a wide-ranging description of domain semantics. Our approach for service discovery initiates service request enhancement with concepts extracted from related domain ontologies and reduces the space of service request and WSDL specification term vectors utilizing LSI to reduce the dimensions to be considered.

associating semantic weights to the retrieved set of web services for effective semantic ranking of the results.

ACKNOWLEDGMENTS
This work was supported in part by the US National Science Foundation under grant IIS-0306838 and SAP Labs, LLC.

REFERENCES
[1]

J. Adcock, A. Girgensohn, M. Cooper, T. Liu, L. Wilcox, and E. Rie, FXPAL Experiments for TRECVID, Proc. TRECVID, 2004. [2] R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules Between Sets of Items in Large Databases, Proc. ACM SIGMOD Intl Conf. Management of Data, 1993. [3] E. Al-Masri and Q.H. Mahmoud, Investigating Web Services on the World Wide Web, Proc. 17th Intl Conf. World Wide Web (WWW 08), Apr. 2008. [4] M.-L. Antonie and O.R. Zaane, Text Document Categorization by Term Association, Proc. IEEE Intl Conf. Data Mining (ICDM 02), 2002. [5] K. Anyanwu, A. Maduko, and A. Sheth, SemRank: Ranking Complex Relationship Search Results on the Semantic Web, Proc. 14th Intl Conf. World Wide Web (WWW 05), 2005. [6] P. Baldi, P. Frasconi, and P. Smyth, Modeling the Internet and the Web, Probabilistic Methods and Algorithms, Wiley, 2003. [7] M. Bruno, G. Canfora, M.D. Penta, and R. Scognamiglio, An Approach to Support Web Service Classification and Annota9 CONCLUSION AND FUTURE WORK tion, Proc. IEEE Intl Conf. E-Technology, E-Commerce and E-Service (EEE 05), 2005. In this paper, we present an integrated approach for Corella Semi-Automatic Semantic-Based automated service discovery. Specifically, the approach [8] M.A. Service and P. Castells, Proc. Intl Conf. Business Process Web Classification, addresses two major aspects related to semantic-based Management Workshops (BPM 06), 2006. service discovery: semantic-based service categorization [9] P.W. Foltz and S.T. Dumais, Personalized Information Delivery: An Analysis of Information Filtering Methods, Comm. ACM, and semantic-based service selection. For semantic-based http://ieeexploreprojects.blogspot.com 12, pp. service categorization, we propose an ontology guided [10] vol. 35, no.Karypis, 51-60, 1992. E. Han, G. and V. Kumar, Scalable Parallel Data Mining categorization of web services into functional categories for for Association Rules, Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 97), 1997. service discovery. This leads to better service discovery by matching the service request with an appropriate service [11] A. Heb and N. Kushmerick, Automatically Attaching Semantic Metadata to Web Services, Proc. IJCAI Workshop Information description. For semantic-based service selection, we Integration on the Web, 2003. employ ontology linking (semantic web) and LSI thus [12] http://dit.unitn.it/~kokash/documents/WS_matching-hybrid. pdf, 2012. extending the indexing procedure from solely syntactical information to a semantic level. Our experiments show that [13] http://idcrue.dit.upm.es/biblioteca/mostrar.php?id=2154, 2012. [14] XMethods, http://www.xmethods.net, 2012. this leads to increased precision levels, recall levels, and the [15] http://reliant.teknowledge.com/DAML/SUMO.owl, 2008. relevance scores of the retrieved services. [16] http://www.uddi.org/specification.html, 2012. In the future, we will extend our approach to allow [17] http://www.few.vu.nl/~andreas/projects/annotator/ws2003. html, 2012. service requests that are formed using specialized query languages. We can then match these requests to semianno- [18] H.L. Johnson, K.B. Cohen, W.A. Baumgartner Jr., Z. Lu, M. Bada, T. Kester, H. Kim, and L. Hunter, Evaluation of Lexical Methods tated services that are described using formats such as for Detecting Relationships Between Concepts from Multiple SAWSDL, OWL-S among others. We can also extend our Ontologies, Proc. Pacific Symp. Biocomputing, 2006. work for web service composition. Typically, multiple [19] M. Kher, D. Brahmi, and D. Ziou, Combining Visual Features with Semantics for a More Efficient Image Retrieval, Proc. 17th services have to be discovered so that they together match Intl Conf. Pattern Recognition (ICPR 04), 2004. a service request. It should be possible to utilize ontologies, [20] M. Klusch and X. Zhing, Deployed Semantic Services for the and explicitly return the sequence of individual service Common User of the Web: A Reality Check, Proc. IEEE Intl Conf. Semantic Computing (ICSC), 2008. invocations to be performed in order to achieve the desired composite service. When no full match is possible, a flexible [21] J. Lu, Y. Yu, D. Roy, and D. Saha, Web Service Composition: A Reality Check, Proc. Eighth Intl Conf. Web Information Systems matching approach could be created to return partial Eng. (WISE 07) Dec. 2007. matches and/or suggest additional inputs that would [22] D. Martin, M. Paolucci, S. McIlraith, M. Burstein, D. McDermott, produce a full match by capturing the dependencies among D. McGunneess, B. Barsia, T. Payne, M. Sabou, M. Solanki, N. Srinivasan, and K. Sycara, Bringing Semantics to Web Services: the matched services. This has several interesting research The OWL-S Approach, Proc. First Intl Workshop Semantic Web issues. Another avenue for future work is to create an Services and Web Process Composition, July 2004. interactive, intelligent service composer that is semantically [23] S. McIlraith, T. Son, and H. Zeng, Semantic Web Services, IEEE guided to locate the target service components step by step. Intelligent Systems, vol. 16, no. 2, pp. 46-53, Mar. 2001. We also intend to extend our ontology framework and [24] S. McIlraith and D. Martin, Bringing Semantics to Web Services, IEEE Intelligent Systems, vol. 18, no. 1, pp. 90-93, Jan. 2003. investigate additional mapping tools to better express a [25] I. Niles and A. Pease, Linking Lexicons and Ontologies: Mapping service request to search for relevant concepts. Finally, as WordNet to the Suggested Upper Merged Ontology, Proc. IEEE Intl Conf. Information and Knowledge Eng. (IKE 03), 2003. part of the service discovery process we will explore

PALIWAL ET AL.: SEMANTICS-BASED AUTOMATED SERVICE DISCOVERY

275

[26] N. Oldham, C. Thomas, A. Sheth, and K. Verma, METEOR-S Web Service Annotation Framework with Machine Learning Classification, Semantic Web Services and Web Process Composition, vol. 3387, pp. 137-146, Jan. 2005. [27] A.V. Paliwal, N. Adam, and C. Bornhoevd, Adding Semantics through Service Request Expansion and Latent Semantic Indexing, Proc. IEEE Intl Conf. Services Computing (SCC), July 2007. [28] A.V. Paliwal, N. Adam, H. Xiong, and C. Bornhoevd, Web Service Discovery via Semantic Association Ranking and Hyperclique Pattern Discovery, Proc. IEEE/WIC/ACM Intl Conf. Web Intelligence, 2006. [29] A. Sajjanhar, J. Hou, and Y. Zhang, Algorithm for Web Services Matching, Proc. Asia-Pacific Web Conference (APWeb), pp. 665-670, 2004. [30] G. Salton and C. Buckley, Improving Retrieval Performance by Relevance Feedback, J. Am. Soc. for Information Science, vol. 41, no. 4, pp. 288-297, 1990. [31] G. Salton, A. Wong, and C.S. Yang, A Vector Space Model for Automatic Indexing, Comm. ACM, vol. 18, pp. 613-620, Nov. 1975. [32] K. Verma, K. Sivashanmugam, A. Sheth, A. Patil, S. Oundhakar, and J. Miller, METEOR-S WSDI: A Scalable P2P Infrastructure of Registries for Semantic Publication and Discovery of Web Services, Information Technology and Management J., vol. 6, pp 17-39, 2005. [33] http://www.musclenoe.org/research/sci_deliv_pub/D5.1_WP5_ SoA_RevisedVersion_sept05.pdf, 2012. [34] H. Xiong, P. Tan, and V. Kumar, Mining Strong Affinity Association Patterns in Data Sets with Skewed Support Distribution, Proc. IEEE Third Intl Conf. Data Mining (ICDM), 2003.

Jaideep Vaidya received the BE degree in computer engineering from the University of Mumbai, India, and the MS and PhD degrees in computer science from Purdue University. He is currently an associate professor in the Management Science and Information Systems Department at Rutgers University. His research interests include data mining, data management, security, and privacy. He has published more than 60 technical papers in peer-reviewed journals and conference proceedings, and has received two best paper awards from the premier conferences in data mining and databases. He is also the recipient of a US National Science Foundation Career Award and is a member of the ACM and the IEEE. Hui Xiong received the BE degree from the University of Science and Technology of China, the MS degree from the National University of Singapore, and the PhD degree from the University of Minnesota. He is currently an associate professor in the Management Science and Information Systems Department at Rutgers University. His research interests include data and knowledge engineering, with a focus on developing effective and efficient data analysis techniques for emerging data intensive applications. He has published more than 70 technical papers in peer-reviewed journals and conference proceedings. He is a coeditor of Clustering and Information Retrieval (Kluwer Academic, 2003) and a coeditor-in-chief of Encyclopedia of GIS (Springer, 2008). He is an associate editor of the Knowledge and Information Systems journal and has served regularly on the organization committees and the program committees of a number of international conferences and workshops. He is a senior member of the IEEE and a member of the ACM.

Aabhas V. Paliwal received the bachelor of engineering degree in electronics and telecommunications from Mumbai University, India, the MS degree in computer engineering degree from Rutgers University, and the PhD degree in Nabil Adam is currently serving as a fellow at management, information technology from Rutthe Science and Technology Directorate of the gers University. He is currently a senior technical US is a http://ieeexploreprojects.blogspot.com Department of Homeland Security. He sysconsultant at Mindlance LifeSciences. He is a professor of computers and information coholder of a European issued patent and has tems, the founding director of the Rutgers two pending patent applications submitted to the University Center for Information Management, US Patent and Trademark Office all related to web services. His Integration, and Connectivity (CIMIC), and coresearch interests include service-oriented architecture, semantic web, founder and the past director of the Meadowsemantic web services, and business process management. He is a lands Environmental Research Institute. He has student member of the IEEE. published more than 100 technical papers covering such topics as information management, information security Basit Shafiq received the BS degree in electro- and privacy, data mining, web services, and modeling and simulation. nic engineering from Ghulam Ishaq Khan InHe has coauthored/coedited 10 books. He is the cofounder and the stitute of Engineering Sciences and Technology, executive-editor-in-chief of the International Journal on Digital Libraries Pakistan, the MS degree in electrical and and serves on the editorial board of a number of journals including the computer engineering from Purdue University, Journal of Management Information Systems and the Journal of and the PhD degree in computer engineering Electronic Commerce. He is also the cofounder and the past chair of from the School of Electrical and Computer the IEEE Technical Committee on Digital Libraries. He is a senior Engineering at Purdue University. He is currently member of the IEEE. a research assistant professor at the Center for Information Management, Integration and Connectivity (CIMIC), Rutgers University. His research interests include semantic web, web services, information systems security, and multimedia systems. He is a member of the IEEE.

Vous aimerez peut-être aussi