Vous êtes sur la page 1sur 25

Algoritmo di text-similarity

per l’annotazione semantica di WS


SWAP research group - 27 luglio 2010
Michele Filannino, @bronko85

lunedì 23 agosto 2010


Outline
Il problema
Scenario di riferimento
Similarità

SAWA
Word-to-word similarity
Text-to-text similarity

Risultati sperimentali
Qualità dei risultati
Tempo di esecuzione

2
Sviluppi futuri
Sessione dimostrativa

lunedì 23 agosto 2010


Il problema
Come misurare la similarità tra due testi?

lunedì 23 agosto 2010


4 Scenario di riferimento
Natural language To approve/reject
descriptions suggested annotations

WSDL file CODEArchitects CODEArchitects SAWSDL file


Annotation Tool Annotation Tool

lunedì 23 agosto 2010


5 Similarità semantica
Assegnare una metrica di somiglianza, basata sul significato, ad un insieme di
termini e/o documenti;

Similarità ≠ Correlatività;
“Banca” e “denaro” sono correlati sebbene non siano affatto simili;

Similarità Correlatività;

“Ragazza” e “fanciulla” sono simili quindi anche correlati.

lunedì 23 agosto 2010


6 Similarità semantica in SWOP

Concetti del WS Concetti ontologici


- RequestOrder Order -
- Order OrderNumber -
- BillingInformation OrderID -
- ... BillID -
BillReference -
BusinessFirm -
Product -
Catalog -
... -

lunedì 23 agosto 2010


7 Peso computazionale

Esempio:
Ontologia con 1200 concetti

WSDL con 15 annotazioni

18.000 esecuzioni di SAWA

:(
1.200 x 15 =

lunedì 23 agosto 2010


SAWA
Similarity Algorithm Wikipedia-bAsed

lunedì 23 agosto 2010


9 Word-to-word similarity

Date due parole stabilire quanto esse sono simili;


Tipi di algoritmi per il calcolo della similarità tra parole:
Corpus-based: pointwise mutual information, latent semantic analysis;

Hierarchy-based: Leacock & Chodorow, Lesk, Wu & Palmer, Resnik, Lin, Jiang &
Conrath;

Input: due parole;


Output: score compreso tra 0 e 1.

lunedì 23 agosto 2010


10 Algoritmo di Lin (1998)

lunedì 23 agosto 2010


11 Tool di word-to-word similarity

Libreria utilizzata: LinguaTools DISCO;


Utilizza Wikipedia come gerarchia di concetti
202.578 concetti;

Aggiornato al 1° gennaio 2008

Utilizza l’algoritmo di Lin per il calcolo della similarità.

lunedì 23 agosto 2010


12 Esempi

Tiger, lion = 90%


Doctor, nurse = 70%
Stock, market = 47%
Love, sex = 46%
FBI, investigation = 35%
Professor, cucumber = 0,006%

lunedì 23 agosto 2010


Qualità dell’algoritmo
Corpus per la misurazione della qualità: WordSim353;
Coefficienti di correlazione (Pearson):
Wikipedia: 0,574;

BNC: 0,415;

PubMed: 0,105;

90,000

67,500

45,000

22,500

lunedì 23 agosto 2010


14 Text-to-text similarity
Dati due testi stabilire quanto essi sono simili;
Estensione opportuna degli algoritmi di word-to-word similarity;
Rimozione delle parole (stopword)
basso potere discriminatorio;

alta frequenza di occorrenza;

Input: due testi;


Output: score compreso tra 0 e 1.

lunedì 23 agosto 2010


15 Stopword

“Returns the first and last name of each customer who is categorized as an
individual consumer”

STOPWORD

“name customer categorized individual consumer”

lunedì 23 agosto 2010


Algoritmo di Corley & Mihalcea
16 (2005)

lunedì 23 agosto 2010


Ottimizzazioni (v1.2)

Caching delle frequenze di ogni termine;


Caching delle similarità tra termini;
Apprendimento incrementale;
Riduzione degli accessi a DISCO;

Performance ridotte di 10 volte;

lunedì 23 agosto 2010


Risultati sperimentali
Qualità e tempo di esecuzione

lunedì 23 agosto 2010


DESCRIZIONE DEL DOCUMENTO WSDL SCELTA:
"returns the first and last name of each customer who is categorized as an individual consumer"

RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score):


*---------------------------------------------------------------------------------------------------------------*
| Descrizione | Score |
*---------------------------------------------------------------------------------------------------------------*
| name: name of customer | 62,85% |
| customer: Current customer individual information | 56,91% |
| customeraddress: Customer address | 42,36% |
| customercredicard: Customer credit card information | 35,08% |
| salesreason: Reasons why a customer may purchase a particular product. | 30,35% |
| customerstore:Stores of our Company (customer and resellers). | 17,31% |
| salesorderdetail: Product details associated with a specific sales order. | 2,99% |
| productinventory: Product inventory information. | 2,59% |
| salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,39% |
| productlocation: Product manufacturing locations | 2,36% |
| salestaxrate: Sales Tax rate. | 2,36% |
| salesterritory: Sales territory. | 2,22% |
| employeeaddress: Employee information such as salary, department, and title. | 2,18% |
| product: Products sold or used in the manfacturing of sold products. | 2,12% |
| enterpricedepartment: Departments of Enterprise | 2,00% |
| salesspecialoffer: Sales Special Offer (discounts). | 1,99% |
| productlistpricehistory: Changes in the list price of a product over time. | 1,80% |
| shipmethod: Shipping methods. | 1,79% |
| salesorder: General sales order information (header). | 1,76% |
| productdocument: Product Document | 1,73% |
| productcosthistory: Changes in the cost of a product over time. | 1,68% |
| productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,61% |
| productmodel: Product model classification. | 1,48% |
| currencyrate: Currency exchange rates. | 1,40% |
| salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled. | 1,29% |
| productcategory: High-level product categorization. | 1,27% |
| addresstype: Types of addresses | 0,95% |
| unitmeasure: Unit of measure. | 0,80% |
| currency: Standard ISO currencies. | 0,51% |

19
| countryregion: ISO standard codes for countries and regions. | 0,51% |
| stateprovince: States and provinces | 0,12% |
*---------------------------------------------------------------------------------------------------------------*
Time elapsed: 9.4 seconds.

lunedì 23 agosto 2010


DESCRIZIONE DEL DOCUMENTO WSDL SCELTA:
"lists the names and addresses of all individual customers"

RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score):


*---------------------------------------------------------------------------------------------------------------*
| Descrizione | Score |
*---------------------------------------------------------------------------------------------------------------*
| addresstype: Types of addresses | 51,77% |
| customer: Current customer individual information | 24,03% |
| customeraddress: Customer address | 10,83% |
| name: name of customer | 6,32% |
| productlistpricehistory: Changes in the list price of a product over time. | 4,91% |
| customercredicard: Customer credit card information | 4,47% |
| salesreason: Reasons why a customer may purchase a particular product. | 4,20% |
| customerstore:Stores of our Company (customer and resellers). | 3,21% |
| salesorder: General sales order information (header). | 2,72% |
| salesspecialoffer: Sales Special Offer (discounts). | 2,53% |
| salesorderdetail: Product details associated with a specific sales order. | 2,49% |
| salesterritory: Sales territory. | 2,14% |
| salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,08% |
| employeeaddress: Employee information such as salary, department, and title. | 1,81% |
| salestaxrate: Sales Tax rate. | 1,79% |
| productlocation: Product manufacturing locations | 1,78% |
| countryregion: ISO standard codes for countries and regions. | 1,64% |
| product: Products sold or used in the manfacturing of sold products. | 1,62% |
| productinventory: Product inventory information. | 1,60% |
| currencyrate: Currency exchange rates. | 1,46% |
| enterpricedepartment: Departments of Enterprise | 1,45% |
| productmodel: Product model classification. | 1,38% |
| shipmethod: Shipping methods. | 1,37% |
| salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled. | 1,36% |
| productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,32% |
| productdocument: Product Document | 1,27% |
| productcosthistory: Changes in the cost of a product over time. | 1,26% |
| productcategory: High-level product categorization. | 1,01% |
| currency: Standard ISO currencies. | 0,85% |

20
| stateprovince: States and provinces | 0,73% |
| unitmeasure: Unit of measure. | 0,71% |
*---------------------------------------------------------------------------------------------------------------*
Time elapsed: 4.177 seconds.

lunedì 23 agosto 2010


DESCRIZIONE DEL DOCUMENTO WSDL SCELTA:
"returns the name of each customer that is categorized as a store"

RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score):


*---------------------------------------------------------------------------------------------------------------*
| Descrizione | Score |
*---------------------------------------------------------------------------------------------------------------*
| name: name of customer | 64,29% |
| customeraddress: Customer address | 43,83% |
| customer: Current customer individual information | 40,05% |
| customercredicard: Customer credit card information | 36,52% |
| salesreason: Reasons why a customer may purchase a particular product. | 31,74% |
| customerstore:Stores of our Company (customer and resellers). | 21,07% |
| employeeaddress: Employee information such as salary, department, and title. | 2,75% |
| salesorderdetail: Product details associated with a specific sales order. | 2,67% |
| productinventory: Product inventory information. | 2,52% |
| salestaxrate: Sales Tax rate. | 2,22% |
| salesterritory: Sales territory. | 2,19% |
| salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,09% |
| productlocation: Product manufacturing locations | 1,91% |
| enterpricedepartment: Departments of Enterprise | 1,87% |
| salesorder: General sales order information (header). | 1,84% |
| product: Products sold or used in the manfacturing of sold products. | 1,79% |
| salesspecialoffer: Sales Special Offer (discounts). | 1,72% |
| productlistpricehistory: Changes in the list price of a product over time. | 1,68% |
| productdocument: Product Document | 1,63% |
| salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled. | 1,61% |
| shipmethod: Shipping methods. | 1,52% |
| productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,47% |
| productcosthistory: Changes in the cost of a product over time. | 1,43% |
| productmodel: Product model classification. | 1,42% |
| currencyrate: Currency exchange rates. | 1,30% |
| productcategory: High-level product categorization. | 1,15% |
| addresstype: Types of addresses | 1,02% |
| unitmeasure: Unit of measure. | 0,93% |
| countryregion: ISO standard codes for countries and regions. | 0,45% |

21
| currency: Standard ISO currencies. | 0,44% |
| stateprovince: States and provinces | 0,12% |
*---------------------------------------------------------------------------------------------------------------*
Time elapsed: 1.245 seconds.

lunedì 23 agosto 2010


22 Tempo di esecuzione
Ottimizzato Non ottimizzato

3 1,0 s 9,4 s

6 1,7 s 9,8 s

5 2,7 s 18,1 s

7 3,6 s 21,8 s

2 3,9 s 15,5 s

8 5,6 s 23,1 s

1 6,2 s 14,3 s

4 9,4 s 39,4 s

0 12,5 25 37,5 50

lunedì 23 agosto 2010


Sviluppi futuri
Imminenti e futuri

lunedì 23 agosto 2010


Sviluppi futuri
Imminenti:
Realizzazione dell’interfaccia Web Service

Realizzazione dell’interfaccia Web (gratuita)

Realizzazione dell’interfaccia di rete

Disseminazione scientifica

Altri:
Introduzione di soglie per migliorare le performance

Rilascio con licenza open-source del codice sorgente

lunedì 23 agosto 2010


Sessione dimostrativa

lunedì 23 agosto 2010

Vous aimerez peut-être aussi