Vous êtes sur la page 1sur 15

CUSTOMER INNOVATION TEAM

Triples DB CAS Consumer


User's Guide

Version 3.3
VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

Revision History
Date Version Description Author
30 July 2009 0.1 Initial version Vesela Grigorova
03 August 2009 0.2 Reviewed Maya Malakova
04 August 2009 0.3 Updated (DB2 added) Vesela Grigorova
04 August 2009 0.4 Reviewed Elena Stankova
04 August 2009 1.0 Reviewed Maya Malakova
07 August 2009 1.1 Updated Maya Malakova
12 August 2009 1.2 Updated Vesela Grigorova
12 August 2009 2.0 Reviewed Maya Malakova
01 September 2009 2.1 Updated Vesela Grigorova
02 September 2009 2.2 Updated Elena Stankova
02 September 2009 2.3 Reviewed Maya Malakova
28 October 2009 2.4 Updated – feature relations added Maya Malakova
2 November 2009 2.5 Updated – Oracle support Maya Malakova
9 November 2009 2.6 Updated – feature types support Maya Malakova
17 December 2009 2.7 Updated – logging configuration Maya Malakova
24 September 2010 2.8 Updated – MSSQL support Blagoy Borisov
04 October 2010 2.9 Updated – Oracle support Blagoy Borisov
07 October 2010 3.0 Updated Blagoy Borisov
13 October 2010 3.1 Updated Blagoy Borisov
27 October 2010 3.2 Updated – relationType added Blagoy Borisov
18 November 2010 3.3 Updated Blagoy Borisov

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 2 OF 15


VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

Table of Contents
1. INTRODUCTION......................................................................................................................... 4
2. PROJECT FEATURES AND FUNCTIONALITY.........................................................................4
3. ABBREVIATIONS AND ACRONYMS........................................................................................ 4
4. SOFTWARE PREREQUISITES.................................................................................................. 4
5. USING TRIPLES DB CAS CONSUMER.....................................................................................4
5.1 DATABASE CONFIGURATION.................................................................................................... 4
5.1.1 DB2 Configuration............................................................................................................ 4
5.1.1.1 Creating Database and Tables....................................................................................4
5.1.1.2 Deploying the Stored Procedure.................................................................................5
5.1.1.3 Configuration File........................................................................................................ 5
5.1.1.4 JDBC Driver................................................................................................................ 6
5.1.2 MSSQL Configuration...................................................................................................... 6
5.1.2.1 Creating Database and Tables....................................................................................6
5.1.2.2 Deploying the Stored Procedure.................................................................................6
5.1.2.3 Configuration File........................................................................................................ 6
5.1.2.4 JDBC Driver................................................................................................................ 7
5.1.3 Oracle Configuration........................................................................................................ 7
5.1.3.1 Creating Schema and Tables......................................................................................7
5.1.3.2 Deploying the Stored Procedure.................................................................................8
5.1.3.3 Configuration File........................................................................................................ 8
5.1.3.4 JDBC Driver................................................................................................................ 8
5.2 UIMA ANNOTATION TYPES CONFIGURATION............................................................................9
5.3 BAT FILE OVERVIEW............................................................................................................. 11
5.4 LOGGING CONFIGURATION.................................................................................................... 11
5.5 CPE CONFIGURATION.......................................................................................................... 11
5.6 CPE USAGE........................................................................................................................ 14
5.7 VIEWING THE PROCESSING RESULTS....................................................................................14
6. APPENDIX: USING THE TRIPLES DB CAS CONSUMER WITH ORACLE 10G ENTERPRISE
EDITION.............................................................................................................................................. 15
6.1 CREATE SCHEMA AND GRANT SYSTEM PRIVILEGES...............................................................15
6.2 CREATE TRIPLE TABLES....................................................................................................... 15
6.3 DEPLOY STORED PROCEDURES............................................................................................ 16
6.4 Configuration File............................................................................................................... 16

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 3 OF 15


VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

1. Introduction
The purpose of this document is to describe the steps that need to be performed in order to use the
Triples DB CAS Consumer.

2. Project Features and Functionality


The goal of the Triples DB CAS Consumer project is to repackage part of the functionality of the
pre-existing TestHarness lwets application as a separate easily reusable CAS consumer.
The Triples DB CAS Consumer component extracts relationships between annotations in the form
of head-body-association triples and stores them in a database. Sample database configuration settings
for DB2 and MS SQL are provided in the project.
Analysis engine and reader components are provided as part of the project in order to assemble a
CPE used to demonstrate the abilities of the Triples DB CAS Consumer component.

3. Abbreviations and Acronyms


 UIMA – Unstructured Information Management Architecture
 CAS – Common Analysis Structure
 CPE – Collection Processing Engine
 PEAR – Processing Engine Archive

4. Software Prerequisites
 Microsoft Windows XP SP2
 Java 1.5 or higher
 DB2

5. Using Triples DB CAS Consumer


The project has three main parts:
 A UIMA collection reader: FileSystemCollectionReader
 A CAS consumer: TriplesDbConsumer
 A UIMA annotator installed as a PEAR: celebrity_crimes

5.1 Database Configuration


The database which will be used needs to be initialized and the parameters for connecting to it need
to be specified in a properties file. Sample initialization and configuration files for DB2 and MSSQL
are provided in the project.

5.1.1 DB2 Configuration

5.1.1.1 Creating Database and Tables


Before using the consumer a database and tables which will be populated with the information
extracted by it have to be created.

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 4 OF 15


VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

In order to create database and tables for storing the extracted triples:
 Open a DB2 Command Window.
 Create a new database named TRIPLES by executing the following command:
db2 create db <databaseName>
 Connect to the database by executing the following command:
db2 connect to < databaseName > user <username> using <password>
 Navigate to the directory <install_package >/bin/db/db2 containing starSchemaDB2.sql file.
 Execute the following command:
db2 -tf starSchemaDB2.sql
 Check for errors, you may receive several warnings saying that the statistics are inconsistent,
these can be ignored.

5.1.1.2 Deploying the Stored Procedure


The Triples DB CAS Consumer uses a stored procedure which populates the triples data into several
interconnected tables. This stored procedure needs to be deployed to the DB2 database.
In order to do this:
 Open a DB2 Command Window
 Connect to the database with the star schema (db2 connect to <dbName> user <username>
using <password>)
 Navigate to the directory <install_package >/bin/db/db2 containing the
deployStoredProcedureDB2.sql script
 Type the following command:
db2 -td@ -f deployStoredProcedureDB2.sql

5.1.1.3 Configuration File


In order to establish connection with the database that will be used to store the triples the following
parameters need to be specified in a properties file:
 dbDriver – full name of the database driver class for connecting to the database;
 dbUrl – URL of the database;
 dbUsername – username for connecting to the database;
 dbPassword – password for connecting to the database.
These properties are specified in a file and its location is defined in the CAS consumer parameters.

A sample DB2 database descriptor is provided with the installation package:


<install_package>/bin/consumer/config/db2_TripleStore.properties

Note: If you are using a database named different than TRIPLES, you need to change the name of
the database in the configuration file.
Note: You need to change the configuration file to fit your database admin username and password.

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 5 OF 15


VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

dbDriver=com.ibm.db2.jcc.DB2Driver
dbUrl=jdbc:db2://localhost:50000/triples
dbUsername=db2admin
dbPassword=db2admin

5.1.1.4 JDBC Driver


Triples DB CAS Consumer requires DB2 JDBC driver to be available in the classpath. To add DB2
JDBC driver to the CPE classpath, perform the following steps:
 Copy <db2_dir>/java/db2jcc.jar to <install_package>/bin/consumer/lib
 Copy <db2_dir>/java/db2jcc_license_cu.jar to <install_package>/bin/consumer/lib

5.1.2 MSSQL Configuration

5.1.2.1 Creating Database and Tables


In order to create database and tables for storing the extracted triples:
 Open a console.
 Create a new database named TRIPLES by executing the following command:
sqlcmd -U <username> -P <password> -Q “create database <databaseName>”
 Create tables by importing starSchema.sql file from <install_package >/bin/db/mssql:
sqlcmd -U <username> -P <password> -d <databaseName> -i
<install_package>/bin/db/mssql/starSchema.sql

5.1.2.2 Deploying the Stored Procedure


In order to do this:
 Open a console
 Deploy the Stored Procedure by executing the following command:
sqlcmd -U <username> -P <password> -d <databaseName> -i
<install_package>/bin/db/mssql/deployStoredProcedure.sql

5.1.2.3 Configuration File


In order to establish connection with the database that will be used to store the triples the following
parameters need to be specified in a properties file:
 dbDriver – full name of the database driver class for connecting to the database;
 dbUrl – URL of the database;
 dbUsername – username for connecting to the database;
 dbPassword – password for connecting to the database.
These properties are specified in a file and its location is defined in the CAS consumer parameters.

A sample MSSQL database descriptor is provided with the installation package:


<install_package>/bin/consumer/config/mssql_TripleStore.properties

Note: If you are using a database named different than TRIPLES, you need to change the name of
the database in the configuration file.

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 6 OF 15


VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

Note: You need to change the configuration file to fit your database admin username and password.

dbDriver=com.microsoft.sqlserver.jdbc.SQLServerDriver
dbUrl=jdbc:sqlserver://127.0.0.1:1433;databaseName=triples;selectMethod=cursor
dbUsername=admin
dbPassword=admin

5.1.2.4 JDBC Driver


Triples DB CAS Consumer requires MSSQL JDBC driver to be available in the classpath. To add
MSSQL JDBC driver to the CPE classpath, perform the following steps:
 Download latest Microsoft SQL Server JDBC Driver from:
http://msdn.microsoft.com/data/jdbc
 Extract downloaded archive
 Find inside the extracted folder the file: sqljdbc4.jar
 Copy sqljdbc4.jar to <install_package>/bin/consumer/lib

5.1.3 Oracle Configuration

5.1.3.1 Creating Schema and Tables


Before using the consumer a schema and tables which will be populated with the information
extracted by it have to be created.

In order to create schema and tables for storing the extracted triples:
 Open a console.
 Start SQL*Plus executing the following command:
sqlplus system/<syspassword>@<oracle_sid>
Where <syspassword> is the password for the system account and <oracle_sid> is Oracle
System ID (usually: orcl)
 Create permanent tablespace by executing the following command:
create tablespace <ptname> datafile '<ptpath>' size <ptsize> autoextend on;
Where <ptname> is the name of the permanent tablespace, <ptpath> is the path of the
tablespace file (for example: ‘C:/triples.dbf’), <ptsize> is the size of the tablespace (for example:
5m for 5 megabytes, 1g for 1 gigabyte, etc.)
 Create temporary tablespace by executing the following command:
create temporary tablespace <ttname> tempfile '<ttfile>' size <ttsize> autoextend on;
Where <ttname> is the name of the temporary tablespace, <ttpath> is the path of the
tablespace file (for example: ‘C:/triples_temp.dbf’), <ttsize> is the size of the tablespace (for
example: 5m for 5 megabytes, 1g for 1 gigabyte, etc.)
 Create the schema(the user) by executing the following commands:
create user <username> identified by <password> default tablespace <ptname>
temporary tablespace <ttname>;
 Grant the user privileges to connect and create tables:
grant connect, resource to <username>;
 Exit SQL*Plus executing exit command:

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 7 OF 15


VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

exit
 From console navigate to <install_package >/bin/db/oracle
 Start SQL*Plus using the newly created username executing the following command:
sqlplus <username>/<password>@<oracle_sid>
Where <username> is the new username, <password> is the password and <oracle_sid> is
Oracle System ID (usually: orcl)
 When connected to Oracle execute the following command to create tables:
@starSchemaOracle.sql;
 Check for errors.

5.1.3.2 Deploying the Stored Procedure


The Triples DB CAS Consumer uses a stored procedure which populates the triples data into several
interconnected tables. This stored procedure needs to be deployed to the Oracle database.
In order to do this:
 Start SQL*Plus using the newly created username executing the following command, if
SQL*Plus is not already started:
sqlplus <username>/<password>@<oracle_sid>
Where <username> is the new username, <password> is the password and <oracle_sid> is
Oracle System ID (usually: orcl)
 Deploy stored procedures by executing the following command:
@deployStoredProcedure.sql;

5.1.3.3 Configuration File


In order to establish connection with the database that will be used to store the triples the following
parameters need to be specified in a properties file:
 dbDriver – full name of the database driver class for connecting to the database;
 dbUrl – URL of the database;
 dbUsername – username for connecting to the database;
 dbPassword – password for connecting to the database.
These properties are specified in a file and its location is defined in the CAS consumer parameters.
A sample Oracle database descriptor is provided with the installation package:
<install_package>/bin/consumer/config/oracle_TripleStore.properties

Note: You need to change the configuration file to fit your database username, password and Oracle
System ID.

dbDriver=oracle.jdbc.driver.OracleDriver
dbUrl=jdbc:oracle:thin:@localhost:1521:orcl
dbUsername=triples
dbPassword=triples

5.1.3.4 JDBC Driver


Triples DB CAS Consumer requires Oracle JDBC driver to be available in the classpath. To add
Oracle JDBC driver to the CPE classpath, perform the following:

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 8 OF 15


VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

 Copy <oracle_dir>/product/XX.X.X/dbhome_1/jdbc/lib/ojdbcY.jar (where XX.X.X is Oracle


version and Y is the version of the driver, for example ojdbc5.jar) to
<install_package>/bin/consumer/lib

5.2 UIMA Annotation Types Configuration


The mapping file which defines what relationships will be extracted as triples is found at
<install_package>/bin/consumer/config/TriplesMapping.xml. It has the following format:

triplesMapping

annotation
1 feature
-name
-extractSentence 1 -name
-extractParagraph -type
* *
1
relation
-from
* -to
-fromType
-toType

Figure 1: Triples mapping

It contains a set of annotation elements. Each of them has the following attributes:
 name – defines the name of a type of annotation that will be extracted as head of a triple;
 extractSentence – a flag that defines if IN_SAME_SENTENCE relationships with annotations
of the given type will be extracted. The default value is “false”;
Note: IN_SAME_SENTENCE is a type of relationship between a Sentence annotation and an
annotation that is contained inside its text.
 extractParagraph – a flag that defines if IN_SAME_PARAGRAPH relationships with
annotations of the given type will be extracted. The default value is ”false”;
Note: IN_SAME_PARAGRAPH is a type of relationship between a Paragraph annotation and
an annotation that is contained inside the text of the paragraph.

The annotation elements can contain multiple feature sub elements. Each of them corresponds to the
name of a feature which will be used to extract AGGREGATE_FEATURE relationships – a
relationship between an annotation and the value of the feature with the given name. Each feature
can have the following attributes:
 name - The name of the feature can be a simple name like “crimeDetails” or a sequence of
feature names separated by “:” e.g. “crimeDetails:sentence” called feature paths. Feature
paths allow defining relationships not only with the direct child features of an annotation but
also with features of their child features and so on.
 type – this is an optional attribute that defines a name of a type that will be associated with
the feature. If it is not specified then the consumer will try to retrieve the uima type of the
feature and if has no type (this means it is a literal) then the type associated with it will be
Text

The annotation elements can also contain multiple relation elements. They define relationships
between two features of an annotation type. It has attributes:

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 9 OF 15


VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

 from - specifies the feature path of the first feature


 to - specifies the feature path of the second feature.
 fromType – optional attribute that specifies the name of type that will be associated with the
feature
 toType – optional attribute that specifies the name of type that will be associated with the
feature
 relationType – optional attribute that specifies the name of the relation between first and
second feature
Such relationships are associated with the FEATURE_RELATION association type.

Sample mapping documents are given below:

o With simple feature name


<?xml version="1.0" encoding="UTF-8"?>
<triples>
<annotation name="com.ibm.langware.CelebrityCrime" extractSentence="true"
extractParagraph="true">
<feature name="crimeDetails"/>
<feature name="name" type="com.ibm.langware.PersonName"/>
<relation from="crimeDetails" fromType="com.ibm.langware.Details" to="name"/>
<relation from="crimeDetails" to="realName"
relationType=”CRIMEDETAILS_REALNAME_RELATION” />
</annotation>
<annotation name="com.ibm.langware.Person" extractSentence="true">
</annotation>
</triples>

It defines that for the com.ibm.langware.CelebrityCrime annotation type the pipeline will extract
IN_SAME_SENTENCE and IN_SAME_PARAGRAPH relationships, AGGREGATE_FEATURE
relationships for the crimeDetails and name features, FEATURE_RELATION relationships between
the values of the crimeDetails and name features and CRIMEDETAILS_REALNAME_RELATION
relationships between the values of the crimeDetails and realName features. The type associated
with the name features will be com.ibm.langware.PersonName and the type associated with the
crimeDetails feature in the declared feature relations will be com.ibm.langware.Details.

o With complex feature name


<?xml version="1.0" encoding="UTF-8"?>
<triples>
<annotation name="com.ibm.langware.CelebrityCrime" extractSentence="true"
extractParagraph="true">
<feature name="crimeDetails:sentence"/>
<feature name="name"/>
<relation from=" crimeDetails:sentence " to="name"/>
</annotation>
</triples>

It defines that for the com.ibm.langware.CelebrityCrime annotation type the pipeline will extract
IN_SAME_SENTENCE and IN_SAME_PARAGRAPH relationships, AGGREGATE_FEATURE
relationships for the crimeDetails:sentence and name features and also FEATURE_RELATION
relationships between the values of the crimeDetails:sentence and name features.
Note: If another annotator is used, the mapping file has to be edited.

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 10 OF 15


VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

5.3 Bat File Overview


The project can be executed via a CPE Configurator. The cpeGUI.bat file contains commands which
start a CPE Configurator GUI tool and initialize project specific variables and class path values.
Note: In order to work successfully with the GUI tool, the hierarchy of the installation package must
not be changed. If reordering the package folder is necessary, the cpeGUI.bat file must be edited to
match the new project structure.

5.4 Logging configuration


The logging configuration for the project is contained in the
<install_package>/bin/consumer/config/Logging.properties file. It defines the location of the file in
which logs will be stored and the log details level. The log details level configured by default in this file
is WARNING. If using another level of log detail is necessary the following line must be changed in
the configuration file:

java.util.logging.FileHandler.level = WARNING

5.5 CPE Configuration


Start the CPE Configurator by running the cpeGUI.bat file or type <install_package>/bin/cpeGUI.bat
in the Start  Run… Windows dialog. A dialog will open:

Figure 2: Empty CPE Configurator

Note: Choose File  Clear All to remove any configurations remained from previous runs of the CPE.

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 11 OF 15


VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

Figure 3: Clear previous CPE configuration

Set the values of the CPE components using the following steps:
 Collection Reader
o Descriptor – browse to the location of the descriptor file of the reader component:
<install_package>/bin/reader/desc/FileSystemCollectionReader.xml

After the file is selected, three additional fields appear:

Figure 4: Selecting a Collection Reader

o Input Directory – browse to the location of the directory containing the input files. Sample
data files are provided with the package in the <install_package>/bin/reader/data folder.
o Language – keep the field blank.
o Encoding – keep the field blank.

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 12 OF 15


VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

 Analysis Engines
Choose the Add… button. In the dialog that opens, browse to the location of the XML file of
the annotator component:
<annotator_dir>/com.ibm.dltj.ruleannotator/com.ibm.dltj.ruleannotator_pear.xml
Note: <annotator_dir> is the folder where the Celebrity Crimes annotator is installed.
The annotator appears as a tab in the Analysis Engines section.

Figure 5: Selecting an Annotator

 CAS Consumers
Choose the Add… button. In the dialog that opens, browse to the location of the descriptor
file of the consumer component:
<install_package>/bin/consumer/desc/TriplesDbConsumer.xml
The consumer will appear as a tab in the CAS Consumers section.

Figure 6: Selecting a CAS Consumer

The CAS Consumer has the following parameters:


o Database Properties File – the absolute name of the file that describes the structure of
the database that will be used for storing the triples. Sample files for the different DB
systems are provided with the package in the <install_package>/bin/consumer/config/
folder.
o Document Type – keep the default value.
o Crawler – the name of the crawler application that retrieved the input documents, for
example Omnifind.
o Mapping File – browse to the location of the mapping file that defines what triples will be
extracted. A sample mapping file is provided in the package in the
<install_package>/bin/consumer/config/TriplesMapping.xml folder.
o Uri Feature – keep the default value.
o Document Source – description of the source of the document, for example Local File
System.
o Store Doc Stats – whether to store additional information about files that are parsed and
the triples found in them are stored in the database. The information that is stored is:

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 13 OF 15


VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

crawler, crawled file, number of triples found and parsed date.

5.6 CPE Usage


NB: Before running the CPE, delete the content of all database tables that are used in the scenario.
When the configuration is done, choose the Start button on the Collection Processing Engine
Configurator in order to trigger the analyzing process.

Figure 7: Running the CPE

When the Stop button is chosen, on the Performance Report window, which appears, information for
the documents processing procedure is displayed.

5.7 Viewing the Processing Results


The triples created by the Triples DB CAS Consumer are stored in database tables. The doc_table
table contains the input files:

Figure 8: Results in the doc_table table

The triple_store table shows the object annotations created for the input files’ content:

Figure 9: Results in the triple_store table

Note: If the annotation has IN_SAME_SENTENCE association type, the BODY column of the

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 14 OF 15


VERSION: 3.3 TRIPLES DB CAS CONSUMER
DATE: 8 DECEMBER 2021 USER'S GUIDE

triple_store table contains a number which points to a row in the doc_sentences table. This table
stores the sentence where the annotation text is found in.

Figure 10: Results in the doc_sentences table

IBM CONFIDENTIAL  CUSTOMER INNOVATION TEAM, 2011 PAGE 15 OF 15

Vous aimerez peut-être aussi