Vous êtes sur la page 1sur 33

TREX Architecture and functionality

Search Engine for SAP

TREX is the one search technology in SAP solutions. TREX is deployed in over a dozen SAP produts TREX searches and analyses as well unstructured documents as structured business data. TREX in knowledge management provides search access to an extensible number of document repositories TREX will provide the backend technology for Enterprise Search

TREX Architecture

SAP AG 2006, Title of Presentation / Speaker Name / 3

TREX Anatomy

TREX provides several client options:


Java client for communication via HTTP/XML in SAP EP ABAP client for communication via RFC or ICM in SAP landscape C++ and Python clients for internal calls and development

Inside TREX there are four main services:


Name server: manages TREX landscape, allocates TREX services Index server: indexing and retrieval
- Text-mining engine for classification and similarity search - Text search engine for search and indexing unstructured text - Attribute/BIA engine for searching and indexing structured data

Queue server: manages asynchronous indexing Preprocessor: document retrieval, filtering, linguistic processing

SAP AG 2006, Title of Presentation / Speaker Name / 4

Name Server

TREX Name Server


Monitors the landscape (for high availability) Maintains a list of all services and their status Is called whenever one service seeks another Distributes load

Example
When a service sends the name server the request GetServer (IndexServer, SearchMode, MyIndex) the name server answers with the address <host>:<port> of the index server to which to send the request

SAP AG 2006, Title of Presentation / Speaker Name / 5

Name Server: Initialization Files

The most important .ini files are: topology.ini


Read by all name servers Contains all index-relevant information
- To edit the file, use the TREX standalone admin tool

sapprofile.ini
Read by all TREX services and clients Specifies:
Port number of local name server Host and port numbers of all master name servers Amount of shared memory used by topology.ini data System ID Path information to where each service saves its data

SAP AG 2006, Title of Presentation / Speaker Name / 6

Queue Server

TREX Queue Server


Collects indexing requests
- Sends them to the index server

Enables asynchronous indexing


- Scheduled - Event triggered

Includes scheduler for replication


- Replication runs on index server

Stores snapshots for replication

SAP AG 2006, Title of Presentation / Speaker Name / 7

Preprocessor 1

TREX Preprocessor
Delivers documents that the engines can use directly Supports almost any data type Gets documents via HTTP from source Converts documents to HTML Keeps the document structure

Extracts attributes
- Metadata from DOC, PDF, ... - Names from a lexicon - Application-specific attributes
.* .zip .ppt

.pdf
.* <html> <head></head> <body></body> </html>

.*
.doc

Performs linguistic processing


- Tokenization - Stemming - Tagging (using third party products)

SAP AG 2006, Title of Presentation / Speaker Name / 8

Preprocessor 2

TREX Preprocessor
Reduces workload on the other engines Works independently of the indexes Is stateless
Java Client ABAP Client Index Server

Python Extensions

Preprocessor

Name Server Client

HTTP Client HTML Filter

Lexicon

Highlighting

Extensions

SAP AG 2006, Title of Presentation / Speaker Name / 9

How Search Works: An Example BooksOnline, an online bookstore, offers a range of books with the special feature that a customer can search the full text of the books online before purchase

Auditor Jane wants to buy a book about invoice verification and decides to evaluate the suggestions offered by the BooksOnline search service The following slides describe how the SAP NetWeaver search service used by BooksOnline answers her search request

SAP AG 2006, Title of Presentation / Speaker Name / 13

Search Example 1
Jane enters invoice verification in the BooksOnline search field in the Web browser on her office desktop PC The business application forwards her search request, together with information about the kind of search and which index to use, as an HTTP/XML packet via the Java client to the Web server
Java Client

TREX
Name Server Preprocessor Queue Server

Index Web Server


Text Mining Engine

Server
Attribute Engine

Text Search Engine

Do a phrase search for invoice verification in the BooksOnline index


SAP AG 2006, Title of Presentation / Speaker Name / 14

Index

Index

Index

Search Example 2
The Web server converts the HTTP message into the format used inside TREX and sends a request to the name server for the name and address of a service to handle the request The name server checks its list of available servers and tells the Web server the address of an index server that has received the fewest calls so far and can handle the request
Java Client

TREX Where can I send this request?


Name Server Preprocessor Queue Server

Web Server

Send it to Index Server 1


Text Mining Engine

Index

Server
Attribute Engine

Text Search Engine

Index

Index

Index

SAP AG 2006, Title of Presentation / Speaker Name / 15

Search Example 3
The Web server passes the search request to the index server as a TCP/IP packet The index server sees that the request is for a phrase search and therefore forwards the phrase to the preprocessor for language identification, tokenization, tagging, and stemming
Java Client

TREX
Do a phrase search forer invoice verification in the BooksOnline index
Name Preprocessor Queue Server

Web Server

!Text Mining
En

Index

Server
rib te - e

A phrase search this means work for the preprocessor!


Index

The language of the search may be specified in advance


SAP AG 2006, Title of Presentation / Speaker Name / 16

Index

Index

Search Example 4
The preprocessor performs linguistic processing. It parses the phrase into two words invoice and verification, tags them as nouns, reduces the words to their stem forms (in this case the words themselves) and sends the result back to the index server

Java Client

TREX
Name Server Preprocessor Queue Server

Web Server

Please preprocess the phrase invoice verification


Engine

Index

Server

TextSeDone - two English Engin ouns in stem form

Index

Index

Index

SAP AG 2006, Title of Presentation / Speaker Name / 17

Search Example 5
The index server sends the preprocessed request to the search engine for optimization and result retrieval
The query optimizer in the search engine analyzes the query, builds the query tree, which in this case has three nodes, one for each word and one for AND, and optimizes it based on index statistics, to evaluate the term that appears less frequently first
Java Client

TREX
Name Server Queue Server

Preprocessor

This is a simple query - just a 2-word phrase Index


Web Server

Server
Attribute Engine

The index listing for invoice is longer than the index listing for verification so select verification first
SAP AG 2006, Title of Presentation / Speaker Name / 18

Text Mining Engine

Text Search Engine

Index

Index

Search Example 6
The search engine finds the row for the term verification in the BooksOnline index and selects the set of books containing the term, then it checks this set of books against the row for the term invoice and selects just the books that contain both terms Next, it reads the addresses of the terms in each book, calculates rank values, sorts the results, and takes the top ten (or more)
Java Client

TREX
Name Server Queue Server

Preprocessor

Calculate ranks and sort


Web Server

The rank of a document for a term is defined by TF*IDF ranking


SAP AG 2006, Title of Presentation / Speaker Name / 19

1. Find set of books Index Server with verification Text Search Attribute Engine Engine 2. Find subset with invoice 3. Find addresses Index Index of both terms

Search Example 7
The search engine reads all the requested attributes for the selected books, including titles and authors and keys to the documents The engine uses the keys to load the document contents and scans the texts for the first occurrences of the search phrase (or linguistic variants of the phrase) to create a brief summary text
Java Client

TREX
Name Server Preprocessor Queue Server

Web Server

The preprocessor extracted attributes during indexing


SAP AG 2006, Title of Presentation / Speaker Name / 20

Scans through the texts to find the first few sentences containing the phrase invoice verification
Index

Index

Server
Attribute Engine

Text Search Engine

Index

Index

Search Example 8
The search engine passes the result set back via the index server for merging with results from any other engines (here none) The index server passes the result set back via the Web server and the Java client to the graphical user interface Jane sees a ranked list of books about invoice verification less than a second after she launched the search
Java Client

TREX
Name Server Preprocessor Queue Server

Index Web Server

Server
Attribute Engine

73 books found in 0.14 seconds

Text Mining Engine

Text Search Engine

Index

Index

Index

SAP AG 2006, Title of Presentation / Speaker Name / 21

Search: Results A sample document from the result set


Exact format depends on application settings

Internal Auditing
by First Author, Second Author Economic Publishers, New York Invoice verification is the next step ... The invoice verification in the ...
375 pages First edition ISBN 0-3XX-XXXXX-X

Browse full text Document attributes Link to document Sample phrases with search terms highlighted

Results ranked by frequency of search terms


How many results returned depends on application settings

SAP AG 2006, Title of Presentation / Speaker Name / 22

How Indexing Works: An Example BooksOnline worked hard to give Jane such a rewarding search experience

Before Jane could see a ranked list of books about invoice verification and browse the books, BooksOnline had to index the full texts of all the books The following slides describe how the SAP NetWeaver search service used by BooksOnline indexes the full texts of the books on show in its website

SAP AG 2006, Title of Presentation / Speaker Name / 23

Indexing Example 1
The BooksOnline indexing administrator opens the SAP queue and index administration tool and sends a request to TREX to create an index called BooksOnline The ABAP Client forwards the index request as a Remote Function Call via the SAP Gateway to the RFC server
ABAP Client

TREX
RFC Server Name Server Preprocessor Queue Server

Gateway

Index

Server
Attribute Engine

Create an index called BooksOnline

Text Mining Engine

Text Search Engine

Indexing can be done just as well via the Java Client


SAP AG 2006, Title of Presentation / Speaker Name / 24

Index

Index

Index

Indexing Example 2
The name server tells the RFC server the address of an index server that can create the index In a one-box implementation of TREX, this step is straightforward unless the index server is down for some reason The name server uses a round robin procedure to select an index server
ABAP Client

TREX
RFC Server Name Server Preprocessor Queue Server

Gateway

I want to create a new index!

So go to <host>:<port>
Text Mining Engine

Index

Server
Attribute Engine

Text Search Engine

Index

Index

Index

SAP AG 2006, Title of Presentation / Speaker Name / 25

Indexing Example 3
The RFC server sends the request to the index server The index server creates a new index called BooksOnline The new index is still empty but any documents to be indexed can now be assigned to it

ABAP Client

TREX
RFC Server Name Server Preprocessor Queue Server

Gateway

I want to create a new index called BooksOnline

Index
Text Mining Engine

Server
Attribute Engine

Text Search Engine

New index created successfully!

Index

Index

Index

SAP AG 2006, Title of Presentation / Speaker Name / 26

Indexing Example 4
The administrator sends a request to index the new books in a specified folder and write the results in the BooksOnline index The digital files for the books are in a variety of formats, but TREX can handle all standard formats, such as Microsoft Word (.doc), Adobe Page Description Format (.pdf), and plain text (.txt) The name server directs the request to an available queue server
ABAP Client

TREX
RFC Server Name Server Preprocessor Queue Server

Gateway

Please index all the books in folder <path_to_folder>

Please put this indexing request in your queue and have the documents indexed as soon as TREX finds the time to do it
Text Mining Engine Text Search Engine

Attribute Engine

Queueing is an option: Indexing can also be done immediately


SAP AG 2006, Title of Presentation / Speaker Name / 27

Index

Index

Index

Indexing Example 5
The queue server receives the list of URLs for the documents from the specified folder and persists them in a queue for the index for as long as required until a preprocessor is available Indexing a large collection of documents can be a long job, so the administrator can hold or flush the queue manually at any time

.htm .xls

.pdf .doc

.ppt .txt

ABAP Client

TREX
RFC Server Name Server Preprocessor Queue Server

Gateway

Queue server receives document URLs and adds them to the BooksOnline queue for indexing
Text Mining Engine Text Search Engine Attribute Engine

BooksOnline has all its books available in digital form (either as author files or scanned and OCR'd) ready for indexing and browsing
SAP AG 2006, Title of Presentation / Speaker Name / 28

Index

Index

Index

Indexing Example 6
The queue server sends the documents to a free preprocessor The preprocessor fetches documents via URLs, filters them from their original format to HTML, identifies their language, tokenizes them into sequences of terms, tags the terms as nouns or whatever, and stems the terms as appropriate The preprocessed documents are then sent to the index server TREX
RFC Server Name Server Queue Server

.htm .xls

.pdf .doc

.ppt .txt

ABAP Client

Gateway

Preprocessor

A lot of work for the preprocessor


Index
Text Mining Engine

Server HTML
Attribute Engine

Text Search Engine

Index

Index

Index

SAP AG 2006, Title of Presentation / Speaker Name / 29

Indexing Example 7
The index server forwards the documents to the search engine For each document, the search engine writes a list of all its terms and for each term it writes a list of positions in the document where the term appears The engine merges the term list for each document to the existing term-document matrix that forms the BooksOnline index TREX
RFC Server Name Server Preprocessor Queue Server

.htm .xls

.pdf .doc

.ppt .txt

ABAP Client

Gateway

Index
Text Mining Engine

Server
Attribute Engine

Text Search Engine

Indexing data merged Index into existing matrix

Index

Index

SAP AG 2006, Title of Presentation / Speaker Name / 30

Indexing Example 8
The BooksOnline indexing administrator can use the TREX queue and index administration tool to display the status of the indexing process at any time during the process

ABAP Client

TREX

The tool lets you follow the progress of queued documents from left to right

Gateway

SAP AG 2006, Title of Presentation / Speaker Name / 31

TREX Administration Tools

The TREX administration tool is the place to:


Set up and configure a distributed landscape Monitor and administer services, indexes, queues, replication, ... Show trace files, configuration files, version info, ...

There are three flavors:


Standalone
- Richest feature set - Requires full access to TREX host

ABAP
- Restricted feature set - Easy access on customer systems

Java
- Highly restricted feature set - Browser access via Portal

SAP AG 2006, Title of Presentation / Speaker Name / 33

Landscape Example

Alert area

SAP AG 2006, Title of Presentation / Speaker Name / 49

TREX Traces
Trace file logs are available in below trex path: cd /usr/sap/SID/TRX02/ldtr01<sid>/trace

SAP AG 2006, Title of Presentation / Speaker Name / 56

THANK YOU

Vous aimerez peut-être aussi