Vous êtes sur la page 1sur 17

‫ﺿﻤﻴﻤﻪ‬

‫وﻳﮋﮔﻲ ﻫﺎ و ﻗﺪرت ﺗﺤﻤﻞ ﻣﻮﻟﻔﻪ ﺟﺴﺘﺠﻮي ‪Lucene‬‬


Apache Lucene - Features
• Features
• Scalable, High-Performance Indexing
• Powerful, Accurate and Efficient Search Algorithms
• Cross-Platform Solution

Features
Lucene offers powerful features through a simple API:

Scalable, High-Performance Indexing


• over 20MB/minute on Pentium M 1.5GHz
• small RAM requirements -- only 1MB heap
• incremental indexing as fast as batch indexing
• index size roughly 20-30% the size of text indexed

Powerful, Accurate and Efficient Search Algorithms


• ranked searching -- best results returned first
• many powerful query types: phrase queries, wildcard queries, proximity queries,
range queries and more
• fielded searching (e.g., title, author, contents)
• date-range searching
• sorting by any field
• multiple-index searching with merged results
• allows simultaneous update and searching

Cross-Platform Solution
• Available as Open Source software under the Apache License which lets you use
Lucene in both commercial and Open Source programs
• 100%-pure Java
• Implementations in other programming languages available that are index-
compatible
Apache Lucene - Resources - Performance
Benchmarks
• Performance Benchmarks
• Benchmark Variables
• User-submitted Benchmarks
o Hamish Carpenter's benchmarks
o Justin Greene's benchmarks
o Daniel Armbrust's benchmarks
o Geoffrey Peddle's benchmarks

Performance Benchmarks
The purpose of these user-submitted performance figures is to give current and potential
users of Lucene a sense of how well Lucene scales. If the requirements for an upcoming
project is similar to an existing benchmark, you will also have something to work with when
designing the system architecture for the application.

If you've conducted performance tests with Lucene, we'd appreciate if you can submit these
figures for display on this page. Post these figures to the lucene-user mailing list using this
template.

Benchmark Variables
Hardware Environment

• Dedicated machine for indexing: Self-explanatory (yes/no)


• CPU: Self-explanatory (Type, Speed and Quantity)
• RAM: Self-explanatory
• Drive configuration: Self-explanatory (IDE, SCSI, RAID-1, RAID-5)
Software environment

• Lucene Version: Self-explanatory


• Java Version: Version of Java SDK/JRE that is run
• Java VM: Server/client VM, Sun VM/JRockIt
• OS Version: Self-explanatory
• Location of index: Is the index stored in filesystem or database? Is it on the same
server(local) or over the network?
Lucene indexing variables

• Number of source documents: Number of documents being indexed


• Total filesize of source documents: Self-explanatory
• Average filesize of source documents: Self-explanatory
• Source documents storage location: Where are the documents being indexed
located? Filesystem, DB, http, etc.
• File type of source documents: Types of files being indexed, e.g. HTML files, XML
files, PDF files, etc.
• Parser(s) used, if any: Parsers used for parsing the various files for indexing, e.g.
XML parser, HTML parser, etc.
• Analyzer(s) used: Type of Lucene analyzer used
• Number of fields per document: Number of Fields each Document contains
• Type of fields: Type of each field
• Index persistence: Where the index is stored, e.g. FSDirectory, SqlDirectory, etc.
Figures

• Time taken (in ms/s as an average of at least 3 indexing runs): Time taken to index
all files
• Time taken / 1000 docs indexed: Time taken to index 1000 files
• Memory consumption: Self-explanatory
• Query speed: average time a query takes, type of queries (e.g. simple one-term query,
phrase query), not measuring any overhead outside Lucene
Notes

• Notes: Any comments which don't belong in the above, special tuning/strategies, etc.

User-submitted Benchmarks
These benchmarks have been kindly submitted by Lucene users for reference purposes.

We make NO guarantees regarding their accuracy or validity.

We strongly recommend you conduct your own performance benchmarks before deciding
on a particular hardware/software setup (and hopefully submit these figures to us).

Hamish Carpenter's benchmarks


Hardware Environment

• Dedicated machine for indexing: yes


• CPU: Intel x86 P4 1.5Ghz
• RAM: 512 DDR
• Drive configuration: IDE 7200rpm Raid-1
Software environment

• Lucene Version: 1.3


• Java Version: 1.3.1 IBM JITC Enabled
• Java VM:
• OS Version: Debian Linux 2.4.18-686
• Location of index: local
Lucene indexing variables

• Number of source documents: Random generator. Set to make 1M documents in


2x500,000 batches.
• Total filesize of source documents: > 1GB if stored
• Average filesize of source documents: 1KB
• Source documents storage location: Filesystem
• File type of source documents: Generated
• Parser(s) used, if any:
• Analyzer(s) used: Default
• Number of fields per document: 11
• Type of fields: 1 date, 1 id, 9 text
• Index persistence: FSDirectory
Figures

• Time taken (in ms/s as an average of at least 3 indexing runs):


• Time taken / 1000 docs indexed: 49 seconds
• Memory consumption:
Notes

A windows client ran a random document generator which created documents based
on some arrays of values and an excerpt (approx 1kb) from a text file of the bible
(King James version).
These were submitted via a socket connection (open throughout indexing process).
The index writer was not closed between index calls.
This created a 400Mb index in 23 files (after optimization).

Query details:

Set up a threaded class to start x number of simultaneous threads to search the above
created index.

Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) (Teaser:goo* Tea


ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) +DisplayStartDate:[mkwsw2jk0 -
mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]

This query counted 34000 documents and I limited the returned documents to 5.
This is using Peter Halacsy's IndexSearcherCache slightly modified to be a singleton
returned cached searchers for a given directory. This solved an initial problem with
too many files open and running out of linux handles for them.

Threads|Avg Time per query (ms)


1 1009ms
2 2043ms
3 3087ms
4 4045ms
.. .
.. .
10 10091ms

I removed the two date range terms from the query and it made a HUGE difference
in performance. With 4 threads the avg time dropped to 900ms!

Other query optimizations made little difference.

Hamish can be contacted at hamish at catalyst.net.nz.

Justin Greene's benchmarks


Hardware Environment

• Dedicated machine for indexing: No, but nominal usage at time of indexing.
• CPU: Compaq Proliant 1850R/600 2 X pIII 600
• RAM: 1GB, 256MB allocated to JVM.
• Drive configuration: RAID 5 on Fibre Channel Array
Software environment

• Java Version: 1.3.1_06


• Java VM:
• OS Version: Winnt 4/Sp6
• Location of index: local
Lucene indexing variables

• Number of source documents: about 60K


• Total filesize of source documents: 6.5GB
• Average filesize of source documents: 100K (6.5GB/60K documents)
• Source documents storage location: filesystem on NTFS
• File type of source documents:
• Parser(s) used, if any: Currently the only parser used is the Quiotix html parser.
• Analyzer(s) used: SimpleAnalyzer
• Number of fields per document: 8
• Type of fields: All strings, and all are stored and indexed.
• Index persistence: FSDirectory
Figures

• Time taken (in ms/s as an average of at least 3 indexing runs): 1 hour 12 minutes, 1
hour 14 minutes and 1 hour 17 minutes. Note that the # and size of documents
changes daily.
• Time taken / 1000 docs indexed:
• Memory consumption: JVM is given 256MB and uses it all.
Notes

We have 10 threads reading files from the filesystem and parsing and analyzing them
and the pushing them onto a queue and a single thread poping them from the queue
and indexing. Note that we are indexing email messages and are storing the entire
plaintext in of the message in the index. If the message contains attachment and we
do not have a filter for the attachment (ie. we do not do PDFs yet), we discard the
data.

Justin can be contacted at tvxh-lw4x at spamex.com.

Daniel Armbrust's benchmarks


My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed, nor
was the total index built in one shot. The index was created on several different machines
(all with these specs, or very similar), with each machine indexing batches of 500,000 to 1
million documents per batch. Each of these small indexes was then moved to a much larger
drive, where they were all merged together into a big index. This process was done
manually, over the course of several months, as the sources became available.

Hardware Environment

• Dedicated machine for indexing: no - The machine had moderate to low load.
However, the indexing process was built single threaded, so it only took advantage of
1 of the processors. It usually got 100% of this processor.
• CPU: Sun Ultra 80 4 x 64 bit processors
• RAM: 4 GB Memory
• Drive configuration: Ultra-SCSI Wide 10000 RPM 36GB Drive
Software environment

• Lucene Version: 1.2


• Java Version: 1.3.1
• Java VM:
• OS Version: Sun 5.8 (64 bit)
• Location of index: local
Lucene indexing variables

• Number of source documents: 13,820,517


• Total filesize of source documents: 87.3 GB
• Average filesize of source documents: 6.3 KB
• Source documents storage location: Filesystem
• File type of source documents: XML
• Parser(s) used, if any:
• Analyzer(s) used: A home grown analyzer that simply removes stopwords.
• Number of fields per document: 1 - 31
• Type of fields: All text, though 2 of them are dates (20001205) that we filter on
• Index persistence: FSDirectory
• Index size: 12.5 GB
Figures

• Time taken (in ms/s as an average of at least 3 indexing runs): For 617271
documents, 209698 seconds (or ~2.5 days)
• Time taken / 1000 docs indexed: 340 Seconds
• Memory consumption: (java executed with) java -Xmx1000m -Xss8192k so 1 GB of
memory was allotted to the indexer
Notes

The source documents were XML. The "indexer" opened each document one at a
time, ran an XSL transformation on them, and then proceeded to index the stream.
The indexer optimized the index every 50,000 documents (on this run) though
previously, we optimized every 300,000 documents. The performance didn't change
much either way. We did no other tuning (RAM Directories, separate process to
pretransform the source material, etc.) to make it index faster. When all of these
individual indexes were built, they were merged together into the main index. That
process usually took ~ a day.

Daniel can be contacted at Armbrust.Daniel at mayo.edu.

Geoffrey Peddle's benchmarks


I'm doing a technical evaluation of search engines for Ariba, an enterprise application
software company. I compared Lucene to a commercial C language based search engine
which I'll refer to as vendor A. Overall Lucene's performance was similar to vendor A and
met our application's requirements. I've summarized our results below.

Search scalability:
We ran a set of 16 queries in a single thread for 20 iterations. We report below the times for
the last 15 iterations (ie after the system was warmed up). The 4 sets of results below are for
indexes with between 50,000 documents to 600,000 documents. Although the times for
Lucene grew faster with document count than vendor A they were comparable.

50K documents
Lucene 5.2 seconds
A 7.2
200K
Lucene 15.3
A 15.2
400K
Lucene 28.2
A 25.5
600K
Lucene 41
A 33
Individual Query times:
Total query times are very similar between the 2 systems but there were larger differences
when you looked at individual queries.

For simple queries with small result sets Vendor A was consistently faster than Lucene. For
example a single query might take vendor A 32 thousands of a second and Lucene 64
thousands of a second. Both times are however well within acceptable response times for
our application.

For simple queries with large result sets Vendor A was consistently slower than Lucene. For
example a single query might take vendor A 300 thousands of a second and Lucene 200
thousands of a second. For more complex queries of the form (term1 or term2 or term3)
AND (term4 or term5 or term6) AND (term7 or term8) the results were more divergent. For
queries with small result sets Vendor A generally had very short response times and
sometimes Lucene had significantly larger response times. For example Vendor A might
take 16 thousands of a second and Lucene might take 156. I do not consider it to be the case
that Lucene's response time grew unexpectedly but rather that Vendor A appeared to be
taking advantage of an optimization which Lucene didn't have. (I believe there's been
discussions on the dev mailing list on complex queries of this sort.)

Index Size:
For our test data the size of both indexes grew linearly with the number of documents. Note
that these sizes are compact sizes, not maximum size during index loading. The numbers
below are from running du -k in the directory containing the index data. The larger
number's below for Vendor A may be because it supports additional functionality not
available in Lucene. I think it's the constant rate of growth rather than the absolute amount
which is more important.

50K documents
Lucene 45516 K
A 63921
200K
Lucene 171565
A 228370
400K
Lucene 345717
A 457843
600K
Lucene 511338
A 684913
Indexing Times:
These times are for reading the documents from our database, processing them, inserting
them into the document search product and index compacting. Our data has a large number
of fields/attributes. For this test I restricted Lucene to 24 attributes to reduce the number of
files created. Doing this I was able to specify a merge width for Lucene of 60. I found in
general that Lucene indexing performance to be very sensitive to changes in the merge
width. Note also that our application does a full compaction after inserting every 20,000
documents. These times are just within our acceptable limits but we are interested in
alternatives to increase Lucene's performance in this area.

600K documents
Lucene 81 minutes
A 34 minutes
(I don't have accurate results for all sizes on this measure but believe that the indexing time
for both solutions grew essentially linearly with size. The time to compact the index
generally grew with index size but it's a small percent of overall time at these sizes.)

Hardware Environment

• Dedicated machine for indexing: yes


• CPU: Dell Pentium 4 CPU 2.00Ghz, 1cpu
• RAM: 1 GB Memory
• Drive configuration: Fujitsu MAM3367MP SCSI
Software environment

• Java Version: 1.4.2_02


• Java VM: JDK
• OS Version: Windows XP
• Location of index: local
Lucene indexing variables

• Number of source documents: 600,000


• Total filesize of source documents: from database
• Average filesize of source documents: from database
• Source documents storage location: from database
• File type of source documents: XML
• Parser(s) used, if any:
• Analyzer(s) used: small variation on WhitespaceAnalyzer
• Number of fields per document: 24
• Type of fields: A1 keyword, 1 big unindexed, rest are unstored and a mix of
tokenized/untokenized
• Index persistence: FSDirectory
• Index size: 12.5 GB
Figures

• Time taken (in ms/s as an average of at least 3 indexing runs): 600,000 documents
in 81 minutes (du -k = 511338)
• Time taken / 1000 docs indexed: 123 documents/second
• Memory consumption: -ms256m -mx512m -Xss4m -XX:MaxPermSize=512M
Notes

• merge width of 60
• did a compact every 20,000 documents
‫ﺿﻤﻴﻤﻪ‬
‫وﻳﮋﮔﻲ ﻫﺎي ﺷﻨﺎﺳﺎﮔﺮ ﻧﻮري ﻛﺎراﻛﺘﺮ ﻳﺎ ‪OCR‬‬
‫‪Automatic Reader ver8.0 SDK‬‬
Sakhr’s Automatic Reader is the outcome of Sakhr ongoing research in the fields of Arabic Natural Language
Processing and Character Recognition technologies. Sakhr’s Automatic Reader pioneers the OCR programs in
Arabic language. OCR stands for Optical Character Recognition. When a text document is scanned, the
computer recognizes this text as a graphical image. The user cannot manipulate, search, or edit the image text in
its image format. An OCR program reads this scanned text, recognizes it, and then converts the figures and
characters into editable text pieces.

Sakhr’s Automatic Reader – Sakhr OCR – transforms scanned images into a grid of millions of dots, optically
recognizes the characters found in them and ultimately converts them into text. The complex nature of the Arabic
language is evident in the cursives of the text, character overlapping, various character shapes, diacritics and the
variety of calligraphic Arabic fonts that exist. As a result, these specific Arabic language complexities present
major technical challenges in the Arabic OCR industry. The Automatic Reader, backed by Sakhr’s extensive
experience in Arabic Natural Language Processing (NLP)technologies, addresses these challenges effectively,
thus providing Arabic users with an award-winning and high quality OCR solution.

Great features are offered within Sakhr’s Automatic Reader package regarding accuracy enhancement,
employing NLP tools, supporting PDF, all new famous image formats, and other script languages that have
similar shapes to Arabic such as Farsi, Urdu, Pashto and Jawi.
KEY FEATURES
Performance and Accuracy

 800 characters per second on PIII-based computers


 Up to 99% accuracy in recognizing Arabic books, newspapers, etc...
 Windows NT, 2000 and XP (Arabic Enabled).

Recognition Engines

 Supports Arabic, English, French and 16 other languages


 Supports other script languages: Farsi, Jawi, Pashto and Urdo (Available optionally in the extra language
pack.)
 Recognizes bilingual documents: Arabic/English, Farsi/English and Arabic/French
 Supports both OMNI & Learning technologies to obtain higher accuracy in different fonts

Supported Formats

 Deals with all image formats (.bmp, .tiff, .pcx, etc.)


 Saves the output text in different formats such as .txt, .rtf, and .html
 Supports PDF formats

Supported Scanners

 Supports Twain, ISIS and KOFAX protocols


 Works with any type of scanners
 Supports simplex and duplex scanners

UNIQUE FEATURES

 Recognizes the diacritics in Arabic images


 Opens multiple documents at the same time
 Recognizes tables in scanned images
 Supports ill-formed tables
 Recognizes underlined words
 Recognizes broken and stickled characters
 Detects automatically style for fonts (Regular or Bold)
 Uses Arabic linguistic rules with recognition (Artificial Intelligence)
 Supports non-rectangular frames
 Supports color documents
 Groups recognition attributes into pre-defined types of source documents

Powerful Imaging Tools

Automatic and manual image rotation and fixing

Software Developers Kit (SDK)

The OCR can be easily integrated with third party applications using the SDK. The SDK supports scanning
without opening an OCR application; and enables the user to highlight a specific recognized word in the image.
The SDK contains OLE Automation, OCX and DLL (standard APIs)

OTHER IMPORTANT FEATURES

 Provides program interface in both Arabic and English.


 Includes bilingual spellchecker.
 Supports both automatic and manual framing modes.
 Sends OCR results by e-mail.

System Requirements Minimum Recommended


CPU Pentium III, 700 MHz Pentium IV, 2.0 GHz
Free Disk Space 65 MB 400 MB
RAM 64 MB 128 MB
Operating System Win 2000, XP Win 2000, XP
OCR SDK
OCR Professional contains the same features as the OCR Office but it is distinguished with some extra
features:

• Opening multiple images files simultaneously.


• Recognizing colored images, not only the Arabic ones, but also in other European languages.
• Recognizing Arabic/French images.
• Supporting Learning mode.
• Supporting four different types of batch jobs.
• Correcting, editing and spell checking the resultant text at the same
time of any batch job.
• Supporting more than 16 European languages.
• Containing 26 Arabic font libraries.

OCR stands for Optical Character Recognition. When a text document


is scanned, the computer recognizes this text as a graphical image. The user cannot manipulate,
search, or edit the image text in its image format. An OCR program reads this scanned text,
recognizes it, and then converts the various figures and characters into a text document.

The OCR, backed by Sakhr's experience in Natural Language Processing NLP, can solve the
various Arabic language unique characteristics.

To begin with, Arabic is written cursively, where several characters are connected to form what we
call 'blocks of characters'. Arabic can also be written in many fonts, so that a 'block of characters'
has more than one base line. Additionally, Arabic uses many types of external objects such as dots,
'Hamza' and 'Madda'.

Diacritizer adds a new set of external objects. Furthermore, Arabic characters can have more than
one shape according to their position inside the block of characters (initial, middle, final or
standalone block of characters).

Overlapping also makes it difficult to determine the spacing between blocks of characters and
words. Finally, Arabic font suppliers do not follow a common standard. Given the peculiarities of
Arabic fonts and the characteristics of the Arabic language, building an OMNI OCR becomes a
difficult undertaking.

Sakhr's OCR combines two main technologies: Omni technology, which depends on highly
advanced research in artificial intelligence and Training Technology, which increases the accuracy
of character recognition.

It can identify more than one language through Xerox Text Bridge technology; one of the most
popular OCR programs. The program can also identify both Arabic and English characters on the
same page. The OCR can also distinguish between 26 true types of Arabic fonts. The professional
version enables the user to scan large amounts of documents, save them as graphic files and
classify them for later recognition in order to save time.

The program recognizes graphics and places them in their proper location on the page. The
program also saves page format without any modification to tables, columns, or graphics. It can
also identify diacritics and keep or remove them according to the user's preference.

The output can be saved to disk for use in a variety of applications such as word processing. The
program is compatible with many other programs, and provides an SDK (Software Development
Kit) to allow integration with other application. The OCR SDK is available in both DLL and
ActiveX formats.

The Sakhr OCR engine has been optimized using the Intel tools to better run on the latest Pentium
4 processors that support Hyper-Threading.

The PROFESSIONAL Version 8.0 - Automatic Reader Pro supports the following languages:
Arabic, Farsi, French, English, Czech, Danish, Dutch, Finnish, German, Greek, Hungarian, Italian,
Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, and Turkish.. This version also adds
training technology to Omni technology to further raise accuracy levels. Users access 4 batch
modes, can train the program in specific fonts, use spell checkers, Arabic linguistic rules, and OLE
and DDE features.

Available in Gold Version, and Platinum Version (includes SDK).