Cimtech Intro To Document Image Processing - 1991

TECHNOLOGY SECTION
An Introduction to

Document Image Processing &

Digital Document Management Systems

By

Tony Hendley, Technical Director,

Cimtech

9

TECHNOLOGY

CONTENTS:

1. Defining what we mean by document image processing & digital document management 13

1.1 What Is A Document? , 13

1.2 What Is Document Image Processing? .

13

DIP As A Subsystem .

DIP As A System .

13

14

2. The key hardware components of a document

image processing system ....•..•.•• 14

2.1 Input .. 14
Image Processing 15
2.2 Control 16
2.3 Storage . 17
Read Only Optical Disks . 17
Write Once Optical Disks 18
Rewritable Optical Disks. . 18
Jukeboxes 19
2.4 Output Devices ... 19
Printers . 19
Monitors 19
2.5 Networking 20 3. Software components of a document image processing system . • • • • . . . . . • • • • . •• 20

. 3.1 Operating Systems ... . . . . ... . . .. 20 ".3.2Image Processing & System Level,

Software , , , , , , . " 21

3,3 Standard Application Software .. ,,' . , ., 21 Input Subsystem. . .. ,..".,.... 21

Storage Subsystem ,. 21

3. Continued

Retrieval Subsystem 22

Output Subsystem 22

3.4 System Management Software. 22

3.5 File & Index Data Management . . . . .. 22

File Management 22

Database Management 23

DIP & Database Management . . . . . .. 23

Extended Data Processing . . . . . . . .. 24

Document & Content Architecture .. " 24

Document Filing & Retrieval . . . . . .. 26

3.6 Workflow Automation 26

3.7 Applications - 'What Do People Want To

Do With The Documents? . . . . . . . .. 28

4. Assessing your requirements. . • . . . •• 30

4.1 Paper Replacement . . . . . . . . . . . . .. 30

4.2 Improving Access To Documents!

Document Contents. 31

Keyword Indexing For Transaction &

Evidentiary Documents. 32

Reference Documents - Keyword Or Full

Text & Hypertext! Hypermedia? ... " 32

4.3 Workflow Automation. . 33

4.4 Applications

•• " ..• '.' ••..• ; 33"

5. Conclusion.

.. .

34 .'

11

TECHNOLOGY

1. DEFIN1NG WHAT WE MEAN BY DOCUMENT IMAGE PROCESSING & DIGITAL DOCUMENT MANAGEMENT

Defining what we mean by DIP and digital document management is more difficult than it may at first sight appear. The simplest way to do it is simply to describe the hardware and software components that go to make up a DIP system and this we do below. However, we preface this with a brief discussion on what we mean by a document.

1.1 "'hat Is A Document?

Ask most people what a document is and they will think in terms of paper documents and say that it can comprise anything from one page to many thousand pages and, if they think hard about it, they will say that it has a logical structure i e it has a start and an end and can usually be divided into chapters or sections or paragraphs or other logical divisions. It is the logic of a document which enables us to divide a stack of pages into several discrete documents. Secondly, they will say that it has a layout structure - the content of the document is laid out on one or more pages following some set of logical rules again. Thirdly, they will say that the document is comprised of content types - text, line art or graphics and images.

Our view of a document is certainly clouded by the media used to store and distribute documents which has traditionally been paper. Even when users looked for a more compact alternative to paper they chose microfilm which, by its analczue nature, simply miniaturises existing pages of a document - thus preserving most of our hallowed concepts of what a document is.

In the computer industry our view of a document has been shaped by two factors - firstly our own inherited traditional view of what a document is and secondly. our newly formed view of what a document is, based on how a document can be created, stored and processed on a computer - which until recently has been very limited. Within office systems and word processors a document is created as a string of ASCII or EBCDIC characters and stored as an unstructured text fiJe. When it comes to laying out the contents of a document onto a screen for display or paper for hardcopy output then most word processing and office system users have a number of formatting commands at their disposal.

: "Izithelatterhalfoftlle 8.0·swe.hllYc~'arso,-~d.e great strides . _ . In' . the:. computer .mdustryin .our, a~_ility !~t: capture! create ....• vectorgraphics (line art.C;AD designs etc)3P.d.bit map raster _ , . images- (facsimile,' black: and . white, - greyscaJe and colour .~·-=:-scanneiSf and this has allowed us -to capture and create not

just text documents with limited formatting options but real world. documents that comprise text and vector and raster

graphics in a.'raIlgeof_£PIi1pinat!~~: ._ ." .

H~weVer, th~sheer:~~£th.ofop_tionsh_as:cr~ted a lot of ,ditfic;ulti~Md _~qnfPsiQQ. (0" users.who, having captured or created documents, want to be able to view them on a range of different terminals! workstations, print them on a range of

Copyright Cimtech 1991

different printers; often to copy and edit them on a range of different word processing and desktop publishing software, interchange them with users running a range of different systems and store them in digital form with some hope that when they came to retrieve them again they will still be able to view, print and, if necessary, edit them.

To meet this need the computer companies have developed and are still developing standard document architectures and interchange formats which will allow users to define the logical structure, layout structure and contents of their documents in a form which is intelligible to a range of systems and applications and which should, in an increasingly open environment. facilitate the interchange of these documents between a range of different systems. The whole area of digital document management is growing at a tremendous rate and the issues are becoming increasingly more complex.

1.2 What Is Document Image Processing?

It is against this background that we must define what we mean by Document Image Processing. In its simplest form we are talking about systems which transform paper documents into digital images which then become the basis for a wide variety of computer processing applications.

Most DIP systems capture paper documents by means of a scanner which scans each page or side of a page in the case of double-sided documents and then digitises the resulting video image of each page and stores it as a raster image file. So far so good - we are still in the realms of pages and documents just as with paper and microfilm systems and facsimile systems.

The difficult concept to convey is that a DIP system can function in a number of different ways. There are two key options.

DIP As A Subsystem

Firstly. DIP can simply be a subsystem - a way of getting paper documents into a general Digital Document Management system. This option itself has a range of suboptions. Much depends on how flexible the digital document management system is. If the digital document management system is geared to managing all documents as text files then the DIP subsystem will comprise a scanner for capturing an image of a page and an Optical Character Recognition or Intelligent Character Recognition software package,Jor converting the characters in the document into coded characters plus an editing package for correcting errors; Once"?" captured and corrected the text will be input into the te"t management system.

If the. digital document management system is geared to a publishing requirement then the options widen. The DIP system could be used to scan in pages of documents or photographs etc. The document would be analysea-and-c- broken up into text, image and vector graphic contents and then each would be held in a different way and manipulated and used to produce a new published document.

13

TECHNOLO'GY

If the digital document management system was a compound document ~agement system geared to the needs of a legal office or project management application then the DIP system would be used to scan in images of pages which would be held as images with the necessary application software to displa~ and print them as images and they would be managed alongside text files and documents in other formats on one compound document management system.

If the digital document management system was geared to the needs of technical document management then the DIP system would be geared to scanning large format documents and might provide a raster to vector conversion option as well as a raster to ASCII conversion option.

The ~ptions are many and very varied and all one can say is that lD. the first major option - DIP is a subsystem - a way of captunng the contents of documents and processing them so they can then be managed by whatever system is currently in

" place to manage digital documents.

DIP As A System

Alternatively, a DIP system can function as a standalone system where all documents are stored in "image" format. ~uch systems usually have two key input sources. The first IS once again the scanning in of paper documents. The second ~)Qe is the downloading of existing digital documents and data into the DI~ system in "image form". Here we are talking about the Wide range of electronically created documents that already exist in digital form (text files from WP; vector files from CAD; text, image and vector files from a desktop publishing system; structured data records from an EDI transaction etc). If the originator wants to manage them and make them accessible for long periods of time then in the past he would have printed them out from his originating syste~ and stored them in paper format in filing cabinets or he might have output them directly to microfilm via a plotter and stored them on microfilm.

With many large scale DIP systems today users could opt to transmit their electronic documents to the DIP system for long term storage and management. The format of choice would tend to be a print image or raster plot file format - the formatted image of the document which they would normally send to a hardcopy printer or plotter when they wanted to make a reference copy of the document. Hence an image of a document does not necessarily need to be a scanned image of a paper document - it could be a print image generated from a word processing system or a plot file generated from

. a CAD system etc. Both these inputs can be are easily ,un~e"rstood as they equate}() how, a paper or microfilm document management systenfworks. .' ,

Inthis environment the DIP"~)'steIIlwouldb~linked to DP " ... applications, to office applications, to CAD systems and to Desktol? Publishing .applications and even other image processing applications and its main role would be to hold a ,:' formatted Vers!oif Of allilie, (mal format documents created on

.: these systems .. ' " :.,: r. ': ",i, ::i

• I. ~ '.:


2. TIlE KEY HARDWARE COM:PONENTS OF A DOCUMENT IMAGE PROCESSING SYSTEM

We will look first at the key hardware and software components that go to make up a DIP system, whether it is functioning as a standalone system or a subsystem. We will then go on to look at some of the key ways in which systems and subsystems can be differentiated.

Most standalone DIP systems we have come across are made up of some five discrete modules. These comprise the input system (scanning; data downloading etc); the control system (open and closed, distributed! centralised); the storage subsystem (magnetic, optical, microform etc); the :etrieval/output system (display stations for soft copy; nonImpact printers, COM etc for hardcopy) and the communications system. On top of that basic set of modules we will look briefly at some of the many options that are becoming available for data management and document filing and retrieval; the range of application software in cases where users. want to manipulate the docu~ents and, finally, the growing range of workflow automation software.

2.1 Input

Scanners form the front end and capture a video imaze of the image stored on paper/ film which is then digitised. The dots that compose a digitised image are called pixels and an image represented in this form is called a bit-mapped image. The ?reater the number of pixels the higher the resolution of the Image.

The resolution of an image is expressed as the number of dots per inch (dpi). Generally, resolutions of 200 dpi to 400 dpi are appropriate for business applications While professional printing applications will require much higher resolutions. The data content in an image is proportional to the square of the resolution. Hence a 200 dpi image contains 40,000 pixels per square inch, four times as many as a 100 dpi image.

As the amount of data in an image increases the amount of disk storage needed for that image also increases and the time it will take to transfer that image between peripherals.

Document images take many forms. They can be black and white half tones; they can be grey scale images or continuous tones (colour or monochrome). Continuous tone images (photographs) are images with a full range of colour tones (or grey tones for black and white images). Grey scale. images use a .fixed nu~ber of grey to~es. Halftone images are grey scale Images USIng only two distinct grey levels - black and white. The bulk of printed text uses half tones ...

Most scanners use a linear array of CCDs containing 1728, 2048, 3456 or 4096 photodetector elements correspondingto':;.::::c .. 200,240,400 or 496 dpi resolutions for an 85 Inch" wide"-~""image. The photodeteciors' analogue signals have to be converted into digital signals. Most scanners convert the

signal into a discrete level of grey. Scanner grey scale levels

range from 8 to 256. The low range are appropriate for most

14

TECHNOLOGY

printed documents, the high range are used for high resolution applications such as medical X-rays.

The amount of data required to record each pixel depends on the amount of grey scale information. A single black and white pixel requires only 1 Bit of information. Hence an A4 document scanned at 200dpi would require a total of 3,740,000 Bits. To record 256 levels of grey 8 Bits per pixel is needed so an A4 document would require 3.74 MBytes.

Image Processing

The system performance is improved when grey scale information is then eliminated by thresholding. All pixels above a threshold are deemed black and all below, white. The system then discards redundant grey scale information concerning that pixel and records a simple black and white representation of the pixel. Most DIP system printers do not print varying shades of grey so halftone images are used. They create a simulated grey scale image by varying the intensity and pattern of black dots. A simulated grey scale image is created while scanning by a process called dithering. The scanner converts the grey scale information of small areas of the image into an even pattern of black dots. If there are 8 levels of grey it will use 8 different dot patterns.

It is possible to get colour scanners for photographs, maps etc but they are much more expensive and require colour monitors and printers. In addition, they create at least three times as much data as a standard grey scale image. It is necessary to scan the image for each of the primary colours and use 3 sets of CCDS or 3 light sources.

The user may require to capture both textual and photographic information from the same page - e.g, magazine articles with photos. If these compound documents were scanned without grey scale the text would be clear but the photo would have too much contrast. If we scanned the page with grey scale the photo would look good but the text would be hard to read. To overcome this problem some scanners can automatically process these compound images using a threshold technique on textual areas and a dithering technique on photographic areas. However, such compound scanning takes longer to perform.

Image editing can also be done at the scanning stage. We can change the appearance of the image by using the following facilities - enlarge; invert; rotate or highlight and we can remove parts of it; clean up poor quality images; adjust

"contrast l~v~ls; carry,out.~g~,.enhancemeDtand use filtering techniques,

uses two dimensional Read to provide compression ratios of 15 - 30: 1 for textual documents. The exact figures obtained depend on the original..

There are as yet no fully agreed standards for compressing grey scale and colour images although the Joint Photographic Experts Group (JPEG) is close to finalising such a standard. Once a standard is agreed and chips are widely available to compress and decompress such images then we may well see a significant increase in the number of mainstream DIP systems offering colour image storage.

The CCITT standards also include additional information to describe the document and compression techniques. The compressed image file represents the image plus additional image descriptor and file format codes. When these additional codes are added, the resulting image file may be proprietary to the system. Proprietary image file formats may work well within one supplier's system and allow them to scan in and store documents in image format following the facsimile standards and in many cases to import and export Group 3 facsimile images but users of such systems may find themselves constrained should they wish to transfer their images to different systems and peripherals other than those supported by their supplier or other facsimile machines.

Any image file must hold enough information to enable the program you are using to view it to decode it. This typically comprises the image data plus information describing how the image data is to be interpreted which is typically stored as a header. Most image file formats are hardware specific - they are optimised for specific hardware. They typically use a positional structure where the location of specific data in the file determines what the data refers to. Such a structure is resistant to change but one good way of overcoming this is to place a tag at the start of each data field which tells the reading software exactly what the following data is. Aldus and Microsoft adopted this approach when they developed their Tagged Image File Format (TIFF) as a common format for scanner vendors and developers of desktop publishing software. All information relating to the images is stored in tag fields and the latest version of TIFF includes some 45 fields. The image file directory contains a list of the fields present in a TIFF file and TIFF can support multiple images in a single file and several compression schemes.

However, TIFF was not intended to be a printer language or page description language nor is it intended to be a general document interchange standard. The main aim was to facilitate the exchange of image data between application programs. Hence while TIFF can and should be used to store images and deliver them to an image editing program it does not contain all the additional logical and layout data needed in a document architecture or document interchange format.

;,;U;~'u~Ii~ i~g~'p:~~~i~~ '~!r~l~i~';~~b~~~~~~ti the scanner

under computer control or in the control computer. The result

-: .is. a ,data:.,fiJe.con,~i.!I.~g .. ~:Aigiti~~; image which is As we said above, Optical Character Recognition (OCR) isan

compressed.: before. tnl,IlS~ssiollL,,;storig~ .. !;The use of additional software facility which can be added toa DIP

; :.compression'red~r:~,~orage ~~ t~~EE.~!9noverheads on system if required by the application; The documentvis

.: .. : the:systelllcandll~n~c:.j~EroY~d:>Ycer~Ls)'~te~ performance . ,scanned and digitised as before and the digitised file is then ." 'C.blit,OrCQursc;l:"we ;mu.st deCompress before viewing or "~-processed by the OCR software and the textual content of

~~- .. - printing.~ _11;te~_. GG!n:~ h~Y~ 4~.Y~~9P~_~ -~~_~~r Group 3/4- the document is converted into an ASCII text file" H6wever:- -----------

facsimil~ :!!t,andards for compression/ expansion. Group 3 uses OCR devices are relatively slow compared to scanners and

~ -one dimensional Huffman code to provide compression ratios are not 100% accurate today and they cannot capture

of 5~15: 1 and optionally two dimensional coding. Group 4 signatures, logos, Dotes and handwritten material. Hence most

Copyright Cirntech 1991

15

TECHNOLOGY

DIP systems, if they use OCR at all, simply use OCR to automate the laborious task of indexing by reading the characters contained in specified segments of the document. Cheque processingl giro processing systems use DIP and OCRIICR techniques for automated data entry and image management by bringing up the cheque image on a DIP screen alongside a data entry screen and transferring data from segments of the image to fields in the data entry screen.

Turning to the actual scanners, the scan speeds depend on a number of factors, including: the speed with which light and a mirror can be scanned across the face of a document; the time taken to process the bit-mapped image and the time taken to position a new page. The quoted scan speed for the scanner excludes factors such as the data rate of the compression circuitry, the control computer and the communications system. Typical scan speeds for office scanners range from 2 - 15 seconds per page.

. Other key factors which can crucially influence throughput include time for document preparation; Quality Control and indexing.

Another key parameter for scanners is, of course, the size of document accepted. Apart from specialised cheque and voucher scanners, most can accept A4 size documents and a large percentage up to A3 size for business documents. For applications requiring the capture of larger documents (drawings, maps etc) there are specialist scanners. Such larger documents up to AD size or more also require extensions to the Gp 4 facsimile standards and the US military has sponsored the creation of raster graphics standards for large format documents.

Type 1 specifies an extension of the Group 4 facsimile standard for documents up to US K size and Type 2 specifies a tiled format where large documents are broken down into a series of tiles, each of 512 x 512 pixels and each tile is then compressed individually. Such a scheme means that large images can be viewed on standard PCs and workstations without the need for massive amounts of expanded memory.

There are several types of scanner on the market. Firstly we have rotary scanners which like rotary microfilm cameras can be used for scanning single sheets of standard sizes at high speed. Secondly, we have flatbed scanners which resemble photocopiers and can be manually loaded or offer semiautomatic sheet feed facilities. Thirdly, for specialised markets we have planetary scanners which can accept bound volumes, stacks of documents and even in some cases three

ciiI?ensional obj~ts'l":

There are also a range of microform scanners on the market for ~nlosLof ~h~.keymicroformfoTJlUlts.:Thet:e are scanners

-. - - . ~that ~i~,r ~~c.~P~. - ~ 9.Jip: - ~ri~·.Q4~((·~s~QjrrJ..:-~-r6-iiun roll film; : stancIard' inicroficheand aperture cards and all -these can offer . full or semi-automaii~ operation. They .will work either with one roll of film or one fiche or one or a stack of aperture cards v.; There a.re.~lso, .1l11umber,:ofjac:ket .scanners on the market.hit: due~:tC;·the·noii standard formatS· used in most

~ ~: .. "_ jackf£'~ystelns ". lbese .... require,l1lanual··IIlani pulati on of the

jacket transport and h~nc~.~r:esl()\v~i)Q·QP~~te.

; ,'_-. ., _.~ . ~ C", - __ ,-'. - _ _",. ._. _ _ __ • ' .•.. __ •.• _ .• __ .••• _ •• • _. '.


2.2 Control

Document Image Processing Systems, just like any form of computer system today can be categorised as open or closed systems. Open systems offer greater flexibility to the user in the choice of hardware and software options. Typically, an open system vendor would provide users/ third parties with access to internal components to add circuitry; would use standard internal and external data communications protocols; would provide support to system developer; support third party hardware peripherals and third party software. An example of an open system which offers great flexibility and a good expansion path is, of course, the IBM PC. Closed architecture systems, on the other hand, are shipped as a complete unit and changes are usually limited to customisation of the database application software or selection from a limited number of peripherals supported by the vendor .

Inevitably, it is usually a case of degrees of open or closed systems rather than the two extremes painted above. With DIP systems, often the control computers are open but the system is effectively closed if the vendor does not make provision for modifications. When a new area of technology emerges ( e.g. Document Image Processing) closed systems are cheaper and faster to develop and they can perform well in standard application environments but they can prove more expensive to a user over time, particularly if the application expands dramatically and major changes have to be made to the system.

DIP systems can also be characterised by their architecture. A few years ago in the computing environment all systems adopted a centralised architecture where all the data processing tasks are handled by the control computer which is linked to a series of dumb peripherals. Increasingly, with the advent of low cost PCs and then workstations we have seen a growth in distributed computing.

With a distributed system the peripherals have their own intelligence and the DP tasks are distributed. An intelligent peripheral will have its own processor and communications interface. Increasingly the client server architecture is being adopted where applications run on the client workstations and one or more servers manage the index database and the application and the storage system.

A single workstation PC-based DIP system is, of course, a centralised system where the control microcomputer controls a scanner, printer; display and optical disk store directly and holds the application data and index database on magnetic disk.

Distributed DIP systems comprise multiple PCsj~orksuitions connected across a local area network to one or more servers which control the index database, the application software and . the image database on a mix of magnetic and optical disk storage. The workstations operate dedicated peripherals and functions (scanner; printer; display; editing workstation etc). Multitasking is much easier in these systems as the peripherals are independent and each workstation can be optimised for its task. In addition the systems can be easily expanded as the application grows by adding new servers dedicated to specific tasks.

16

TECHNOLOGY

A significant variation on the distributed system for users who wish to have a greater degree of integration with existing host based databases or who need to utilise the power of the host system is the instream or cooperative processing system which provides' for separate processing of the image files away from the database information relating to the image. An instream system can be attached to an existing mainframe database and a full instream system would thus make use of three processors; a mainframe for the application and index database server; a minicomputer/ workstation for the document image server and microcomputers for peripherals.

Control computers vary in processing power and speed, integration and expansion facilities. There are three main categories of control computer used in DIP systems - PCs and compatibles; other microcomputers and mini and mainframe based systems.

PCs are currently a popular basic component of DIP systems. However, the processing demands made by a DIP system mean that most of the PCs used are enhanced 80286/ 80386 machines.

Image processing (IP) functions operate on the individual pixels of the image data array created by scanning and digitisation and we have seen that such page images can be very large - 4 - 10 Mbits or more in uncompressed form. IP operations such as compression, scaling etc involve executing tbe same function repeatedly for each pixel involved and hence performance is critical. To carry out IP on the PC therefore most suppliers make use of an application accelerator - an add-on hardware card or subsystem that has its own processing capability independent of the host which is used to speed up the image processing operations.

Because of the limited number of IP functions required in most standard DIP applications, DIP suppliers employ dedicated logic accelerators with application specific integrated circuits, digital signal processors and proprietary logic on a board which then simply executes the specific set of functions required for an imaging application (compress; scale etc).

Leading vendors of such boards include Xionics, Kofax and Laserdata. Increasingly, as DIP systems join the mainstream so tbere is a demand for casual users to have their PCs "image enabled" at a low cost. Hence the above vendors and otber specialist vendors are making available IP software that offers image decompression, display, scaling, rotation facilities at a much lower cost per workstation. Anyone with a VG_J\~display~'_v[e\villlagt!,~~ori'jfusirig such software and on 386-a.ndmore pOwerful machines performance comes

" ,- close, tO,that B9f:a.41ed witll.bollrds'.,For viewing images on ',lOw 'resolution 'ScreenS, Xionics' have"receiiiiy introduced a packageth-atDu!k~ ~~ 9(" anti :aliasijig~"oremploying grey , scaleto 'give the-illusion ofsmoot!iliJg.()~(th~' edges of pixels.

-, The results are impressive on . standard A4 'document pages when viewed on VGA screens. The advent of Windows ;;.0 from, Microsoft' has made the PC and DOS even more

, , 'atiriicti've 'as a platform for low cost DIP applications and most DIP vendors have launched or are about to launch Windows 3.0 based products.'; ..".


Otber microcomputer systems are also of considerable importance in the DIP field. Many companies offer microcomputer workstations based on tbe Motorola 68000 family of microprocessors and the rusc family. Sun, Hewlett Packard and DEC systems are used as host computers and as distributed processing workstations for several DIP systems. Their processing power and advanced graphics capabilities and use of UNIX make them well suited for a range of DIP applications and IBM's RS 6000 family is being used in a range of DIP systems too. The Apple Macintosh family are also used in DIP systems, most notably the MARS system from Micro Dynamics.

Mini- and mainframe based systems have not traditionally formed tbe basis of many turnkey DIP systems due to a perceived mismatcb between centralised architectures and data intensive applications such as image processing, the fact that the mainframe vendors were not early into DIP and the fact that most early users were looking to solve departmental problems before looking at more ambitious enterprise-wide projects. However, they have been used in customised systems and will be more widely used in large corporate - wide DIP systems now they are supplied by the major computer companies i e IBM's MVS/ESA Image Plus. DIP workstations used in instream or mini! mainframe based systems are multifunction terminals which can emulate mainframe terminals and also function as DIP terminals using multiple windows providing a mainframe window; image window and usually a PC window as well for office automation facilities.

2.3 Storage

DIP systems require tbe use of a complete hierarchy of storage devices including RAM in the terminals; magnetic disk drives for storage of the index database, system software and temporary storage of image data; magnetic tape drives to backup the magnetic disks; WORM and rewritable optical disk drives for the storage of active image files; optical disk jukeboxes or automatic optical disk handling devices which can hold large numbers of writable optical disks containing semi-archival documents and automatically retrieve any disk and load it into one of a number of drives when the files contained on that disk are called for; offline optical disks tbat can hold archival images offline on shelves etc until they are called for by the system and mounted into a jukebox or drive by an operator. Hence the traditional magnetic storage hierarchy has been expanded to cater for the new and demanding storage requirements posed by image processingj

and largescale electronic document management. ' .,

There are a range of optical storage media on the marketplace today all of which use laser beams to record and read back a range of information - both analogue and digital. They fall into three categories-' read only; write once and rewritable.:- ..

Read Only Optical Disks.--~~ ' ..

The read only optical disk systems are, by their nature, publishing systems and the consumer Compact Audio Disc and the CD-ROM used for database publishing and technical publishing applications are the best known. The CD-ROM

17

TECHNOLOGY

There are manY diff~~;tt.sizes.()J WORM disk available.

. :.-Thereis a WORM" version ofthe CD-ROMbiii the drives are expensive and are aimed maillIyat publishers who want to prototype CD-ROM databases prior to ~going to mass production. Within 18 months lVC expect to launch a much

.. cheaper- £2000 CD WORM drive and once file structure

<. ". issues havebeen resolved CD -WORM could become very

popular forlowvolumepublishing applications inhouse and There are already a number of S.25inch rewritable optical

/as'~ a1ternativet{)·~icrofilmJortechn.icaland academic_ disk drives, available from Son)', Ricoh, Canon,.Panilsonic,

r: ".' PHPJi,s,hin.g.,o.;·,;,,·;,.::-- ~:::::~:.:::::'·'::i ~:;;i' ~;,.>. .. Maxoptix and others.' They employ magneto-optic or phase change recording technology. On the horizon are 3 .5 inch diameter rewritable optical disk drives too with MOST supplying one and IBM showing one that can function as a rewritable or read only disk. Currently rewritable optical disk

(Compact Disc Read Only Memory) is based on the same physical and recording standards as the CD but also has an extra layer of error correction coding and more accurate data addressing facilities for use in the computing environment so CD-ROMs could be used to distribute databases and raw data to users with PCs, workstations and other computer systems. The CD-ROM was also given a standard volume and file structure so that we could adopt one interface between DOS and other key operating systems and the CD-ROM and Microsoft introduced their CD-ROM Extensions software so CD-ROM drives can be accessed as standard drives within DOS.

Following on from CD-ROM we have a range of derivatives including CD-ROM XA (Extended Architecture) and CD-I. All these systems are aimed at slightly different markets but all have in common the fact that they are publishing/ distribution media and hence only of peripheral interest to users considering introducing their own document management or document archiving facilities.

Write Once Optical Disks

When looking at DIP systems, We are primarily looking at systems which allow users to capture their own information and hence we are looking at the second and third categories of optical storage device - Write Once Read Many (WORM) disks and Rewritable Optical Disks. WORM disks allow users to record data on them as well as to read it back using one laser beam with multiple power settings or two separate laser beams - a write and a read beam.

The data is recorded by burning holes or pits into the surface of the disk or by creating chemical changes on the surface of the disk. These changes are irreversible and hence the disks are called WORM disks. However, this does not mean that all the data has to be written to the optical disk in one go. On the contrary, the user can continue to add data to unused sectors of the disk until the disk is full. Hence a WORM disk is rather like a notebook, you can keep adding data to it until it is full up but you cannot erase data and then rewrite over the top of it as you can with rewritable magnetic disks. This quality can make WORM disks very attractive to records managers and users looking for secure, permanent long term digital storage but it also can be seen as a major weakness for users maintaining volatile databases who, every time they changed the data would have to write it to another sector of the disk.

_._" .. _ •••••• ' 'Y. _,_"

Tlie'iWC:;'rnaiil c~~~'~~~j~rW8RMdi~ forinats today are 5.25inch:and 12inch. The 5.25inch disks use CA V mode recording like magnetic disks as opposed to CD-ROM which


uses the consumer CLV recording format. A typicaI5.25inch WORM disk has a storage capacity of 325 MBytes per side and is double sided although current drives can only read one side at a time. The 12 inch disks have a storage capacity of from 1 GByte to a staggering 4.5 GBytes per side and again are double sided. Most current drives can only read one side at a time but LMSI have launched a 12 inch drive that can read both sides of a 12 inch disk at once to provide very fast access times.

The key benefits which WORM disks offer include their high storage capacity; the fact that the WORM disks are removable and interchangeable compared to magnetic disks which are usually fixed; the fact that WORM disks are written to and read by a laser beam which can be focused on the disk from a distance and hence there is no danger of head crashes and wear during read/write operations; the fact that WORM disks are non erasable and hence offer a permanent audit trail and look set to be legally acceptable for the storage of valuable records and finally the low cost per MByte of storage. Many of these benefits are shared by paper and microfilm and many see WORM disks as being the digital paper of the future - something we write on and never erase.

WORM disks fit between magnetic disk and magnetic tape in the digital storage hierarchy. At present the WORM disks do not compete with magnetic disks on access speeds or data transfer speeds but compared to tape they offer random access, much faster access and far higher storage capacities. At present there are no standards for 12 inch WORM disks but considerable effort is being devoted to agreeing standards for 5.25 inch WORM disks and an international standard specifying the physical and recording standards for 5.25inch WORM disks has been issued although it is a weak standard and allows for several formats.

WORM disks are a niche market today with only a few hundred thousand drives sold worldwide. This leads many to predict that small format WORM drives and media may disappear in the medium term as they lose out to low cost mass produced rewritable optical disk drives and media. The jury is stiII out on that at present.

When looking at the storage capacities of the disks we must make the point that we are only at the beginning of tbe evolution of optical media. Today a 12 inch WORM disk offer 1 - 4.5GBytes of storage per side. Depending on the resolution used, the content of the document, the compression algorithm used etc between 15,000 and 90,000 A4 document images can be stored on one side of a 12 inch WORM disk in compressed bit-map form. Alternatively, over 1 million pages of ASCII coded text could be stored on the samedlsk side.

Rewritable Optical Disks

18

TECHNOLOGY

drives are being used as high density floppy disks in data intensive applications such as CAD; DTP and DIP. They are ideal for temporary storage of bulky work in progress files although the drives are still quite expensive.

What we will see increasingly over the next few years is multifunction 5.25inch drives capable of playing either both WORM and Rewritable optical disks or rewritable disks that can be used as rewritable disks or used in write protect or WORM mode so one drive could meet a user's backup and archiving requirements. Arguments rage again as to whether WORM and rewritable media is needed or whether write protect will meet the WORM: requirement. The majority of vendors favour write protect as they say drives that accept one medium only will be cheaper to produce. Users are still not convinced when they come from a records management environment but price could be the deciding factor.

Jukeboxes

Because both WORM and rewritable optical media are removable there is a market for nearIine storage devices such as disk jukeboxes or autochangers. These devices allow 10 - 288 disks to be held in a device with 1 - 11 drives and one or more picker mechanisms that can, under computer control, retrieve any disk, rotate it 180 degrees and insert it into a preselected drive where the required data can be read and transmitted to a cache.

The key benefit is low cost per MEyte of data stored where high performance is not an essential requirement but totally automatic retrieval and random access is attractive compared to manual tape library systems. As part of the storage hierarchy jukeboxes have a valuable role to play but users must be sure to be aware of the performance issues involved in their use before committing all their active documents to a jukebox.

There are jukeboxes for WORM and rewritable 5.25inch and 12 inch disks but we are not aware of any CD-ROM jukeboxes currently in commercial production. The nearest thing to that is the Pioneer 6 disc caddy system.

2.4 Output Devices

Printers

,0 Ute primary hardcopyoutput device for DIP systems are , non-impact printers.including laser printers; LED; ink jet etc. ",0 These printers can simultaneouslyprint both text and graphics on the same page. However, before an image can be printed ,by the laser print engine it must be collverted back to an

'analogue video format. ' ",

Mosisystems can, convert, transmit .and printan A4 200 dpi ~rrulgeinapprox)O -~() secqnds with subsequent copies at 8

,per minute :s0l!l~ ;w~_ h.ll-ye, seeriwithbardwareassist can print as many as 90 pages per minute, DIP systems used for drawings use plotters.althoqghA21aserprizlters are now becoming available at a reasonable price and A 1 laser printers

:opyright Cimtech 1991

have been shown. Another alternative, again is to plot out to microfilm i e microfiche or aperture cards.

Monitors

The availability of high resolution monitors capable of displaying a full document image is the key to the widespread acceptance and use of distributed DIP systems and the consequent reduction of paper in the office. There are a wide range designed to display document images plus conventional ASCII information. CRT displays are still the most common (99.9%). The resolution of the CRT depends on the number of horizontal scanning lines and the number of pixels per scanning line.

In addition to the CRT high resolution monitors comprise a power supply, display controls and a case. Display resolution is a key issue. If you scan a document at 200 or 300dpi it seems desirable to display it at the same resolution but currently this can prove expensive. A 200 dpi monitor needs 4X the display buffer storage of a 100 dpi monitor. There are therefore a range of monitors for casual or dedicated image users ranging in resolution from 80dpi to 300 dpi

The monitor size is also an important consideration. The standard CRT has a 4:3 aspect ratio so it is wider than taller. DIP monitors come in landscape or portrait mode, Portrait is higher for single page display in a vertical mode; landscape is wider for engineering drawings etc or two A4 documents side by side or one image next to a database window. The size of a monitor is quoted as the diagonal dimension of the screen.

A number of image processing functions can be performed at the display as at the scanner, using a choice of cursor, function keys or mouse to control them. The user can reduce the size of an image to fit the display or crop for display. Typically, image reduction goes in 10% increments where for each 10% the system drops off one in ten pixels in the vertical and horizontal plane. Alternatively, it is possible to pan the image across the screen to see the original without reducing the size to fit the screen. Image expansion is also useful to examine part of a page on a low resolution monitor.

Image rotation can be achieved in 90 degree steps and most systems allow the user to invert images or sections of images and provide image editing software too in certain applications for document annotation plus draw and paint software, selective erasure etc. Specific hardware monitor facilities include swivel and tilt and the choice of a range of display colours although black on white is generally considered best for displaying office documents.

When DIP was first introduced many vendors offered dedicated DIP display monitors which offered high resolution and avoided the issues involved in window management where there was a requirement to vi_~W' document images and access mainframe screens 'plus index data form screens simultaneously. In some cases there may be merit in this approach but it is less popular now we are seeing the widespread introduction of graphical user interfaces such as Windows 3.0 on MS-DOS; Presentation Manager on OS/2 and all the X-windows derivatives such as Open Look and

19

TECHNOLOGY

Motif. We do not have space to go into detail here on GUIs but they offer a considerable amount of flexibility for suppliers developing DIP applications and many of the vendors of IP boards such as Xionics and Kofax now offer toolkits and simple application packages for developers working in Windows.

2 . .5 Networking

Networks provide the links that hold all these key modules together and, given the large amount of data taken up by even a single compressed image then high bandwidth networks are required in any medium to heavy use DIP application. Users of DIP systems will require local area networking plus, increasingly wide area networking capabilities. For local area networks there is little doubt that the two dominant LANs used for DIP are Ethernet and Token Ring with Ethernet out in front at present. Ethernet offers 10IMbits/sec over a single bus cable while Token Ring can operate at 4 Mbits or 16 Mbits/sec.

It is invidious to state categorically how many users can be supported on one Ethernet or Token Ring network as it is obviously dependent on how many retrievals are made, how many images are requested etc but clearly there is a limit beyond which performance will suffer. Most users looking at very large enterprise wide systems are looking to a future where a number of Ethernet or Token Ring networks will be linked to a very high speed backbone network which will comply with the new Fiber Distributed Data Interface (FDDI) standard and offer up to 100 Mbits/sec. For wide area networking most users are looking to ISDN.

For DIP and the open interchange of documents, LAN protocols are just as important as the physical data transfer rates and here we are seeing some uncertainty as what we might call the non-IBM school of suppliers support TCPIIP today and are looking increasingly towards ISOIOSI open systems with the seven layer reference model. Meanwhile the IBM world follows SNA today.


3. SOFfW ARE COMPONENTS OF A DOCUMENT IMAGE PROCESSING SYSTEM

There are, of course, a number of levels of software employed in any document image processing (DIP) system starting with the operating system and system support software and moving on up through the basic foundation software -increasingly provided by the image processing board and software vendors in order to free developers to concentrate on tbe provision of end-user solutions - to a range of standard application software providing the functionality for document management, storage, retrieval and output with a wide range of facilities for configuring and tuning larger systems. Most systems share those facilities although of course the requirements placed on a single user PC systems vary considerably from those placed on a very large enterprise-wide system.

On top of that we have to look at data management and document filing and retrieval, workflow management and a wide range of application software which is what primarily differentiates the systems on the market. Increasingly, as standards emerge and suppliers gain confidence we will be able to mix and match and run DIP software on a range of processors and link it with a range of database management systems and control it using a range of workflow software packages and integrate it with a range of application software running at the workstation.

3.1 Operating Systems

Traditionally, operating system software has been hardware specific with each computer system model requiring a different version of the operating system. With DEC's VMS and most noticeably with UNIX we have seen a move towards common operating systems that run on a wide range of computers so users can start small and port applications on to more powerful machines as the application grows. Many users of DIP systems are likely to be in this position, growing applications from workgroup to departmental to enterprise level systems.

If we go back to the three basic DIP architectures - standalone PC based; client and server and host based then we can see the operating systems favoured in each case. Standalone PC DIP systems will be DOS based and increasingly will alsouse Windows 3.0. As we move up to distributed workgroup type LAN based DIP systems then the. PC suppliers are tending to move away from the single tasking DOS towards Xenix on the one hand or OS/2 on the other for multi-tasking on powerful 386 and 486 PC servers - combining DOS and Windows workstations with Xenix or OS/2 servers.-

For larger, departmental DIP systems then suppliers such as Plexus, Hewlett Packard, etc are offering DOS and Windows' based client workstations and UNIX servers -wliilesuppliers such as Filenet are offering UNIX based client workstations and servers. Increasingly we are seeing a convergence here as Plexus, HP, Bull, NCR and others move towards supporting both DOS and Windows and OS/2 and UNIX workstations running X windows and UNIX servers and, on

20

TECHNOLOGY

the other hand, Filenet open up to support DOS and Windows workstations as well as UNIX workstations.

At the minicomputer and mainframe end then suppliers such as Philips and Wang are offering DOS based workstations and minicomputers running their own operating systems but are also moving rapidly to support for UNIX servers as an alternative. IBM is offering DOS and OS/2 based workstations and minicomputer or mainframe hosts running OS 400 or MVS/ESA while DEC is working On workstations and X terminals and PCs and their wide range of hosts running VMS and increasingly their Ultrix version of UNIX.

Looking to the future one can see that while standalone PC DIP systems will continue to build on DOS and Windows, most workgroup type multi-user DIP systems will run on UNIX or OS/2 servers. The choice at the higher end will be between UNIX running on a range of processors from micros to. mainframes and IBM's evolving SA.A platform which incorporates numerous operating systems. IBM itself is hedging its bets by also supporting DIP vendors such as Dorotech and IBS who are developing DIP systems based on AIX on the IBM RS 6000 as a server and DOS and Windows or UNIX workstations.

3.2 Image Processing & System Level Software

Device driver and interface software provide the interface between DIP peripherals and the control computer. As we have seen above, most suppliers employ image processing boards for performance purposes or software for flexibility which caters for the image capture, processing and compression at the input stage and for image manipulation and decompression at the printing or display stage. The vendors of these image processing engines for DIP applications include Xionics, Kofax, Discorp and Laserdata and they design them with a system level perspective to meet the needs of hardware integrators and application programmers. Alongside their board and software products they have developed a range of software designed to accelerate the application development cycle and isolate application developers.

Traditionally the application developer had to carry out programming at two levels - the creation of system level routines that enabled the document processor to interact with the host operating system and all of the other system elements as well as performing imaging functions - and secondly, the creation of application level routines that enable an operator

,:-:to;.e~ecute workstations tasks according to his business

requirements.

r .Th~ fo~~ j~bJs:~~edputin'a:~itcira~ ·waY_-by all of their

.• customers-so vendors' such--aso··Ko{ai--and,Xionics have

: i;~~~ioPedali~~I)':p{ p~og~~ m_?4~!~~t~~t~re said to free .up -de"elopersto. concentrate. solely. on ,~~~ting end user solutioQs,.The: :llSe,,·()f j~ge ljbt:arie~; : ,also.. protects the i , ;investm~nt;jni,softwa~,,~eve.l()p~en~)ince:~ ibe application

. .: .ioftWare~isJhen ~()mp~J.iJ)J~clCr()s.s:-a p~ijipulariYendor' s range

, -·-f'fIP,en8ines: -- :""·,,c~,.,:: ,· .. 7 .. -;; ,.- .. ,-:,,,, ... ,.~ .. -~

:_ ~.-. .. ..;...."_:-: -_:,

,- Tht;Seve~ddrsstArtea worki~g i~ aDOS -environment and are now porting into UNIX and X- Windows and Windows 3.0


environments and gradually a whole set of shrink-wrapped imaging software packages will appear for use by developers. Standard DIP platforms for all the preferred Graphical user Interfaces is an attractive proposition.

3.3 Standard Application Software

If we look briefly at the standard application software that is required in most DIP systems then we can list the functions as follows.

Input Subsystem.

For the input subsystem we need software to control the scanning and input of documents including any image processing, compression and the provision of temporary storage. We also need, as part of the input subsystem, the provision of Quality Control processes and a facility to assign index data to each image captured. Finally, we need the facility to link the image and index data and to assign the index data to a data manazement svstem and the imase data

to an image management/storage system. -

Depending on the size of system this software and the workflow involved in the input process may be more or less complex. A single user PC system used for scanning in 50 pages per day will support simultaneous scan, QC and index on a page by page or image by image basis and a range of indexing options - page by page or document by document etc. A large multi-user system with 6 batch scanners capturing 30,000 pages per day will need very complex workflow software to control all the queues needed for documents scanned and awaiting QC, documents that have passed QC and are awaiting indexing, documents that were rejected at QC and need rescanning etc.

Storage Subsystem.

Typically the storage subsystem will comprise two discrete elements. The first is the index manager which manages the index data records - the logical record assigned to each image or document. These are usually managed in a database management system. The index manager maintains an index and tracking information for the documents stored on the system and performs the required searching of the indexes to meet user requests for documents. It has to return the result of the search with the related document index information to the requesting station.

The image server, on the other hand, has to store the images on optical disk or, ina system managed storage system on a predefined hierarchy of magnetic and optical disk storage. It assigns each image a unique identifier and then keeps track; of where the corresponding image is located and how it can

best be retrieved when required. DEC, in apublicity:-=:-=-::-='

. brochure, refer to this as the hat check process and iris a~ good analogy. The index data management system, in--· response to a view or print request will hand the image server

an identifying number or a hatcheck, It is then the job of the image server like the hatroom manager, to go and find the

21

TECHNOLOGY

relevant image he assigned that identifying number to and it is his job to manage it as effectively as possible.

Retrieval Subsystem

Here the software must accept user requests for document display and then handle the retrieval and display of the relevant document images on the relevant display screen. It must look after local image processing and be programmable so that the image appears in the right window or in the right format to meet the user's specified requirements and it must look after the local printing of screen images/ documents. Alternatively, if the user goes on to request a print or some other output option from a central output resource then the retrieval software must be able to handle this and submit the job to the output subsystem.

Output Subsystem

On mid to largescale systems the output subsystem will run on a dedicated output server. It will look after all output requests except local screen prints. It will output documents to printers or to plotters - both paper, microfilm and vellum etc; it will also output documents to facsimile gateways if that is an option on the system and it can output documents to magnetic or optical media if that is a requirement as well.

3.4 System Management Software.

Finally. any DIP system or general digital document management system must be supplied with complex system management software to keep track of all the operations taking place on the system at any time and to cater for and provide a way out of every eventuality including every type of error that can happen from paper jams in scanners and printers to complete power failures, head crashes etc. It must also provide for statistical reports etc needed to keep control of the system and to fine tune it for performance etc.

Some of the key system management functions needed on any multi-user DIP system -whether networked client/server or cooperative processing based include the following:

The system management subsystem must provide for centralised control of the system operation; it must provide error reporting and display facilities and preferably each module - (input, storage, retrieval and output) should have its ownerror reporting system. It should provide for statistics

' .. collection and reporting and provide for security verification. . Most importantly tJte' system management' subsystem must . . p~ovideJor;:aat3·,Dackup.' ana_systemA~very from all "", ." . ~:veri~ities~it11 th~.nliniIl111m. of 4a~ )ossand the system - must.providefulf p~tWorkirianagemeDt capabilities. When

linkecJ~ m.~·wiih~-a .workfiow· -package. the : subsystem must provid~!or~ibe"controrof~orkflowon the system and it must . p~()"id.efacilities.Jor.migratingimage_ data through a . cc~.hie~rchyof storage media supported. on the system.


3.5 File & Index Data Management

The next area we need to look at is the data management software. The whole issue of file management and database management is a complex one and one which many books have been devoted to. We are not aiming to make any profound statements here for the cognoscenti but feel that it is important that records managers and office managers be aware of some of the fundamental data management issues facing the potential user of a DIP system.

File Management

In traditional data processing we handle files which contain a collection of records and each record contains a set number of fixed or variable length fields presented in a set sequence to facilitate access. The whole emphasis is on structuring data to facilitate high speed processing. With the advent of text processing and office systems used by everyday users to create text we started to have to deal with a different set of files which contained largely unstructured text which effectively meant files made up of a coIIection of byte records. As we now move into the era of image processing where users are scanning in images of document pages we have to deal with image files. The simple question is how do we manage such files?

As with text files so with image files it is clear that to facilitate the identification, storage and retrieval of such unstructured files it is necessary to associate them with or integrate them into a structured record. At a simplistic level we could decide that every image we scan in to the system is stored as a discrete image data file and each image data file is linked to a structured data record containing a' set number of descriptive fields. If we were managing cheques these could be fields for cheque number; date presented; payee name; drawer name etc. At input a link is established between the data file balding the image and the image header data (bow large it is, what resolution it was scanned at etc) and the identifying information that describes it logically by, for example, storing the name/number assigned to the image file as a field in the dbase record that describes the image. Once the image data file is linked to the database record both the image file and the database information can then be stored and there are of course a range of issues and options surrounding where best to store image data files in largescale document management systems.

Over the years, of course, a wide variety of file management and database management systems have evolved in order to facilitate the way we store and manage and access and process data files containing records with a set of fixed. or even variable length fields within them. More recently we have seen a lot of work done on improving the way-~ we manage and process text files.. Our techniques. and options have changed as our digital storage options have altered too. - Initially, when magnetic tape was the main storage format for

a file of any size we had to organise and process fifesm'a serial or sequential fashion.-- ..------.. _-

22

TECHNOLOGY

If wlbah'summari,se 'th~1cef~,eD~fi~' 00 DB,MS they would '_ be that they 'allow the same data to be viewed' in a number of ways; they provide the vital map between application

programs IlIld the logical database; theypr~)Vicl.e physical data The issues are many and varied and may represent an

,independeDCe ,~l1j~JL i~.cr.iW~~t I~ 'the,~volving world of ",', evolution of data managemenftechnique~ratherthan a series, ' •• '."'" '> '.,' :'opti~t.c,ta~'st~iag~"~d-,tl.t~y:;~ypifd.ai~)~undancy;' To':: .·~.of • rignt:'wrong""scenarios--bUF'if" is ",inipOrtaIlr=r6'F-tiSers~~~~~:'~~

ensurea higbleve] of logical'ancfphysicar data independence, considering large, systems to at least be aware of thel4ci-:,__:

'Jhe ANSr(AmeriCaIi'NatioQat Standards' In~titii:te) study group management decisions thai have already been made fbi them- :::~"

on DBMS's distinguished three 'levels of'data description. The by the supplier today. '

top level is the Conceptual schema or the Logical Database

Then direct access magnetic disks became widely available so when we were dealing with fixed length fields and records we could adopt direct access methods and even when dealing with variable length fields and records we adopted a range of direct or indirect access methods including the widespread use of pointers; hashing and most significantly, the use of index files. With index files we built up a separate index to the files stored on the disk based on either primary keys or, where more flexible searching was required, secondary key indexing. The extreme case of such multikey retrieval systems is, of course, inverted indexes where every word or every significant word in a text file can be extracted and used as an index key in what is effectively then an inverted index to the master file. Inverted indexing was and still is the predominant technique used by many suppliers offering free or full text retrieval systems in applications such as research libraries; legal practices etc where there is a requirement to search for textual documents stored in a file by any significant word or any combination of significant words contained in the documents.

All of these file management options are available to users of DIP systems with the very significant proviso that if users want to have a free text index to a set of document images they have scanned into a system then they will either have to key in an abstract of each document and provide free text searching of the abstract or they will have to apply OCRtlCR techniques to convert the text content of the document into ASCII coded characters which can be stored in a text file which can then be inverted.

Database Management

So far we have only been talking about file management systems, however. In order to provide greater flexibility and to provide greater independence between the programs developed by the user or third parties and the data (structured and unstructured) we have seen the-widespread introduction of a range of database management systems of differing functionality. In mainstream data processing, database management systems (DBMS's) have been developed to enable multiple simultaneous users to extract information for a range of applications with high levels of system reliability

and at a reasonable cost. '

One of the key advantages of using a DBMS over an integrated file system is therefore the maintenance of data validity - the provisions made for data integrity, security and concurrent access. As DIP systems migrate from personal

_ . and .workgr()up.filing sys!e~todepartmentat and enterprise wide operati()n~· systelllS, s~~~Jacil!ti~, h~~e proved vital.

~:~~: ~':-.-.,."~."':.~- ',' ~":'-.:" :~.-<-;_~.~ ;_';'_ \-::::";:": .. :-::-.;


Description describing the syntax and semantics of the data types that exist in the enterprise.

The second level is the External schema and represents a user or an application'S specific view of the database which might only be a partial view. This is referred to as the Functional Database Description. The third level is the Internal Schema or the Physical Database description which deals with the storage structures employed and the access methods used.

Over the years we have seen three data models developed and widely supported which are referred to as the Hierarchical Data Model; The Network Data Model and the Relational Data Model. The relational model is today the model of choice for most new systems due to the high degree of data independence it provides, the efficient way it deals with data semantics; consistency and redundancy issues and the power of the base model. Data is presented in two dimensional tables and relational algebra is used to manipulate the tables. Most importantly, the relational model is becoming an extendible model that can be used to describe and manipulate both simple and complex data . Of relevance to the DIP marketplace is the considerable work being carried out to introduce object orientation into relational databases.

DIP & Database Management

How have DIP and digital document management suppliers made use of all the work that has gone into file system and database management system design over the past three decades and what additional features do we or will we need in order to meet our current and future increasingly complex requirements to manage the unstructured data held in digital documents - whether image or text or graphics etc?

Well tbe first lesson should be that in all cases except perhaps

in future the smallest standalone personal filing system of the type being developed by Canon with their Canofile 250 then DIP systems will need to employ database management systems to safeguard the user's investment in developing the application and scanning and indexing all his documents.

One of the most pressing issues being discussed at the moment is whether or not the unstructured data contained in documents (image, text, graphics etc) is held in the same database management system that was used to manage the structured data ( ie extend relational tables to manage Binary Large Objects) or whether it should be managed in a separate object oriented database management system dedicated.jo.L, tracking the location of all the objects that, go to make .. up documents and the relations between them or whether, thirdly, the document images or text files etc are simply held outside the database as files and pointed to from database records.

23

TECHNOLOGY

Extended Data Processing

The leading vendor today who has taken the approach of extending an existing relational database to manage large objects is Plexus who started with their eXtended Data Processing (XDP) model and oriented its architecture around a relational database and high level language paradigm.

Large structured and unstructured objects were added to the syntax of the database and associated programming language. The management of optical storage, removable media, events and data conversion are implemented as extensions to the semantics of the RDBMS and fourth generation language development environment. The Plexus DataServer is a set of software products which cooperate to provide complete data services for alphanumeric and compound data behind a standard SQL interface. This includes the option to store large objects on magnetic or optical media and integrated support for manual or automatic mounting of removable media.

To accommodate unstructured compound data of the type held in documents (image, text, vector graphics etc) Plexus have added two new data types to those normally supported by RDBMS and SQL. Columns of tables can be declared to be of type 'Bytes' or 'Text' and such columns are of variable width. No maximum length needs to be given during data definition and each instance of data in a Bytes or Text column can be up to 2 GBytes long. In addition a client application can, using SQL, cause the database to store Bytes or Text columns on optical media. The allocation of data on optical media can be designated as either sequential or clustered.

Both the database and SQL have been extended to support these new data types. The Bytes was provided to allow applications to store any types of large object and the database manager itself has no knowledge of the contents of the Bytes column. This means that an application can store digitised image or audio data, programs, CAD!CAM vectors or anything that can be stored in a file. The SQL select; update and insert statements have been extended to permit the consistent storage and retrieval of columns.

Additional database semantics have also been implemented for text columns and SQL has been extended to support them. Text columns can be defined as indexed via SQL and when the data is inserted or updated in them the database manager updates a full text index on that column in real time. Thus they have integrated a full text facility within the database which will be required in an office systems application.

semantics for images are known and the interface to any datatype from the 4GL is said to be very natural. There are no syntactic changes in the way SQL is used. The underlying program variable has associated with it a locator variable and contains fields which indicate the location and, if appropriate, the length of the data.

In the case of an insert the Bytes data may be sourced from a location in PC memory or a PC-DOS file or it may be streamed from a compression! decompression card as the result of a compression operation. Plexus claim that storing image data as a fundamental part of the database has significant advantages over storing them in a separate subsystem. Areas they point to include ease of use; transaction integrity; access control; concurrency control and the provision of backup!recovery procedures.

Document & Content Architecture

However, whereas with traditional database management systems we were storing away ASCII or EBCDIC codes which had been universally agreed, with image data and with text and vector graphics and with compound documents in particular we are storing away unstructured data which is mainly held in a proprietary format and which can only be accessed by proprietary application software today. What are the implications of this for users wishing to build up very large digital document databases which they will want to be able to access and use in years to come?

We appear to need standards for managing both the data types that make up the contents of documents plus standards for what we actually mean by a document i e a standard document model. Here we are back to where we came in - what do we mean by a document?

We have to bear in mind again the three concepts of a document's logical structure; its layout structure and its contents. Ideally we need standards for the contents and their interchange and we then need a standard model for defining a document's logical structure and its layout structure. This is not to say, of course, that there will be one standard logical structure or layout that can meet all needs. There is not. What it is saying is that there needs to be a standard model and set of standard procedures we can use so a range of application software and a range of systems can read files and understand what they contain - what the logic of a document is, what the layout structure is and what contents it comprises.

SQL's syntaxobas,also"been,;.extended--t(}, incorporate a The simplest model, employed by most early suppliers of

~,,~C6ntains· clause •. Embedded within ~ ~()therwise normal dedicated standalone DIP systems for the;' records

. select.statement the contains clause identifies individual words- management and transaction-processing" environmenris to

as well as:BOQlea,J;l,_~combin,atiQns=cofwords~ to match and adopt the stance that all documents will be made up. of the

.' ; proximity indicatorso:TheBQL query~op1imise incorporates content type raster image graphics and that one image will be

_ ".tb~.wotd occurreace.clauses :1ogeJj~lC!:t~qlleries involving laid out as a pageandthe document will be made'up-ofone'"

fielded data andfull.text data, C;::;;l:_;;nt c::c or more pages., With this approach they can 'adopt the

''''i'",,-'oc:;r'- '''''_':". '- .. _,---;:.~',=;'C,~::;-~'_':_-;. :'--''C:C ' .•. '. facsimile compression algorithms t() compress each page

,:: ,~"As:~result,'oLthese_:_concePtually .simple. extensions to a'" image and they can useeitheraproprietaiyimagefile format--:, standard Informix TurboRDBMS Plexus claim to be able tc+-vora-de facto standard 'such as' TIFF in' order to store the .. __ .

·-.()fferapowerfulandflexibleSQL server as .a basis for XDP image or images that comprise the document.r-One ithen

. 'applications. They have extended SQL to support arbitrarily assigns one identifying data record to one TIFF file and you

long objects. In the 4GL programming environment additional have a data record that points to a document that comprises

Copyright Cimtecb 1991

24

TECHNOLOGY

one or more page images stored in facsimile compressed format. The image file contains all the application specific data such as what size it was originally, what resolution it was scanned at, what scanner it was scanned at, what compression algorithm was used etc.

The database management system knows nothing about the file except where it starts and ends and upon request the image file is transferred either to the user's workstation where the header data is read by the workstation software so that it can be displayed and viewed etc or to a print server where it is formatted for printing on a laser printer etc.

At present there are still some issues to be settled here. Firstly, the fax standards were geared to transferring images from one fax machine to another for immediate printout. When they are adapted for use in digital document management systems where they may be stored in a system for many years and then displayed on a range of screen sizes we get into some byte ordering issues. A committee of the US Association for Information & Image Management has been set up to try and agree standards here.

Secondly, there are a wide range of versions of TIFF and some suppliers have registered clever tagged fields describing how many images there are in a document etc - all of which needs jidying up in specific applications. The US CALS initiative is designed to agree header formats for Group IV facsimile images. It is further complicated by the need to cater for tiled and untiled images.

Assuniing these issues are resolved, such a simple document and content model may work well in a standalone PC based system and mean that indexing can be kept to a minimum. It may be reasonable when dealing with items such as cheques, vouchers etc where typically each document represents only 1 - 2 images (front and back). However, it is less well suited to systems where large documents of several tens of pages or more are being filed away. Here it often becomes necessary to subdivide the document into sections so that tbe image files do not become too long. The high storage overheads associated with image mean that typically we do not want to be passing large multi-page files across networks to workstations where we have an alternative of only transferring the page or the section of a document that the user wants.

Moving away from records management and transaction processing to the office and publishing systems environment where we may. want to manage documents in image format alongside documents held in text format or compound documents _ comprising. text, ci,mage .andvectcr graphics and

;::~whef~~.)X:19~t, ~,f~p#H>< ~e'~;~~~r~~~# '.~ant. to stote

. documents in a revisable format (i.e so we can-edit them If

. _ .. we ~~ ~u!ltot:i~~) ,81l,~ ~9Uus.th,i:;l(R~tt~ format'{ ie - e ,where: " w~; "c;3A, gpJy; ¥i~~;, ~9 -:_RripC.the*-!-, ~whatever our "_authonSation);~iquick1Y findthafiruch a Iimited document -' .. model does not go anywhere near meeting their requirements,

--H~ie;Ve'- - ~iet~~'.;n~~'·~~pI6f'40¢tiinf~i'ni6'del where the

- .'-Ioglc-of lhe_ciQC~n:ieOt .. Jb~ !ajpllt' and. the contents required

-" are. far' more' varied.' Here-we maf\~'ish' to' store documents

as page images if we have just scaaned them in or received them via facsimile but we may also want to store documents


in text format if they were created on a word processor or in a compound format if they were created on a desktop publishing system and comprise text plus image plus vector graphics. Hence the document model must go down to the object level. Each document will be marked up with a more or less sophisticated logical structure that will start with a table of contents and include a concept of sections or chapters and then could even go down to the paragraph, sentence, word or character level.

Equally we will need a range of layout/output options which we can decide on when we want to output the document. Do we want to print it in page format as it was designed to be printed (if it was designed with hardcopy in mind) i e WYSIWYG (What You See Is What You Get) or do we just want to print out an image from tbe page or some text or what?

To handle these requirements the suppliers have come up with two sets of standards. The first, SGML (Standard Generalised Markup Language) is primarily geared to the needs of the publishing industry where there is no need to determine tbe layout of the document until there is a requirement to publish it and hence the emphasis is on standards for marking up documents logically and less on defining standards for dealing with specific content types and their layout.

The second is a series of compound document architectures. The international standard is the Office/Open Document Architecture (ODA) and Office Document Interchange Format (ODIF) standard. The major computer companies also have their proprietary supersets of these standards which in the case of DEC is their CDA (Compound Document Architecture) and DDIF (Digital Document Interchange Format) and in the case of IBM is their MO:DCA (Mixed Object Document Content Architecture) and DIA. These are designed for machine to machine or system to system communication and operate at a level above image file formats such as TIFF which only deal with one specific content type. They offer users and system developers an envelope in which a specific document can be placed for transmission across local and wide area networks and for interchange between systems. They effectively standardise the header information that will be contained within that structured hierarchical envelope. The first header will state that it is a document, the standard it follows and details of its logic and layout. Subsequent files within the envelope will contain the contents with their own header data describing how they are to be processed and handled.

There will also. of course, be facilities for converting from

the ODA standard to proprietary standards such as DEC's and IBM's so that a document created on a DEC system could in theory be translated into ODA/ODIF and then . transmitted to an IBM site where it would be convertedjrito:~::_:

the IBM format without any data loss.

In addition,pr()~tsi6n=~Ii b~~ O1ade~Jthin thesestaJldar~s _f<?t" ._. transferring documents in forrnattedform where the end lls.~r=-_;:_ simply needs to view them in the format in which they were

created or to print them out in that format. Alternatively, provision is also made for transmitting them in processable

form where a user with suitable editing software could

25

TECHNOLOGY

process the documents i e adding text to a word processing document; editing and manipulating image data etc. Thirdly, there is provision for transmitting documents in a formatted! processable form which is really WYSIWYG( What You See Is What You Get) for applications where users want to see exactly the impact that their editing has on the layout of the document etc. Most of the suppliers are concentrating on implementing the formatted version of the standard first as clearly it is less complex and today it meets a large percentage of DIP application requirements.

Once these document models and content standards are firmer a large number of vendors including ICL, Bull and others are committed to building the model into their Digital Document Management Systems. It is not difficult to conceive how the Plexus XDP model which has been licensed by many major Open System vendors such as NCR, DEC and others, could be developed to embody the aDA document model. At that time the database could be made increasingly more object

. oriented and instead of not knowing anything about the Bytes it stores it could be programmed to respond to requests for formatted Or processable documents and to transform the data it holds and output it in the relevant data sequences.

Document Filing & Retrieval

Having decided on the data management system and the document model to be employed, the next major decision for suppliers and developers of Digital Document Management systems including DIP systems relates to the way in which the documents or, more accurately, their contents, are physical1y filed and retrieved and how the data management system indicates the relations that exist between the documents and their contents as opposed to the structured data records used to identify them. In the majority of cases suppliers are faced with taking the existing traditional office document filing hierarchy ( filing cabinet; drawer; folder; document) and linking it to the data model employed in the structured database management system (hierarchy; network; relational). Finding the right match between these two distinct but related requirements can be tricky at both the logical level and at the physical level. At the logical level if we attach a second hierarchy to an existing hierarchy we will reduce our flexibility and search performance will be impacted too. Here we are looking at the increasing number of applications where DIP facilities need to be integrated with an existing DP line of business application.

optical disk, changing sides and disks as needed and not leaving any space then in an active case handling application, documents that belong to one case and hence one folder would be stored on numerous different disks and sides. Although the index database could provide us with a list of all the documents that belong to the folder provided we indexed them down to that depth, if we wanted to view them all or print them all out then this would necessitate a lot of work on the system and hence relatively slow performance times.

To avoid this problem we could either index documents down to the level where in most cases users do not need to view the whole file they will just browse the index and select the documents they need - or we could develop a clustering methodology where we allocate certain files to certain disks and then write all the documents that we would normally file into those files to those specific disks, leaving sufficient space for overflows etc.

This clustering issue can be very complex when starting to build document databases that may grow to hold many millions of documents. If we decide in future that the documents will not just be stored as page images but could be stored as huge text files or as a mix of image, text and vector graphic files plus layout and logical data then the problem of bow best to store the data and group it becomes beyond the wit of man to solve and has to be delegated to system managed storage programmes and algorithms which are, we are told, being written now by several vendors.

Considerable attention is being devoted to the document management and document filing issues at present and users used to data modelling exercises for structured databases will have to rethink some concepts in order to ensure their document management systems are optimised for flexibility and performance. Internationally, the same groups who have worked on the Open Document Architecture and Interchange Format are now addressing the Document Filing and Retrieval issue and looking at standards here to guide users and developers (ISO Draft DFR).

3.6 Workflow Automation

In the sections above we have shown the hardware and software we need to develop a DIP or digital document management system that can capture documents, store them, assign index data to them and allow multiple users to search the index database, identify the documents they want and then request and view or print them. In order to provide for fun workflow automation i e to programme the way documents are handled in an organisation, we need to build on top of the database management system using workflow softWare~.".-_"

If we have a policy database in an insurance company with records, that contain fields for policy number; client name; "; i ,type. ,(),Cpqlicjo.etc:,.then.We~7;p~tto§~rch)t in a fairly stiJJctilre<i,way, .artd it will be .()ptimise4 for that purpose. If we than add in a docllmentdatabare 'and group the documents

.. in policynuIllber. files: we have. to. ~r~in ,mind the Workflow software is just one of a set of new categories of

.. ·.reqtiirem~tl(tc;isea¥nJor, ';alrc()rresporid~pce 'from a certain work automation that will come to dominate the office in the ..

. -. clienr _which will represent.documents,stPre4 in several files 1990s. Increasingly now networks are in place and documents

·i~ the~ ~1~~~t,h3;S_~~,,:~~J poI.i.cies e~~ ..•. .. :.: .~ can be handled digitally, people will work together in a

. ' :, .... ",.;- ... ,_ - ----.-- ., ".' ... _.. cooperativefashion using the, network to aidcolllIl1uni~ti9ns,_ ., •..

.. ··.:A(;ilie.'.'physi~fJ~~~(\\t~:~isQ;:~)J,~v~']o·:~~Y610p the righe': rather than simply as away 'of sharingacce-ssCto-pnnters oC: .; '_~:

'" .,: : strategy for storing related documentS·givel)· the relatively mass storage. . .

-~'slow performance characteristics of optical disk storage

devices and in particular jukeboxes. If we simply scanned in page images and stored them on the next free sector of the


26

'_",""_ .• : ....•.•••.••. _ •. : . .,.._~_ . ...- •• • . ..., ·-I'· .. r"'.>'1'- --',.--.; .•• ~ .• , '.-'.,.,,_ •.•••• _ ••.. _ •.• ,_ .•.. _ -- -'_ .- .-_--_

TECHNOLOGY

Other categories of work automation will include second generation electronic mail services; collaborative document composition systems and distributed project management.

Workflow automation is the cooperative processing of documents throughout their processing life. Examples where workflow automation is already employed and where the biggest benefits have been obtained include claims processing in an insurance environment; credit application processing; -work order processing and engineering design and change control in a manufacturing environment where numerous people have to work together to suggest, agree, authorise, implement, approve and release changes to existing drawings and documents.

Worflow Automation in a DIP environment has evolved over a period of 10 years and is still evolving.

In the early 1980's, primarily in the US, very large organisations with major paper handling problems paid systems integrators to develop proprietary workflow automation systems for them. These were very high cost, long term projects and nothing shrink wrapped emerged from them but a lot of know -how was gained in the requirements of workflow automation.

From the mid 1980's onwards we saw the emergence from specialist imaging companies such as Filenet of SOme general purpose but still proprietary workflow oriented languages. These were extensions of' general programming languages designed to provide support of workflow commands for document processing. They simplified the job of specifying the workflow for the user but they were only designed with specific applications in mind and could only be used on specific platforms.

From 1988 - 1990 we saw the introduction of workflow application toolkits from PC based workgroup OIPsuppliers such as Viewstar in the US, IMC and others. These were rich in functionality and easy to set up but often the platforms they were based on would not support large applications because of performance problems etc.

We are now moving into the era of integrated workflow products where workflow can be written in 4GLs on top of database management systems and where the major suppliers are building standard platforms and workflow providers can then port on to as many or as few of these as they wish to support. An example of that has been Computron which has developed a sophisticated workflow automation package and is now porting it fromtheVVang.\VII§ proprietary. VS platform to run on IBM Imageplusarid 'DEC's DEcimage Express platforms. IMC is another example with

lmageMere~.:::c::-:-.-.-_ _ _ _ .. _

.- . FCMC's' STAFFW~·procedures·prot!-s.~ing system has .: .. been.deSigjledtO"be.iritegrilfea~withar.irige of other software . '. products including databases, .electronic mail,text processing

. C'." .atld. g~ne~,Qftj'? ,appUcati9~' ~c~!!U,~ ,p'<?cument ,Image" .... . :::- Processing.' ST AF_F\-V~bas'been'adopiea 'as part of their

. strategic" UNIX 'office - automatiori" software by ICL (PQWERFLOW); UNISYS «()FJS.P~OC::EDURES); IBM iUidBULC Most recently'sTAFFWARE has been integrated with Dorotech's Dorodoc document image processing system


to provide workflow automation. Dorodoc runs on AIX on RS 6000. The DIP software was originally developed by Racal Imaging to run on Sun platforms.

It was taken over by Dorotech and integrated with their optical storage management system. Dorodoc was then ported across to RS 6000 and now ST AFFW ARE is being integrated with it. Such manouveres will increase as the image processing component of DIP system becomes shrink wrapped and the Workflow packages and data management options and storage management options become more and more portable and flexible.

Workflow automation applications are developed by suppliers intbree ways. The first way is to develop in ·C·. This gives flexibility but is bespoke and expensive and time consuming to do and if users requirements change considerably, and they will do, it can prove expensive to change the application. If that is the option offered to you at least want to see whether the software you are paying to have developed becomes your intellectual property in time.

The second is by using a proprietary workflow language such as Filenet's Workflo. TIlls is now highly developed and is the leading system used today but it has its limitations and ties users in to one supplier at present although Filenet are gradually opening it out so Workflo applications can now be developed on PCs as well as on their proprietary workstation.

The third and most popular now is to develop in a standard 4GL on top of a relational database management system. This gives flexibility but complex workflow applications take many man years of development effort and the major vendors are still working on the best compromises - providing standard support for standard operations and the ability to string them together and customise specific elements for each new application that comes along.

The key enabling technologies that make workflow automation feasible include the widespread introduction of networking and communications to provide firstly workgroup connectivity and information sharing and then move out to wide area networking and enterprise-wide connectivity.

Distributed database management can also be critical in a workflow environment with systems needing to support multiple data types (image, text and structured data); the need for cooperative processing or client/server architectures; the need to use scalable back -ends ( ie processors to manage index database and image database which are infinitely expandable) and the need t(). support workgroup .level processing.

To date the majority of workflow automation software has been geared to Process Oriented Workflow. This is typified_

by the white collar factory (claims processing etc) where we have a highly structured environment and where clerical and supervisory workers, are carrying out a large numberof_. ,_ highly repetitive,tasks~This calls for. applicationcontrolled" .. processing and th~ use of pr~graIDmable process logic., ~,~.,,~~~ .

When looking at reference documents and knowledge workers

just as we need to start looking at sophisticated text retrieval software for indexing so we need to look at much more task

27

TECHNOLOGY

oriented workflow to support flexible, dynamic environments where the system has to be user driven, where there is a complex process logic and where we are automating knowledge workers.

The latter is far more complex to automate. Much of the interest in hypertext systems is in response to the challenge of trying to define the complex relations that may exist between documents and a group of knowledge workers working on them. Hypertext systems create links between documents so that as two knowledge workers reference documents in a conversation they can be brought up and links made between them and when a knowledge worker establishes links between data in the files he can create temporary links and save the relationships between them. When built on top of relational database management systems such as the Plexus XDP application these systems can be very powerful but they require a considerable amount of planning and analysis and we would not recommend these sort of applications for initial DIP applications.

Every workflow system is put together using five key elements - Information Units; Queues; Activities; Procedures and Administration Tools.

The information units are of two types. The first we have defined when looking at indexing the documents. We must define what we mean by a folder and a document and the indexing data we need to identify them etc. The second type is the vital control information we need to control the workflow. We have to define the status codes that a document can go through in its lifetime; we have to control the process logic and we have to agree the level of security we need in an application.

The Queues are the electronic "in-baskets" - they represent the input/output repositories for tasks. The exact role of the individual queues is application specific and must be flexible but there are some standard queue operations that can apply to a11 queues. As an example, after documents are scanned in as a batch they may be assigned to a quality control queue. All documents in that queue are awaiting quality control.

They can only be accessed by the two quality control operators we have set up and the documents are at a particular status. The QC operators will either be able to access the queue and choose which document they take out of it next or they will simply be assigned the next document off the bottom of the queue for QC.

The Activities are what happens to documents between being take,n,outof one queue and put into another queue - quality control is one -' ilidexirig is 'matlier - 'ext~cting data from . ~c:>cum~llts_8I)clputting. it into aD'applicati()Dis another - 'editirfg~oCtnilents:w.ouidbe,: aneu{speeifi6:cases - we, look 'at the applicationS. bd6W:.~_-t·~·· '.-;~ ,;', :~;~;-;; ..

. . - - .. _ ..

':;";piOcedli~~-are' :the::~ripts_'jviiidh';tb~,~;d~~~ioper writes to

define how the. information moves between the queues and wha('ad'i'vHies "need' 'to ",occuf"'at ; 'each" pOInt before the , ,inf0rn:latioDzpoves to' ibenext queue.

Finally, there are the all important administration tools and really the quality and flexibility of those defines how well the


system functions, how easy it is to develop the application, how flexible it is and hence how quickly and cheaply it can be changed and what level of statistics can be gleaned from them and hence used to improve the performance of the system or to improve customer service and answer customer enquiries.

The Process Defmition Tools are used to set up document groups (files etc); to verify workflow procedures; to analyse input/output queues and to control some of the activities users can have access to. Monitoring and Control Tools are needed so that supervisors can go in and check on the status of any document; so that they can go in and generate reports of various sorts to monitor system performance and productivity levels and so that they can analyse the whole process and check for bottlenecks, balance etc.

Finally, the application development tools are used to define what is meant by a document in each application; to specify the queues, activities and processes; to create the whole configuration; to develop and change user interfaces and views into the system and to develop trial demonstrations prior to going live.

There is no doubt that it is workflow that is critical and which holds the key to productivity improvements and doing more work with the same number of staff and providing a better service to customers. However, we would like users to take away from this introduction the message that while the workflow functionality can be the key to productivity improvements, the strength of the basic digital document management hardware and software platform and the system management tools provided etc - is equally important to the successful implementation of a system.

3.7 Applications - What Do People Want To Do With The Documents?

This is another area that needs to be looked at carefully when assessing any application for DIP. We have looked so far at how we convert documents to digital format to overcome the problems associated with paper. We have indexed them so that we can identify the individual documents and the relationships between them and we have also looked at how we can control and automate the workflow through the use of application development tools and the benefits that will bring in some applications. We must now look at what users need to do with documents once they have retrieved them - what ~ Activities" they need to perform on them.

With paper or microfilm systems all users can d()jsloolc at them and read the information or take a copy or printout of a document - they serve an evidentiary or reference function., With transaction and evidentiary documents this is normally all that is required. As part of the application we might use OCR or MICR to read data off the document but that isas

far as we would want to go. The documents are held i~_ . image format and we would.not want to change.tberri·oihe(· ~., than to simply scale the image to fit on differeritsiZe_s~reen_s,_

or pieces of paper at output. .' -

28

TECHNOLOGY

However, if we are looking at integrating DIP into an electronic office system where documents are created in text form or with a desktop publishing system where documents are created from text, image and graphics then the situation would be quite different. Here users would not onJy want to be able to retrieve a document and view it, they would also want to be able to take a copy of the document and edit it - delete parts of it - use paragraphs of text in other documents etc.

Similarly, but in a more controlled way, users of engineering drawings, maps and plans may want to take a copy of a digital drawing in raster form and edit it using raster or vector editing tools and up issue the drawing or create a new drawing from it.

These distinctions are important as they influence the document architecture that must be employed and the complexity of the data management system that needs to be employed.

In the first case where all users want to do is view documents, then all documents could be stored in a "formatted" or image format irrespective of whether they were paper documents scanned into a system or text files downloaded into the system as formatted page images or structured data records formatted up and downloaded in report format. Today most specialist DIP suppliers (Filenet etc) store all documents in this format. The system assumes all documents are stored in image format and handles them all in that way.

In the second case, where some users want to just view documents and others want to be able to take copies of them and reuse the information in the copy - edit it, delete part of it etc, then the system must be capable of holding documents and delivering them to users in either "formatted" form or in "revisable" form.

This is analogous with a word processing system today where text files are held on the system and brought to the user in revisable form so he can edit them but when he sends them to a printer they are sent to the printer in formatted form.

In this case, we would need documents of some legal significance to be held in the database in formatted form but with sufficient data to identify the originating application. Then, authorised users would be able to go in and check out a copy of the document from the database back into their "Work In Progress" file in revisable form with the application software needed to edit them.

A third, additional option is that users would want to access documents in formatted/processable format. This equates with desktop publishing WYSIWYG (What You See Is What You

. Get) where the document appears as it will be when it is formatted and sent to the printer but a user can still edit it and see the impact that editing will have on its layout.

At present, providing medium to largescale DIP systems with revisable capabilities is complex and very few suppliers provide them. However, it is important to select a supplier with a compound document architecture that will allow applications to be extended to support such requirements in


future if they are ever likely to be needed. A compound document architecture is designed to do just tbat.

Another way many users need to work is to be able to annotate documents without actually affecting the original. Many suppliers of technical DIP systems offer this capability where one starts with a raster image "base layer" and tben up to 256 vector layers can be superimposed upon it. These can be used in the office as electronic "post it notes" so users can scribble comments on documents and send them to a colleague for his comment. The second person's comments are placed on layer 2 and a user can retrieve just the base layer or the base and 1 or the base and 2 or the base and 1 and 2 etc. At some point when the query is resolved the overlays can be deleted or they can be burned into tbe raster original or stored as an attached raster file etc according to requirements.

Another requirement that often has to be dealt with is hypermedia links. Often to carry out conferencing a user will want to take documents or extracts from documents from a number of "logical" folders, staple them together electronically and create a new temporary "Work In Progress" folder. He will then attach some annotations or comments to the folder and circulate it to people via electronic mail. The way this \VIP folder is created is by placing pointers from existing folders to the new folder or effectively creating links between documents. Such "hyper links" can be built on top of existing relational database management systems and extended systems sucb as the Plexus software.

Again, however, the data management issues here are complex and as a guiding principle most users today want to select a platform that is capable of being extended in this way rather than selecting an application initially that depends on these advanced facilities to be feasible or to obtain maximum benefits.

29

TECHNOLOGY

retrieved at some stage and related to other relevant documents. This can be a very considerable overhead unless it is seen as part of an existing structured information management application. If one already captures all the indexing information related to a book to create a catalogue record on the computer then all one has to do here is create a link between the index data record and the image or text files containing the book.

Against this, if one was capturing personal files or other similar information then one would have to create the entire indexing system solely for the DIP application and in most cases this is hard to justify today. Hence indexing is an overhead but it is only a relatively small overhead if one is adding DIP facilities to existing information management applications where most of the indexing data will already be captured.

111 If one is going to replace paper as the working copy then everyone in the organisation who currently accesses the paper! files will need to have access to an image workstation so they can not only access the structured information to identify the document! file they want to look at - they can also view the relevant document! browse through the relevant file on the screen. This can be an expensive operation to carry out all in one go in a large department.

That is not to say that it cannot be justified but it does argue for a phased implementation in many applications or for running a paper and DIP system in parallel in some cases until the base of workstations is expanded. It also argues for looking at Departments' existing IT plans. If a Department is a candidate for a DIP application in the next 2 - 3 years then they should not be installing large numbers of "dumb" terminals at present. Rather they should be looking at the benefits they would get from installing intelligent PCs instead which could then be used for structured D P applications, for office applications and for DIP applications in future. Images cannot be accessed on character terminals one needs either a PC with VGA screen or a PC with higher resolution large format screens or an X terminal or a workstation such as a V AXstation or DECstation.

IV Today if the computer goes down users are inconvenienced but many of them can work with paper files etc. If we place all their key unstructured information-as well-as structured -ihformationonto the computer ·~d it goes down then the impact is even

- - "-- ~"greateT. - - - - :----, ~::,~~~""-~.- ~. =-.::--~~.~-'=' .. ,-

, -~ _'_:.'_-

has challenged the legality of document images and hence no ruling has been given. Generally where ope is dealing with very important legal documents the practice is to use the DIP system as a replacement for the working copy and to move the paper documents offsite to lower cost storage where they can be retrieved if there is a legal case.

VI Archival issues. Paper filing systems and microfilm systems, once set up can last for many decades. They may be inefficient to use and occupy very expensive space but they can be kept up for decades and paper tends to be moved out to warehouses as it gets older etc. If one starts a DIP system up one has to look carefully at how long the system will be supported for and what the costs/ provisions are for upgrading it every 5 or 7 or 10 years and what provisions are made for weeding and purging files and hence reducing the cost of maintaining the database. This argues for dealing with a large established supplier and even here for looking carefully at their strategy for purging and weeding of files.

vu Backfile Conversion. In some applications if one is contemplating switching from paper to a DIP system then in order to avoid lengthy periods during which half a file will be on paper and half in a DIP system or users have to access both paper and digital files, it is necessary to carry out some form of backfile conversion exercise. This is a potentially expensive exercise and should be avoided where possible and hence applications that require considerable backfile conversion work have to be seen as potentially less attractive in the short term unless they offer very great benefits.

Overall the benefits of DIP over paper based systems are increasingly outweighing the disadvantages in many structured applications. When one adds in the next three requirements then the case can easily be made in many applications.

4.2 Improving Access To Documents/ Document Contents.

A second major reason for installing a DIP system is to improve our ability to search for and retrieve documents and, in many cases, the information contained within documents; The key here is not avoiding the physical problems of handling paper documents but trying to improve our ability to search for and identify relevant documents out of a store

_--_.- ,-'_ .:_ --_- ..

.. of many millions. All the time the document is held in paper ...

- This:a.tgues for mstaIIingcomputet systems with even form all we can reaIJy hope to dois identify it uniquely-by=r+

,,~ ~·gH~ater resilience and 'even 'lowetMean Times Between indexing it. When we have the document content held in -

.' Failure (MTBF) than today and this can be expensive. digital format we can also provide ourselves with thellbility::."..-=c- .•.

. - ·Alternatively-, in tbeearJy implementations it may mean-.-. to search the contents-of the document as wen,"where_thifis-

·,·,-opting·to mtrOduce DIP iii parallel--with paper filing required.-- -

systems so that there is a backup in such circumstances.· .. -

v : ~I..egality.--There are-still "SOllie issues relating to the legality of documents stored on DIP systems. These are expected to be resolved but to date no party to a trial

'opyright Cimtecb 1991

Almost all DIP applications will require us to uniquely identify each document stored in the system and show its relationship with other documents and what category of document it is, how many pages it occupies etc - so that we

31

TECHNOLOGY

4. ASSESSING YOUR REQUIREMENTS.

There is not time or space here to try and propose a methodology whereby users can assess their requirements for DIP and digital document management systems. As a general guide, however, we feel that users should look at DIP and digital document management systems from four perspectives.

4.1 Paper Replacement

The first one we would refer to as the paper replacement perspective. Here we are looking at DIP and digital document management as a way of avoiding some of the problems we face with document handling that can be attributed to the sheer physical nature of paper. We may well already have a structured data processing application that manages structured data but that then references the related unstructured data held in paper format. Examples include a case processing system that references policy documents etc held in paper form; a computerised index that references articles held in paper format etc. If we install a DIP system then the fact that the documents will be stored in digital format will bring us some benefits and may have some disadvantages too:

The benefits of such an approach include the following:

We would avoid the need for two separate information management systems (structured information on computer and unstructured information in a paper based records management system). This normally translates itself into staff savings.

ii We would increase security as the files would always be available and could never be physically lost.

iii We would improve access because multiple users could view the same documents on screen at the same time without leaving their desks and we would avoid the well known problem - a file is active which is why you want to see it but that means that it is not in the store it is already out on someone's desk and has to be tracked down.

iv We would improve access and copy facilities by distributing hardcopy printers around where users could take copies of pages etc.

v We would save space as in many cases the paper documents could be destroyed or at least moved off site.

d'_i ~~~~_~_e: 'N9~Ji,Jh~~~bl~t~.makem~jtiplecopies of the data ., and store .copiesof(s!te .so avoiding the risk that a fire today would destroy-the only copy ofVi~1 government

~O~L- .'

. ". :- -- ,. =- r., ". .; ~.,

vii' We'would tend to reducejhe volume.of.photocopying carried gut Qy;qepartments ~4Q:C~J:!lents;<:<ln be viewed by num~rous users at the same. time. and are always • ; -. . available' and hence as users became 'confident in the

system they' ;;ouid~rely on one central file and not make copies of every document in circulation just to be sure that they could retrieve them if they needed them.


viii We would have the ability through facsimile and electronic mail/networking to transmit images of documents between departments or between departments and external private or commercial clients.

IX We would avoid the wear and tear caused to files and documents by the need to keep them in active offices and by constantly retrieving them and replacing them in poorly designed file cabinets.

x We would avoid what can only be described as a potential fire hazard caused by the vast volume of paper stored in some offices.

xi We would avoid what could become in some offices shortly a potential Health & Safety At Work Issue - with filing cabinets and shelves and files stacked on floors restricting access and likely to cause staff accidents.

These are some of the benefits that can be obtained by the switch from paper to electronic document storage - just avoiding the bulk and inflexibility of paper. There are many examples where just one of the above benefits has proved more than adequate to justify a largescale or small volume application of DIP technology. Nuclear installations are required by regulatory powers to keep all documents accessible in a minimum timeframe or they can be closed down. Only a DIP system can provide them with that degree of certainty and security. Companies faced with having to relocate or build extensions to their offices have justified DIP on space savings alone.

Some of these benefits - space saving, security etc - can be obtained by using microfilm but the need to photographically process microfilm and the limits to retrieval automation with microfilm mean that it cannot really be used for very active retrieval. On the other hand, today microfilm is a considerably cbeaper option so if a User only has a paper storage problem and Dot a paper access problem then it is the prime candidate for automation. It should also be considered where some of the following disadvantages of DIP outweigh the benefits.

Some of the potential disadvantages or, more accurately, technical and administrative issues associated with DIP as opposed to paper filing - again just looking at the physical aspects of the system - can be listed as follows:

Every incoming paper document may have to be physically scanned into the system - this is an overhead requiring staff and usually sufficient staff so that all the moming's post can be captured quickly and then distributed to staff. In most cases the extra staff needed for input are more than counterbalanced by the staff saved in filing and retrieving and refiling documents but one should be aware of this requirement. It leads one to only consider those systems where the documents are being actively retrieved and look less favourably at pure

archival collections of documents. . ..

11 Every incoming paper document has to be identified and associated with an index data record so it can be

30

TECHNOLOGY

can manage it efficiently and retrieve it in a reasonable time etc. However, there is an important subset of applications where that level of keyword indexing and file folder management is simply not adequate.

We can broadly categorise documents into 3 types - transaction; evidentiary and reference.

Keyword Indexing For Transaction & Evidentiary Documents.

These documents relate to a transaction and either initiate it or provide some of the information needed to initiate it or resolve a query etc. Data is usually entered from them into a computer in a prescribed sequence and when all the relevant data is entered the computer processes the transaction and money changes hands or goods are delivered etc. At that point the documents become evidentiary and we then only usually need to store them as evidence tbat a transaction took place or that we were instructed to carry it out or tbat tbe transaction tbat occurred was as per instructions in the document etc. Deeds, cheques, proof of delivery notes, invoices, accounts payable documents, mortgage, insurance policy, loan request forms etc all fall into this category.

There are normally three characteristics related to DIP systems that handle transaction documents and evidentiary documents.

Firstly, they always relate to a transaction processing application and are best integrated with it to cut down on indexing requirements and to integrate the workflow with the logic of the application.

Secondly, the documents need to be held in image format as a legal facsimile of the original document.

Thirdly, the indexing requirements are usually simple as it is part of a structured process and the ways in which people want to access an insurance policy tile or view a cheque or an invoice etc are very structured. Hence keyword indexing is quite adequate and most efficient.

Reference Documents - Keyword Or Full Text & Hypertext! Hypermedia?

The third category of document which accounts for the largest percentage of documents held in offices - are reference documents. These documents do not relate to any specific transaction. They comprise technical reports,

. standards, procedures, consultancy studies, product literature,

presscuttings; jourrialsarticles etc. . ..

. They build up an~l_~recoJl~~e4l'yizldividlJal:s,workgroups, r ; ,·,.c departmentsor whole{)r.g~~tl~n~~tQ C!~~1cIl()wledge bases that are relevant to .the work thatindividuall workgroup/ .,:,' clepiLorgMisa!i,QiI :cl~.~fu.-Soffi& ~3.se~,the)'.:-~ome close to , ,·:eYidentiarycdcX::um~:flts_{a:con~uJ6m~yproject-file-evideDce ->. _.: .. ofwo* Q9De:9o.;~ project in the.eventof.a.dispute). In other cases they are the paper equivalent of an online reference databases or an expert system.


These files and documents tend to be used by predominantly knowledge workers whereas transaction documents and evidentiary documents are predominantly handled by clerical workers and supervisors. By knowledge workers we mean lawyers; consultants; marketing professionals; project managers; architects; senior police investigators; journalists etc.

Trying to analyse the information requirements of such people is much more difficult that trying to analyse the information requirements of clerical and supervisory staff in a transaction processing application. Knowledge workers take on specific projects and depending on the project they will want to access their document database in different ways.

The paper filing systems evolved to meet the needs of these people are always changing as the interests of the people change and a lot of reindexing and refiling is needed.

In the late 1960s and early 1970s companies like IBM and Harwell developed their own indexing systems in order to manage scientific and technical information for these categories of people. They developed what they called free text or full text retrieval systems. They acknowledged that trying to keyword index technical documents was a thankless task and the only way to be sure that every one's interest was catered for was to index every "significant" word in a document. This was prohibitively expensive to do manually but as word processing became widespread one could simply take in a text file of a document and pass it through a process called inversion and build up an electronic concordance of every word used in a document database together with a record of its physical location. Under this system if we bad 1,000 text documents stored on the computer and we wanted to find all of the documents that related to Document Management we would just input tbe term Document and would get a list of every document that contained the word Document in it. The addition of Boolean searching techniques meant one could eliminate documents about other aspects of Documents by combining search terms - i e Document + Management etc.

These full text retrieval systems are DOW widely used in libraries and research centres and by lawyers and other categories of professionals who wish to make sophisticated and unstructured searches of their document databases.

They work efficiently for text documents created on word processing systems but what about mixed document databases where half the documents come in paper format? The answer there is to scan in the document image and either:

a) accept that we can only keyword those documents that we have captured in image form.

b) apply OCR (Optical Character Recognition) to' the'-' scanned image and try and recognise the characters"-contained on the document page that we have scanned.

These systems are quite efficient with typewrittentexF---~~ and with some editing canprovide us with afull texr~~C~version that is then input into the text retrieval ·system.~

c) key in an abstract of the document which we can then carry out full text searches on.

32

TECHNOLOGY

The cost of managing such systems is much higher than simple keyword indexed systems but in many applications they represent probably the only answer to intelligently accessing the unstructured information contained in large document databases.

4.3 Workflow Automation.

The third area users should look at is workflow automation.

When DIP systems were first introduced they were seen as electronic equivalents of paper filing cabinets or microfilm systems. Hence users tended to specify systems where completed files - dead files in many cases - were indexed and scanned and stored on WORM optical disks as an alternative to sending them down to the basement or filming them and storing them on rolls of film or microfiche. Clearly DIP systems have some advantages in these applications - space saving and security and multi user access when compared to paper - no chemicals, higher quality, faster retrieval compared to microfilm etc.

In many cases, however, these advantages were not high profile and significant enough to get top management support and in value terms were not sufficient to justify the cost of DIP systems given that the documents by their very nature were archival and hence unlikely to be accessed very often by the time they were sent to the records management section.

Companies like Filenet very quickly realised that DIP systems could more easily be justified by adding them on to existing structured transaction processing applications, capturing the paper as it came into a department, routing it to the relevant users who needed to access it along networks, creating electronic "in baskets" and "out baskets" and setting up queues of documents and setting the status of documents so that they could completely control and automate the flow of documents through an organisation.

In other words they captured and eliminated (held offsite) the paper at the input stage and from then on all documents were accessed in image form and processed in image form and the system software was used to control and automate and streamline the whole workflow procedure.

By doing this companies exposed very inefficient paper based systems to the spotlight and built up very valuable statistics on how long it took them to process a document - how long it took from receipt of document to completion of relevant transaction, what all the exceptions were, . where bottlenecks

,: ... ~~t~,i1l.~~~ir.proce9l:l~,~~?,~;~", ';;11,; ..... ',_ "r"

With w()rkflow a~t6~ti~~'sirppiiers"~~~i-d-b~gin to justify ._~ Pg'syst~~}l().tjllst ()n space savings and 'security but, most "si~ficantly; .0n.reduJ;:_rl:f(~IIle·J~Cim_iec~ipt,Qf application to billing of customers; on lessstaff needed to process the same numbers of transactions and, perhaps most importantly in a -competitive environment,on improved customer. service and

hence competitive edge. '


Improved customer service comes because supervisors can access the system at any time using report writing facilities and ascertain the status of any document and any transaction - when a document was received; what queue it was in - if it was pending what had caused the holdup - if the transaction was complete, when the payment or acknowledgement had been sent off etc.

The productivity benefits were achieved because staff no longer had to leave their desk to access all the documents they wanted - all the information they needed to make a decision was brought to them by the system to the workstation and they then simply made a decision. Time previously spent retrieving information could now be spent processing more documentsl applications. The application could be system driven or user driven and supervisors could compare the productivity of workers and rectify the situation if it was poor.

The reduced time came because there was no risk of lost files, no risk of files getting left in trays because they were awkward or just overlooked - every file was dealt with in sequence and supervisors would inspect every queue to ensure all documents were dealt with and to advise on any exceptions etc. The workflow analysis will itself have eliminated some redundant stages in the processing steps and the staff will have been assigned work and reassigned roles to avoid bottlenecks.

With paper based systems it was impossible to readjust staff jobs quickly to take account of changes in demand. With DIP systems this can easily be done provided the system is set up

properly. .

These are some of the benefits of workflow automation and the reasons for planning for an ambitious front office DIP implementation rather than a conservative back end DIP application. The choice of a front end or back end application is an important one for users to make early on.

4.4 Applications

The fourth area which users have to look at very carefully is that of applications or what people want to do with documents today or tomorrow. We have shown the options above.

Generally speaking if all users need to do is view and print documents then a DIP system that handles all documents as page images is ideal. If users need to access and change the documents at regular intervals as part of an office application or a publishing application or a CAD application then we are looking at a much more complex digital document management application that will involve the use of i compound document management system unless w~ _lI.lly~a

way of passing documents to and fro and linking Work In·c,--=:Progress files with master formatted files that record where

they stood at each major releas~point.~,:",.=~c"::,,,c,=- ._

33

TECHNOLOGY

5. CONCLUSION.

Hopefully this introduction has provided a guide to the key hardware and software components that together provide all the functionality of a fully fledged Document Image Processing System and pointed the way to the complex and changing relationship that exists between Document Image Processing and the much wider area of Digital Document Management which others refer to variously as Compound Document Management or Intelligent Document Management.

Clearly in anyone application not all the facilities will be needed and users seeking small scale filing systems will expect their suppliers to have made most of the key technical decisions for them and to provide them with some choice of peripherals (scanners etc) and some basic customisation facilities at the index database level. On the other hand potential users of largescale DIP and digital document management systems will, we hope, appreciate the need for flexibility and take the points we make about the need for flexible hardware and software platforms.

For a more in depth coverage of the technology of Document Image Processing together with a detailed review of the key management issues and some sixteen in depth case studies of existing DIP users we would recommend the following three volume work published by the National Computing Centre.

Docwnent Image Processing: Applications: Benefits; Technology. A three volume handbook comprising: A Technical Introduction by A.M. Hendley; Case Studies From Europe And USA by John Pritchard and The Business Benefits & Justifications by Simon Perkins. Details and application forms from Sales Administration, The National Computing Centre Ltd, Oxford Road, Manchester, Ml 7ED.

Case studies and news of new systems and technical developments at the hardware and software level are carried in Cimtech's journal "Information Media & Technology" which appears on a bi-monthly basis and is available free to members of Cimtech.

The industry meets to show its new products and to discuss topical issues and business benefits etc at CIS Docwnent Imaging. Cimtech organises the conference and Meckler administer the exhibition and show.


34

Cimtech Intro To Document Image Processing - 1991

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Cimtech Intro To Document Image Processing - 1991

Transféré par

Droits d'auteur :

Formats disponibles

TECHNOLOGY SECTION

Vous aimerez peut-être aussi

Cimtech Intro To Document Image Processing - 1991

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Cimtech Intro To Document Image Processing - 1991

Transféré par

Droits d'auteur :

Formats disponibles

﻿TECHNOLOGY SECTION

Vous aimerez peut-être aussi

TECHNOLOGY SECTION