Vous êtes sur la page 1sur 772
IBM Content Collector Version 3.0 Administrator's Guide SH12-6980-00

IBM Content Collector

Version 3.0

Administrator's Guide

SH12-6980-00

IBM Content Collector Version 3.0 Administrator's Guide SH12-6980-00

IBM Content Collector

Version 3.0

Administrator's Guide

SH12-6980-00

Note Before using this information and the product it supports, read the information in “Notices”

Note Before using this information and the product it supports, read the information in “Notices” on page 749.

Note Before using this information and the product it supports, read the information in “Notices” on

This edition applies to version 3.0 of IBM Content Collector (product number 5724-V57) and to all subsequent releases and modifications until otherwise indicated in new editions.

This edition replaces SH12-6914-01.

© Copyright IBM Corporation 2008, 2012. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Contents

ibm.com and related resources

How to send your comments Contacting IBM

vii

viii

viii

Part 1. Solution overview

.1

Content Collector overview

.3

What's new in Content Collector Version 3.0?

.

.

.

6

New email management features

.

.

.

.

.

.6

New source connector features .

.

.

.

.

.

.7

. New indexing features in IBM Content Manager . 9

New target connector features

.9

.

.

.

.

.

New indexing features in IBM FileNet P8 .

 

.

.

11

Further enhancements

 

.

.

.

.

.

.

.

. 12

Content Collector architecture

 

overview

.

.

.

.

.

.

.

.

.

.

.

. 15

Definition of the email storage data model .

 

.

.

. 16

IBM Content Manager data model

 

.

.

.

. 17

IBM FileNet P8 data model

.

.

.

.

.

. 19

Document archiving scenarios

23

Scenario: Document archiving for storage purposes 23

. Scenario: Document retention and disposition .

Scenario: Archiving journal email

.

.

 

.

.

. 24

. 25

Scenario: Preparing the email repository for email

analytics .

.

.

.

.

.

.

.

.

.

.

.

.

.

. 26

Part 2. Installing .

.

.

.

.

.

.

.

.

. 29

Installing Content Collector

31

Prerequisites for the installation

.

.

.

.

.

.

. 31

.

.

.

.

.

.

.

.

. 31

Hardware prerequisites Software prerequisites .

. Additional prerequisites and restrictions

.

.

.

.

.

.

.

.

. 32

. 34

Configuration worksheets

.

.

.

.

.

.

.

.

. 39

Configuration worksheets for the Content

 

Collector source systems

.

.

.

.

.

.

. 40

Configuration worksheets for Content Collector

repository systems

.

.

.

.

.

.

.

.

. 46

Configuration worksheets for the Content

 

Collector configuration database

.

.

.

. 50

Configuration worksheets for the Content

 

Collector connectors

.

.

.

.

.

.

.

.

.

. 52

Configuration worksheets for the Content

 

Collector general settings

.

.

.

.

.

.

. 59

Upgrading to version 3.0 of IBM Content Collector 65 Upgrading specific FileNet P8 task routes for

email archiving .

.

.

.

.

.

.

.

.

.

.

. 69

Additional steps for upgrading IBM Content

Collector for Microsoft SharePoint

 

.

.

.

. 70

Installing Content Collector

 

.

.

.

.

.

.

. 71

Installing Content Collector for use with one or more source systems and Content Installing Content Collector for use with one or

more source systems and FileNet P8 .

Installing Content Collector on several servers -

.

.

.

72

. 73

scale out .

. Installing individual components .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. 75

. 76

Installing or upgrading IBM Content Collector

 

for Microsoft SharePoint .

.

.

.

.

.

.

.

. 76

Installing Content Collector Notes Client

. Installing Content Collector Server

Extension

.

.

.

.

.

.

.

 

.

.

.

.

. 80

. 83

. Performing the initial configuration .

 

.

.

.

. 85

Verifying and adjusting the initial configuration

settings .

.

.

.

.

.

.

.

.

.

.

.

.

. 108

Setting the Content Collector environment

 

. Installing Content Collector on several servers

variables

.

.

.

.

.

.

.

.

 

. 110

115

Configuring the web application server

122

Replacing the Lotus Notes mail template in all

. Installing Content Collector Outlook Extension 136

mailboxes

. 136

.

.

.

.

.

.

.

.

.

.

.

Enabling offline repositories to allow access to archived content without network access Installing and configuring Content Collector Outlook Web App (formerly Outlook Web

. 139

Access) support

.

.

.

.

.

.

.

.

.

.

. 141

Removing Content Collector

153

Part 3. Migrating

Migrating to Content Collector

155

157

Moving from CommonStore to Content Collector 157 Restubbing documents archived using IBM

CommonStore for Lotus Domino

157

Restubbing documents archived using IBM CommonStore for Exchange Server

160

Moving from FileNet Email Manager or FileNet

Records Crawler to Content Collector .

 

.

.

.

.

161

Part 4. Configuring

163

Configuring Content Collector

 

165

The Configuration Manager

.

.

.

.

.

.

.

. 165

Enabling security in the Configuration Manager 166 Signaling changes to the configuration database 167 Adding, changing, or deleting configuration

objects in the Configuration Manager

.

. 167

Keyboard commands for Content Collector

.

. 168

Setting up a configuration database

180

Adding or editing data store connections .

.

. 180

Deleting a data store connection

182

Exporting or importing a configuration database 182

Starting the Task Routing Engine

.

.

.

.

.

.

.

.

. 183

Configuring the task route service

. 183

Checking if Content Collector is running . Configuring the settings for LDAP lookups

.

. 186

during task route processing .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. 186

Content Collector services

. 187

Content Collector processes

. 195

Providing connections for collecting and archiving

documents .

.

.

.

.

.

.

.

.

.

.

.

.

. 195

Configuring connectors .

.

.

.

.

.

.

.

. 196

Source connectors .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. 197

Target connectors .

. 218

Utility connectors .

Configuring general settings .

Configuring Content Collector for

. 225

. 228

CommonStore for Exchange Server legacy

support .

.

.

.

.

.

.

.

.

.

.

.

.

. 229

Modifying the Configuration Web Service

settings .

.

.

.

.

.

.

.

.

.

.

.

.

.

. 232

Modifying the information center settings .

. 233

Modifying the settings for the Web Application

233

Modifying client configuration settings

.

. 236

Configuring the access to archived data . Modifying the settings for Content Search

.

. 238

Services Support

.

.

.

.

.

.

.

.

. 244

Modifying the settings for the Metadata Web

. Selecting the metadata form template .

Application

.

.

.

.

.

.

.

.

.

.

. 245

. 245

. Configuring the metadata form definition .

.

. 250

Configuring metadata and lists

.

.

.

.

.

.

. 254

. Adding, editing and sorting lists .

Metadata and lists

.

.

.

.

.

.

.

.

.

.

.

. 254

. 256

Adding and editing user-defined metadata

.

. 257

System metadata

.

.

.

.

.

.

.

.

. 258

Configuring task routes .

.

.

.

.

.

.

.

.

. 290

Task routes .

. Sample task route templates

.

.

. Building task routes .

. Task route traits and considerations .

.

.

.

.

.

.

.

.

.

.

.

processing .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. 290

. 292

. 302

. 331

Working with the Expression Editor

. 341

Using extended processing functions . Collecting documents for archiving or

. 372

. 405

. Configuring tasks .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. 460

Using the setup tools

. 558

Configuring an IBM Content Manager

repository

.

.

.

.

.

.

.

.

.

.

.

.

. 558

Configuring the Domino environment for

Content Collector

.

.

.

.

.

.

.

.

. 562

Enabling a Domino template for Content

. Enabling an IBM Content Manager repository for processing by the indexer for text search .

Collector .

.

.

.

.

.

.

.

.

.

.

.

. 563

. 564

Configuring an IBM FileNet P8 repository.

.

. 565

Enabling the access to archived data

.

.

.

. 570

. Enabling search for email documents

About collections

.

.

.

.

.

.

.

.

.

.

. 571

. 610

Enabling searching for documents archived by

IBM CommonStore for Lotus Domino Enabling searching for messages archived by IBM CommonStore for Exchange Server

.

. 625

626

Customizing search and result fields

Setting a default date range for the Email Search .

page .

. Enabling access to IBM Connections documents 631

Enabling access to File System or Microsoft

. 630

. 629

. Changing the preview mode for Outlook .

. 628

.

.

.

.

.

.

.

.

.

.

.

.

.

.

SharePoint documents .

.

.

.

.

.

.

.

. 631

Handling erroneous documents

.

.

. 634

. Enabling Microsoft Outlook links

Blacklist .

.

. Securing Content Collector communications .

.

.

.

.

.

.

.

.

.

.

.

.

.

. 636

. 638

. 639

Replacing certificates for the embedded web

 

application server .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. 639

Client communication

.

.

.

.

. 642

URL protection

.

.

.

.

. 642

Part 5. Tutorials

645

Content Collector file system tutorials 647

Archiving file system documents to FileNet P8 Moving documents off the network into IBM

 

647

FileNet P8 .

.

.

.

.

.

.

.

.

.

.

.

. 647

Detecting and processing duplicates, searching for archived and stubbed documents, and

declaring documents as records .

 

.

.

.

.

. 648

Defining metadata to be used to process files for

archiving

. 650

.

.

.

.

.

.

.

.

.

Part 6. Developing

Developing with the Content Collector

APIs .

.

.

.

.

.

.

.

.

.

.

.

.

653

655

Creating requests for interactive archiving

 

.

. 655

Document states .

.

.

.

.

.

.

.

.

.

. 658

Developing with the Content Collector Web

 

Application services APIs RestoreAPI ViewingAPI

.

.

.

.

.

.

.

.

.

.

.

 

.

.

.

. 659

.

.

.

. 660

.

.

.

. 662

Enabling security for the Web Application

services APIs

. Developing with the Document Viewer

.

.

.

.

.

.

.

.

.

.

.

. 664

. 670

. The Document Viewer configuration files .

.

. 670

Document Viewer requests .

.

.

.

.

.

.

. 675

Configuring Workplace or Workplace XT for the

use of the Document Viewer .

.

.

.

.

.

. 678

Part 7. Monitoring

681

Monitoring Content Collector system

 

performance .

.

.

.

.

.

.

.

.

.

683

Using the system dashboard .

.

.

.

.

.

.

. 683

Information monitored in the system dashboard 684

Using performance reporting .

 

.

.

.

.

.

.

. 685

Performance reporting database tables .

.

.

. 687

Using performance counters

.

.

.

.

.

.

.

. 687

Performance counters

.

.

.

.

.

.

.

.

. 688

Tracking system log files

.

.

.

.

.

.

.

.

. 692

What logs to track .

.

.

.

.

.

.

.

.

.

. 692

File format and naming conventions for system

 

Troubleshooting target repositories .

.

.

.

. 723

log messages in Content Collector

Log levels

.

.

.

. Using audit logs Using event logs

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. 696

. 697

Troubleshooting components . Troubleshooting task routes

.

. Identifying document processing errors

.

.

.

.

.

.

.

.

.

.

.

. 725

. 727

.

.

.

.

. 698

. 728

.

.

.

.

. 699

Relevant event logs . Checking event logs .

. Checking if documents were collected .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. 728

Interpreting event logs .

.

Deleting event logs

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. 700

. 700

. 729

. 730

Event IDs

.

.

.

.

.

.

.

.

.

.

.

.

. 700

SharePoint farm or web application collection

 

fails for some site collections .

.

.

.

.

.

. 732

Part 8. Troubleshooting and

Checking whether the source system connector

started .

.

.

.

.

.

.

.

.

.

.

.

.

.

. 732

 

703

support

.

.

.

.

.

.

.

.

.

.

.

Checking whether the source system collector

Troubleshooting Content Collector

. Collecting troubleshooting data on Windows .

.

.

Retrieving version information

.

.

.

.

.

.

Troubleshooting installation

. Troubleshooting scale out mode .

.

.

.

.

.

.

.

705

. 705

. 705

. 706

. 706

. Checking whether documents were submitted to

the IBM Content Collector Task Routing Engine

. Checking whether documents were received by

service .

started .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

the IBM Content Collector Task Routing Engine

. 734

. 736

. The installation of the web applications failed

708

service .

.

.

.

.

.

.

.

.

.

.

.

.

.

. 738

The installation, upgrade, or removal of Content

Checking whether the expected task route was

Collector for Microsoft SharePoint failed

 

. 708

assigned .

.

.

.

.

.

.

.

.

.

.

.

.

. 739

Creating the Content Collector configuration database on remote server fails The connection to the configuration database

.

. 709

Checking the IBM Content Collector

. Checking whether any tasks failed .

deployment

.

.

.

.

.

.

.

.

.

.

. 740

 

. 742

. The connection to the Oracle database fails Memory issues when running the initial configuration or the set-up tools

fails

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. 709

. 710

Identifying whether metadata is missing . Checking whether the connector stopped .

.

.

. 743

. 744

 

Analyzing task connector logs .

.

.

.

.

.

. 745

.

. 710

IBM FileNet P8 validation fails using HTTPS connection in Initial Configuration/Setup Tools . 711 The CommonStore server and the CSLD tasks

fail to start

. Troubleshooting configuration

.

.

.

.

.

.

.

.

.

. 711

. 713

Troubleshooting source systems

 

.

. 713

Part 9. Appendixes

747

Notices .

.

.

.

.

.

.

.

.

.

.

.

.

. 749

Index .

.

.

.

.

.

.

.

.

.

.

.

.

.

. 753

ibm.com and related resources

Product support and documentation are available from ibm.com ® .

Support and assistance

Product support is available on the web. Simply click Support from the appropriate product website.

IBM ® Content Collector

IBM Email Archive and eDiscovery Solution

IBM CommonStore for Exchange Server http://www.ibm.com/software/data/commonstore/exchange/

IBM CommonStore for Lotus ® Domino ® http://www.ibm.com/software/data/commonstore/lotus/

IBM FileNet ® P8

IBM Enterprise Records

Information center

You can view the IBM Content Collector product documentation in an Eclipse-based information center. See the information center at

PDF publications

You can view a PDF version of the IBM Content Collector installation and configuration guide by using the Adobe Acrobat Reader for your operating system. The guide is available from the IBM Publications Center. If you do not have the Acrobat Reader installed, you can download it from the Adobe website at http://www.adobe.com.

How to send your comments

Your feedback is important in helping to provide the most accurate and highest quality information.

Send your comments by using the online reader comment form at

Contacting IBM

To contact IBM customer service in the United States or Canada, call 1-800-IBM-SERV (1-800-426-7378).

To learn about available service options, call one of the following numbers:

v

In the United States: 1-888-426-4343

v

In Canada: 1-800-465-9600

For more information about how to contact IBM, see the Contact IBM website at http://www.ibm.com/contact/us/.

Part 1. Solution overview

Content Collector overview

IBM Content Collector archives email and other digitized content in an external, central repository. Additional functions enable users to reduce the size of their mailboxes, reclaim space on their hard drives and Microsoft SharePoint servers, search for email in the repository, and restore archived email to their original locations.

Archiving You can archive content from various sources. These include:

v

Mailboxes on Lotus Domino or Microsoft Exchange servers

v

Email that is received through the Simple Mail Transfer Protocol (SMTP)

v

Microsoft Exchange public folders and PST files

v

Lotus Domino applications and local NSF archives

v

Microsoft SharePoint sites

v

IBM Connections content

v

Documents in NTFS, DFS, and Novell file systems

Archiving means that the content of these documents is processed and then stored in a central repository.

Terminology: IBM Content Collector uses documents as a generic term for email, messages, Microsoft SharePoint items, IBM Connections items, and file system documents.

The central repository provides a single access point for all business-relevant documents, which means that sensitive data can be better controlled. Various security features are in place for the protection of business documents.

Archiving methods include automatic and interactive archiving.

v

Automatic archiving means that an administrator centrally sets up an archiving schedule and selects the sources from which to archive content, such as email servers, applications, user groups, Microsoft SharePoint sites, IBM Connections deployments, or storage systems.

v

Interactive archiving on the client side enables Notes and Outlook client users to flag documents for archiving. Documents flagged by email client users are selected for archiving the next time the scheduled archiving process runs. Users can also specify additional archiving information before the documents are archived.

When archiving email documents, IBM Content Collector always archives the entire email content, including the attachments. You can configure which parts are removed from the original document after it is archived, and when this happens. You can select documents from all connected mail clients, or from just a subset, according to predefined criteria, such as the size of mail databases, the age of documents, and so on.

You can copy or move documents, including their attachments, from multiple Microsoft SharePoint sites, a single site, or selected libraries and lists. You can filter the archive collection based on content types or through additional task route filters, and map your custom site columns to corresponding metadata in your repository.

Content from multiple IBM Connections applications, from one or several deployments, can be copied to a repository. The collections of content can be filtered by users.

File system documents can be processed depending on metadata and stored in a specific repository folder structure to facilitate search and retrieval.

Accessing Content

The preview and restore functions allow your client users to view and restore archived documents from the central repository, especially in cases where the archived content has been removed from the original documents. Client users can access the archived material in different ways depending on their source system. Email documents can be previewed or restored through links and hot spots provided in stub documents or through a web-based search interface, while documents from the file system or Microsoft SharePoint can be previewed through direct links.

In IBM Content Collector, access to archived content is restricted. For email, access to a link is provided by the security of the user's mailbox, meaning the user will see only what the mailbox allows. For file system and Microsoft SharePoint, access to a link is determined by the user's access to the document's location within the file system or SharePoint list.

Access to document content is also possible when using a repository client, either custom-built or out-of-the-box, where the credentials of a repository user are applied against a document's security to determine access. In IBM Content Collector, file system or SharePoint links can also be defined as secure links. Clicking a secure link prompts the user for specific user permissions to view the document content.

Restriction: Archived SharePoint items cannot be restored from secure links.

To remove the content of restored email in IBM Content Collector, you can define a schedule. This process is referred to as restubbing.

Search (email) Installing and configuring the search functionality adds a search interface to the connected Lotus Notes or Outlook clients. From this interface, users can start full-text queries to search for archived content. The content of archived attachments is included in the search.

For security reasons, the search capability is limited. Archive users can only search for content that was archived from their mailbox. They cannot search or restore content that is owned by other users. However, they can search content that was archived from mailboxes to which they have “delegate access”. For example, if an assistant has been given delegate access to a manager's mailbox, the assistant can search for content that was archived from this mailbox. Similarly, users can search the content that was archived from any Microsoft Exchange PST file or Notes Storage Facility (NSF) file that was assigned to them before it was archived.

Users can also search for email metadata. This is information which resides in fields of the original email, such as the sender, recipient or subject field. The information in these fields is extracted during the archiving operation, and stored in corresponding fields in the repository. You can customize the list of email fields that you want to extract metadata from. It must be said,

though, that metadata searches require the user to have a deeper understanding of the data in these fields.

There is also a preview function. If a document looks promising in the result list, a user can select it to display its content in a web browser window. The search text is highlighted. If the document shows the desired content, users can click a Restore button to copy the content to an email document in their mailboxes.

Search (Microsoft SharePoint, IBM Connections, and file systems) Microsoft SharePoint and file system document stubs contain all of the metadata related to a document. Users can perform a metadata search to locate documents. For all documents that were archived using the File System Connector, users can view the content by clicking the stub links. To search by content for Microsoft SharePoint, IBM Connections, and file system documents, users can use their target repository clients. For Microsoft SharePoint, if the target repository is FileNet P8, they can use IBM FileNet Connector for Microsoft SharePoint Web Parts to search by content. For searching by metadata for documents in a file system repository, users can apply the standard search tools provided by Windows.

Document life cycle

IBM Content Collector enables you to implement a range of document retention strategies, from simple deletion after processing to a formal declaration of documents as records in IBM Enterprise Records.

You can remove parts of archived email documents or Notes application documents step-by-step from the original document until finally, the entire content is deleted. The removal of document content frees up space in the users' mailboxes or databases, and on the servers of your content management system.

In Microsoft SharePoint source systems, you can replace entire documents with links to the archived document in the target repository. You can later update outdated links and remove orphan links from target repositories.

To configure the email document life cycle, you can define a so-called stubbing life cycle. Stubbing means converting a document to a stub. A stub is a document from which parts of the content have been removed. For example, your stubbing life cycle might instruct IBM Content Collector to remove email attachments one week after the mail content has been archived. A second instruction in the schedule of the stubbing life cycle removes the main text or email body after four weeks so that just an empty shell of the original email remains. Finally, the stubbing schedule can be set up to delete the entire mail.

The stubbing function can insert links in these email stub documents after archiving, which enable users to view the archived content by a mouse click. In addition, IBM Content Collector can be configured to insert brief texts in the original email to indicate that content has been removed, texts that inform users about the archiving of a particular piece of content. A separate task route can be configured to delete orphaned stubs, thus stub documents for content that has been deleted from the archive.

Related concepts :

“Content Collector architecture overview” on page 15

“Scenario: Preparing the email repository for email analytics” on page 26

“Scenario: Document archiving for storage purposes” on page 23

“Scenario: Archiving journal email” on page 24

“Scenario: Document retention and disposition” on page 25

Related information :

and disposition” on page 25 Related information : IBM Content Collector website What's new in Content

What's new in Content Collector Version 3.0?

IBM Content Collector Version 3.0 provides the following new features.

For the most current software requirements, including versions, see the System Requirements technote on

New email management features

IBM Content Collector Version 3.0 provides the following new email management features.

Email management enhancements

Lotus Domino: Configure which icons to use for showing the document state When you customize the Lotus Domino template for IBM Content Collector you can select to use IBM Content Collector icons to represent the document state. The Content Collector icons then overwrite the default icons displayed in the attachment icons column.

If you select to not use the IBM Content Collector icons, the original icons are preserved.

Lotus Domino: Enable the template with basic IBM Content Collector functions

You can select to enable the Domino template with basic Content Collector functions only, which are required for archiving, but are invisible to the user. This means that the client menu does not contain any Content Collector elements that provide functions for searching and restoring documents or for collecting additional archiving information. Basic functions like automatic archiving or automatic retrieval of documents when they are opened are available, however.

Lotus Domino: Default setting includes messages types in all mailbox management task route templates You can now specify to include message types, and not only to exclude message types. The default setting in all mailbox management task route templates now is to include all message types.

Microsoft Exchange: Show the archiving status in Outlook You now have the option to have the archiving status of messages shown in Microsoft Outlook. You can select to add or remove an additional column that indicates the archiving status in the Outlook folder.

Microsoft Exchange: Ribbon Support in Outlook Ribbon style of IBM Content Collector functions in Microsoft Outlook 2010 is supported.

only

Legacy restubbing Documents that were archived using IBM CommonStore and are restored

in IBM Content Collector can be restubbed.

SMTP Connector enhancement The SMTP Connector now supports business process management and content classification scenarios. You can configure the SC Prepare Email for Archiving task to include the attachments of the document in the temporary files if the temporary files are used for business process management or as input for the IBM Content Classification task.

Private items You can now explicitly exclude private items from being archived or, if private items are archived, you can limit delegate access to archived items that are not marked private.

Cleanup of orphaned stubs New task route templates enable you to check mailboxes for orphaned document stubs. IBM Content Collector checks whether the document to which the document stub points still exists in the archive. If no associated document is found, the document stub is deleted.

Enhanced blacklist UI You can now filter the blacklist to display only those entries that meet specified criteria.

Arching local files in a scale-out environment

IBM Content Collector now supports archiving local files (NSF and PST )

in a scale-out environment. PST or NSF files are processed by one

dedicated node.

Enhancements to the EC Copy to Mailbox task The EC Copy to mailbox task has been renamed to EC File Email in Mailbox Folder and now supports additional use cases. In addition to copying Microsoft Exchange messages from a local archive to the associated mailbox, messages can now also be copied or moved to a configurable folder that can be build from metadata, for example, the folder name is created from IBM Content Classification metadata. For Microsoft Exchange messages are copied, for Lotus Domino they are moved.

Enhanced stubbing options The EC Create Email Stub task can now be configured to treat embedded attachments as part of the email body so that you can control whether embedded attachments are removed when attachments are removed or when the body of a message is truncated.

New source connector features

IBM Content Collector Version 3.0 provides the following new source connector features.

File System

File re-collection You can collect new versions of files and have them added to IBM Content Manager or IBM FileNet P8 as new versions.

Cleanup of orphaned stubs

A new stub collector enable you to set up task routes for checking file

systems for orphaned stubs. IBM Content Collector checks whether the

document to which the stub points still exists in the archive. If no associated document is found, the stub is deleted.

Stubs that are created with IBM Content Collector V3.0 contain the ID of the FileNet P8 or IBM Content Manager repository into which the document was archived. This ensures that the correct repository is accessed, thus preventing unintentional deletion of stubs. For stubs that were created with earlier versions, you can set the repository ID manually in the respective tasks.

Metadata file collector To collect metadata files describing large numbers of content files, you can now configure task routes with a specific metadata file collector. The metadata file collector combines some of the functionality of the FSC Associate Metadata with the functionality of the file system collector. Working with a metadata file collector reduces memory requirements, makes better use of CPU, and ensures that the status of the metadata file is tied to the status of the documents.

XML-based mapping of properties for files XML namespaces are now supported.

Microsoft SharePoint

The following new features have been added to Microsoft SharePoint support:

Collection levels and depth You can now configure a collection source to begin collection at the site, web application, or farm level. In addition, you can specify how deep you want to delve into a level. These features eliminate the need to create multiple site connections to traverse multiple web applications, sites and subsites. You can simply begin the collection process at the farm or web application levels and collect to any depth that you choose.

Library or list type filtering You can now filter the collection process by selecting the library and list types that you want to collect.

User filtering You can filter the collection process to select only content touched by specific users.

Library and list types All library and list types are now supported.

Column support enhancement All column data types are now supported and mapping options have been added.

Re-collection enhancements It is no longer necessary to add an additional SP Collector when configuring re-collection. In addition, re-collection is enabled automatically during the installation process.

Restore from link You can now restore a document from the target repository using a check out operation in SharePoint.

Task route template enhancements The FileNet P8 and Content Manager Version Series templates now include list attachments.

IBM Connections

This is a new source connector. IBM Connections support comprises the following feature:

Application support You can now capture and archive content from IBM Connections applications: profiles, activities, wikis, blogs, files, bookmarks, and forums.

New target connector features

IBM Content Collector Version 3.0 provides the following new target connector features.

IBM Content Manager

Hierarchical folders IBM Content Collector now supports the use of hierarchical folders in IBM Content Manager version 8.4.3 or later.

Dynamic ACL support In addition to selecting from the access control lists (ACLs) that are available on the Content Manager server, you can now select to create a new ACL based on Content Collector ACL metadata or to define an expression to dynamically select an ACL.

Support for additional document model parts The support for the IBM Content Manager document model has been enhanced to allow more flexibility in part selection.

IBM FileNet P8

Indexing with IBM Content Search Services New tasks are available that support archiving email into a FileNet P8 repository that is configured to use IBM Content Search Services as its indexer.

Indexing of additional document properties with IBM Legacy Content Search Engine

The IBM Legacy Content Search Engine (formerly known as Verity or Autonomy) style sets were updated so that additional document properties are indexed into a separate zone.

Mime type mappings You can now configure mime type mappings in Configuration Manager. Mappings that you configured in previous versions of IBM Content Collector are preserved.

Maintenance task The configuration for the XIT consolidation task is now viewable and editable in Configuration Manager. Additionally, you can now configure a schedule for this task.

New indexing features in IBM Content Manager

Indexing in IBM Content Manager using IBM Content Collector Text Search Support provides the following new features.

Indexing features added in IBM Content Collector V2.2.0.2

Additional processing in afuEnableItemType to support recognizing the TIEFLAG value in IBM CommonStore resource item types Additional processing in afuEnableItemType to support recognizing the TIEFLAG value in IBM CommonStore resource item types When you run the enable item type tool called afuEnableItemType on IBM CommonStore resource item types, the table of completed tasks is automatically filled with all the items that were already indexed. The IDXRC value that is given to these items correlates with the TIEFLAG value that defines which item parts are text-searchable.

New configuration option added to the indexer process that acknowledges the TIEFLAG value used in IBM CommonStore

A new indexer configuration option has been added to the indexer process

that fills the TIEFLAG column in the item type component table in IBM Content Manager.

Changes to the -reindexwarnings command-line argument used with afuIndexer The -reindexwarnings argument of the afuIndexer tool ignores items that were indexed by the fast indexer or the standard IBM Content Manager indexer with the IBM Text Search user exit and have an imported IDXRC value of between 10 and 19.

New command-line option for afuEnableItemType to change the UDF buffer

size

A new command-line option called -udftransferbuffersize was added to

afuEnableItemType that you can use if you need to specify a different size

of the buffer which the AFUFetchFile UDF uses to load the temporary

XML files for access by Net Search Extender.

New indexer configuration option for handling items with sever errors By setting the configuration option IdxProcessSevereErrors to 1, items that might have caused the indexer worker process to stop unexpectedly are not moved to the table of completed tasks with an IDXRC of 200, but instead will be processed again without the embedded attachments the next time afuIndexer runs.

Performing index validation and repair operations The indexer for text search index tool called afuIndexTool offers useful index operations that can be applied to an index to check for inconsistencies or can be used to update the index database tables to accommodate the IBM CommonStore TIEFLAG feature.

Reindexing archived Lotus Domino mail documents that were not indexed correctly To identify those documents that might be affected and might need to be reindexed, use the tool named afuRepairCSN. This tool must be run on all item types containing Lotus Domino mail documents that were archived using an IBM Content Collector Server version between 2.1.1.1 LA006 and 2.1.1.3 LA006.

Searching for encrypted email in the index

A warning search notification message string called IcmFceWarning:

IcmDocIsEncrypted is indexed when encrypted email (Exchange, Domino, and SMTP/MIME email) is processed by the indexer. The content of encrypted email cannot be indexed. Using the search message string, you

can search for all encrypted email, decrypt the email, and reindex the email

if you want to index the email content.

Indexing embedded MSG files The textual content of embedded MSG files, even recursively embedded MSG files, can now be indexed. This means that the notification string IcmFceWarning: IcmUnhandledEmbeddedMsg is no longer used and cannot be searched. After you have applied the fix pack, search for all items in the index that have the notification string IcmFceWarning:

IcmUnhandledEmbeddedMs g and reindex these items.

Indexing IBM CommonStore Content Manager document item types You can index items in IBM CommonStore item types for the IBM Content Manager document model GENERIC_MULTIDOC and GENERIC_MULTIPART and the archiving type entire and attachment. Before you can use a CommonStore document item type in IBM Content Collector, the item type must be enabled for use in Content Collector.

Indexing features added in IBM Content Collector V3.0

New indexer command-line arguments for reindexing items that were indexed with search strings Reindexes only those documents that were indexed during a previous indexing run and where the specified string was indexed with the document.

New indexing mode that processes items with severe errors only You can run aufIndexer in a special mode in which only those items that were processed in an earlier indexing run and resulted in a severe error are reprocessed using configuration settings that are optimized for handling error situations, and not tuned for performance and high throughput.

Additional support for item types containing IBM Connections documents The indexer for text search supports processing and indexing of IBM Connections documents.

Support for Microsoft SharePoint item types that are created using the data model with embedded attachments The indexer for text search supports processing and indexing of Microsoft SharePoint documents in item types created using a new data model that supports handling embedded attachments.

Index validation runs in parallel mode To increase performance, the index validation tool afuIndexTool performs index validation operations run in parallel.

New indexing features in IBM FileNet P8

IBM Content Collector supports indexing in IBM FileNet P8 using both IBM Legacy Content Search Engine and IBM Content Search Services.

The following new features have been added:

Indexing using the IBM Content Search Services indexing engine

IBM Content Collector P8 Content Search Services Support IBM Content Collector P8 Content Search Services Support is an optional document constructor plug-in in IBM Content Search Services for custom preprocessing of all documents archived by using IBM Content Collector other than file system documents.

Further enhancements

IBM Content Collector Version 3.0 provides the following additional features.

Configuration Manager

Enhanced resilience When connection to the configuration database became invalid, the Configuration Manager is automatically reconnected to the database after the connectivity is back.

Select more than one task route in the Explorer view You perform actions on more than one task route simultaneously.

Email clients

The IBM Content Collector email clients now provide access to Content Collector client help documentation. For Microsoft Outlook, Outlook Web App, Lotus Notes ® , and Lotus iNotes the help documentation is available online. For Microsoft Outlook and Lotus Notes, the help documentation is available offline as well.

Expiration Manager

Support for the IBM Content Search Services data model in IBM FileNet P8 The Expiration Manager now supports the FileNet P8 data model for IBM Content Search Services.

Improved performance The performance of the Expiration Manager has been improved. Additional configuration options provide flexible control and allow for multi-thread processing.

Metadata

Consolidate user-defined metadata and file system metadata The mechanism for specifying property mappings for files has been changed to make the configuration consistent with the configuration of user-defined metadata for email and for Microsoft SharePoint documents. You can now set up file system metadata as user-defined metadata properties. The properties are later mapped within a task route, in the FSC Associate Metadata task.

Lists You can now import and export list values.

Monitoring

Performance reporting The new performance reporting component gathers statistical data about the performance of your IBM Content Collector installation. You can use the report viewer to generate a performance report from this data and display it.

Additional performance counters IBM Content Collector now provides additional performance counters for system monitoring.

Search using IBM Content Collector

Sorting the result list You can now sort the search result list by any column.

Task route processing

Performance improvement The IBM Content Collector Task Routing Engine service is much more efficient than in previous versions. This was achieved by reimplementing the thread pool and work queuing mechanism.

Viewing documents in IBM FileNet Workplace or IBM FileNet Workplace XT

You can now configure IBM FileNet Workplace or IBM FileNet Workplace XT for viewing archived documents with the Document Viewer.

The following redirections for viewing archived documents in IBM FileNet Workplace or IBM FileNet Workplace XT are no longer supported:

# BRI file view redirect

application/csbundled=/postRedirect?{QUERY_STRING}&redirectUrl=https://<<ICCServerName>>:11443/AFUWeb/CsnViewer.do

# CSN file view redirect

application/icccsn=/postRedirect?{QUERY_STRING}&redirectUrl=https://<<ICCServerName>>:11443/AFUWeb/CsnViewer.do

Content Collector architecture overview

IBM Content Collector consists of several components, which interact with components of your Microsoft Exchange, Lotus Domino, NTFS, DFS, and Novell file systems, Microsoft SharePoint and IBM Connections environments, and repository servers. See the diagram.

IBM Content Manager IBM FileNet P8 Target Repositories Search/View/Restore IBM Information Integrator for Content
IBM Content Manager
IBM FileNet P8
Target Repositories
Search/View/Restore
IBM Information Integrator
for Content
IBM FileNet P8
Content Engine Web Service
IBM Content Manager
connector
IBM FileNet P8
connector
Metadata form
connector
Web Application Server
Derby
Task Routing Engine
database
Search/
Specify
Search/
View/
Configuration
Source connector
View/
additional
View/
Restore View
Manager
Restore
archiving
Restore
information
IBM Content Collector
Microsoft
Outlook
Exchange
clients
Lotus
Notes clients
Domino
Microsoft
SharePoint
SharePoint
clients
IBM
Connections
Files
SMTP
Source
email

Figure 1. Interaction diagram including IBM Content Collector components, email clients, email servers, Microsoft SharePoint, IBM Connections, file systems, and repository servers

Source system

A system that contains documents that you want to collect with IBM

Content Collector. This can be Microsoft Exchange, Lotus Domino, SMTP email, NTFS, DFS, and Novell file systems, or Microsoft SharePoint or IBM Connections environments.

Source connector

A source connector provides an interface to a third-party system that

contains documents that you want to work with in IBM Content Collector.

It is responsible for the communication between email servers, file servers,

Microsoft SharePoint, or IBM Connections and IBM Content Collector.

Documents that are routed to IBM Content Collector for archiving pass this layer before they are processed and stored in a repository.

Target connector

A target connector provides an interface to the third-party system that

serves as the target repository for IBM Content Collector. It is responsible

for the communication between a IBM Content Manager repository, a IBM FileNet P8 repository, or a File System repository, and IBM Content Collector. Documents that are routed from IBM Content Collector for archiving pass this layer before they are stored in a repository.

Task Routing Engine

A service that monitors most of the collector services that run in IBM

Content Collector.

Configuration Manager

A graphical user interface for the administration of IBM Content Collector.

Web application server The IBM Content Collector web application server. This can be the embedded web application server or an external web application server.

Metadata Form Connector

A connector to a database where metadata is stored temporarily.

Text Extraction Connector An interface to the Oracle Outside In Technology filters, which are used to convert binary data, for example from email attachments, into a plain-text representation.

Utility Connector

A container for those tasks that provide the intrinsic functions of IBM

Content Collector.

Derby database

A temporary storage for any additional archiving information that a user

specified when manually archiving a document.

Related concepts :

“Definition of the email storage data model”

“Content Collector overview” on page 3

“Scenario: Preparing the email repository for email analytics” on page 26

“Scenario: Document archiving for storage purposes” on page 23

“Scenario: Archiving journal email” on page 24

“Scenario: Document retention and disposition” on page 25

Related reference :

“Additional prerequisites and restrictions” on page 34

Related information :

and restrictions” on page 34 Related information : IBM Content Collector website Definition of the email

Definition of the email storage data model

IBM Content Collector uses a prescriptive email storage data model for compliance archiving, space management, and duplicate management. The benefits of such a data model are that it supports ingestion of high volumes of email, enables effective deduplication on email and email attachments across multiple email sources, and that it supports searches across the entire content of the email and electronic discovery by using IBM eDiscovery Manager.

The email data model describes how Content Collector stores email in the repository: the entire content (email body, all attachment text, and metadata), deduplicated instances, and searchable XML. However, in business process management scenarios or in cases where search across the entire email content is not required, you do not have to work with the Content Collector email data model.

IBM FileNet P8 now supports two content search engines: IBM Content Search Services and IBM Legacy Content Search Engine (formerly Autonomy K2 or Verity ). Therefore, an additional email data model had to be introduced. Content Collector now offers these email data models for archiving into a FileNet P8 repository:

v

FileNet P8 data model for IBM Legacy Content Search Engine (also referred to as IBM Legacy Content Search Engine data model)

v

FileNet P8 data model for IBM Content Search Services (also referred to as IBM Content Search Services data model)

The IBM Legacy Content Search Engine data model was enhanced to allow for a different way to create and update the XML Instance Text (XIT) object, which contains the email content to be indexed for text search. These changes improve resilience in the processing of duplicate email documents and of email documents that failed to be processed completely in a previous archiving attempt.

The IBM Content Search Services data model is a simplified data model that not only supports the new FileNet P8 content search engine but also goes without an XIT object, thus saving database and file storage.

There is no formal data model for Microsoft SharePoint, IBM Connections, and File System documents. IBM Content Collector offers a sample repository configuration for each. You can choose not to use the samples at all or choose to use some of the properties from the samples, depending on your business case.

IBM Content Manager data model

In IBM Content Manager, all documents archived using IBM Content Collector are

stored in item types. You must have at least one IBM Content Manager item type for each source system that you configure in IBM Content Collector. Deduplication on email, Microsoft SharePoint, and File System documents is only available within one and the same item type and not across item types.

Email is stored in an email item type. The email item type is an IBM Content Manager resource item type containing one or more distinct email instances (DEIs).

A DEI is the root item and is the common binary email object in one of these

formats:

v

Notes binary (CSN) format

v

Multipurpose Internet Mail Extensions (MIME) format

v

Microsoft Exchange mail document (MSG) format

The root holds all common email data and attributes that are shared across all instances of the email. It contains the hash that is used to ensure that the email is stored only once in the repository. The DEI is the item that is required by an application, for example, in a workflow process, for records management or for viewing purposes.

A DEI has two child components:

v

The email instance (EI) child that tracks the references of all duplicates of the same email archived from different mailboxes or the journal. It contains the properties of each email duplicate which are needed to restore each individual copy of the email, the varying properties . For journal archiving, the varying properties contain the additional journal attributes produced during the journal process.

v

The attachment instance (AI) child that tracks the references to the email attachments that are archived separately. As an email can have multiple attachments, this reference child can have none, one, or many entries pointing to attachments. Not only are the references to the attachments stored but also additional meta data required for viewing and restoring the email with its attachments, for example, the attachment file name and a correlation key which is used to restore the attachment to the original location in the email.

Varying properties Varying properties from journal EI EI Hash, Common properties, Text DEI Email object
Varying properties
Varying properties
from journal
EI
EI
Hash,
Common properties,
Text
DEI
Email object
Search result list properties
index
AI
Email item type
When a DEI is removed, all associated objects will be removed as well. To prevent
accidental deletion of the DEI, for example, by a client user, the expiration date is
monitored and only if the current date is past the expiration date, removal is
allowed.

Email attachments are stored in an attachment item type. The attachment item type is a resource item type and can contain attachments from different email source system item types. The attachment item type contains one or more distinct attachment instances (DAIs). A DAI represents the attachment object itself and is the master object that controls the deletion of the associated content and objects. A DAI is referenced by one or more AIs from an email instance (a DEI). A DAI can only be removed if no other instances are pointing to it. The only attribute required by a DAI is the hash used to calculate a unique deduplication hash key that ensures that only one copy of the attachment is kept in one item type, no matter how many times the same attachment was archived by different users.

AI Hash DAI Attachment object Attachment item type IBM FileNet P8 data model
AI
Hash
DAI
Attachment object
Attachment item type
IBM FileNet P8 data model

In IBM FileNet P8, all documents archived using IBM Content Collector are stored as document objects in an object store. The object store must be dedicated to archiving with IBM Content Collector. The same object store can be used to store email, Microsoft SharePoint, and File System documents, and with object stores that are configured for use of IBM Content Search Services also IBM Connections documents.

FileNet P8 offers two Content Search Engine components, IBM Legacy Content Search Engine and IBM Content Search Services, which you can run in parallel for indexing and search. However, object stores that are used for email archiving with Content Collector must be configured to use either IBM Legacy Content Search Engine or IBM Content Search Services.

FileNet P8 data model for IBM Legacy Content Search Engine

This data model is used for storing objects into a FileNet P8 repository for which IBM Legacy Content Search Engine (formerly Autonomy K2 or Verity ) provides the indexing and search capability.

To store an email in FileNet P8 the following objects are used:

v

A distinct email instance (DEI) that is the root document object for the email consisting of one or more content elements:

– The first content element is the email document from different mailboxes or the journal.

– All subsequent content elements are the attachments.

The ID of the DEI document is based on a unique hash that is used to ensure that the email is stored only once in the repository. It also contains the properties that are common to all duplicates of the email.

v

An XML Instance Text (XIT) that is an indexable XML file containing the data of the content elements of the DEI that needs to be indexed for text search. It also contains the search result list properties . These properties are intended for use in the search result list only and contain truncated values. Do not use them for other processing.

With IBM Legacy Content Search Engine, you cannot index the content of documents with more than one content element. The XIT document provides a workaround for this limitation. All data from the email that must be indexed, including the email body and the content of any attachments, is stored in the single content element of the XIT document. Search tools that are compatible with the Content Collector email data model can locate the additional parts of an email, that is the DEI and the email instances, as soon as they found the XIT document.

v

An email instance (EI) that is a custom object. For each duplicate of an email (mailbox or journal instances of the DEI), one EI tracks the data that is unique to this copy.

In addition, an annotation object is created for each duplicate email that is found. The
In addition, an annotation object is created for each duplicate email that is found.
The content element of the annotation contains the information that is required to
update the XIT object. Annotation objects are deleted as soon as the XIT is
updated.
Varying properties
Varying properties
from journal
from mailbox
EI
EI
Email object 1
Hash,
DEI
Attachment object 1
Common properties
Attachment object n
Temporary
annotations
Search result list
properties
Text
XIT
Text object
index
Email document object

Email deduplication is provided by Content Collector, whereas attachment deduplication is managed by FileNet P8 or on the storage device layer.

When a DEI is removed, all associated objects will be removed as well. An exception to this is if IBM eDiscovery Manager placed a legal hold on the XIT. In this case, the archived email cannot be deleted. Deletion constraints are put on the DEI and XIT so that an accidental deletion of the XIT is prevented. Attempts to delete the XIT result in an error. This ensures that the indexing for the email is not lost.

To prevent accidental deletion of the DEI, for example, by a Workplace user, an expiration date can be set on the DEI. When an expiration date is set, an event handler checks this property on deletion and only if the current date is past the expiration date, the DEI can be deleted.

FileNet P8 data model for IBM Content Search Services

IBM Content Search Services can be used with IBM FileNet P8 to index documents and enable search. It is a new approach to full-text indexing optimized for email and compliance solutions. To be able to write and read index information for email stored into FileNet P8 by using IBM Content Search Services, Content Collector requires an email storage data model that is different from the data model that is used with IBM Legacy Content Search Engine. The IBM Content Search Services data model does not require an XML Instance Text (XIT) object to contain the email content to be indexed for text search.

To store an email in FileNet P8 the following objects are used:

v

A distinct email instance (DEI) that is the root document object for the email consisting of one or more content elements:

– The first content element is the email from different mailboxes or the journal.

– All subsequent content elements are the attachments.

The ID of the DEI document is based on a unique hash that is used to ensure that the email is stored only once in the repository. It also contains the properties that are common to all duplicates of the email and the search result list properties . The search result list properties are intended for use in the search result list only and contain truncated values. Therefore, do not use them for other processing.

The DEI object is enabled for content based retrieval and is text indexed.

v

An email instance (EI) that is a custom object. For each duplicate of an email (mailbox or journal instances of the DEI), one EI tracks the data that is unique to this copy.

Index information for the DEI and EI objects is created or updated when Content Collector
Index information for the DEI and EI objects is created or updated when Content
Collector creates, updates, or deletes such an object.
Varying properties
Varying properties
from journal
from mailbox
EI
EI
Text
Email object 1
index
Hash,
Common properties,
DEI
Attachment object 1
Search result list
properties
Attachment object n
Email document object

Email deduplication is provided by Content Collector, whereas attachment deduplication is managed by FileNet P8 or on the storage device layer.