Vous êtes sur la page 1sur 52

19th Century Pamphlets Online

Digitisation Scoping Study


Credits
Grant Young, Information Services, University of Bristol
The University of Bristol
Consortium of Research Libraries in the British Isles (CURL)
JSTOR
University of Southampton

Commissioned by JISC and CURL


Contents
1. Introduction 2
1.1 Background to the 19th Century
Pamphlets Online scoping study 2
1.2 Outline of 19th Century Pamphlets
Online project 3
1.3 Summary of scoping study findings 4
2.1 Collection surveys 7

2. Pamphlet collections 7
2.2 Collection profiles 15
2.3 Recommended strategies 21
3.1 Digital images 28

3. Digital datasets 28
3.2 OCR text 30
3.3 Metadata 32
3.4 Production and QA workflow 34
4.1 Workflow 35

4. Project workflow 35
4.2 Work plan 37
4.3 Work packages 38

5. Conclusions and
recommendations 44

Appendix A – Survey Sheet 46

Appendix B – Glossary 48

PAGE 
1. Introduction
1.1 Background to the 19th Century Pamphlets Online scoping study

A short scoping study was undertaken in July and August 2006 to clarify some
elements of a bid under the JISC Digitisation Programme (Phase 2). This study
received £6,239 in funding from the JISC Digitisation Programme. Several partners
to the bid also made contributions in kind.

1.1.1 Aims
The study had three main aims:

1. To inform the revision of the project proposal for Stage 2 submission, in


particular the methodology and budget

2. To provide information for a detailed Project Plan should the bid be successful

3. To provide information of wider use, e.g. for similar or related projects

1.1.2 Activities
The scoping activities took five main forms:

1. Surveys into the format and condition of the pamphlets and the extent of
duplication between collections

2. Visits to the seven contributing libraries to examine collections and discuss


workflow issues

3. Tests of scanning and OCRing 19th century pamphlets and analogous material

4. Discussions among partners and with external parties (e.g. over workflow,
technical specifications, transportation of original materials and data, and
potential for linking with related projects)

5. Desk research (e.g. into technical specifications)

1.1.3 Deliverables
The initial plan for this scoping study proposed the following deliverables:

■ Selection criteria – see section 2.3.1

PAGE 
19th Century Pamphlets Online
Digitisation Scoping Study

■ Digital capture standards – see section 3.1

■ OCR benchmarks – see section 3.2

■ Metadata specifications – see section 3.3

■ De-duplication strategy – see section 2.3.3

■ Workflow diagram – see section 4.1

■ Gannt chart showing how collection consignments might be staged – see


section 2.3.4

■ Short narrative report covering any additional findings – this current document

Some of these deliverables have already been included in the revised project
proposal. They are all included here for completeness and for those who do not
have access to that document.

The University of Bristol and other contributors license all the deliverables of this
study to the JISC non-exclusively and in perpetuity.

1.1.4 Structure of this report


The results of the scoping study are presented in three main sections, covering:

■ Pamphlet collections (Section 2) – provides profiles of the collections along


with strategies for addressing selection, IPR issues, de-duplication, and the
scheduling of collections for scanning

■ Digital dataset (Section 3) – provides details of digital capture, OCR, metadata


and quality assurance

■ Project workflow (Section 4) – describes the overall workflow proposed for


the project, including a workflow diagram, work plan, and description of work
packages

Some brief conclusions and recommendations are made in section 5.

1.2 Outline of 19th Century Pamphlets Online project

This section provides a brief summary of the proposed project for those without
access to the bid document. It provides the context for this report and its
deliverables.

The proposal is for Phase 1 of a larger 19th Century Pamphlets Online project. That
larger project has the vision of providing researchers, learners and teachers with
online access to the most significant pamphlet collections held in UK research
libraries. Phase 1 aims to digitise a proportion of the pamphlets that are held
within CURL1 libraries and searchable via Copac2. This first phase concentrates
1 See at www.curl.ac.uk
on collections with a strong political, economic and social focus, and is subtitled:
Pamphlets as a Guide to the Parliamentary Debates of the 19th Century.
2 See at http://copac.ac.uk/about

PAGE 
19th Century Pamphlets Online
Digitisation Scoping Study

The project’s Primary Partners include: seven collection holders (University of


Bristol, Durham University, University of Liverpool, LSE, University of Manchester,
University of Newcastle, UCL); a digital production unit, BOPCRIS3 (based at the
University of Southampton); a US not-for-profit archiving and delivery service,
JSTOR4; and a metadata and resource discovery provider, MIMAS5. The project is
led by the University of Southampton on behalf of CURL. In addition to the Primary
Partners, the project has nineteen Associate Partners: other research libraries
who would be willing to contribute collections to later phases of the project.

The project intends to:

■ Provide users with a wide selection of approximately 23,000 digitised 19th


century pamphlets (amounting to approximately 1 million pages) that focus on
political, social and economic issues

■ Implement a consortial scanning operation at BOPCRIS to create a dataset of


digital images, metadata and OCR text

■ Provide a financially and technologically sustainable delivery and preservation


operation through the partnership with JSTOR

■ Deliver the digitised pamphlets to users as a special JSTOR collection, which


will be free to UK FE, HE and secondary schools for at least 25 years and will
enable reuse and repurposing within local learning and research environments
and national learning repositories such as JORUM6

■ Provide a sophisticated distributed resource discovery service, enabling users


to access the collection from: (a) the JISC/JSTOR collection; (b) an item-level
search on Copac and on the OPACs of holding libraries; (c) a collection-based
search from the online Guide to 19th Century Pamphlets;7 and (d) searches on
other resource discovery tools, in particular, Google8, Google Scholar9 and via
the CrossRef10 service.

1.3 Summary of scoping study findings

This scoping study has confirmed that the proposed project is viable, can be
3 See at www.bopcris.ac.uk
managed within the allotted timescale (2 years, from January 2007-December
4 See at www.jstor.org 2008), and would be extensible. Findings from this study have informed the revision
5 See at www.mimas.ac.uk of the bid for stage 2, particularly its methodology (e.g. standards and work
6 See at www.jorum.ac.uk packages) and its budget (which has risen). Much of the content of this report has
7 See at www.curl.ac.uk/rslpguide/
found a place within the revised bid.
guidehp.htm

8 See at www.google.co.uk
9 See at http://scholar.google.com
10 See at www.crossref.org

PAGE 
19th Century Pamphlets Online
Digitisation Scoping Study

The deliverables of this scoping study have taken a very concrete form (criteria,
standards, strategies, diagrams, etc). All the deliverables promised in the scoping
study plan (see 1.1.3) are included in this report, along with several others. Here is
a full list, along with a brief summary of their findings or approach:

1. Results of a survey into the format and condition of the pamphlets (see 2.1.1).
The survey found that many pamphlets would pose challenges for scanning.
It also found a great deal of variation among the collections. An important
finding was that previously estimated page averages were too low, requiring
a reduction in the estimated number of pamphlets to be digitised within this
phase.

2. Results of a survey into the extent of duplication across the collections (see
2.1.2). The survey found significant duplication across some of the collections,
requiring some means of de-duplication, and resulting in a reduction in the
number of pamphlets expected from some collections.

3. Profiles of seven 19th century pamphlet collections (see 2.2). These identify
key issues likely to affect the scheduling and scan-time for the collections.

4. Selection strategy and criteria (see 2.3.1). As five whole collections have been
pre-selected for inclusion in the project, a de-selection criteria is outlined, along
with a selection criteria for the two remaining collections.

5. Copyright strategy and workflow (see 2.3.2). The proposed strategy requires
library partners to take primary responsibility for identifying and dealing with
any copyright concerns, but the project will provide support.

6. De-duplication strategy and workflow (see 2.3.3). This strategy proposes


de-duplicating at the libraries, before any pamphlets are sent. It proposes a
database to manage the de-duplication workflow.

7. Gannt chart showing scheduling of collections (see 2.3.4). This chart shows
one possible schedule for selecting and digitising the seven collections, based
on their characteristics and the capacity of BOPCRIS.

8. Image capture standards (see 3.1). A mix of bitonal, grey and colour capture
is proposed in order to balance the need to provide easy access to intellectual
content with that of representing the pamphlets as historical objects. The
chosen formats are standard and conform to JISC IE11 and Minerva12 technical
guidelines.

9. OCR benchmarks (see 3.2). Based on tests, the project expects to achieve an
average accuracy, per character, of 97-98%. Additional specialist software
would be used to maintain high accuracy levels across more difficult material.

10. Metadata standards (see 3.3). A suite of established and new XML-based
metadata standards is proposed for the archival and delivery datasets. The
chosen standards conform to IE and Minerva guidelines.

11. Description of production workflow and quality assurance (QA) (see 3.4).
11 See at www.ukoln.ac.uk/distributed- Presents an overview of the production workflow, indicating how QA fits into the
systems/jisc-ie/arch/standards/
workflow.
12 See at www.minervaeurope.org/
publications/technicalguidelines/
tablecontents.htm

PAGE 
19th Century Pamphlets Online
Digitisation Scoping Study

12. Diagram showing overall project workflow (see 4.1). Presents a graphical
representation of the flow of original pamphlets and digitised datasets,
indicating how these relate to other workflows and work packages.

13. Gannt chart showing work plan (see 4.2). Presents a graphical view of work
packages and associated activities

14. Description of work packages (see 4.3). Presents a narrative explanation of


work packages.

Apart from these tangible deliverables, important outcomes of the scoping study
were the good working relationships established between project partners (e.g.
libraries and the project team, BOPCRIS and JSTOR). The partners worked very
successfully together in completing the work required for this study in a short
timeframe. Partners were actively involved in the definition and negotiation of the
approaches chosen for the project. The combination of a determined methodology
and established relationships would be a powerful enabler in any extension of this
work or development of similar joint projects.

While this whole report could be regarded as a set of recommendations for a


specific project, three more general recommendations are made in section 5:

1. That scoping studies such as this are undertaken for all large digitisation projects of
this nature, especially where multiple partners are involved.

2. That the 19th Century Pamphlets Online project, should it proceed, evaluate the
findings and approaches chosen in this study in light of the practical reality of the
project, and disseminate its findings.

3. That this report has a wider application than this current project and should be
made available to others undertaking similar work.

PAGE 
2. Pamphlet collections
This section provides profiles of the collections, including their condition,
duplication with other collections, and factors influencing their selection and
scheduling for scanning. It draws on survey work, visits and correspondence with
those responsible for the collections.

2.1 describes two collection-based surveys and their findings; 2.2 presents brief
profiles of the individual collections, drawing on the surveys, visits and additional
information; 2.3 recommends strategies for addressing selection, copyright,
de-duplication, and the scheduling of the collections, based on information in the
preceding sections. A later section of this report shows how these activities fit
within the overall project work plan and workflow (section 4).

2.1 Collection surveys


2.1.1 Survey of format and condition of pamphlets
The format and condition of the pamphlets will have an impact on: their selection
(e.g. item may be too fragile to send); handling considerations; time required to
scan (e.g. number of pages, tightness of binding); and the quality of the OCR output
achieved (e.g. use of unusual typefaces).

In preparation for the initial project proposal, contributing libraries were asked
to provide information about the size and condition of their collections. Libraries
supplied a total number of pamphlets for their collections. Total page numbers
were then estimated by multiplying the number of pamphlets by an average of 25
pages per pamphlet (for 3 collections) or 35 pages per pamphlet (4 collections, who
said their average would be higher than 25). Based on these calculations, the initial
project bid estimated that 1 million pages (the BOPCRIS capacity for this project)
would approximate 30,000 pamphlets.

As part of the scoping work for the revised project proposal, a survey was
undertaken to collect more accurate page statistics and determine other factors
likely to affect scan time and image or OCR quality.

PAGE 
19th Century Pamphlets Online
Digitisation Scoping Study

Methodology
Libraries were asked to check a random selection of 100 items drawn from the
collections chosen for inclusion in this project. In five cases a computer program
was used to generate this selection from lists supplied by the libraries; in two
cases (Durham and UCL) the sampling was made on a systematic basis by library
staff, but without sight of the pamphlet or its title.

Each library was asked to check a range of characteristics, such as size, location,
condition of binding, and presence of greyscale or colour illustrations. These
criteria were drawn up by Julian Ball (BOPCRIS Manager) and the author. They
were based on experience and some initial testing in Southampton (see Section 3
below).

The following criteria were chosen for the format and condition survey:

1. Overall condition 10. Number of pages with grey


(i.e. would the library be happy to illustrations
send for scanning)
2. Location (i.e. open shelves, on-site 11. Number of pages with colour
store, on-site archival, off-site
store, off-site archival)
3. Whether bound or separate 12. Presence of foldouts
(i.e. intentional foldouts)
4. If bound, whether volume opens flat 13. Presence of folded pages
to 180° (i.e. folded to fit into a volume)
5. If bound, whether volume has loose 14. Presence of adverts
boards
6. If separate, whether there is 15. Presence of annotations
margin stitching
(i.e. additional stitching to hold
sections together)
7. Whether loose pages 16. Gothic or other unusual typeface
used as main body text
8. Whether obviously missing pages 17. Existence of multiple copies
9. Number of pages 18. Closest size (to A5, A4 or A3)

A copy of a page from the survey form is included as Appendix A. Questions 1-18
on this form relate to this survey and follow the order of the criteria listed above.
As the survey began it became clear that some collections included significant
proportions of non-English language pamphlets, so libraries were additionally
asked to gather this information. This was not always possible where the survey
work was already underway.

This survey was undertaken during 4-11 August, with some libraries spending up
to 10 hours on it. This time was provided as a partner contribution to the scoping
LSE pamphlets. Image with permission study.
from the LSE.

PAGE 
19th Century Pamphlets Online
Digitisation Scoping Study

Results
The tables below provide findings from the format and condition survey. General
discussion is provided in this section, with the significant characteristics of each
collection detailed in section 2.2 below.

Table 2.1.1:A
Location and page averages (based on answers to questions 2 and 9)

Collection Location Number of items checked1 Total pages Average pages per
found pamphlet
Bristol Off-site store (83%) 100 4548 45.5
and
On-site archive (17%)
Durham2 On-site archive 94 6141 65.3
Liverpool On-site archive 99 4246 42.8
LSE On-site store 100 4564 45.6
Manchester On-site store 100 3466 34.7
Newcastle On-site store 100 2903 29.0
UCL On-site archive 99 4173 42.2

Note 1. In some cases a number less than 100 were checked, due to missing items or time constraints.
Note 2. Durham’s sample included some 20th century items and other forms of publication, which may account for its higher page average.

Table 2.1.1:B
Grey and colour pages (based on answers to questions 9, 10 and 11)

Collection Pamphlets Total grey pages % of total pages Pamphlets Total colour
containing that are grey containing colour pages
greyscale
Bristol 10% 108 2.4% 2% 2
Durham1 16% 256 4.2% 3% 3
Liverpool 0% 0 0% 3% 3
LSE 6% 55 1.2% 1% 6
Manchester 11% 38 1.1% 10% 10
Newcastle 6% 22 0.8% 0% 0
UCL 7% 12 0.3% 1% 2

Note 1. Durham’s sample included some 20th century items and other forms of publication, which may account for the higher number of grey
pages.

PAGE 
19th Century Pamphlets Online
Digitisation Scoping Study

Table 2.1.1:C
Binding condition (based on answers to questions 3-6)

Collection Pamphlets bound % of pamphlets % of pamphlets Separate % of separate


in volumes in bound vols. in bound vols. pamphlets pamphlets with
which cannot be with loose margin stitching
opened to 180° boards
Bristol 8% 63% 12% 92% 18%
Durham 0% NA NA 100% 13%
Liverpool 100% 2% 47% 0% NA
LSE 100% 0% 5% 0% NA
Manchester 71% 28% 7% 29% 45%
Newcastle 100% 3% 15% 0% NA
UCL 100% 0% 0% 0% NA

Table 2.1.1:D
Pamphlets with foldings, annotations and adverts (based on answers to questions 12-15)

Collection Pamphlets Pamphlets with Pamphlets Pamphlets


containing fold-outs pages folded to fit containing containing adverts
volumes annotations
Bristol 5% 0% 4% 12%
Durham 5% 0% 7% 6%
Liverpool 5% 0% 0% 3%
LSE 3% 2% 4% 5%
Manchester 10% 0% 10% 7%
Newcastle 1% 1% 9% 24%
UCL 9% 5% 43% 9%

Pamphlets folded to fit volumes (left) and fold-outs (right). Images with permission from UCL.

PAGE 10
19th Century Pamphlets Online
Digitisation Scoping Study

Table 2.1.1:E
Pamphlets with unusual typefaces, multiple copies and foreign languages (based on answers to questions 16-17 and an
additional question1)

Collection Unusual typeface for Multiple copies Non-English Non-English


body text language pamphlets1 languages
represented in
collection (no. of
pamphlets) 1
Bristol 2% 0% 1% French (1)
Durham 1% 6% Not checked1 Not checked1
Liverpool 0% 2% 7% French (7)
LSE 3% 0% 23% German (10); French
(9); Dutch (1); Italian
(1); Russian (1);
Swedish (1)
Manchester 0% 5% 23% French (14); German
(4); Italian (4); Arabic
(1)
Newcastle 4% 3% Not checked1 Not checked1
UCL 0% 3% 1% French (1)

Note 1. The foreign language check was introduced after the survey had begun when it appeared this would be significant for some collections.
Consequently, not all libraries recorded this information.

Table 2.1.1:F
Size, loose or missing pages and overall suitability for scanning (based on questions 1, 7-8, 18)

Collection Closest to A5 Closest to A4 Closest to A3 Pamphlets Pamphlets Would not


in size with loose with missing send for
pages pages scanning in
current state
Bristol 100% 0% 0% 19% 0% 1%
Durham1 54% 44% 2% 0% 0% 0%
Liverpool 98% 1% 1% 1% 0% 2%
LSE 91% 8% 1% 0% 0% 7%
Manchester 90% 10% 0% 15% 0% 1%
Newcastle 99% 1% 0% 3% 0% 0%
UCL 100% 0% 0% 1% 0% 0%

Note 1. Durham’s sample included some 20th century items and other forms of publication, which may account for the larger sizes.

PAGE 11
19th Century Pamphlets Online
Digitisation Scoping Study

Discussion
Some care needs to be taken in drawing conclusions based on this survey, since
only 100 (or less) items were assessed from each collection. For the smaller
collections (Durham and Liverpool) this represents 6 or 7% of the collection;
but for the larger collections (LSE and Bristol) it is less than 1%. Nonetheless,
it provides better information than the previous estimates and enables some
profiling to be done. Should the project go ahead it would enable some of these
statistics to be checked and would provide an indication of how accurate and useful
this approach to characterising collections is.

Section 2.2 provides a brief profile of each collection and highlights characteristics
identified by the condition survey that are likely to impact upon scan time. These
and other factors noted in that section have influenced the scheduling of the
collections in 2.3.4 below (both their order and allocation of time).

One of the most important findings from the survey was that the page averages for
these collections was much higher than the 25-35 previously estimated. Because
the project had set the number of pages to capture at 1 million (the BOPCRIS
capacity for this project), this meant reducing the overall number of pamphlets
we would expect to capture. The revised bid recalculated each collection’s page
numbers based on the averages found in this survey (see Table 2.2.1:A above for
averages). When combined with a reduction for anticipated duplication (see next
section), this reduced the overall number of pamphlets from nearly 30,000 to just
over 23,000. Table 2.1.2:B, below, presents new estimates for each collection. Note
that because selections are to be made from the LSE and Bristol the numbers
of pamphlets from these two collections were adjusted until 1 million pages was
achieved.

2.1.2 Survey of duplication across collections


Volumes from the Hume Tracts with a
title page annotated by Hume. Images Published pamphlets are not unique items and with seven contributing libraries
with permission from UCL.
some level of duplication is to be expected. It is important for the project to gauge
the extent and pattern of that duplication in order to determine: (a) the best
place within the workflow to de-duplicate, (b) the best order in which to scan the
collections, and (c) the likely reduction in number of scanned pamphlets due to
de-duplication. Estimates of duplication from those familiar with this material
ranged from 20%-60% (across Copac as a whole), so a more accurate measure
was required.

Methodology
It was hoped that duplication could be gauged by an automated means using the
CURL13 or Copac14 databases. This did not prove possible due to the existence
of multiple records and the lack of time and resources to develop suitable tools
for comparison. A part of the work of MIMAS in Work Packages 3 and 8 of the
project (see 4.3 below) will be to develop such tools to enable all libraries holding
a pamphlet that has been digitised to be provided with links. In the absence of an
13 See at www.curl.ac.uk/database automated means, the same 100 item sample used for the condition survey (2.1,
14 See at http://copac.ac.uk

PAGE 12
19th Century Pamphlets Online
Digitisation Scoping Study

above) was checked by libraries against Copac for duplicates: (a) across the six
other libraries contributing to Phase 1 and (b) against all holdings on Copac. This
check was included as questions 19 and 20 on the survey form (see Appendix A).

The duplication survey took place alongside the format and condition survey,
during 4-11 August. In some libraries the same staff members completed both
surveys; in others, special collections staff undertook the format and condition
survey (questions 1-18) and cataloguing staff undertook the duplication survey
(questions 19-20). Checking 100 records on Copac took up to 6 hours for some
libraries. This time was provided as a partner contribution to the scoping study.

Results
The table below presents findings from the duplication survey. General discussion
is provided in this section, with comments relating to individual collections
discussed in section 2.2 below.

Table 2.1.2:A
Duplication survey results (based on questions 19 and 20 of survey form)

Collection Unique on Duplicated within any Duplicated within individual Duplicated within any
Copac partner library partner libraries non-partner libraries
Bristol 23% 32% LSE (19%); Manchester (10%); 70%
Liverpool (6%); Newcastle (3%);
Durham (1%); UCL (0%)1
Durham 41% 40% LSE (20%); Bristol (19%); 53%
Manchester (18%); Liverpool (4%);
Newcastle (4%); UCL (1%)1
Liverpool 24% 45% Bristol (28%); LSE (24%); 61%
Manchester (16%); Durham (4%);
UCL (2%)1; Newcastle (0%)
LSE 37% 40% UCL (22%)1; Bristol (13%); 50%
Manchester (9%); Durham (5%);
Liverpool (3%); Newcastle (3%)
Manchester 44% 21% Bristol (10%); LSE (8%); Liverpool 55%
(5%); Newcastle (2%); Durham
(1%); UCL (1%)1
Newcastle 25% 44% Bristol (28%); LSE (18%); Liverpool 61%
(8%); Manchester (8%); Durham
(2%); UCL (0%)2
UCL1 32% 28% LSE (12%); Manchester (10%); 57%
Bristol (4%); Durham (3%);
Liverpool (3%); Newcastle (1%)

Note 1. There has been a delay in loading some of UCL’s records into the CURL/Copac databases, so the duplication with its collection is not
fully represented here. UCL’s records will be loaded before the project commences.

PAGE 13
19th Century Pamphlets Online
Digitisation Scoping Study

Discussion
As previously mentioned, care needs to be taken in drawing conclusions based
on this survey, since only 100 (or less) items were assessed from each collection.
Nonetheless, it provides better information than previous guesses and enables
some estimation to be done. Should the project go ahead it would enable the
accuracy and usefulness of this approach to be evaluated.

Between 56 and 77 percent of these collections were found to be duplicated within


at least one of the other 23 libraries represented on Copac. However, this still
leaves a significant amount of material that is unique to each of these collections
(23-44%).

Between 21 and 45 percent of the pamphlets in these collections were found to be


duplicated within at least one of the other 6 libraries taking part in Phase 1 of the
19th Century Online Pamphlets project. However it is important to note here that
the collections being digitised are a subset of each library’s 19th century pamphlet
holdings. For example, Manchester’s Foreign and Commonwealth collection only
represents about 21% of its total estimated holdings of 19th century pamphlets
on Copac (17,000). So the 10% of UCL’s Hume Collection found duplicated with
Manchester’s holdings is unlikely to equate to a 10% duplication for this particular
project.

There is a fairly high level of duplication with the LSE and Bristol, which have large
19th century pamphlet collections (an estimated 15,989 and 22,150, respectively,
on Copac). However, the project intends to make a selection from these two
collections rather than capture them in their entirety, so the overlap with these
collections can be compensated for in the selection process.

It is important to note, however, that the records for UCL’s 19th century pamphlets
are not yet fully loaded onto the CURL and Copac databases. It is likely that there is
higher duplication with this collection than is apparent in the survey. UCL expect to
have their full records on Copac before the project start date (January 2008).

For the purposes of putting together the revised bid we have assumed a duplication
of half of the total amount found ‘duplicated within any partner collection’ for the
Durham, Liverpool, Manchester, Newcastle and UCL collections and reduced
the number of pamphlets we expect to capture accordingly. It may be that this
reduction is larger than is necessary and we would seek to refine these estimates
as the project proceeds. As selections are being made from Bristol and LSE,
their numbers were not reduced for duplication, but were adjusted upwards to
compensate for the reduction across the other collections, in order to make the
collection up to 1 million pages.

PAGE 14
19th Century Pamphlets Online
Digitisation Scoping Study

The table below shows the combined effects of the higher page averages (see 2.1.1
above) and the reductions made to account for duplication. Although it is hoped
that this is closer than the previous estimates, it is likely that further adjustments
will need to be made as the project proceeds and the numbers of pamphlets may
rise or fall.

Table 2.1.2:B
Effects of higher page averages and reduction for duplication

Partner Library Collection Estimated pamphlets Estimated pages


Initial bid Revised bid Initial bid Revised bid
UCL Hume Tracts 4,200 3,528 147,000 148,881
Durham Earl Grey Pamphlets 1,450 1,160 36,250 75,478
University
University of Knowsley Pamphlet Collection 1,560 1,209 39,000 51,745
Liverpool
University of Cowen Tracts 1,974 1,579 49,350 45,796
Newcastle
University of Foreign & Commonwealth 3,662 3,149 143,000 109,281
Manchester Pamphlets
University of National Liberal Club Pamphlets 8,500 6,250 297,500 284,375
Bristol (selection)
LSE Pamphlet Collection (selection) 8,500 6,250 297,500 285,000
Totals 29,846 23,125 1,009,600 1,000,556

2.2 Collection profiles

This section provides profiles for each of the seven collections, including overviews
of the collection content and a list of significant points that emerged from the
surveys, or from visits and discussions with collection contacts. All collections
were visited during the scoping study.

Preferences for transportation, insurance, storage, scheduling of collections,


and other issues, were discussed with collection managers during the visits or
by correspondence. In addition to specific issues included in the profiles below,
several points can be made here that apply to all or most of the collections:

■ Collections were presented with the option of doing their own scanning for this
project, but all are happy to use BOPCRIS as a consortia scanning service

■ Everyone was happy with the suggested transport arrangements (Harrow Green
15
, Momart16 or equivalent) and most preferred the transport company to pack
the pamphlets
15 See at www.harrowgreen.com
16 See at www.momart.co.uk

PAGE 15
19th Century Pamphlets Online
Digitisation Scoping Study

■ Everyone was happy with the level of insurance cover provided by the transport
firms we asked to quote (minimum of £100,000 per consignment) and by
BOPCRIS whilst on premises (minimum of £100,000)

■ Everyone was happy with the storage conditions being offered by BOPCRIS,
which is temperature controlled, has low ultraviolet levels and is only
accessible by authorised personnel

■ No one said they would require the return of original items from BOPCRIS if
requested by users whilst away: each would notify users of their collection’s
absence and be happy to make do with photocopies or digital images (if already
scanned) from BOPCRIS

■ Collections will have the option to receive digital datasets relating to their own
items (images, metadata and OCR), but few intend to take up this option within
the life of the project (Durham, Manchester and possibly UCL)

University of Bristol
National Liberal Club Pamphlets (selection from this collection)
Overview of this collection
These pamphlets are from the libraries of, amongst others, Charles Bradlaugh, John Noble, the Liberation Society, the
Land Nationalisation Society and the Cobden Club. There are also many individual items given by W.E. Gladstone and
other prominent politicians. The collection is especially strong on 19th century commerce, economics, finance, politics,
religion and sociology. It includes publications by and about not only the Liberal Party, but also the Conservative and
Labour Parties.
For more information about the collection, please see:
www.curl.ac.uk/rslpguide/BristolNatLibClub.htm.
Key points emerging from surveys, visits and correspondence
■ Although this is a liberal club collection, its inclusion of other party pamphlets will provide a good political
representation
■ Collection is largely held off-site (83%) which is likely to slow the selection
■ Collection contains a lot of separate pamphlets (92%) which is likely to slow scanning
■ There is a high proportion of in-margin stitching (18% of the separates) which will slow scanning and may affect the
quality of imaging (due to page bowing)
■ The high proportion of separates means the collection may be useful in replacing poor quality copies bound into
other collections (we note the high duplication of other collections with Bristol)
■ The proportion of bound volumes is small (8%), but there may be difficulties capturing these because of the
tightness of binding (63% of bound volumes cannot be opened flat)
■ Bristol have no particular timing issues, but the collection will need to be broken into consignments because of the
volume and the added time required for selection
■ The bulk of the Bristol material should be staged towards the end of the project to fill gaps and replace poor
duplicates

PAGE 16
19th Century Pamphlets Online
Digitisation Scoping Study

Durham University
Earl Grey Pamphlets (all 19th Century pamphlets from collection)
Overview of this collection
This is a family collection accumulated largely by the 2nd, 3rd and 4th Earls Grey. Charles was Foreign Secretary
in 1806-1807 and Prime Minister 1830-1834. Henry George was Under-Secretary for Home Affairs in 1830, Under-
Secretary for the Colonies in 1830-1834, Secretary at War in1835-1839 and Secretary of State for the Colonies in 1846-
1852. Albert Henry George was Administrator of Rhodesia in 1896-1897 and Governor-General of Canada in 1904-1911.
The Greys were strongly interested in parliamentary reform, colonial affairs and Catholic emancipation.
For more information about the collection, please see:
www.curl.ac.uk/rslpguide/searinst.htm#D
Key points emerging from surveys, visits and correspondence
■ The Earl Grey Pamphlets collection is not owned by Durham but on loan from the family. Lord Howick, the current
owner, has given permission for the collection to be included within the project
■ The condition survey found surprisingly high page averages (65), greyscale counts (16% of pamphlets, 4.2% of
pages), and larger sized items (44% were nearer to A4 than A5). Unfortunately the survey was skewed by the
presence of some 20th century material and some books, journal issues, and official publications within the sample
(these form part of the Earl Grey collection). With this material excluded, as this project would do, the number of
items will drop to about 1,250 and these statistics would be likely to change.
■ None of Durham’s pamphlets are bound, but are all held separately in archive boxes, requiring special storage and
handling and slowing the scanning
■ There is a reasonable proportion of in-margin stitching (13% of items) which will slow scanning and may affect the
quality of imaging (due to page bowing)
■ Although not asked for in the survey criteria, the person completing the form for Durham noted that 7 items (out of
94) contained uncut pages. Southampton will need to cut these pages for scanning or seek duplicates (Durham have
approved cutting)

PAGE 17
19th Century Pamphlets Online
Digitisation Scoping Study

University of Liverpool
Knowsley Pamphlet Collection (all 19th Century pamphlets from collection)
Overview of this collection
This is a family collection, reflecting the careers of the Earls of Derby. Edward George was successively Irish Secretary
(1830-33), Colonial Secretary (1833-34 and 1841-44) and three times Prime Minister (1852, 1858-59 and 1866-68). His
career was summarised by Disraeli as follows: “He abolished slavery, he educated Ireland, he reformed parliament”.
His son, Edward Henry, 15th Earl of Derby (1826-1893) was Colonial Secretary and later Indian secretary in his father’s
administration of 1858-59.
For more information about the collection, please see:
www.curl.ac.uk/rslpguide/LiverpoolKnowsley.htm
Key points emerging from surveys, visits and correspondence
■ The Knowsley collection is 100% bound in volumes which are uniform (consistent binding, pages trimmed to size)
and open easily, factors which could speed the scanning process
■ However, a large proportion of these pamphlets (47%) are in volumes with loose boards, which will require special
care and handling and the use of a specialist scanner
■ Although there is high duplication, this is largely with Bristol and LSE and so can be avoided through the selection
and de-duplication processes
■ This collection would need to be scheduled in 2007, because collection is due to be wrapped for building work in
2008
■ Liverpool would prefer that the collection goes after Easter, since the first part of the year is the busiest for this
material, but will be flexible

LSE
Pamphlet Collection (selection from this collection)
Overview of this collection
This collection includes a comprehensive set of material from political parties, including election manifestos and
political cartoons. Issues in British political history include the corn laws, land question, Church and state and home
rule for Ireland. There is a wealth of material on the co-operative movement, including the Cooperative Women’s Guild,
and from long-standing pressure groups such as the Fabian Society and organisations which have long disappeared
such as the Cobden Club, the Imperial Federation Defence Committee, the Poor Law Reform Association, the
Workhouse Visiting Society and the Liberty and Property Defence League.
For more information about the collection, please see:
www.lse.ac.uk/library/pamphlets/
Key points emerging from surveys, visits and correspondence
■ The LSE pamphlet collection is 100% bound, which is good for scanning but not so good for selection – we will try to
select by the volume
■ The condition survey suggested that a high proportion of pamphlets might not be available for scanning due to their
condition (7%), but this can be avoided through the selection process
■ There are a high proportion of foreign language pamphlets, but this is also likely to be avoided through the selection
process (which with focus on UK debates)
■ Work is planned for the store where the pamphlets are housed in either 2007 or early 2008 – this will fit well with
the project because LSE needs to be scheduled late in the project in order to pick up duplicates and fill gaps around
other collections
■ As this is a larger collection, it will need to be broken into consignments

PAGE 18
19th Century Pamphlets Online
Digitisation Scoping Study

University of Manchester –
Foreign & Commonwealth Office Pamphlet Collection (all 19th Century pamphlets from collection)
Overview of this collection
This collection is on deposit from The Foreign and Commonwealth Office (FCO). It comprises two earlier collections.
(1) The Foreign Office pamphlet collection, consisting largely of pamphlets acquired by British ambassadors overseas
and sent back to London as being of value for the formulation of policy. This collection is rich in material from South
America (where the British government was the formal arbitrator in boundary disputes), the Near East (both the
last century of the Turkish Empire and the growth of Zionism) and the various great European “Questions”, from the
Congress of Vienna through to the aftermath of the creation of the German Empire. (2) The Colonial Office pamphlet
collection, consisting chiefly of local imprints including, e.g., unique early Australiana.
For more information about the collection, please see:
www.curl.ac.uk/rslpguide/ManchesterForeign.htm
Key points emerging from surveys, visits and correspondence
■ The Foreign and Commonwealth Office pamphlet collection is not owned by the University of Manchester but is on
loan from the FCO. Permission has been obtained for the collection to be included within the project
■ The larger Foreign Office component in this collection is bound (71%) and often in tight binding (28%), which will
require special scanning
■ The smaller Colonial Office component is unbound, but were held in folders with pillars, which have occasionally
punched through the text – this might necessitate replacement with duplicates, if available
■ A very large proportion of the separate pamphlets also have in-margin stitching (45%), which will slow scanning and
may affect the quality of imaging (due to page bowing)
■ During the visit, many older font styles were found (e.g. long s’s, diphthongs, and ligatures) – these will require the
use of specialist OCR software
■ There is a high proportion of foreign language material (23%), including non-Latin scripts (1%). Specialist OCR
software will be required for foreign languages, but cannot recognise the non-Latin pamphlets.
■ This collection contains a higher proportion of foldouts and (consequently) colour pages than other collections
– these will require specialist scanning
■ There is a high proportion of annotations (10%), which will require greyscale scanning
■ There is a higher proportion of unique items than others, so there is less reduction for duplication, but also less
chance of finding replacements for poor copies

PAGE 19
19th Century Pamphlets Online
Digitisation Scoping Study

University of Newcastle –
Cowen Tracts (all 19th Century pamphlets from collection)
Overview of this collection
This is a personal collection of Newcastle-born Joseph Cowen (1829-1900), Member of Parliament (MP) and social
reformer. On his father’s death in 1873, he was elected in his place as MP for Newcastle and, though he came into
conflict with both the Parliamentary and local Liberal parties, he remained MP for Newcastle until he retired in 1886.
The Cowen Tracts date, in the main, from Cowen’s active years of the late 1840s to early 1880s, though there is some
earlier and later material. The topics covered largely reflect his main interests of social, educational and economic
issues.
For more information about the collection, please see:
www.curl.ac.uk/rslpguide/NewcastleCowenTracts.htm
Key points emerging from surveys, visits and correspondence
■ The Cowen collection is 100% bound (in three types of binding) and will generally open flat (3% will not)
■ There are some potential issues with loose boards (15%), which will require special care and handling and a
specialist scanner
■ There is a high proportion of annotations (9%), which will require greyscale scanning
■ There are some unusual typefaces (4%) which will require use of special OCR software
■ Newcastle is happy for the collection to be done in one consignment and there are no particular timing issues

UCL –
Hume Tracts (all 19th Century pamphlets from collection)
Overview of this collection
This is the personal collection of Joseph Hume (1777-1855), Radical Member of Parliament. Its broad subject-matter
reflects the major political, economic and social developments and reforms taking place in Britain in the early part
of the nineteenth century, and includes some of the causes championed by Hume during his parliamentary career,
such as universal suffrage, Catholic emancipation, a reduction in the power of the Anglican church and an end to
imprisonment for debt.
For more information about the collection, please see:
www.curl.ac.uk/rslpguide/UCLHume.htm
Key points emerging from surveys, visits and correspondence
■ The Hume collection is very well bound (100%), with no loose boards or issues with opening flat
■ A very high percentage of the pamphlets are annotated (43%), which will require grey or colour, take longer to scan,
and may pose some OCR issues
■ There are some personal items (e.g. letters) bound between pamphlets – these will be excluded from the project
■ The collection contains a high proportion of fold-outs (9%), which will require specialist scanning
■ There is a low level of duplication with other partner libraries suggesting this is a fairly unique collection
■ The entire collection is pre-1855, so there are no copyright issues for this collection.

The key points identified for these collections have influenced the proposed
scheduling made in section 2.3.4 below.

PAGE 20
19th Century Pamphlets Online
Digitisation Scoping Study

2.3 Recommended strategies

During the scoping study we focused on four issues of particular importance in


ensuring a good selection and flow of pamphlets to BOPCRIS for scanning: (1)
selection of material; (2) addressing any copyright issues; (3) addressing the
duplication across the collections; and (4) the scheduling of the collections for
scanning. This section presents strategies for dealing with each of these issues.
These would be firmed up in the project plan and further amended as necessary
once the project was underway.

2.3.1 Selection strategy


As stated above (section 1.2), an aim of this project is to provide users with a wide
selection of pamphlets that focus on the political, social and economic issues
of the 19th century. Instead of just digitising 23,000 pamphlets relating to this
theme (which could have been achieved through a combination of two of the larger
collections), this project has chosen to focus on capturing as much as is possible
of a number of smaller collections associated with individuals or families (Durham,
Liverpool, Newcastle and UCL) or organisations (Manchester), and to supplement
these with pamphlets drawn from larger collections (Bristol and LSE). There are
several advantages of capturing whole collections:

■ In addition to understanding the motivation of pamphlet producers, users will be


able to appreciate the motivations of their collectors. As seen in the format and
condition survey above, some of these collections include significant levels of
annotation by their collectors.

■ Although these collections have a political, economic or social focus, they


include material relating to the wider interests of their collectors, providing a
good overview of the whole pamphlet literature.

■ The collections can be linked with the collection-level descriptions in the online
Guide to 19th Century Pamphlets;17 and found in collection-based searches.

■ By including a larger number of library collections within this project, the


workload is spread, the benefits are shared more widely (e.g. digital copies for
libraries), and more can be learned to support the extension of the project to
other collections in later phases.

De-selection criteria
For five collections, then, the appropriate strategy is not a selection strategy, but
a de-selection strategy. Although the goal is to capture as much as is possible of
these collections, the project anticipates de-selecting some pamphlets from digital
capture, for these reasons:

■ There are copyright concerns (see Copyright Strategy, 2.3.2 below)

■ The pamphlet was published outside the bounds of the 19th century (either
earlier or later). We note that extending into the 20th century would greatly
increase copyright issues.

■ The pamphlet has already been captured or is better captured from another
17 See at www.curl.ac.uk/rslpguide/
guidehp.htm collection (see De-Duplication Strategy, 2.3.3 below)

PAGE 21
19th Century Pamphlets Online
Digitisation Scoping Study

■ The pamphlet is too fragile to send for scanning.

Note that the project will not be deselecting on the basis of difficulties in scanning
or OCR. BOPRIS has specialist scanners that can capture any material a library is
willing to send.

Selection criteria
For the remaining two collections (Bristol and LSE), a positive selection of
pamphlets will be made and will be based on the following criteria:

■ Their relevance to themes of the great 19th century debates (e.g. universal
suffrage, relationship of church and state, colonial policy), as identified by
the collection curators, and the academics and teachers involved in the
management and steering groups.

■ Their usefulness in addressing gaps in the digital collection (e.g. themes not
well covered, formats not represented, particular authors who should have
a voice). Again, these gaps will be identified by curators, researchers and
teachers.

■ Feedback and demand from collection users: the bulk of Bristol and LSE
material will be selected later in the project, by which stage there will be
material available online and the possibility of tracking usage and surveying
users.

■ Replacements for copies held in the smaller collections that are in too poor or
fragile a state to scan.

2.3.2 Copyright Strategy


Copyright is the primary IPR issue facing a 19th century collection of published
pamphlets. Although these collections are owned by their libraries (or have been
included in the project with owners’ permission), this ownership or permission
does not automatically give libraries a right to copy them. For most of the
pamphlets copyright will have expired and there will be no barrier to copying. But it
is anticipated that a small proportion of the late 19th century material within these
collections will still be within copyright and require either permission or exclusion.
Project partners have some experience in these issues, but the project will seek
advice during Work Packages 1 and 2 (see section 4.3 below).

We expect to identify three classes of material:

■ Class 1. Items clearly out of copyright (based on their age or anonymity)

■ Class 2. Items potentially within copyright due to age but of unknown status
(e.g. because the death dates of the author or identity of their inheritors are not
easily discoverable)

■ Class 3. Items known to be in copyright (based on age or the known dates of the
author)

PAGE 22
19th Century Pamphlets Online
Digitisation Scoping Study

The following approach may be adopted:


■ Libraries can send any Class 1 materials without concern
■ Libraries must make the decision whether to send any Class 2 materials
and bear the full responsibility for this decision: the project will expect them
to make efforts to determine its status and require them to provide a suitable
indemnification (this will form part of a Memorandum or Understanding)
■ Libraries must not send any Class 3 materials for scanning without obtaining
appropriate copyright clearance. The project will provide some assistance to
libraries, including a suitable licence.

The following diagram shows a possible workflow for determining a pamphlet’s


copyright status.

Figure 2.3.2
Suggested Copyright Workflow
Is author known or
No discoverable through Yes
reasonable enquiry?

Copyright expires Yes Is their death


70 years after date known?
publication Was death
before 1937?
(2007-70yrs)
No
No
Was work
Class 3 : published before
In copyright – 1857? (Assumes
Yes library must not author published
send without at 20 and lived
clearing to100)*

No
Class 2 : Might be in
copyright – library takes Yes
responsibility and liability

Class 1 : Out of
copyright –
library can send

*This is a conservative cut-off – the project will seek legal advice on this.

Note that this chart assumes a conservative cut-off publication date of 1857 (150yrs)
in case of an uncertain death date. We are aware of other projects who are using 120
years (i.e. 1887) as their cut off and will seek advice on this during the initial stages of
project planning. It is anticipated that the vast majority of pamphlets in this collection
will belong to Class 1. Many of the pamphlet records checked on Copac include dates
for their authors, which will assist in determining the pamphlet’s copyright status.

PAGE 23
19th Century Pamphlets Online
Digitisation Scoping Study

If, despite this strategy, issues arise with any of the pamphlets, the project would
seek to remove it from the digital collection.

2.3.3 De-duplication Strategy


As the duplication survey above has indicated (see 2.1.2), there is likely to be some
duplication across these collections, although it is difficult to gauge the exact
extent.

Because the project is selecting from the two larger collections (Bristol and LSE)
there is an opportunity to de-duplicate these at the selection stage – provided there
is some means of catching duplication when the selection is being made.

There are several approaches that might be taken to de-duplication for the smaller
collections:

1. Do not de-duplicate. This may make sense given the focus on individual
collections and the level of annotations of some pamphlets. However, this must
also be balanced against the project’s aim to create as large and accessible a
selection of pamphlets as possible. Where there are a handful of personally
annotated copies in different collections, the occasional duplicate may be
permitted. But as a general policy, the approach of ignoring duplication would
lead to wastage and reduce the number of unique resources that could be
captured and delivered.

2. De-duplicate at library, before sending for scanning. Advantages of de-


duplicating at the library are that pamphlets are not handled, transported, and
removed from circulation unnecessarily. Disadvantages in de-duplicating at the
library are that a comparison between different copies will not be possible, and
the library going first in the queue will have more of their collection scanned
than those following. There needs to be some mechanism for the LSE and
Bristol to select around the other collections, so a database tool developed for
this purpose could also be used to manage de-duplication between the smaller
collections.

3. De-duplicate at BOPCRIS, before scanning. This approach has the advantage


of enabling a comparison between physical copies. However, this would require
that a pamphlet is kept until its duplicate arrives, which will take up space,
delay return (and possibly require additional transportation). It would also
require some means of identifying that the pamphlet has a duplicate in another
collection.

4. De-duplicate at BOPCRIS, post-scanning. This approach was suggested in


the initial bid, before the extent of duplication was explored. Although this
does not require the originals to be held at BOPCRIS, it does assume that the
appropriate datasets can be easily retrieved and compared. This may not be the
case if BOPCRIS is sending the data to JSTOR and then deleting it from its own
storage system (BOPCRIS does not have the resources to hold the entire digital
collection at once). This approach also leads to great wastage, since one of the
duplicate pair has been handled, transported and digitised unnecessarily.

PAGE 24
19th Century Pamphlets Online
Digitisation Scoping Study

There is no perfect solution to de-duplication. However, as a result of this scoping


study the recommended strategy is to de-duplicate at the library, prior to sending the
item to Southampton (Option 2). Library partners have agreed to this approach. It
will require some sort of system (e.g. database) to identify duplicates among the
smaller collections and enable the LSE and Bristol to make their selections with
the other collections in view.

System to manage de-duplication


The project proposes that a database be developed at the beginning of the project
(see WP3 in section 4.3) and maintained by a Project Officer located at BOPCRIS.
We would expect to populate this database with simple bibliographic records
exported from library OPACS or CURL/Copac at the beginning of the project. The
database records would include additional fields to, for example, log the status and
condition of the pamphlets.

As a library prepares its collection (or selection) it searches the database and
identifies any duplicate records in other collections. It then checks the status of
these items.

A pamphlet record can have one of several statuses:

1. Not checked – the default; for items that have not yet been checked or selected
from a collection

2. Missing – pamphlet could not be found

3. Not suitable – pamphlet is too fragile to send

4. Sent – pamphlet is waiting to be sent or in transit

5. Received – pamphlet is at BOPCRIS and awaiting scanning

6. Scanned – pamphlet has been scanned and is awaiting signoff and return

7. Returned – pamphlet has been returned

8. Duplicate – not sent

If the librarian finds a duplicate with a status of 1-3, then the item should be sent
(unless that library’s own copy is also missing, not suitable or in poor condition). If
another copy has a status of 5-8, then the item should be returned to its place on
the shelf (or ignored within the volume) and the appropriate status recorded in the
database.

Some libraries may still wish to check their duplicate for any significant
annotations and make a case with the project team for a duplication.

The following diagram shows a de-duplication workflow based on the suggestions


above. Note that it makes reference to a record slip. It is recommended that such
a slip accompany all pamphlets to be scanned, and include details of the library
name and catalogue ID number and other information. This will be particularly

PAGE 25
19th Century Pamphlets Online
Digitisation Scoping Study

useful in marking which parts of a bound volume are to be scanned. The project
would supply a record slip template and recommend that it be printed on an acid-
free or PH-neutral paper.

The diagram below shows how de-duplication might be achieved within the library
workflow using the proposed database.

Figure 2.3.3
Suggested De-duplication Workflow

Select next pamphlet


from the list (or within
volume)

Set status to
“duplicate –
Check project database
not sent” in
database

Yes
Has pamphlet already
been “sent”, “received”,
“scanned”, or
“returned”?

No

Is pamphlet recorded as Tick as


being “missing” or “not Yes appropriate on
suitable”? a record slip

No
Check your copy of item

No
Set status to
“missing” in Is pamphlet found?
database
Yes
Set status to Yes
“not suitable” Is pamphlet too fragile to
in database send?

No

Set status to “sent” in Send


database, tuck record slip pamphlet to
within pamphlet, and put BOPCRIS
for sending

PAGE 26
19th Century Pamphlets Online
Digitisation Scoping Study

2.3.4 Strategy for scheduling collections


The Gannt chart below sets out a possible scheduling of collections for digital
production at BOPCRIS (WP5 in section 4). The time allocations take into account
the size of the collections (i.e. number of pages) with some adjustment for its
condition, based on the profiles in 2.2. The ordering of the collections is also
influenced by the key characteristics identified in 2.2. We have tried to avoid the
preparation of material overlapping or there being more than three libraries’
pamphlets at BOPCRIS at any one time.

We have indicated single consignments for the smaller collections and two
consignments each for the larger collections, although some of the smaller
collections may also be broken in half or sent as two van loads. The collections we
are doing in their entirety are scheduled earlier, and those from which selections
are made (LSE and Bristol) generally occur towards the end of the schedule. This
will enable the selections to be more informed and to minimise duplication. It will
also enable the project to adjust the numbers from LSE and Bristol to ensure that
the 1 million page target is met.

It must be stressed that this is one possible scheduling for the collections. Should
the project go ahead the timetable would be negotiated with library partners.

Figure 2.3.4
A Scanning Schedule

PAGE 27
3. Digital datasets
This section describes the production of the digital datasets, focusing on the
technical standards being proposed for this project. It describes: (1) the capture of
the digital images; (2) generation of OCR text; (3) creation of metadata; and (4) the
overall production workflow and Quality Assurance (QA) activities.

In order to inform discussions with JSTOR and decisions about technical


standards, a wide selection of approximately 150 19th century pamphlets were
transported from the University of Bristol to BOPCRIS at Southampton. These were
examined by BOPCRIS, with several pamphlets scanned and others matched with
previous scans of analogous material. The Bristol pamphlets exhibited most of
the characteristics checked in the format and condition survey at 2.1.1 above (e.g.
greyscale information, tight binding, unusual typefaces) and they helped inform the
criteria. Further issues, not represented in the Bristol selection, were observed
during visits to the collection (e.g. early 19th century long “s”s). Some of these
were tested using similar materials drawn from Southampton.

3.1 Digital images

This project will use the full range of book scanners available within the digitisation
laboratory at BOPCRIS:

■ Minolta PS 7000 greyscale scanners – for bitonal or greyscale capture from


volumes or separate pamphlets that can be opened flat – pictured left

■ Digitising Line Suprascan colour scanner – for colour capture and for grey or
bitonal capture from volumes or individual pamphlets requiring special support
(e.g. loose boards or tight binding) – pictured centre

■ Digitising Line robotic colour scanner – for sturdy bound volumes where the
bulk of the volume is being scanned and the pamphlets are of similar paper
weight and have been trimmed to fit the volume – pictured right

PAGE 28
19th Century Pamphlets Online
Digitisation Scoping Study

Image with permission from BOPCRIS Image with permission from BOPCRIS and Image with permission from BOPCRIS
model

All BOPCRIS staff are trained in handling materials appropriately and in operating
the scanners. Given the nature of the collections, only a small proportion of the
19th century pamphlets could be captured on the robotic scanner, with most
requiring a PS7000 or Suprascan scanner. This has increased the cost of digital
capture in the revised bid.

In the course of this scoping study, video- and phone-conferences were held
with JSTOR to discuss the digital capture specifications. These discussions were
informed by sample images BOPCRIS had generated from the Bristol collection or
from similar materials.

JSTOR’s standard approach is to capture pages of text as 600dpi bitonal scans and
pages with grey or colour (e.g. illustrations or photographs) as 300dpi 8-bit grey or
24-bit colour scans. JSTOR’s delivery images are downsized from the TIFF masters
and delivered as GIF images (for text) or JPEG images (where grey or colour is
present). The latter images are often created by overlaying and compositing 8-bit
or 24-bit illustrations with bitonal text.

Project partners considered the possibility of capturing the entire pamphlet


collection in greyscale or colour, in recognition of their value as historic objects
as well as information carriers. However, the higher filesize would vastly increase
the storage requirements and the download for users. Although manageable
for 1 million images, it would make it very difficult to scale the collection beyond
this current project. It was decided to stay with the mixed bitonal and grey/colour
specifications proposed in the initial bid, but to capture annotations in grey or
colour where significant.

PAGE 29
19th Century Pamphlets Online
Digitisation Scoping Study

The table below indicates the digital image standards adopted for this project.

Table 3.1
Image specifications

Master images Delivery images


(archived by JSTOR; available to (JSTOR standard delivery formats)
contributing libraries and JISC)
Pages of text 100% capture at 600dpi bitonal (1-bit) 760 pixel wide GIF images (with large
images saved in uncompressed TIFF PDF versions or full size TIFF images
6.0 format. downloadable)
Pages with grey or colour 100% capture at 300dpi grey (8-bit) 760 pixel wide JPEG images (with
information or colour (24-bit) images saved in large PDF versions or full size TIFF
(including significant annotations) uncompressed TIFF 6.0 format. images downloadable)

3.2 OCR text

There was some discussion during the scoping study about where it was best
to undertake the OCR work: BOPCRIS or JSTOR. JSTOR generally requires its
vendors to use PrimeOCR18 while BOPCRIS have Abbyy19 software built into its
Agora production system20. Pages from some of the sample pamphlets were
OCRed and the data was shared with JSTOR. This data was in the .idx file format
Agora generates using Abbyy, which includes word coordinates (not currently used
by JSTOR). In addition to the standard Abbyy 8.0 software, an ‘Old English’ add-
on was trialed during the scoping study. This specialist software is designed to
capture older typefaces.

It was agreed that the OCR was best done by BOPCRIS as a part of its production
workflow, using Abbyy and Abbyy Old English where necessary (selectively used,
because its high license cost is based on the number of pages OCRed). JSTOR
would be supplied with both plain text (.txt) and the coordinated text (.idx). JSTOR
would use the text files initially, but be likely to use the idx if it moved to a delivery
system that highlights words from the text (it anticipates doing so in the future).

Several tests were made to determine the level of accuracy achievable from the
sample pamphlets. The tests suggested that a high level was possible for much
of the material, with even difficult texts achieving up to 99.9% character accuracy.
Further tests would be done were the project to proceed, but these initial tests
suggested an average character accuracy of 97-98% for the 600dpi bitonal images
and 300dpi grey/colour images specified for this project. For this project and
material JSTOR would apply an average accuracy level rather than a minimum
acceptance level and were happy with the averages BOPCRIS were obtaining. While
rescanning or re-OCRing may occasionally be necessary, the project does not
18 See at www.primerecognition.com envisage having to re-key any data.
19 See at www.abbyy.com
20 See at www.agora.de/eng/index.html

PAGE 30
19th Century Pamphlets Online
Digitisation Scoping Study

Currently any problems with the OCR are picked up visually and the accuracy levels
determined manually by comparing the OCR output with the original page or its
digital image. BOPCRIS hope to introduce accuracy monitoring software into their
automated workflow at an early stage of this project. This software would make it
easier to detect any issues and change the variables in order to optimise the OCR.

The following tables show examples of OCR drawn from an early 19th century
satirical pamphlet. The first two show the same text scanned at 600dpi and 300dpi
(the later would usually be used where illustrations are present). Both have
OCRed well for this typical text. The third table (3.2:C) shows the title page for this
pamphlet and illustrates the impact of complex fonts. These usually occur on title
pages, where they will be compensated by the presence of bibliographic metadata.

Table 3.2:A
600dpi bitonal using Abbyy 8.0 – 97.6% character accuracy

THE TOTAL ECLIPSE,


Courteous Reader:
This eclipse, Tho’ Moore so deep in science clips, Was not foretold; but it
is said That Moore and Andrews both are dead. So sly withal, so deep and
ohary> Was that Royston Luminary, Never could I, broad or ‘t home, Once in
conjunction with him come. Had I his ultramundane store, I’d write i’ th’ style
of mystic lore, Of Capricorn, of knees, and hams*, Of Cancer, Leo, bulls, and
rams ; Of legs and shoulders, arms and hips, But to proceed with the eclipse.
As in the vast expanse we sec* The orbs all varying in degree, Revolving round
some fixed star, Now nearer us, now from us far; Sustained by Gravitation’s
laws, Enacted by the great First Cause:-*-* So do Terrestrial objects bend And
toward one common centre tend ; Nor let a thought be entertained, The simile
is overstrained. Terrestrial bodies have their sigHs^ One darkens and another
shines; Sometimes we see they change their places, Slide across each other’s
faces ; Sometimes these terrestrial ‘clipses Brown us all like herds of’gypsies:

Table 3.2:B
300dpi greyscale using Abbyy 8.0 – 98.7% character accuracy

THE TOTAL ECLIPSE,


Courteous Reader:
This eclipse* Tho’ Moore so deep in science clips, Was not foretold; but it
is said That Moore and Andrews both are dead. So sly withal, so deep and
ohary> Was that Royston Luminary, Never could I, broad or ‘t home, Once in
conjunction with him come. Had I his ultramundane store, I’d write i’ th’ style
or’mystic lore, Of Capricorn, of knees, and hams, Of Cancer, Leo, bulls, and
rams ; Of legs and shoulders, arms and hips, But to proceed with the, eclipse.
As in the vast expanse we sec* The orbs all varying in degree, Revolving round
some fixed star, Now nearer us, now from us far; Sustained by Gravitation’s
laws, Enacted by the great First Cause:-”* So do Terrestrial objects bend Aud
toward one common centre tend ; Nor let a thought be entertained, The simile
is overstrained. Terrestrial bodies have their sighs^ One darkens and another
shines; Sometimes we see they change their places, Slide across each other’s
faces ; Sometimes these terrestrial ‘clipses Brown us all like herds of gypsies:

PAGE 31
19th Century Pamphlets Online
Digitisation Scoping Study

Table 3.2:C
300dpi greyscale using Abbyy 8.0 – 75.8% character accuracy

THE TOTAL ECLIPSE:


A GRAND 3Pol(ttco=&gtronomical pjtttonttttott, WHICH OCCURRED IN THE
YEAR 1820;
WITH A SERIES OF IBKK&NB&WI&*B8»
TO DEMONSTRATE Cfte &mtftguration of tje fMattel^
TO WHICH IS ADDED, AN HIEROGLYPHIC, Adapted to these Wonderful Times J
, v\ \> m
<t
m
w
“ All tongues speak of him, and the bleared sights “ Are spectacled to see him.”
LONDON: PUBLISHED BY THOMAS DOLBY, ^STRAND, 30, HOLYWELL
STREET, AND 34, WARDOUR SrKELl.
ONE SHILLING.

3.3 Metadata

BOPCRIS has a German production system called Agora21, which uses its own
proprietary XML (Extensible Markup Language) metadata format. This metadata
would be customised to ensure that all the necessary information is incorporated.
Once complete, it would be exported and transformed (via software routines)
into standards-compliant XML. This standard metadata would be delivered to
JSTOR for archiving and delivery to libraries or JISC (on request) and for further
transformation into JSTOR’s own metadata standard for delivery to end users.

The table below details the metadata standards adopted for this project.

21 See at www.agora.de/eng/index.html

PAGE 32
19th Century Pamphlets Online
Digitisation Scoping Study

Table 3.3
Metadata specifications

Metadata standard Comments


Bibliographic Metadata MODS (Metadata Object BOPCRIS would take bibliographic records from the
Description Schema)22 or CURL or Copac databases: probably in the MODS format,
MARCXML23 but possibly as the more extensive MARCXML. MODS
(currently version 3) is a good choice for this project
because: (1) Copac is moving to MODS by the end of
2006; and (2) MODS was especially developed to hold a
simplified set of MARC data for use within digital library
collections.
Technical Metadata MIX (NISO Metadata for BOPCRIS would probably use selective elements from the
Images in XML)24 new MIX standard to record information about the digital
images. MIX is an encoding of the very extensive NISO
data dictionary (Z39.87)25 MIX is still in draft (currently
version 0.2).
Preservation Metadata PREMIS (Preservation BOPCRIS would use elements of the PREMIS data
Metadata Implementation dictionary, which is being widely adopted as a means
Strategies Working Group)26 of recording information to support the preservation of
digital resources. PREMIS is currently still undergoing a
period of trial use.
Structural Metadata METS (Metadata Encoding BOPCRIS would use METS, which is now an established
& Transmission Standard)27 standard for structuring complex digital resources (e.g.
publications with multiple pages) and wrapping other sets
of metadata. It often used with MODS.
Delivery Metadata – JSTOR NLM (National Library of JSTOR are migrating to the NLM standard, which is
Medicine)28 widely used for journals but also has an adaptation for
monographs, such as pamphlets. BOPCRIS would work
with JSTOR to develop a suitable mapping (via a DTD or
schema) to enable the rich archival metadata set (MODS,
MIX and PREMIS in METS) to be transformed into JSTOR’s
delivery metadata standard.
Delivery Metadata – Copac MODS, MARCXML or The project will enable access to the digitised pamphlets
and OPACs MARC21 via Copac and the OPACs of holding libraries. This is
achieved by adding a link to existing MODS, MARCXML or
MARC21 records.

22 See at www.loc.gov/standards/mods
23 See at www.loc.gov/standards/
marcxml

24 See at www.loc.gov/standards/mix
25 See at www.niso.org/standards/
resources/Z39_87_trial_use.pdf

26 See at www.loc.gov/standards/
premis

27 See at www.loc.gov/standards/mets
28 See at http://dtd.nlm.nih.gov/tag-
library/2.1/index.html

PAGE 33
19th Century Pamphlets Online
Digitisation Scoping Study

3.4 Production and QA workflow

Quality Assurance (QA) is a key part of any digitisation workflow (see TASI’s QA
documentation29). For this project, BOPCRIS would undertake QA at several stages
during the production phase, and then further QA would be done by JSTOR when
they receive the dataset.

The workflow follows this order:

1. Images are logged onto the Agora production system and passed on to
Scanning Operators.

2. Scanning Operators scan each page, checking as they go (the first QA, for
images). Images are rescanned as necessary. Once complete, the set of files
are passed on to Indexers.

3. Indexers check that all the pages are present and that the images are of good
quality (the second QA, for images). If there are any issues they request a
rescan from a Scanning Operator.

4. Indexers initiate the XML generation, which incorporates the data necessary for
later export and transformation into the standards described in 3.3.

5. Indexers identify any non-English language or old English fonts and flag these
in the production system so that the appropriate software settings are triggered
Scanning Operator. Image with
when the images are OCRed. The dataset then enters the automatic OCR
permission of BOPCRIS & model.
workflow.

6. The production system picks up the images and metadata, OCRs the images,
and generates associated .idx and .txt files.

7. Currently OCR is spot-checked, but it is anticipated that automated OCR-


checking would be introduced at this point in the workflow for this project (a
third QA, for OCR).

8. JSTOR do further QA on the images and OCR (the fourth QA, for images and
OCR), checking an average of 10% of images and OCR files. JSTOR would liase
with BOPCRIS to address any issues. Because some rescanning might be
necessary, collections would be held at Southampton until signed off by JSTOR.

Figure 4.1:A in the next section of this report includes a diagrammatic version of
this workflow.

Indexer. Image with permission of


BOPCRIS & model.

29 See at www.tasi.ac.uk/advice/
creating/quality.html

PAGE 34
4. Project workflow
This final section describes the overall workflow for the project, bringing together
several elements from previous sections of this report. It includes: a project
workflow diagram (4.1); a work plan, showing the project’s main activities and work
packages (WP) on a Gannt chart (4.2); and a detailed description of the work to be
undertaken within in each work package (4.3).

4.1 Workflow
The following diagram illustrates the overall workflow proposed for this project,
from pamphlet selection to discovery by users. The major work packages (WP4-8)
are also indicated.
Figure 4.1
Project Workflow

Pamphlet is selected
and checked on the
project database
Move to
next item See
copyright
Is pamphlet de-selected workflow in
due to copyright?
2.3.2
Work Package 4 - Libraries

Yes
No
Yes Move to
Is pamphlet de- next item
duplicated?

No
See de-
duplication
workflow in Prepare pamphlet for Pamphlet is
2.3.3 transport to BOPCRIS returned to
store

Pamphlet is transported Pamphlet is


to Southampton transported
back to library

Continues
over page

PAGE 35
19th Century Pamphlets Online
Digitisation Scoping Study

Continued from
previous page

Pamphlet is received by
BOPCRIS and logged
onto production system

Scanning:
See 3.4 for description of this workflow

Digital capture of images

QA
Work Package 5 - BOBCRIS

QA Indexing:
Image QA, metadata
generation, OCR
initiation
Pamphlet
is signed
off
QA OCR:
Automatic generation
and QA of OCR text

Dataset is transferred to
JSTOR via hard disk or
FTP

QA

Master Dataset is received and


dataset is checked by JSTOR
archived
Work Packages 6-7 - JSTOR

and
available
to libraries
or JISC Dataset approved

Delivery DOI and URL


dataset is are prepared
generated

Continues over

PAGE 36
19th Century Pamphlets Online
Digitisation Scoping Study

Continued from
previous page

Links passed on to MIMAS


Work Packa ge 8 – MIMAS

Links added to CURL database

Library OPACs Copac database


include linking includes linking
catalogue records catalogue records

Digital pamphlet
available within
user
JSTOR collection search

Other resource discovery providers


(e.g. Google) include links

4.2 Work plan

The Gannt chart on the next page outlines the likely timing of the main activities
and work packages (WP) proposed for this project. This timetable would be
confirmed in the Project Plan to be prepared under WP2. Section 4.3 describes
each work package in detail.

PAGE 37
19th Century Pamphlets Online
Digitisation Scoping Study

Figure 4.2
Proposed Work Plan

4.3 Work packages


The Gannt chart in 4.2 above provides an overview of the main activities and work
packages (WP) for this project. This final section provides more detail on each
work package, including its timeframe, leader, objectives, main outputs, and a
discussion of particular issues.

WP1 Pre-Project activities


Timeframe: Objective: To enable the project to
Oct-Dec 2006 begin promptly in January 2007.
Lead: Main outputs: Staff in post and
Southampton management groups constituted for a
January project start. Libraries have
identified IPR issues and begun any
clearance activities.

Should the bid succeed, some work will be initiated immediately: (a) staffing
secondments and recruitment would be organised; (b) the project management
groups (described in WP2) would be constituted; and (c) library partners would

PAGE 38
19th Century Pamphlets Online
Digitisation Scoping Study

be asked to identify materials that may be in copyright and begin any clearance
required (see 2.3.2 above for a workflow to identify copyright issues).

WP2 Project Management activities


Timeframe: Objective: To ensure that all the Work
Jan 2007-Dec 2008 Packages in the project are managed
coherently and that all the project
outputs are delivered within agreed
deadlines and budgets.
Lead: Main outputs: detailed Project Plan;
Southampton MoUs between partners; liaison with
and between all partners; regular
reports and other documentation;
attendance at JISC programme
meetings; overall QA and risks
monitoring; editing of the project web
site; dissemination programme.

This scoping study has provided much ground work for the project and would help
inform a more complete and detailed Project Plan. Other early documentation
would include Memoranda of Understanding (MoUs) between partners and a
project website.

The project would conform to the JISC’s Project Management Guidelines30 and
closely follow the PRINCE2 methodology31. The core Project Team would be
comprised of: a Project Manager (0.5 – to be seconded from TASI at the University
of Bristol); a Technical Project Manager (0.5 – the current manager of BOPCRIS);
and a Project Officer (1.0 – also from BOPCRIS). The Project Manager would take
responsibility for the overall project monitoring, risk management, reporting,
liaison and dissemination. The Technical Project Manager would be responsible
for managing the production processes, quality assurance, and liaison with other
partners over technical standards. The Project Officer would receive the pamphlets
and track them through the BOPCRIS production system. They would also maintain
the database described in section 2.3.3 above, which is used to aid the selection
and de-duplication of pamphlets. A Software Developer (0.5) would also be
employed to undertake WP3 development and provide support for other packages.

This core team would be supported by two groups: (a) the Project Management
Group, which would include the Project Director (chair), project managers, and
representation from among the partners and JISC, and would meet regularly and
as required to oversee the project and manage any exceptional circumstances; and
(b) the Project Steering Group, which would offer a strategic oversight. The Steering
Group would be chaired externally and be comprised of the Project Director and
managers, representation from the partners, JISC, research councils, and senior
members of the research, teaching, and information communities. The Steering
Group would meet at the beginning, middle, and near the end of the project. It
30 See at www.jisc.ac.uk/proj_
manguide.html would provide advice and contribute to maintaining a high level of visibility for the
31 See at www.ogc.gov.uk/methods_ project within the UK and internationally.
prince_2.asp

PAGE 39
19th Century Pamphlets Online
Digitisation Scoping Study

WP3 Development
Timeframe: Objective: To make adjustments to
Jan 2007-Aug 2007 existing systems or develop new
systems to support the major work
packages: i.e. WP4-8.
Lead: Main outputs: adjustments to the
Various BOPCRIS production system, JSTOR
delivery system, and Copac database;
project database for libraries.

A key aspect of this project is its use of existing infrastructure, including


production systems, preservation and delivery systems, and discovery services.
This avoids the considerable expense and delay in developing new systems. Some
development work, however, would be required to ensure an efficient workflow,
better conformity to the emerging standards, delivery of a new kind of collection
(JSTOR), and to enable the linkages that will provide effective resource discovery
(MIMAS). This work would take place at or near the beginning of the project, timed
to ensure that the systems were ready as the data moved to that stage of the
workflow. The only completely new system required would be the database used by
libraries to log pamphlets, note their condition, and check for duplicates. Data from
library OPACs would be used to help populate this database.

WP4 Selection and Preparation


Timeframe: Objective: To efficiently select, prepare
Jan 2007-Sep 2008 and transport pamphlets from and
back to libraries.
Lead: Main outputs: delivery of selected
Southampton with libraries materials to BOPCRIS for digitisation

The selection and flow of materials from seven libraries to Southampton requires
careful management. This study has explored the issues by assessing the volume
and condition of the pamphlets (both will affect scanning time) and the extent
of duplication across the primary partner collections. It also discussed with
libraries any issues that would affect the scheduling of collections (e.g. whether
the collections should go at once or in batches). As a result of these investigations
we have presented a possible schedule in section 2.3.4 above. The final timetable
would be agreed with libraries at the beginning of the project.

As described in 2.3.3 a database would be created (WP3) and used to enable


libraries to check material already sent and log any missing, fragile or damaged
items. For five libraries, whole collections have been chosen (see 2.1.4.). Selections
will be made from the other two (Bristol and the LSE) to complement these and
address feedback from users (see under WP6). Section 2.3.1 above has outlined the
selection and de-selection criteria for this project.

PAGE 40
19th Century Pamphlets Online
Digitisation Scoping Study

WP5 Production
Timeframe: Objective: To create high-quality
Mar 2007-Nov 2008 digital images, metadata and OCR.
Lead: Main outputs: Datasets comprising
BOPCRIS standards-compliant images,
metadata and OCR text.

Section 3 above has described the methodology, technical specifications and


workflow for the project. These were discussed between BPOCRIS, JSTOR and
the author in the course of the scoping study, with several different approaches to
capture, metadata, OCR and Quality Assurance (QA) considered. This discussion
was informed by test images, metadata, and OCR text generated by BOPCRIS from
a sample collection supplied by the University of Bristol.

As described in section 3, BOPCRIS would take responsibility for the digital


capture; generation of technical, preservation and structural metadata; OCR, and
initial QA work. Texts would be captured bitonally or in greyscale and colour where
relevant (e.g. pages with prints, maps, or annotations). Standards-compliant XML-
based metadata would be generated from the BOPCRIS production system along
with OCR text. Once ready, the data would be transferred to JSTOR in batches
using external hard drives and network transfer (FTP).

WP6 Delivery
Timeframe: Objective: To effectively deliver the
Apr 2007-Feb 2009 collection to users.
Lead: Main outputs: Online collection of ca.
BOPCRIS 23,000 digital pamphlets.

An advantage of using an existing delivery platform is that the collection can be


quickly made available to users. JSTOR would release the collection in batches as
the content is ready. As a result, this project would hope to be delivering digital
content to users within its first year (see Gannt chart in 4.2 above). This would
provide an opportunity to gain information from users to help inform some of the
later selection of pamphlets being undertaken by the LSE and Bristol.

Once the BOPCRIS datasets are received JSTOR would apply their own quality
assurance processes on the image and OCR files, checking an average of 10% of
each. They would work closely with BOPCRIS to address any issues discovered.
Once approved, an archival dataset would be preserved (WP7) and the data
transformed (images resized and metadata mapped) to create a delivery dataset
for incorporation into JSTOR’s systems. JSTOR would also generate URLs and
Document Object Identifiers (DOIs), passing these with the corresponding library
and record IDs to MIMAS for incorporation into Copac and distribution to libraries
(WP8). In addition, JSTOR would take responsibility for facilitating pathways into
the collection through its linking arrangements with organisations such as Google,

PAGE 41
19th Century Pamphlets Online
Digitisation Scoping Study

the History Cooperative32 and RePEC33, and its participation in CrossRef34. The full
OCR text of the pamphlets would be exposed to Google’s indexing spider, enabling
pamphlets to be found via a standard Google Web search.

WP7 Preservation
Timeframe: Objective: To ensure the long-
Apr 2007- term preservation (including
future upgrades and migrations as
technology changes) and accessibility
of the material.
Lead: Main outputs: Archive of image,
JSTOR metadata and OCR datasets.

JSTOR would receive from BOPCRIS a very rich dataset, including large archival
images and standards-compliant metadata, with accompanying full text. This
dataset would be archived by JSTOR and made available to contributing libraries or
the JISC upon request.

WP8 Linking
Timeframe: Objective: To achieve linking from
Apr 2007- Copac, from collection descriptions
on the RSLP Project website, and
from the individual OPACs of libraries
holding the pamphlets.
Lead: Main outputs: Hyperlinked records.
MIMAS

MIMAS would take the URLs and DOIs supplied by JSTOR and use them to update
the CURL and Copac databases and make available links or duplicate records to
partner libraries so they can update their own catalogues. MIMAS would develop
software to generate these additional records and also (as far as is possible)
identify all records in the database describing the same item so that all relevant
libraries could be informed. Any library participating in CURL/Copac would be able
to download a record or link to items held in the pamphlet collection. Collection-
based searching would also be provided by MIMAS via the Guide to 19th Century
Pamphlets hosted on the CURL website35. This would enable users to limit their
search to a particular collection or, if they prefer, to just the digitised content of
that collection.

WP9 Marketing and dissemination


Timeframe: Objective: To generate wide interest
Jan 2007-Dec 2008 in the project and wide usage of its
collection.
32 See at www.historycooperative.org
Lead: Main outputs: Website, publicity
33 See at http://repec.org
Southampton materials, presentations, learning
34 See at www.crossref.org resources, digitisation toolkit, launch
and dissemination event.
35 See at www.curl.ac.uk/rslpguide/
guidehp.htm

PAGE 42
19th Century Pamphlets Online
Digitisation Scoping Study

The Project Manager would take responsibility for this work package, creating
or commissioning webpages and publicity materials, writing papers, making
presentations, and coordinating an event in the summer of 2008. This event
would be likely to take the form of a one-day seminar and to include a formal
launch of the collection (which by this stage would include a significant amount of
content). The Project Manager would also (a) prepare a online ‘toolkit’ or suite of
resources to assist with the future selection, digitisation, delivery and preservation
of pamphlet literature, and (b) commission a Research Officer to create a set of
e-learning resources to encourage the use of the resource within teaching. These
e-learning resources would include additional web content for the Guide to 19th
Century Pamphlets and a sample learning package for deposit within the JORUM
repository.

WP10 Evaluation
Timeframe: Objective: To commission an external
Oct -Dec 2008 evaluation study whose assessment
and recommendations will be
incorporated in the Final Report.
Lead: Main outputs: Evaluation Report.
Southampton

PAGE 43
5. Conclusions and
recommendations
This whole report might be regarded as a set of recommendations for the
proposed project (especially sections 2.3, 3 and 4), so this final section will not
reiterate everything previously presented. However some general conclusions and
recommendations can be usefully made here:

■ The scoping study has confirmed that the proposed project is viable, could be
managed within the allotted timescale (2 years, from January 2007-December
2008), and would be extensible.

■ Although the study has not addressed every single aspect of the project, it has
covered much ground and provided a good base from which to build a full and
detailed Project Plan.

■ All of the deliverables promised in the scoping study plan (see 1.1.3) have
been achieved and are included within this report, along with much additional
information.

■ Some information gathered in the process of the study has not been included
in this report for reasons of confidentiality (e.g. costs and contractual
agreements). Much of this information has been provided in the revised bid
and it will be important in preparing the detailed Project Plan and developing
Memoranda of Understanding should the project go ahead.

■ In addition to providing much information, the scoping study has been important
in establishing relationships between partners (e.g. BOPCRIS and JSTOR),
which would stand the project in good stead if it proceeds.

■ The proposed project has many complexities and the scoping study activities
have proved valuable in addressing many of these (e.g. image capture and OCR
standards, approach to de-duplication). It is recommended that that scoping
studies such as this are undertaken for all large digitisation projects of this
nature, especially where multiple partners are involved.

■ While the study has been able to provide a better base to certain assumptions
(e.g. page averages), some information can only be fully established in
practice. If the project proceeds, it would be able to test some of the findings
and assumptions made here, and to evaluate the chosen methodologies. It

PAGE 44
19th Century Pamphlets Online
Digitisation Scoping Study

is recommended that the 19th Century Pamphlets Online project, should it


proceed, evaluate the findings and approaches chosen in this study in light of
the practical reality of the project, and disseminate its findings.

■ Although the scoping study and this report have focused on a particular project,
some of its findings and approaches are expected to be of wider interest and
value. For example, we now have a clearer understanding of the condition of
19th century pamphlet collections – and of the challenges involved in digitising
them. It is recommended that this report should be made available to others
undertaking similar work.

PAGE 45
Appendix A – Survey Sheet
The table below reproduces the printed survey sheet used in the collection
assessment surveys described in section 2.1 above. An Excel spreadsheet was also
provided and used in the submission and analysis of results.

Sample number 1 2 3 4
Unique identifier for item
1. Would you happily send Tick if would send Tick if would send Tick if would send Tick if would send
item as is for scanning?
2. Where is item located:
Open shelves Open shelves Open shelves Open shelves Open shelves

On-site store On-site store On-site store On-site store On-site store

On-site archival On-site archival On-site archival On-site archival On-site archival

Off-site store Off-site store Off-site store Off-site store Off-site store

Off-site archival Off-site archival Off-site archival Off-site archival Off-site archival

Other (specify)
3. Is it in bound volume or Tick if in volume Tick if in volume Tick if in volume Tick if in volume
separate?
4. If bound, does volume Tick if vol. opens Tick if vol. opens Tick if vol. opens Tick if vol. opens
open flat (180 degrees)? flat flat flat flat

5. If bound, are there loose Tick if loose boards Tick if loose boards Tick if loose boards Tick if loose boards
boards?
6. If separate, is there Tick if margin Tick if margin Tick if margin Tick if margin
stitching in margin? stitching stitching stitching stitching

7. Are there loose pages? Tick if loose pages Tick if loose pages Tick if loose pages Tick if loose pages

8. Are there any obviously Tick if missing Tick if missing Tick if missing Tick if missing
missing pages? pages pages pages pages

9. How many pages are


there?
10. How many pages with
grey illustrations?

PAGE 46
19th Century Pamphlets Online
Digitisation Scoping Study

11. How many pages with


colour?
12. Are there (intentional) Tick if fold-outs Tick if fold-outs Tick if fold-outs Tick if fold-outs
fold-outs?
13. Are pages folded to fit in Tick if folded for Tick if folded for Tick if folded for Tick if folded for
volume? vol. vol. vol. vol.

14. Are there adverts? Tick if adverts Tick if adverts Tick if adverts Tick if adverts

15. Are there annotations? Tick if annotations Tick if annotations Tick if annotations Tick if annotations

16. Is a Gothic or unusual Tick if Gothic etc Tick if Gothic etc Tick if Gothic etc Tick if Gothic etc
typeface the main body text?
17. Do you have multiple Tick if multiples Tick if multiples Tick if multiples Tick if multiples
copies?
18. Which page size is Circle: A5, A4, A3 Circle: A5, A4, A3 Circle: A5, A4, A3 Circle: A5, A4, A3
closest?
19. Does another library
partner have a copy:
Bristol Bristol Bristol Bristol Bristol

Durham Durham Durham Durham Durham

UCL UCL UCL UCL UCL

Liverpool Liverpool Liverpool Liverpool Liverpool

LSE LSE LSE LSE LSE

Manchester Manchester Manchester Manchester Manchester

Newcastle Newcastle Newcastle Newcastle Newcastle

20. Does any other COPAC Others Others Others Others


library have a copy?

PAGE 47
Appendix B – Glossary
BOPCRIS British Official Publications Collaborative Reader Information Service
(www.bopcris.ac.uk)

Copac CURL Online Public Access Catalogue (www.copac.ac.uk)

CURL Consortium of Research Libraries in the British Isles


(www.curl.ac.uk)

DOI Digital Object Identifier

DTD Document Type Definition

FCO Foreign & Commonwealth Office

FE Further Education

HE Higher Education

HEFCE Higher Education Funding Council for England


(www.hefce.ac.uk)

HEI Higher Education Institution

IE JISC Information Environment

JISC Joint Information Systems Committee (www.jisc.ac.uk)

JSTOR Journal Storage – The Scholarly Journal Archive (www.jstor.org)

LSE London School of Economics and Political Science

MARCXML MARC 21 XML (www.loc.gov/standards/marcxml)

METS Metadata Encoding & Transmission Standard


(www.loc.gov/standards/mets)

MIMAS Manchester Information & Associated Services


(www.mimas.ac.uk)

MIX NISO Metadata for Images in XML (www.loc.gov/standards/mix)

MODS Metadata Object Description Schema


(www.loc.gov/standards/mods)

PAGE 48
19th Century Pamphlets Online
Digitisation Scoping Study

MoU Memoranda/Memorandum of Understanding

OPAC Online Public Access Catalogue

OCR Optical Character Recognition

QA Quality Assurance

RFP Request For Proposals

RSLP Research Support Libraries Programme (www.rslp.ac.uk)

TASI Technical Advisory Service for Images (www.tasi.ac.uk)

UCL University College London

URL Uniform Resource Locator

WAI Web Accessibility Initiative (www.w3.org/WAI)

W3C World Wide Web Consortium (www.w3.org)

WP Work Package

Additional information sources and support services


JISC Collections provides a multitude of examples of good ways to deliver digital content. www.jisc-collections.ac.uk

AHDS provides services to aid the creation, use and preservation of digital collections in the arts and humanities and is partially funded by JISC.
www.ahds.ac.uk

BUFVC – The British Universities Film and Video Council promotes the use of moving image and audio resources and provides very useful information
regarding digitisation from these formats. Partially funded by JISC.
www.bufvc.ac.uk

Digital Curation Centre provide a national focus for research into curation issues and to promote expertise and good practice, both national and
international, for the management of all research outputs in digital format. Partially funded by JISC. www.dcc.ac.uk/index.html

Digital Preservation Coalition fosters joint action on preservation of digital resources in the UK to secure our global digital memory and knowledge
base. www.dpconline.org

HEDS Digitisation Services provides papers and advice on planning and costing digitisation projects along with a complete digitisation service.
www.heds-digital.com

The JISC Legal Information Service provides legal resources for further and higher education and their website is an excellent starting point for
information on copyright and intellectual property and is funded by JISC.
www.jisclegal.ac.uk/ipr/IntellectualProperty.htm

TASI provides advice and guidance on the management of digitisation projects and creating, delivering and using digital images across all subject
areas and is funded by JISC. www.tasi.ac.uk

TechDis provides an advice and information resource via extensive web-based databases and an email helpdesk. These resources should be the first
port of call for anyone in education who has a question relating to disability and technology. Funded by JISC. www.techdis.ac.uk

UKOLN provides information and standards on how resources can interoperate and automated tools for testing website accessibility. Partially funded
by JISC. www.ukoln.ac.uk

For further information on the JISC Digitisation Programme, visit www.jisc.ac.uk/digitisation


or contact Stuart Dempster: s.dempster@jisc.ac.uk

PAGE 49
19th Century Pamphlets Online: Digitisation Scoping Study

Further information about JISC: Further information about CURL:


Web: www.jisc.ac.uk Web: www.curl.ac.uk
Email: info@jisc.ac.uk Email: robin.green@curl.ac.uk
Tel: 0117 954 5083 Tel: 0121 415 8106

Version 1.1, November 2006

Vous aimerez peut-être aussi