Vous êtes sur la page 1sur 6

Subject analysis and indexing:

from automated indexing to domain


Hanne Albrechtsen

Discusses the nature of subject analysis, suggesting three different conceptions of this: simplistic,
content-oriented, and requirement-oriented, considering the type of subject information and index
ing method appropriate for each.

Introduction 1. subject analysis of a document and expression of

the perceived information in a concrete linguistic
Preparing an index to a book or assigning indexing
terms to documents involves the task of subject analy
2. assigning the document with terms elicited in the
sis. Subject analysis can be defined as "the intellectual
linguistic statement, where these terms can be trans
or automated process by which the subjects of a docu
lated to conform with the terminology of a con
ment are analyzed for subsequent expression in the
trolled vocabulary of indexing terms, for instance
form of subject data'.1 Is automatic indexing a chal
according to a thesaurus or a classification scheme.
lenge, or even a threat to indexers? What do we mean
when we talk about "subjects' of books and other This methodology is generally recommended as GIP
documents? Are there different conceptions of subjects, (good indexing practice) to professional indexers.
and hence of subject analysis; and if so, are such con Examples include the International Standards Or
ceptions interconnected with methods applied for ganization,6 Fosketl,7 and Dutta & Sinha." As pointed
indexing? out by Frohmann" and by Blair,10 the majority of the
These questions are important because discussions literature on subject indexing concentrates on step two
of indexing tend to focus on the performance of auto and fails to provide precise rules for realizing step one
mated indexing2 versus the performance of human where the challenge presented is: finding the subject(s)
indexing,' usually based on relevance judgments com of a document. It should be mentioned, however, that
piled from particular groups of IR-system users,4 or the International Standards Organization recommends
observations of interindexer consistency.- Using perfor that the linguistic statement of subject contents follows
mance as a criterion for finding a 'winning' approach a facetted approach to elicit concepts from a docu
to subject analysis and indexing implies a model which ment. This latter method can be fruitful indeed, pro
restricts itself to comparing the two techniques. Based vided that it involves facets of a less generic nature
on empirical observations of user and indexer behav than 'agent', Instrument', 'object of action' etc.; com
iour, it represents a mechanistic evaluation method for pare for example ISO's general facets with Mutrux and
subject assignment and retrieval. This article presents Anderson's article' and Albrechtsen's facet scheme for
an alternative model for discussing subject analysis and software.121 shall return to such alternative approaches
indexing, where the choice between particular methods later.
applied for this task is of minor importance. The inten
tion is to attempt to place indexing in a wider social What does it mean that a document has a
context beyond such mechanistic evaluation methods subject?
and to point towards new challenges for us as indexers. After some years of magic sleep in the research
community of library and information science, the con
Current methods for subject analysis and
cept of "subject' has been awakened again to constitute
indexing a central area of study. Admittedly, the concept some-
The process of creating subject data, i.e. subject limes sneaked onto the stage for brief performances
entries for books or descriptors for documents in IR- disguised under the term 'aboutness', as for example in
systems, involves two major steps: domain analysis." It has been argued1 that the concept

The Indexer Vol. 18 No. 4 October 1993 219


of 'aboutness', introduced by Fairthorne and others,u vide the 'aboutness' rather than the 'aspect', that is the
was eventually introduced to avoid the difficulties of point-of-view or innovation of a document, which can
addressing the concept of 'subject' proper, and that the not readily be perceived by interpreting a document as
previous vagueness of the concept of 'subject' was an isolated source of knowledge. Soergel1'' and
transferred to the concept 'aboutness'. With a few Hjorland'1 both advocate that the traditional implicit
exceptions, including Hutchins'5 and Beghtol," both definition of subjects as direct abstractions of single
stressing the issue of intertextuality, the 'aboutness* documents, which is also applied by researchers in
approaches tended to handle documents as isolated automated indexing, as for example in the work of
sources of knowledge. At the same time, one may Korycinski and NewehV is too narrow. Soergel1'' pro
regard in particular Hutchins' work with discourse poses a method of request-oriented indexing.
analysis, involving analysis of text organization as Hjorland" offers a new definition of the concept of
offering an interpretative approach to automatic text 'subject' as the totality of epistemological potentials of
analysis, thus intending perhaps to inject a softer, documents.
humanistic flavour to what could otherwise be criti
cized as 'hard' automatic indexing approaches, based
Conceptions of subject analysis
on statistical and/or computational linguistic tech In order to have a clear reference frame for dis
niques, approaches which have dominated research in cussing subject analysis and indexing, I stipulate a
indexing during the 1970s and 1980s, for example the model of conceptions of subject analysis and indexing.
work of Hutchins15 and Salton.16 The model, shown in figure 1, covers three different
In contrast to the 'aboutness' approaches, at the conceptions or viewpoints of subject analysis and
same time challenging research in automatic indexing indexing.
to reconsider their methodological weaknesses, Blair," The simplistic conception (i) regards subjects as
Hjorland,17 Weinberg,ls and Soergel19 point towards absolute objective entities that can be derived as direct
new ways of looking at indexing. They reinstate the linguistic abstractions of documents or summed up
concept of 'subject' in the leading part for the practice like mathematical figures, using statistical indexing
and theory of indexing, stressing that the primary func methods. According to this conception, indexing can
tion that indexing should serve is the search for knowl be fully automated.
edge. They recommend that the indexer should not The content-oriented conception (ii) additionally
focus exclusively on the contents of documents, but involves an interpretation of the document contents
attempt to anticipate the impact and value of a par that goes beyond the lexical and sometimes grammati
ticular document for potential use. 'Indexing fails the cal surface structure, which is the boundary within
researcher', says Weinberg,1" because the indexes pro which the simplistic conception (i) operates. Subject

Figure I. Interconnection between conceptions of subject analysis, types of document information and indexing

220 The Indexer Vol. 18 No. 4 October 1993


analysis of document contents involves identification of be used for comparing two text files, does not feature
topics or subjects that are not explicitly stated in the the term 'compare1 or any of its natural language
textual surface structure of a document, but they are equivalents in the machine-readable documentation.
readily perceived by a human indexer. Hence it For automatic indexing of this software, this implies
involves a more indirect abstraction of the document that users searching for utilities that can compare two
itself. files will not retrieve this item.
The requirements-oriented conception (iii) regards This is not to say that automatic techniques should
subject data as instruments for transfer of knowledge, be rejected as an approach to subject analysis. Rather,
hence aiming at finding pragmatic information or this has practical limitations. More seriously, one may
knowledge. According to this conception, documents contend that it lacks a theoretical foundation for sub
are created for communicating knowledge, and subject ject analysis other than computational linguistics and
data should hence be tailored to function as instru statistics.
ments for mediating and rendering this knowledge vis
The content-oriented conception of subject
ible to any possible interested persons.
I shall provide some critical comments on each
approach, using examples from my work with classifi The content-oriented conception of subject analysis
cation and indexing of software and domain analysis of is based on explicit as well as implicit subject informa
computing. From 1987 until 1991. 1 participated in the tion in texts. By explicit subject information is meant
esprit project practitioner, whose primary aim was information which is expressed in the terminology
to construct a thesaurus of software engineering terms, applied by the document producer. A document may
derived from automatic analysis of terminology applied also convey implicit information, which is not directly
in machine-readable software documentation.2" While expressed by the author, but which is readily under
this approach was fruitful for getting an insight into stood or interpreted by a (human) reader of the
terminological problems of the domain of software, it document.
failed to facilitate a communication of knowledge The content-oriented conception of subject indexing
about software beyond a narrow technical community. is the most common approach to subject indexing,
In contrast, domain analysis12 regards software as including subject indexing of software.3 However, the
involving both knowledge production and use in a conception can be said to confine itself to representing
broad societal and interdisciplinary context. Domain or abstracting the document as an isolated entity.
analysis entails investigations of scientific, sometimes Consider for instance Frakes and Gandel's definition
technical, paradigms and viewpoints which represent of subjects in software:
different knowledge interests in one domain with the aim 'A software representation is a mapping of predi
of building a classificatory structure to capture know cates and terms describing individual objects and
ledge and serve its transfer and sharing. Even though relationships among objects from the represented to
discussions on possible drawbacks of approaches to the representing world. The representing world may
automatic indexing and thesaurus construction already be tin elaboration of the represented world, containing
new predieates and terms' {my italics).
emerged during the work of the practitioner pro
ject,-1 the present model of subject analysis and index And similarly, by Hall,:4 in the terminology of
ing places itself in the context of this latter domain Saussure:
analysis approach to classification and indexing.
"[The thesaurus terms] are signifiers and the [soft
The simplistic conception of subject analysis ware] concept is the signified, with the preferred
term acting as a surrogate for the concept . ..]
This conception regards subjects as direct abstrac
tions of documents. Following this conception, to Such definitions of subjects in software reveal a
analyse and index software is equivalent to extracting content-oriented conception of subject indexing, where
automatically all single words or phrases from full-text the descriptors (representations or signifiers) are predi
software documentation. cates for the 'aboutness' or the intentional meaning of
It is often argued that technical documents like soft a document," hence aiming at achieving an abstraction
ware documentation are amenable to fully automatic or a condensed version of software documentation.
indexing approaches, due to a so-called 'hard' and The content-oriented conception of subject indexing
unambiguous terminology applied by the document implies some significant shortcomings, identified by
producers. This is not always true. For instance, exper Soergel'1' and HJ0rland.17 According to Soergel,5' the
iments with automated and human indexing of unix content-oriented conception of subject analysis focuses
online documentation" showed that the human in- on the document as an isolated source of knowledge,
dexers (software engineers) often perceived subjects in even though an indexer following this conception may
this documentation which were not readily stated in the consider the context of the document collection to
text. One example: the unix utility comm, which can which it belongs (iniertextuality). As a result, the docu-

The Indexer Vol. 18 No. 4 October 1993 221


ment is assigned to one or more document classes or gations of user terminology do eventually suffer from
categories. Hjorland17 argues that a pure content- the same methodological deficiencies as automatic doc
oriented conception of subject indexing will often result ument indexing. The fact that user-dedicated vocabu
in very trivial descriptors, which cannot be applied to laries for particular IR-systems can be compiled auto
search for more profound aspects like the theoretical matically from logs of end-user search statements
reference frame applied, though often not stated in a points in a mechanistic direction. One should hesitate,
document. This argument complies with Weinberg's on such observations alone, to dismiss empirical inves
critique of the 'aboutness' approaches to indexing."1 tigations of user terminology for building require
ments-oriented indexes or IR-systems. For special
The requirements-oriented conception of domains like fiction, line arts and music, empirical
subject analysis investigations of user concepts, using for example asso
The requirements-oriented conception of subject
ciation tests, may be a fruitful approach indeed, since
these domains involve affective aspects in cognition,
analysis is applied here as a common denominator for
which cannot otherwise be captured.
request-oriented approaches1'' and sociological-episte-
Contrary to Soergel, but following his general idea.
mological frameworks" for indexing. Subject analysis
Hjorland proposes an approach of methodological col
based on requirements entails a different focus from
lectivism for capturing knowledge interests or user
the content-oriented subject analysis approach. When
needs."1 He defines 'subject' as 'the totality of episte-
analyzing a document, the indexer does not concen
trate on representing or abstracting the explicit and
mological interests that one document may serve'.17
implicit information in it. Rather, s/he asks: how Subjects are viewed as» interconnected with scientific
should I make this document, or this particular part of and human cognition. For subject analysis and index
it, visible to potential users? What terms should I use ing, this view imposes a priority on the indexer to
to convey its knowledge to those interested? decide on these long-term qualities of a document.
In requirements-oriented indexing it is the users' This theory presents a challenge for indexing prac
tice, which 1 have pursued under the framework of
search for knowledge in IR-systems or indexes to
books that determine the method for indexing. Hence, domain analysis of computing." Computing is a cross-
a document is analyzed with the purpose of predicting disciplinary knowledge domain, which cannot be out
its potentials for serving particular groups of users. lined readily on the basis of conventional criteria for
SoergeF advocates that these requirements should also the division of labour, or indeed of the sciences.
Rather, it can be seen as a point of convergence or
determine the choice of indexing terms, and that the
terminology applied by the indexer should comply with intersection between different interessees, including the
the terminology of the users. Soergel's approach places producers as well as the users of computers. Based on
itself within the context of current developments in methodological investigations of research and technical
end-user searching; instead of searching for knowledge paradigms in connection with a critical appraisal of
current methods for building classification systems and
in an IR-system, supported by an intermediary, for
instance a librarian, the user goes directly to the docu
thesauri in this field. I proposed a faceted classification
scheme with nine domain-specific facets and experi
ments via a user-dedicated search vocabulary. This
does not imply the extinction of intermediaries for mented with indexing different types of software appli
mediating knowledge. Rather, Soergel's approach rep
cations. These facets can be used as a check-tag-list for
resents a change in the division of labour, in particular subject analysis of software. Viewed as a structure, the
for the context of IR-systems, where the intermediary check-tag-list is an indexing tool, complying with the
functions, traditionally performed by a reference li faceted technique for subject analysis recommended by
brarian, are handed over to the indexers. However, for ISO.'1 Viewed as a framework for capturing concepts
indexers of books, this intermediary role has always and aspects of computing, it constitutes an instrument
been a primary responsibility. for transfer of knowledge. In my present work as a
teacher of indexing. I am introducing the students to
In order to achieve an anticipation of user needs,
Soergel'1' proposes a bottom-up compilation of user ter domain and facet analysis of different disciplines in
minology, for instance for the design of a user-dedi order to train them in building indexing tools which
cated in-house information system in the industry. This can serve several target groups for indexes. I am also
implies that the immediate requirements of a particular investigating classificatory structures as interaction
mechanisms in scientific communication.
user group may be exaggerated at the expense of future
users. What will happen when several inhouse data
Discussion and conclusion
bases featuring quite different terminologies and sub
ject indexing approaches are to be integrated in the Domain-specific approaches to designing indexing
context of a large company take-over? Apart from such tools are open to critique. For instance, Horner5
practical, though presumably not unsolvable. situa argues that at worst they suffer from the current trend
tions, one may wonder whether such empirical investi towards fragmentation of knowledge rather than sup-

222 The Indexer Vol. 18 No. 4 October 1993


porting a holistic view. However, my application of ity in a document and ensure its possible visibility in
domain analysis for indexing aims to facilitate transfer indexes and IR-systems for the future? How much
of knowledge in computing between different inter- responsibility should be imposed on us for judging or
essees. The objective is to investigate how domain-spe mediating the qualities of a document to potential
cific facets can be generalized between individual users?
knowledge domains. In this context. I will mention a Instead of letting the answers to such questions
related approach to building indexing tools: the CIFT- blow in the wind, I suggest that indexers reconsider
project." covering the domain of fine arts and literature their practice. Current practice in indexing can be said
and featuring aspect-oriented facets such as 'Scholarly to confine itself to modest, value-free ethics for dissem
approach' and 'Technique/method'. These facets are ination of knowledge. Requirements-oriented indexing
equivalent to the aspect-oriented facets 'Scientific involves a high degree of subjectivity and responsibility
Paradigm' and Technical approach' in my indexing in choosing among the qualities of documents. Current
framework for computing. discussions in other professions, such as teaching and
Domain analysis adheres to the requirements- medical practice, tend to question prudent ethics of
oriented conception of subject analysis and indexing, objectivity in mediating their services to their target
which entails difficulties, when indexing practice enters groups. Rather than refraining from picking up the
the stage. Before dismissing the approach on suspicions challenges posed by the social and cultural reality
of impracticality and, even worse, because it may within which we operate, we should face the music,
involve the risks of isolating knowledge to an elite of too. New frameworks like requirements-oriented
domain experts, as implied in Homers critique.' I will approaches have potentials for supporting a broad and
summarize the implications of the three different con open transfer of knowledge, which is a primary respon
ceptions in the model presented in figure 1. sibility of our profession. To choose such frameworks
Subject analysis and indexing can be automated, but means the end of such 'threats' to our profession as
fully automated only according to a simplistic concep automatic indexing, but more important, they provide
tion. The two other conceptions entail far more subjec us with a challenge to gain a new consciousness of the
tive and, indeed, difficult frameworks and methods for impact of our profession for mediating knowledge.
subject analysis. The difficulty lies in part in the fact
that the concept of "subject* still constitutes a pioneer References
ing area of study for research and practice in indexing. 1. Hjorland. B. Fundamental concepts in information science
All three conceptions have advantages and draw (In Danish: Informationsvidenskabeligc grundbegrcber).
backs. The advantage of following a simplistic concep Copenhagen: The Royal School of Librarianship, 1992.
tion is pragmatic: due to the current decrease in the 40-3. [English edition in progress, co-authored by Hanne
pricing of computers and software, and the increasing Albrcchtsen].
2. Korycinski, C. and Newcll. A.F. Natural-language pro
cost of manpower, automatic indexing is hard to com
cessing and automatic indexing. The Indexer 17 (I) April
pete with on an immediate economical basts. The
1990, 21-9.
major drawback is that it does not aim at facilitating 3. Jones, K..P. Natural-language processing and automatic
the transfer of knowledge to possible interessees in the indexing: a reply. The Indexer 17 (2) October 1990,
documents it processes. This drawback may eventually 114-15.
render the technique expensive in the long run, if one 4. Clcverdon, C.W. Optimizing convenient on-line access to
considers the transfer and optimum use of knowledge bibliographic databases. Information Services and Use 4
as a key societal asset. 1984. 37-47.

The content-oriented conception has the advantage 5. Cooper. W.S. Is intcrindcxer consistency a hobgoblin?
American Documentation 20 (3) 1969. 266-78.
of being an established technique for training and pro
6. ISO [International Standards Organization]. Methods for
fessional work in indexing, but together with the sim
examining documents, determining their subjects and select
plistic conception, it suffers from being one-sided, as it ing indexing terms 1985 (ISO 5963)
focuses on representing individual documents or docu 7. Foskctt. A.C. The subject approach to information.
ment collections instead of considering their possible London: Clive Bingley. 1982.
uses. 8. Dutta. S. and Sinha. P.K. Pragmatic approach to subject
Finally, the requirements-oriented conception has indexing—a new concept. Journal of the American Society
the advantage that it supports the transfer and dissemi for Information Science 35 1984, 325-31
9. Frohmann, B. Rules of Indexing: a critique of mental ism
nation of knowledge. One may argue, however, that a
in information retrieval theory1. Journal of Documentation
major disadvantage for the practice of indexing is that
46 (2) 1990. 81-101.
its ultimate goal may only be realized in Utopia. How-
10. Blair. D.C. Language and representation in information
much 'scientific pre-thinking'. following Soergcl,'" can retrieval. Amsterdam: Elsevier Science Publishers. 1990.
we offer as professional indexers. and how do we train 11. Mutrux, R. and Anderson. J.D. Contextual indexing and
students of indexing to follow such a philosophy? How faceted taxonomic access system. Drexel Librarv Quarterly
can indexers distinguish subjects of a high or low prior 19(3) Summer 1983. 91-111.

The Imiexer Vol. 18 No. 4 October 1993 223


12. Albrechtsen. H. Domain analysis for classification of soft tem for software reuse. Classification research for knowl
ware. Copenhagen-. The Royal School of Librarianship. edge representation and organization edited by N.J.
1992 (MSc Dissertation). Williamson and M. Hudon. Amsterdam: Elscvicr Science
13. Beghtol. C. Bibliographic text theory and text linguistics: Publishers, 1992, 137-44.
aboutness analysis, intertextuality and the cognitive act of 22. Scdwcll, I. and Albrechtscn, H. The linguistic analysis of
classifying documents. Journal of Documentation 42 (2) UNIX online documeniation. Uxbridge: Brunei University
1986.84-113. 1988 (esprit practitioner project report. BrU-021).
14. Fairthorne. R.A. Content analysis, specification and con 23. Frakcs, W.B. and Gandcl. P.B. Representing reusable soft
trol. Annual Review of Information Science and Technology ware. Information and Software Technology. 32 (10) 1990.
4 1969.73-109. 653-64.
15. Hutchins. J.W. On the problem of "aboutness* in docu 24. Hall. P. Domain modelling and concept storage. Uxbridge:
ment analysis. Journal of Informatics 1 (1) 1977. 17-35. Brunei University 1990 (esprit practitioner project
16. Salton, G. and McGill, M. Introduction to modern informa report, BrU-lS-WP4.1-096).
tion retrieval. New York: McGraw-Hill, 1983. 25. Horner, D.S. Paradigms, discourses and language games:
17. Hjerland, B. The concept of 'subject' in information sci categorical frameworks and signs of the times.
ence. Journal of Documentation 48 (2) 1992, 172-200. Presentation invited for expert panel on Domain Analysis
18. Weinbcrg. B.H. Why indexing fails the researcher. The in Information Science. ASIS Annual Meeting. Columbus
Inde.xer 16 (I) April 1988. 3-6 (OH) 22-28 October 1993.
19. Soergel. D. Organizing information—principles of database
and retrieval systems. New York: Academic Press, 1985.
20. Albrechtscn. H. Software concepts: knowledge organiza
tion and the human interface. Advances in Knowledge
Organization edited by R. Fugmann. Frankfurt: Indcks Hanne Albrechtsen is Lecturer in Classification and
Verlag, 1990,48-64. Indexing at the Royal School of Librarianship.
21. Albrechtsen, H. press: A thesaurus-based information sys Copenhagen. Denmark.

Justification for indexlessness?

Gillian Tindall prefaces her 20-page volume of liter I THE PLACE YOU FIRST THOUGHT OP page 7
ary criticism. Rosamond Lehmann: an appreciation The author's basic themes - the location of her
(Chatto & Wind us, 1985), with 'A note for the reader" childhood - family - upbringing - education - The
concluding with this paragraph: Gipsy's Baby - The Swan in the Evening - dead chil
dren - A Note in Music - 'The Red-Haired Miss
This book has no formal index. Readers and
reviewers are not, however, invited to take this as a Daintreys' - the sources of creation
sign of idleness or lack of scholarly attention on the II THE HOUSE ON THE HILL page 24
part of myself or my editors: it has been a deliberate
The house next door - Dusty Answer - the effects of
decision. The number of books published by
the First World War - Invitation to the Waltz -
Rosamond Lehmann is quite small, each is men
tioned over and over again in my text and to most The Desborough family - love doomed from the
of them whole chapters are effectively devoted. This start

is not a biography: my text is not concerned with

places, people or events so much as with themes.
The Echoing Grove: cycle of experience - perennial
Any index which confined itself, in the traditional
way. to concrete items, proper names, etc.. would themes - mothers - sisters - betrayal - retrospective
be too short and uninformative to be of much use: view - two different time-scales in this novel -
conversely one which attempted, by some means, to attempts at an exact chronology - a cycle in the
recapitulate the true nature of the book's contents author's own life - triangular relationships - repeti
would not make much sense except to someone who tions - this novel progressing thematically - the
had already read the book. 1 have therefore pre events of the story - the order in which they are
ferred to set out the contents to each chapter rather recounted - cuff-links as symbolic objects - readers'
more fully than is customary in the Contents, in the
reactions - theme of salvation - range of characters
hope that this will be of sufficient assistance to any
- sisters in opposition to one another - the outsider
reader wishing to look up a specific book or phase
of the author's life. theme and its relationship to the writer herself

We print below also the contents to three chapters Our thanks to Gillian Tindall for permission to quote
as set out in the list, and invite our readers' comments. these passages.

224 The Imle.xer Vol. 18 No. 4 October 1993