Académique Documents
Professionnel Documents
Culture Documents
MILENACOLLOTANDNANCYBELMORE
...
ACKNOWLEDGMENTS
*
This is a revised version of an article which originally appeared in ]993 under the same
title in English Language Corpora: Design and Exploitation. We would like to express
our appreciation to the editors, Jan Aarts, Pieter de Haan and Nell eke Oostdijk, for
pennission to include the article in this volume. Special thanks are due to G. N. Leech,
Professor of English and Director of the Unit for Computer Research on the Eng]ish
Language at the University of Lancaster, for pennission to use the CLAWS] tagging
programs. We are also grateful to Adrian Saldanha for his technical expertise, and to Anne
G. Barkman of Concordia's Computing Services Department, who modified the
CLAWS] programs so that they would run on Concordia's VAX2 computer. Thanks are
also due to Prof. Margery Fee of Queen's University and Prof. John Upshur of Concordia
for their helpful comments and suggestions as readers of Collot's master's thesis on
Electronic Language.
NOTES
I.
We learned about Baron's article and severa] other early articles on computer-mediated
communication in "The language of electronic mail", a paper which Natalie Maynor
contributed to the TESL-L electronic discussion forum for Teachers of English as a
Second Language in March ]993. In this paper, Maynor comments on some apparent
features of what she calls 'e-mail sty]e'. She notes that some of the features".. .seem to
be attempts to make writing more like speech". Readers who would like a copy can
obtain one by joining TESL-L (listserv@cunyvm.bitnet or ]istserv@cunyvm.cuny.edu).
1. Introduction
Unlike previous studies of computer-mediated communication (CMC) which
have concentrated on either psychological factors, or on the perceived attributes
of the medium, and which have often used small and specific data sets, this
chapter reports the findings of a large corpus-based comparison between
spoken, written and CMC discourse. The various factors across which
comparisons are made were drawn from a Hallidayan model of language use and
focus upon the textual, interpersonal, and ideational (Halliday 1978) aspects of
speech, writing and CMc.
The chapter initially describes the construction of a corpus of CMC
interactions. The textual aspects of CMC discourse are then considered through
such measures as type/token ratios and lexical density. The interpersonal in
CMC is explored through the examination of pronoun use. Last, the presentation
of the ideational within CMC discourse is considered through an exploration of
modal auxiliary use within the CMC corpus. All of these results are
comparatively analyzed against similar results from spoken and written corpora.
30
SIMEON J. YATES
The computer revolution has brought with it new forms of discourse which
also deserve systematicstudy. One of these is electronicmail... Electronic
mail reveals features of both speech and writing. Like other forms of
discourse, new as well as old, it deserves the attention of future corpus
workers. (307-308)
There are a considerable number of different sources that could be used to build
a corpus of CMC interactions and utterances. These possibilities include
electronic mail (e.g., Microsoft Mail, VAX Mail, etc.), bulletin boards,
computer conferencing (e.g., CoSy, VAX Notes, Confer, etc.), synchronous
messaging systems (e.g., VAX Phone, Macintosh Broadcast), and so forth.
Though collecting data from such sources has many advantages, especially as
the data come already transcribed, there are specific difficulties in relation to each
form of communication. The collection of data from direct synchronous forms
of communication is for most purposes a difficult task. Such systems tend not to
store the messages created and, akin to using the telephone, once they are
finished no record remains.
Electronic mail systems do, on the whole, store received messages,
allowing the recipient to decide upon which to keep or to delete. However, there
is one main problem with using such a source, namely privacy. Most electronic
mail is sent directly from individual to individual. To collect such a corpus
would require access to others' mail exchanges, or to those messages they
wished to release. This would also be the case for synchronous forms of CMC.
Electronic mail in the form of "listserve" interactions and bulletin boards
gets around many of these problems as it mostly involves open multi-party
communication. There are fewer data collection problems as anyone person can
be connected to a large number of different discussions. Computer
conferencing, like listserve discussions, also tends to consist of open, multiparty discussions. In many respects these two types of CMC are very similar
systems of communication, differing mainly in the form of their administration
and technical operation. Unlike listserve interactions, computer conferencing
messages are stored centrally in a database and accessed by users there.
Messages from earlier in the interaction are readily viewable and whole
conference discussions can be downloaded at will. This makes computer
conferencing a convenient data source for an initial CMC corpus.
There were three main computer conferencing systems available to the
research project. These were the CoSy system and the VAX Notes systemboth running on the VAX cluster at the Open University in the United
Kingdom-and the Confer system used at the University of Michigan in the
USA. Very little use is made of the VAX Notes system at the Open University,
and no data were therefore collected from that source. A large data set was
provided from the Michigan Confer system which is yet to be fully analyzed.
The main data source was therefore interactions taking place on the CoSy system
at the Open University in the United Kingdom.
'11
31
Given the above discussion of possible sources, the choice of CoSy might
seem a somewhat arbitrary decision. There are, however, important
methodological reasons for this choice. First, CoSy has a large user base spread
across the whole of the United Kingdom. The system is now used directly for
student support on several Open University courses, as well as indirectly by
those students with CoSy access on other courses. It also provides a forum for
much discussion and communication among both academic and non-academic
staff. There were over 2,000 registered users of the CoSy system at the time the
research was conducted. A second reason for using CoSy as a data source is the
possibility for the collection of background data on users. The main course
taught via CMC, through practicals and projects, provides a large database of
information on student use and attitudes, containing approximately 900
responses each year. These data have been used to conduct evaluations (see
Mason 1989) as well as studies of specific issues (see Yates 1993a).
2.2. Creating the CMC corpus
While one obvious method for the creation of a CMC corpus would be to
download all available material, this would produce a data set of excessively
large proportions that might prove too unwieldy. It also ignores the fact that the
corpus must be to some extent comparable to the written and spoken corpora
already in existence. It was decided to use only those conferences which had
been designated as "open". Though it was possible for the researcher to have
access to "closed" discussions, the use of such data did not seem appropriate for
ethical reasons. All open discussions are accessible to all CoSy users.
2.2.1. The 50 message selection
The first task in creating the corpus consisted of collecting a sample from each
open conference topic. Given that the size of text samples for the written and
spoken corpora was 2,000 and 5,000 words respectively, it was decided that the
CoSy samples should be comparable in size. An initial general examination
revealed that 50 CoSy messages contained an average of 3,500 words. The
sampling technique used was therefore to cut off interactions at 50 messages.
Such an arbitrary cut-off point might seem problematic and does raise
some methodological concerns. However, the written corpus to be examined,
the Lancaster-Oslo/Bergen (LOB) corpus, arbitrarily cut off text samples at
2,000 words. On the other hand the spoken corpus, the London-Lund corpus,
amalgamated spoken interactions of a certain type until they contained some
5000 words. The cutting off of interactions at 50 messages was therefore
comparable to the arbitrary sampling techniques used in the other target corpora.
There were 152 conference topics with 50 messages or more. From all of
these the first 50 messages were taken. This provided a total of 648,550 words
and an average of 4,267 words per sample text, somewhat higher than expected
32
SIMEON J. YATES
though still within the required range. (A full description of the corpus including
individual details for each interaction is given in Yates 1993b.) This corpus will
from now on be referred to as the CoSy:50 corpus.
33
Two large corpora of spoken and written material were used in the research
reported here. These were made available on a CD-ROM provided by the
Norwegian Computing Centre for the Humanities. The CD-ROM provided four
different corpora, these being the Brown Corpus of written American English,
the Lancaster-Oslo/Bergen corpus of written British English, the London-Lund
corpus of spoken British English, and the Kolhapur corpus of Indian English.
The texts came in various operating system formats and in a number of
differently tagged forms. It was decided to use the Lancaster-Oslo/Bergen
corpus to represent written texts and the London-Lund corpus to represent
spoken texts. These were used because they represented examples of British
rather than American or other forms of English. This was an obvious choice
given that the CoSy corpus was created by people living within the United
Kingdom.
34
SIMEON J. YATES
..
analysis concerns the lack of reference to the structure of most written English.
Given that clauses are generated through specific constructions of both
grammatical and lexical words, and that these clause constructions may, and do,
differ between speech and writing, such a crude measure of the type/token ratio
is problematic. A more rigorous measure of vocabulary use would be to
calculate the ratio of the number of different lexical items (lexical types) to the
total number of lexical items (lexical tokens).
35
0.65
0.6
0.55
0.5
0.45
0.4
Using specifically designed software tools, data were collected on the numbers
of lexical items in three of the corpora. The three corpora used were the
CoSy:Sample, the London-Lund and the LOB corpora. The reason for choosing
the CoSy:Sample corpus concerns the effects of length upon type/token ratios.
A small text of a few hundred words will not indicate much in terms of
vocabulary use in the type of quantitative analysis being conducted here. On the
other hand, though there are a greater number of words in total in a longer text,
the numbers of new words will be less. This makes a type/token measure length
sensitive. That is to say that for any set of texts of the same length, one would
expect lower type/token ratios for the spoken texts and higher ones for the
written texts, but the actual values of the ratios would be proportional to the
length of the text. In order to avoid this problem, the sampled set of CoSy
conferences was used. The average length of the sampled conferences was
2,262 words, compared to 2,000 for the LOB and 5,000 for the London-Lund
corpora. As Zipf (1935) (See below) indicated, this length sensitivity is a
function of the natural log of the text length, and therefore the differences in
average text length between these three corpora should not affect the result. The
general result is presented in Table 1.
Table 1. Mean type/token ratios for three corpora
Corpus:
CoSy:Sample
(CMC)
LOB
London-Lund
(Writing)
(Speech)
0.590
0.624
0.395
0.051
0.084
0.051
= 367.370,
0.35
CMC
Writing
Speech
"
This result would seem to indicate that CMC is more akin to writing than speech
in terms of range of vocabulary used. The most obvious conclusion is to follow
Chafe and Danielewicz and see this as a product of the medium itself, and the
opportunity it brings for longer gestation over the content of utterances. Careful
thought about this point however raises a complication. The texts contained
within the LOB corpus were mostly produced prior to the advent of the word
processor or the electronic text. The CMC corpus contains only word-processed
electronic texts. Electronic text brings an even greater set of opportunities to
correct, change, restructure and review utterances. Following the reasoning of
Chafe and Danielewicz, one would expect to see an even greater variety of
vocabulary used in CMc. However, this is not the case, no doubt because the
process of redrafting produces complexity and variety in a printed (especially a
published) text. Despite the fact that electronic text provides opportunities for
greater variety in the texts produced by individuals, it appears that these
opportunities are not taken up in most cases.
Moreover, it is writing that demonstrates the greatest variation around the
mean level of vocabulary use, as compared to speech and CMC which have
comparable variances. The implication of this result is that factors other than
simply the mechanical aspects of the medium are at work in detennining the
production of utterances.
Vocabulary use is dependent on a large number of sociolinguistic issues,
and despite having a clear relationship to the production and consumption of a
text, it is not the only measure of the textuality of an utterance.
difference between the three groups. Further comparisons using t-tests between
the three media also produced statistically significant results. Finally a TukeyHonestly-Significant-Difference test was conducted to check these results. This
test also found that all three groups were significantly different at the 0.05 level.
36
..
SIMEON J. YATES
If you invest in a rail facility, this implies that you are going to be
committed for a long time. [lexical density =0.35J
(2)
Halliday argues that the first of these sentences appears much more like the
record of spoken communication than written. The first of these sentences has 7
lexical items and 13 grammatical ones. The lexical items are: invest; rail;facility;
implies; committed; long; and term. In the second sentence there are 7 lexical
items and 3 grammatical ones. The lexical items are: investment; rail; facility;
implies; long; term and commitment.
There is therefore a quantitative difference between spoken and written
utterances. The ratio of lexical items to grammatical ones is lower for spoken
language than it is for written. In the above example the first sentence has a ratio
of 7 to 13 while the second sentence has a ratio of 7 to 3. These can be better
represented as a ratio or percentage of the number of lexical items to the number
of total items within an utterance. According to this measure, the first sentence
would score 7 out of 20 or 0.35 (35%). The second sentence would score 7 out
of 10 or 0.7 (70%). These types of score Halliday describes as lexical density
scores (Halliday 1985:61-63).
Ure (1971) conducted a study of the lexical density of texts representing
34 spoken and 30 written registers or genres. Ure used a similar method to that
described above, counting the total number of words for each text and then
counting the number of words with "lexical properties". From these values she
calculated the percentage of the total words with "lexical properties" (Ure
1971:445). Using this method, Ure noted a clear difference across two factors.
First, all written texts were found to have a higher lexical density; second, texts
containing "feedback" from the listener/reader had lower lexical densities (Ure
1971:445-452).
A similar test was conducted using the London-Lund and LancasterOslo/Bergen corpora, while for CMC the CoSy:Full corpus of interactions and
the CoSy:50 corpus of interactions were used, as this measure is not sensitive to
text length. A one-way analysis of variance produced the results given in Table 2
and the interaction plot in Figure 2. Once again the results of this analysis were
significant (F=238.439, df=810, p<O.OOOI).A post-hoc Tukey-HonestlySignificant-Difference test indicated these individual differences were all
significant at the 0.05 level.
37
CoSy:50 and
CoSy:Fuli
(CMC)
LOB
London-Lund
(Writing)
(Speech)
49.258
50.316
42.292
52
50
48
46
44
42
40
38
CMC
Writing
Speech
..
/'
38
SIMEON J. YATES
indeed a starting point and was used in earlier versions of this analysis (Yates
1993a), but raises several methodological problems. First, it is not clear at what
stage a word becomes a "high frequency item". Nor does Halliday make it clear
whether he is referring to high frequency in the text itself, in the genre, or in the
language. In fact all these factors may affect lexical density. In practical terms
such a method also requires pre-analysis of the data to determine word
frequencies in individual texts and then re-analysis of those texts with
specifically constructed lists of "high frequency items". What is clearly needed
therefore is some definable method which can take into account difficulties
associated with high frequency items.
A second problem in creating such a weighting concerns another form of
repetition, this being the occurrence of the same lexical item, or lexeme, in
several forms (e.g., differ and different). There is no full solution to this
problem although one general and rough measure of the phenomenon is
provided by the specifically designed software tools. By taking note of such
variants as plurals and other readily identifiable forms of a lexeme, a lexemes-tolexicals ratio can be calculated.
Following the work of Zipf (1935) on the linear relationship in a given text
between word frequency, number of occurrences, and word rank-the word's
relative position in a ranking of word frequencies-it is possible to provide a
mathematical formula based on rank through which we "weight" each item in an
analyzed text. To calculate the lexical density of a text, we need only sum the
weights for.each word and divide this by the total number of words in the text.
This calculation is given in the following equation:
rmax
Lw
Ln)
x=l
39
analysis, the set of 61 CoSy conference samples was used to offset text length
effects. A one-way analysis of variance was conducted to examine the
differences between the three media. The results of this analysis were again
significant (F=193.320, df=658, p<O.OOOl).Table 3 gives the mean values for
the different corpora; Figure 3 is an interaction plot of the same. A post-hoc
Tukey-Honestly-Significant-Difference test confirmed that the differences were
significant at the 0.05 level for the comparisons between CMC and speech and
speech and writing.
Table 3. Mean weighted lexical densitiesfor three corpora
Corpus:
Mean Weighted Lexical
Density [%]:
CoSy:Full
(CMC)
LOB
London-Lund
(Writing)
(Speech)
44.996
46.076
35.994
48
46
44
'8
42
40
38
In(rmax- rx)
T
Equation
""
36
34
CMC
Writing
Speech
40
SIMEON J. YATES
'...
41
4. Modality of CMC
120
..
.
100
One of the main differences between speech and writing which many researchers
focus upon is reference to self and other. Chafe (1982) and Chafe and
Danielewicz (1987) have considered differences in the relationship between
writer/reader and speakernistener as encoded within speech and writing. They
examine these issues linguistically through considering pronoun use in various
genres representing formal written, informal written, formal spoken, and
informal spoken varieties of American English. Chafe (1982:45) describes the
differences as being formed by levels of involvement and detachment. He argues
that the involvement of speakers with their audiences arises from the fact that
80
60
... to
1st person
20
0
CMC
Writing
Speech
Figure 4. 1st, 2nd and 3rd person pronoun use as a proportion of total pronoun
use (Actual values)
Where CMC differs markedly from speech and especially writing is in its
lower use of third person reference. In terms of percentage of first and second
person reference, it is closer to speech in overall usage, as indicated in Figure 5.
Fowler and Kress (1979) also consider these issues through examining the
usage of pronouns in general. In contrast with Chafe, they see the omission of
subjectivity from written texts in terms of conventional social practices rather
than effects of the medium. They claim that
100%
..
.
80%
2nd person
40
[i]t is typically the case that a speaker has face to face contact with the
person to whom he or she is speaking. That means, for one thing, that the
speaker and listener share a considerable amount of knowledge
concerning the environment of the conversation. It also means that the
speaker can monitor the effect of what he or she is saying on the listener,
and that the listener is able to signal the understanding and ask for
clarification
3rd person
60%
..
40%
3rdperson
2nd person
1st person
20%
0%
CMC
Writing
Speech
Figure 5. 1st, 2nd and 3rd person pronoun use as a proportion of total pronoun
use (Percentage of total)
42
..
SIMEON J. YATES
43
of modality in language by
--a
16
14
12
10
CMC
Writing
Speech
CoSy:Full
CoSy:50 (CMC)
LOB
(Writing)
18.2925
13.7042
London-Lund
(Speech)
14.4853
44
SIMEON J. YATES
CMC (C)
C,W
C,W
C
50%
40%
30%
W,S
W,S
S
C,S
S
W,S
W,S
100%
90%
80%
70%
60%
Writing (W)
45
Speech (S)
C
.
.
Hypothetical
IIIVolition
Possibility
IIIAbility
II Obligation
20%
10%
Our statistical results show that CMC differs from both speech and writing in all
cases except modals of epistemic possibility. All of the other results indicate
clear differences in the usage of modal auxiliaries within spoken, written and
CMC utterances, with CMC making the greatest use in all cases. Figures 7 and 8
collate these results and indicate the relative proportional use of modal auxiliaries
in all three media.
25
.
.
20
15
Hypothetical
Volition
Possibility
IIIAbility
II Obligation
10
0%
CMC
Writing
Speech
Though all of the other groups indicated a significant difference between CMC,
speech and writing, the modals of ability and possibility (abbreviated
'possibility' in Figures 7 and 8) indicate the greatest differences in use across all
three media.
Such results would seem to indicate that the CMC corpus contains a
considerable degree of discussion in which statements are modalized in one form
or another. Qualitative results obtained by Yates (1993b: 134-136) indicate that
the contextual use of modal auxiliaries within CMC is comparable to that of
speech.
5.
0
CMC
Writing
Speech
As was the case with pronoun use, the overall relative frequencies of modal
usage across the five semantic groups are most similar between speech and
CMc.
Conclusion
46
SIMEON J. YATES
singly-defined discursive object. In fact the text of the CMC interaction is the
field. Such lack of a defined semiotic field may be the explanation for the high
levels of modality within CMC discourse. Not only must the text carry the social
situation, it must also carry the participants' relationship to the situation, their
perception of the relationships between the knowledge and objects under
discussion.
Halliday considers next the semiotic tenor of the communication; this is
essentially defined by social roles, often made clear by the social situation in
which the participants are placed. The tenor of CMC interactions may not be as
limited as some researchers, especially those concerned with the issue of
network anonymity and psychological accounts of de-individuation processes,
indicate; it is limited, by the lack of a clear semiotic field, to those presentations
of self which take place within and through the CMC text. The necessity to
present oneself may be one factor behind the high levels of first and second
person pronoun use within CMC discourse.
Finally, the mode of CMC, as a communications medium, is neither
simply speech-like nor simply written-like. Though CMC bears similarities in its
textual aspects (e.g., type/token ratio and lexical density) to written discourse, it
differs greatly in others, namely pronoun and modal auxiliary use. Taken
together, these similarities and differences make clear the complexity of CMC as
a mode of communication. As with both written and spoken discourse, CMC is
affected by the numerous social structural and social situational factors which
surround and define the communication taking place. What is yet to be made
fully clear is the extent to which human beings in specific social and cultural
settings can develop and enhance their communication through the use of CMC.
NOTES
1.
2.
The DTIOOdata from 1989 are lost forever as they had been erased from the CoSy system
before they could be downloaded.
For a full discussion of the psychological aspects of information processing, see Just and
Carpenter 1980; Pollatsek and Rayner 1989; Rayner 1977, 1978; Rayner and Duffy 1986;
and for their effects on lexical density, see Yates 1993b.
1. Introduction
Increasingly, forms of computer-mediated communication are emerging that
enable "conversations" to take place in real time through the medium of written
language. Synchronous exchanges are becoming popular on bulletin boards,
server communities and across much of the Internet. In their simplest form, such
exchanges consist of two people in different rooms sitting in front of computer
screens that each display a window split into two parts. Everything that either
person types will be simultaneously displayed both on their own screen and on
the screen of the other person, in either the top or bottom half of the window. In
its more complex forms, synchronous computer-mediated communication can
involve people from all over the world communicating in sophisticated and
highly conventionalized ways within electronic zones that can be said to
constitute "virtual communities" (Rheingold 1993).
While different modes, technologies and social spaces inflect this form of
communication in significantly different ways, all have in common that they
involve the production of writing via computer such that synchronous textual
dialogue takes place between spatially distant interlocutors. Such
communication, which following Ferrara et al. (1991) I term "interactive written
discourse", I provides fertile ground for analysis since it makes possible
interesting forms of social and linguistic interaction, brings into playa unique set
of temporal, spatial and social dimensions, reconfigures many of the parameters
that determine important aspects of how communicative acts are structured, and
provides a clear instance of how forms of writing made possible by the
computer exhibit properties that converge with those typically associated with
spoken discourse.
This paper examines a particular form of interactive written discourse
known as Internet Relay Chat, or IRC. I begin by providing a brief sketch of