Académique Documents
Professionnel Documents
Culture Documents
ARABIC
A CORPUS-BASED STUDY
2004
Abdel-Hamid Elewa
No portion of the work referred to in the thesis has been submitted in support of
an application for another degree or qualification of this or any other university,
or other institution of learning
` 1
Acknowledgements
First and foremost, I thank God Almighty, Who teaches man what he does not know.
Then, I would like to express my gratitude to my supervisor Dr. Paul Bennett who
throughout the years I have spent doing my research showed me an unequivocal
perseverance, gave me so much time and enriched my work with his invaluable
comments.
During this work I have collaborated with many colleagues for whom I have great
regard, and I wish to extend my warmest thanks to all those who have helped me
with my work in the Department of Language Engineering, particularly, Sattar Izwaini
and Amin Almuhanna; we managed together through our discussions and
commentary on Arabic language to raise a lot of interesting points.
I would like also to thank my examiners, Prof. Harold Somers, Dept. of Informatics,
University of Manchester, and Dr. James Dickens, Dept. of Middle Eastern Studies,
University of Durham, for their criticism and helpful comments that gave my thesis its
academic form.
My thanks are also due to my wife, Iman Refaey who helped me in assembling the
electronic corpus for use in this research.
` 2
Table of Contents
TABLE OF CONTENTS...........................................................................................................
ABSTRACT................................................................................................................................
NOTES ON TRANSLITERATION.........................................................................................
` 3
4.2.3 Tagging Arabic Texts....................................................................................................
4.2.4 Tools for Processing Arabic..........................................................................................
4.3 DESCRIPTION OF THE CORPUS....................................................................................................
4.3.1 The rationale behind this selection...............................................................................
4.3.2 Why these texts?...........................................................................................................
4.4 Conclusion.......................................................................................................................
CHAPTER FIVE: LEXICAL COLLOCATION....................................................................
5.1 INTRODUCTION........................................................................................................................
5.2 DEFINITION OF COLLOCATION....................................................................................................
5.3 COLLOCATION AND COLLIGATION...............................................................................................
5.4 TYPES OF COLLOCATION...........................................................................................................
5.5 SPANS....................................................................................................................................
5.6 SEMANTIC PROSODY................................................................................................................
5.7 EXTRACTION OF COLLOCATION..................................................................................................
5.7.1 Using statistics in collocation extraction......................................................................
5.7.1.1 Lemmatisation........................................................................................................
5.7.1.2 Concordances.........................................................................................................
5.7.1.3 Frequency...............................................................................................................
5.7.1.4 T-test: a measure of difference...................................................................................
CHAPTER SIX: SYNONYMY: AN OVERVIEW..................................................................
6.1 INTRODUCTION........................................................................................................................
6.2 DEFINITION.............................................................................................................................
6.2.1 Synonymy - Four Approaches......................................................................................
6.2.2 Degrees of Synonymy..................................................................................................
6.2.2.1 Absolute synonymy:..............................................................................................
6.2.2.2 Propositional synonymy.......................................................................................
6.2.2.3 Near-synonymy......................................................................................................
6.3 SYNONYMY IN ARABIC.............................................................................................................
6.4 THE REPETITION OF SYNONYMS IN ARABIC.................................................................................
6.5 CONCLUSION...........................................................................................................................
CHAPTER SEVEN: COLLOCATIONAL TREATMENT OF SYNONYMY IN
ARABIC......................................................................................................................................
7.1 INTRODUCTION........................................................................................................................
7.2 DATA CHOICE.........................................................................................................................
7.3 DATA ANALYSIS......................................................................................................................
7.4 A CASE STUDY: THE WORD PAIR JAA’A AND ATA ‘COME’....................................................................
7.4.1 Summary.......................................................................................................................
7.5 A CASE STUDY: THE WORD PAIR ITHM AND DHANB ‘SIN’....................................................................
7.5.1 A Few Remarks.............................................................................................................
7.5.2 Summary.......................................................................................................................
7.6 A CASE STUDY: THE WORD PAIR H}ASIBA AND Z}ANNA ‘THINK’ ....................................
7.6.1 Summary.......................................................................................................................
7.7 A CASE STUDY: THE WORD PAIR H}BB AND WDD ‘LOVE’.............................................
7.8 CONCLUSION...........................................................................................................................
` 4
CHAPTER EIGHT: CONCLUSION.......................................................................................
APPENDICES............................................................................................................................
APPENDIX 1: COPYRIGHTS..............................................................................................................
APPENDIX 2:................................................................................................................................
THE CONTENTS OF THE CAC ARE SUMMARISED IN THE FOLLOWING CHARTS (1) & (2):...........................
APPENDIX 3:................................................................................................................................
GENRES AND TEXTS INCLUDED IN CAC............................................................................................
APPENDIX: 4................................................................................................................................
APPENDIX 5:................................................................................................................................
BIBLIOGRAPHY......................................................................................................................
` 5
Abstract
I am concerned in this study with applying the corpus linguistics methodology that
concentrates on investigating language use, with particular reference to Classical Arabic. I do
not wish to undermine what has been done on the basis of intuition, but the time is now
opportune to use modern tools to discover new facets of linguistic behaviour in relation to
Classical Arabic and to demonstrate the potential impact of computational methods on Arabic
linguistic studies.
One of our main aims will be to demonstrate the usefulness of the corpus methodology in
describing Classical Arabic by examining lexical collocations. To do this, I have assembled a
classical Arabic corpus which covers the early period of Islam, because the available Arabic
corpora are only limited to Classical Arabic of today which is called Modern Standard
Arabic.
This study is also an attempt at explaining some issues in semantic relations, particularly
synonymy, which can be accounted for in terms of collocations by using a computerized
concordancer that enable large quantities of text to be searched for all occurrences of a
particular lexical item. Through lexical collocational analysis I can compare and contrast the
characteristic uses of semantically related words such as synonyms. According to Cruse
(1986) two lexical units would be absolute synonyms (i.e. would have identical meanings) if
and only if all their contextual relations were identical. Through corpus analysis we can show
whether two items are indeed absolute synonyms or not by checking their relations in all
available contexts.
By this technique, it is possible to compare seemingly synonymous words and find out
whether they are real synonyms or not. I will argue that absolute synonyms do not exist in
terms of their collocational patterns. Through collocation we can distinguish one sense of a
word from another and know whether a seemingly synonymous pair are real synonyms or
not. Collocation is, therefore, a device with which a particular sense of a word is activated.
In order to prove that subtle differences can be brought out by collocation, the collocates
for a list of synonymous pairs are analysed. This will be explored through the analysis of
these seemingly synonymous Arabic words, aiming to show that many synonyms are partial
or incomplete, and none can be called true (absolute) synonyms.
` 6
Notes on Transliteration
There are two common ways to represent the Arabic script in the Roman script: transliteration
and transcription. The former is based on graphemic mapping and the latter is phonemic.
There are some Arabic consonants and vowels which have equivalent letters in the Roman
alphabet. These are easy to transliterate or transcribe; it depends on what purpose one has for
rendering them in either way.
For the letters which have no Roman equivalent, linguists or Arabic users sometimes adopt a
set of symbols which are mainly transcriptions. Such a process yields a mixed system of
transliteration and transcription. ‘This leaves plenty of scope for scholarly debate, with the
result that there are now many supposedly international standards’ (Whitaker, 2002).
Among the most common systems are the one adopted by the International Convention of
Orientalist Scholars in 1936, the British Standard, BS 4280, the US Library of Congress and
the American Library Association. The latter have issued “Romanisation tables” for more
than 150 non-Roman written languages and dialects including Arabic (ibid).
One of the reasons given by Whitaker (ibid) for the inefficiency of these Romanisation
systems is that they are not easy to key due to the sophisticated figures they use like dots,
lines and other marks.
For a practical reason, I tried to use a transliteration system which makes the utmost use of
the English alphabet. This is dependent to a great extent on the one adopted by the US
Library of Congress with some modifications as shown below:
` 7
Arabic Transliteration Chart
Name of letter Arabic letter shape Symbol in Transliteration
hamza ء ‘
ba: ب b
ta: ت t
θa: ث th
ji:m ج j
ha: ح h{
xa: خ kh
da: د d
dh:l ذ dh
ra: ر r
za: ز z
si:n س s
shi:n ش sh
sa:d ص s}
da:d ض d}
ta: ط t}
z{a: ظ z{
cayn ع c
ġayn غ gh
fa: ف f
qa:f ق q
ka:f ك k
la:m ل l
mi:m م m
nu:n ن n
ha: ه h
wa:w و w
ya: ي y
Such a chart is easy to use because it is familiar to both Arabic and English speakers. For
Arabic consonants that do not have equivalents in English we used the most common system.
This applies with two types of sounds: emphatic and pharyngeal. For the former we put a dot
under the symbol to show emphasis and for the latter we used two symbols (c and ‘). This
makes it difficult to represent the doubling of consonant like dhdh or khkh. We would rather
ignore doubling with such consonants. This is much easier than struggling with new symbols.
The Arabic definite article al ‘the’, which sometimes takes another form when assimilated
with the following sound is represented as is without showing any sort of assimilation. The
long vowels are marked by doubling the short vowel to avoid putting more figures on the
symbols, except for Proper Nouns which are commonly used among Arabs and Arabists.
` 8
Chapter One: Introduction
1.1 The Rationale Behind the Study
A general motivation for many recent linguistic studies has been the desire to automate some
descriptive processes and to employ scientific observation in the study of language.
Linguistic studies in Arabic were first introduced and established by Al-Khalil, who was the
first lexicographer to give lexical order in the collection of his dictionary (cf. Haywood
1965), and his outstanding pupil, Sibawayh in the late 8th century. What Al-Khalil and
Sibawayh did was to investigate language use to formulate rules and describe linguistic
devices.
Although Arab lexicographers were the first to integrate corpus-analysis into the dictionary-
making process, with Al-Khalil’s manual corpus discussed below in Chapter Two, a corpus-
based approach is certainly not used in contemporary lexicography in the Arab world. The
mainstream lexicography is undoubtedly intuition-based.
1.2 Goals
The current study will provide the resources for accurate descriptions of the way words co-
occur in classical Arabic. For that purpose, the major activity of the study has been the
assembly and analysis of a corpus comprising samples of different types of written Arabic:
biography, religion, poetry, etc.
With this in mind, I decided to work toward the compilation of a comprehensive corpus of
written Classical Arabic in order to facilitate research in a range of disciplines concerned with
Arabic and with the general methodology of Corpus Linguistics. I would like to emphasise
that the Classical Arabic Corpus will be available for any potential user for her or his needs.
` 9
1.3 Corpus-driven or Corpus-based
Two approaches can be at play when working with corpora: corpus-based and corpus-driven
(Tognini Bonelli, 2000). When a linguist in describing a language using this methodology
observes a phenomenon without a prior knowledge on the validity of a particular theory, i.e.
when he/she finds out something unexpected to him/her, it is called corpus-driven. For
instance, the subtle differences that occur between synonymous pairs and the semantic
features extracted for every word that distinguishes it from another (as shown in Chapter
Seven) are not obvious by casual observation nor available in the literature I have examined.
On the other hand, when we use corpus linguistics methodology to support or invalidate an
existing hypothesis or a theory, then it is called corpus-based. For example, in Chapter Five
we test a collocation assumed to be fixed and find out that it is not a collocation at all.
Collocation was recognised early by Arab linguists, but the phenomenon was just referred to
between the lines and did not get an extensive study. Al-Sakkaki, for example, in Miftah al-
Ulum defined it as ‘likull kalimah maca s}aah}ibatiha maqaam’ (every word has with its
c
companion a position [lit. trans.]). This roughly means that every word has a different sense
with a different adjacent word. Emery (1988: 51) regards this quotation as equivalent to
Firth’s (1957: 179) definition of collocation, which is the company that a word keeps. He
also considers the classification of Thacalibi’s lexicon, Fiqh Al-Lughah2, as showing his
awareness of how significant collocational relations are.
1 Corpus-based methodolgy has been widely used for other linguistic fields (Biber, 1998, Meyer, 2002).
2 This lexicon, which was written ten centuries ago, classifies the types of actions with their specific doers and
the types of words with their specific predicates. So, it can be considered like Benson’s (1997) work on
collocation, The BBI Dictionary of English Word Combinations.
` 10
lexical level. This is what is traditionally called collocation3. In this sense, ‘collocation is
restricted to idiosyncratic relationships between words’ (Wouden, 1997: 24).
1.5 Synonymy
One of the main goals of this study is to check the synonymy or non-synonymy of a given
pair of items. We will use the corpus-based analysis and the computer technology that can
help us identify easily the relative frequency of words, whether throughout the whole corpus
or in a particular genre. Subsequently, we can explore the collocates of words and further
isolate the various meanings, or senses, a word has. This is especially interesting for words
which are considered synonyms, since an investigation may reveal differences in syntactic
and/or stylistic distribution. Such research might show that near synonymous words or
structures are used in different ways.
Synonymy is understood as a gradual cline along which we may locate different degrees of
synonymy: near, cognitive and absolute. However, there is a widely held opinion among
semanticists that strict or absolute synonymy is rare in human languages (see Cruse: 1986). A
further step is taken here in this study to demonstrate that absolute synonymy does not exist
in Arabic. The study will argue that Arabic never has two words that mean nearly the same
thing and are used in the same range of grammatical and lexical patterns.
Chapter Two discusses Arabic linguistics scope and pinpoints some technical problems in
digitising Arabic. Chapter Three gives a brief account about the methodology of corpus
linguistics and surveys its historical background. Chapter Four describes the corpus compiled
especially for this study and gives an account of the tools used for analysis. Chapter Five
discusses lexical collocations with a particular emphasis on Arabic. Chapter Six addresses the
concept of synonymy in English and Arabic. Chapter Seven tries to find differences between
seemingly synonymous word pairs by studying their collocation and suggests that applying
corpus linguistics methodology to Arabic can help us become aware of lexical matters.
Chapter Eight is dedicated to findings and conclusions.
` 11
Chapter Two: Some Aspects of the Arabic Language
2.1 Introduction
The Arabic language originated in Arabia in pre-Islamic times, and spread rapidly across the
Middle East. Today it is spoken as an official language by almost 200 million people,
Muslims and Christians, in more than twenty two countries, from Morocco in Africa to Iraq
in Asia, and as far south as Somalia and Sudan. As the language of Qur’an, the Holy book of
Islam, it is to some extent familiar throughout the Muslim world, rather as Latin was in the
lands of the Roman church. It is taught as a first language in all Arab countries and as a
second language in non-Arab Muslim states. It is the liturgical language of about one billion
Muslims. In addition, Modern Standard Arabic is the lingua franca used and respected by
educated Muslims throughout the entire world.
Although it is widely used throughout the Arab world, with different vernaculars, in everyday
language, language of communication and entertainment, the Modern Standard Arabic is still
4 The term ‘Arabic’ is applied to a number of speech-forms which, in spite of many and sometimes substantial
differences, are reckoned as dialectal varieties of a single language. The term Classical Arabic is sometimes used
as a synonym of Standard Arabic. However, I will use the former to refer to the early Classical Arabic which
extends over the first four centuries of Islam, i.e. until the early eleventh century, whereas the latter is used to
refer to the modern Classical Arabic. These two varieties are sometimes interchangeable; they can be used in
formal situations such as schools, universities, textbooks, lectures (whether religious or academic), mass-media
and personal writing as in letters and autobiography.
` 12
adopted as the formal language of press, writing and speeches. Because Qur’an is revealed in
Arabic, most Arabs think that this language must be perpetuated and kept alive (Haeri, 2003).
They always emphasise that Classical Arabic, as a living language should be used in formal
written and spoken language. Bakalla (1983) argued that ‘living’ language is by definition the
language acquired by children in their early age and this is not the case with Classical Arabic.
However, the general desire among the educated Arabs is to write and read literary works,
Islamic and general books in an elegant language and nothing can be more beautiful than
Classical Arabic. ‘In that sense Classical Arabic is [a] ‘living’ language, but it is not a ‘living’
in the sense of colloquial’ (Bakalla 1983: xvii).
This is an important question in linguistic study because if we believed that Arabic is God-
given, we would stick to the Qur’anic language and the expressions used by the ancient Arabs
and the early Muslims. Ibn Faris (s}ah}ibi, p.17) said, ‘We are not entitled to-day to
innovate, to use expressions which they did not use, or to develop analogies which they did
not know; for this would mean corrupting the language and annihilating its essence.’
` 13
Unlike English and other languages, there was no detailed discussion in Arabic literature
concerning the origin of speech. Arab linguists did not concern themselves with this question
because, owing to the aforementioned Qur’anic verses, they thought that Arabic is revealed
by Allah. This question was considered as theological rather than linguistic. Even those who
thought that Arabic is not revealed by Allah gave up investigating this question since there
was no conclusive evidence for either position. Most grammarians, however, regarded Arabic
as God-given language. Therefore, Arabs had to stick to the usage of their predecessors to
whom the Qur’an was revealed. All they could do was to describe this usage for Arab and
non-Arab people in order to stick to the genuine Arabic, the language of the Qur’an.
As a point of departure, we can realise how Islam influenced the study of language. Arabic
itself was very limited before the advent of Islam in terms of use by a large number of people.
The introduction of Arabic grammar was motivated by Islamic incentives to protect the
language from being corrupted by converts.
Arabic is of supreme and great importance for all Muslims and for those who are interested in
study of the orient; for the former it is their religious language which contains the Qur’an, the
Prophetic traditions and the early Muslim works and for the latter it is the medium of the
Arabic culture.
Ibn Faris (Sahibi p. 17) noted that Arabic is the most eloquent language. Attempting to
` 14
translate the word sayf (sword), for example, into Persian we would have only one word as
equivalent. In Arabic, we can have many words for ‘sayf’, each with a specific connotation.
To most Arabs, Arabic has a magical effect on their souls. Hitti (1958: 90) said,
There is no consensus among Arab or foreign linguists with regard to who is the founder of
Arabic grammar. Some argued that Ali (the fourth Caliph) is the true founder of Arabic
grammar as a science. He gave the first glimpse by dividing the word classes into a ‘noun’, a
‘verb’ or a ‘particle’; others said that Abu Al-Aswad Ad-Du’ali was the first one to write the
` 15
first treatise of Arabic grammar on the basis of what Ali or Ziyad Ibn Abihi, who was the
governor of Iraq by then, supposedly told him.
Although people differ as to who introduced Arabic grammar, they are unanimous in
asserting that it was introduced to preserve the language of the Qur’an. Al-Anbari (Nuzhat:
11) concluded that the first founder of grammar was Ali ibn Abi Talib, because all stories
referred to him and Abu al-Aswad referred to Ali ibn Abi Talib. Abu al-Aswad himself
admitted that he learned grammar from Ali ibn Abi Talib.
The first written treatises in Arabic grammar appeared at the end of the eighth century when
Al-Khalil ibn Ahmad and his outstanding pupil Sibawayh wrote their influential and
pioneering books describing the Arabic language. The former wrote his dictionary of Arabic
Al-cAyn and the latter wrote his grammatical description of Arabic.
The science introduced by Abu al-Aswad dealt with all branches of modern linguistics as a
whole. There was no separation among the different fields of linguistics as in the modern
time. Many of the early Arab scholars had the ability to write in all branches of linguistics.
For example, Sibawayh’s Kitab, dealt with phonetics, syntax, morphology and phonology.
Moreover, Al-Zamakhshari had outstanding works in the field of syntax and lexicography, in
addition to his pioneering work in the exegesis of the Qur’an.
` 16
The golden age of Arabic linguistics was between the eighth and the eleventh century. Chejne
(1969: 170) notes that “in the 12th and 13th centuries Arabic was looked upon with
admiration by the West, in the same manner the Arab of today looks at the more developed
Western languages.”
Owens (1998, ch. 9) argued that Arabic linguistics reached its highest methodology and its
most sophisticated level with Jurjani (d. 1078). There are many contributions made by later
linguists until the end of the eleventh century, but they were mainly interested in reworking
what had been done by their predecessors.
Little contribution has been made in the past millennium. Linguists throughout this period
used only to remodel or to add relatively slight changes to what has been done in the early
ages of Islam. However this little contribution, based on the same corpus used by their
predecessors, was still within the general framework introduced by the early linguists as
‘...the major preoccupation of grammarians… (after 1077)… was to find ever new ways of
saying the same thing’ (Carter 1985a: p. 270, quoted by Owens, 1988: p. 8). In other words,
‘Sibawayh had, in fact, laid down the basic rules and methods of grammar, while the later
grammarians’ contribution consisted only in expounding his theory in a more explicit and
systematic form, or in finding new applications for it’ (Bohas, Guillaume and Kouloughli:
1990, p.5). They were mainly concerned with codifying and preserving the literature of their
predecessors.
` 17
in verifying and editing the grammatical manuscripts left by the Arab grammarians. On the
other hand, it tries to explain and interpret such work in modern linguistic terms.
During the last four decades the study of Arabic language has increased dramatically. The
current tendency has been to enrich Arabic with modern theories of linguistics through
comparative or applied linguistic studies. There are two main features which characterise
modern Arabic linguistics of the last decades. First, the tendency towards the application of
linguistic theories and methodologies, especially to the teaching of Arabic as a first language.
Secondly, the use of modern techniques in linguistic research, as in computational linguistics
and corpus linguistics.
Much of the work in this field was done in thesis or dissertation form, both in the universities
of the Arab world and abroad. Very few of these studies have been published. Straley (1989)
listed the dissertations done in the American universities in the field of Arabic linguistics
from 1967 to 1987 in an annotated bibliography. He noticed that these dissertations, in
general, cover a wide variety of topics: phonology, grammar, comparative linguistics,
language planning, sociolinguistics and pedagogy. Bakalla (1983: p. xxxvii) pointed out that
much of the work on Arabic linguistics ‘has been influenced by developments within
linguistic theory and that many studies have been formed in, and reflect, contemporaneous
theory’.
There are also indications of the same interest in engaging with the development in linguistic
theory as it is a very dominant paradigm in all branches of science represented by the
establishment of some Arabic teaching centres in the Arab world and abroad and the
appearance of some periodicals and journals interested in Arabic linguistics like the Journal
of Arabic and Islamic Studies (JAIS), Journal of Arabic Linguistics (in Germany), Arabica
and Al-cArabiyya (Arabic). Moreover a number of the big universities all over the world are
now engaged in organising conferences, workshops and seminars devoted to Arabic
linguistics for many purposes: scientific, commercial, or others.
` 18
With the introduction of computational techniques into the field of linguistics in USA and
Europe, a corresponding interest in the use of computers to investigate the Arabic language
grew, as was also the case for the theoretical linguistics. Academic centres, companies and
conferences specialised in Natural Language Processing flourished in the Arab countries and
abroad5. Research in this domain is currently under development.
5 The Institute for the Languages & Cultures of the Middle East, University of Nijmegen, focuses nowadays on
Arabic Natural Language Processing. It managed lately to produce an Arabic/Dutch dictionary based on a large
Arabic corpus. Also, some companies like Sakhr (based in Egypt) are involved with developing solutions for
Arabic computationally, and there are also conferences which are specialised in Arabic worldwide.
` 19
5. Arabic words are formed from roots, based on fixed morphological patterns, where vowels,
suffixes, prefixes, or infixes can be added to form new words. Once we know these patterns,
it is easy to form any possible word without making mistakes. More interestingly, we can add
to the base form other linguistic units such as person, tense, mood, participles case, and
verbal noun. English words, on the other hand, are generated from stems. Therefore, the key
word for searching the traditional lexicon in Arabic is the root6, whereas in English it is the
stem (the basic word form).
6. As Arabic is a synthetic language, it allows pronouns to combine with words forming one
single word. Such personal pronouns can be suffixed to nouns, verbs or particles. We may
form an Arabic word representing a whole sentence. Consider the following word in (1)
below.
(1( ضربوكd{arabuuka (they hit you).
This property raises another problem of analysing Arabic computationally. When searching
for a word in an electronic text, we have to search for every possible form of this word. This
is because, if we look for the stem of this word, like in English, we will find a huge amount
of results which are not needed. In Arabic we can form different roots by adding more
characters. For example, cam (year) can include camer (populated), nacam (ostrich), camel
(worker) are derived from different root words. All the occurrences of each word in a simple
word search program which is not trained on Arabic idiosyncrasies can give a good result
which won’t need a laborious hand-editing.
7. Word order in Arabic is more flexible than in English. There are two types of word order in
Arabic: VSO and SVO.
6 By the word ‘root’ I mean the three or four nuclear conosonantal letters from which we can generate all
possible word forms in Arabic by adding suffixes, prefixes or infixes.
` 20
Chapter Three: Corpus Linguistics
3.1 Introduction
Corpus is a Latin word which means ‘body’, hence any collection of texts, linguistic or non-
linguistic, can be called a corpus, such as the Corpus Juris Civilis which was a collection of
early Roman laws and legal principles in the sixth century and the corpus Manuscript of
Chaucer (1400) which included Chaucer’s works. In 1731 Alexander Gruden used the Bible
(King James Version) as a corpus to show that the Bible is consistent (Kennedy 1998: 14). In
modern linguistic terms, a corpus is a designed collection of written, spoken or a mixture of
written and spoken data which can be used for linguistic investigation. In this sense, not any
collection of texts can be called a corpus since there is a big difference between a corpus and
a text database; the former has to be ‘a systematic, planned, and structured compilation of
text’ (ibid: 4).
Linguists throughout the history of linguistic research used to rely on textual resources as a
source of evidence, at least, to prove the correctness of their theories about language. ‘It is
obvious that if someone sets about writing a grammar of English, he must have a suitable
body of material from which he is to elicit his rules, whether they be purely descriptive, or, as
is more common, prescriptive or even pedagogical. These bodies of material may be
considered corpora, with some extension of the term’ (Francis 1992: 28).
The study of language in general, whether in the context of modern linguistics or in the
context of earlier linguistic studies has also been largely based on empirical research. This
empirical approach to language is basically dominated by the observation of naturally
occurring data, as linguists tended to gather evidence for the grammaticality of a given word
or a sentence. This is partly what corpus linguistics deals with. However, corpus linguistics
goes beyond the use of corpora as a source of evidence in linguistic description. ‘Corpus
linguistics, like all linguistics, is concerned primarily with the description and explanation of
the nature, structure and use of language and languages and with particular matters such as
language acquisition, variation and change’ (Kennedy 1998: 8).
` 21
Nowadays, two main objectives can be met via corpus collection: linguistic investigation and
language processing. As Souter and Atwell (1993: i-ii) explained,
With the advent of Chomskyan theories in the 1950s, less emphasis was placed on empirical
observations. With the authority of his works, Chomsky has directed linguistics away from
empiricism and the study of language use towards rationalism for many years. Following de
Saussure, he made a distinction between two approaches to looking at language: a theory of
language system and a theory of language use. These two approaches are drawn (1965) as
competence and performance.7 Chomsky, rejecting the corpus linguistics approach, argued
that:
Any natural corpus will be skewed. Some sentences won’t occur
7 Competence can be defined as ‘the speaker-hearer’s knowledge of his language’ whereas performance is ‘the
actual use of language in concrete situations’ (Chomsky: 1965: 4). Competence both explains and characterises
one’s internalised knowledge of a language. The only way to investigate competence is through introspection.
` 22
because they are obvious, others because they are false, still others
because they are implicit. The corpus, if natural, will be so wildly
skewed that the description [of language] would be no more than a
mere list.
(Chomsky, 1962, quoted in Leech 1991: 8)
In the course of invalidating the corpus-based studies, he gave a lecture at the Linguistic
Society of America Summer Institute in 1964, in which he rejected any kind of quantitive
(statistical) data. To prove his argument, he gave the following examples in (1a & 1b) below:
1a. I live in New York.
1b. I live in Dayton, Ohio.
The sentence (a) above is more likely to occur more frequently, just for demographic reasons!
Following Chomsky, Horrocks (1987: 13-14) argues that although performance is the only
available evidence to the linguist, it is not a transparent reflection of competence. He (ibid:
16) expounded that an observationally adequate grammar cannot simply list all the well-
formed sentences of a given language. This is because our mind has a finite storage capacity
and the choices of language we produce are infinite. Only by positing competence can we
account for a finite system with the capacity to define the membership of an infinite set.
Therefore, Chomsky suggested that ‘the corpus could never be a useful tool for the linguist,
as the linguist must seek to model language competence rather than performance’ (McEnery
and Wilson, 1996: 5).
Horrocks (1987: 16-17) further argued that relying on a corpus to derive grammatical rules
will lead to some sort of rules which have a predictive power which can generate strings not
available in the corpus itself. However, we can only test the validity of such strings through
referring to the intuition of a native speaker.
In fact, the approaches based on Chomsky’s theories, which were considered mainstream in
linguistics, do not cope with vast areas in language study, most notably register variation
where probability plays a major role in selecting certain combinations of meaning with
certain frequencies. However, the bitter criticism of corpus data arising from the tradition
` 23
which Chomsky established has led corpus linguists to remedy the drawbacks of corpus data
such as balance and representativeness. To pursue the premise, I would suggest, following
Francis (1992), if someone sets about writing a grammar of a given language, he must have a
corpus from which he is to derive his rules. Hence, the grammatical rules are derived by
analysis and generalisation of a corpus.
Makkai (1987) considers the total reliance on intuition a serious disease that affects modern
linguistics, which he called textphobia, that needs a radical surgery. A useful cure for this
disease, he proposes, is reading Malinowski, Firth and Halliday.
It is worth stressing that eliminating observation from the study of language was fervently
criticised by linguists even before Chomsky. Criticising de Saussure’s approach, Malinowski
in 1936 suggested overlooking the question of langue and parole and paying more attention to
the living speech in a context of situation, which is the main object of linguistic study
(Roulet, 1975: 78).
Firth (1957) also discredited the introspection of the native speaker as a reliable source of
data. He observed that the language we produce is governed to a large extent by particular
conventions (social, situational, etc.).
Sinclair (1991) also criticised the reliance on intuitive data, especially in the field of word
meaning, lexis. He argued that ‘we may see formal patterns being used overtly as criteria for
analysing meaning, which is a more secure and less eccentric position for a discipline which
aspires to scientific seriousness’ (Sinclair, 1991: 6-7).
In conclusion, studying corpora of naturally occurring data is a very useful way to test a
` 24
theoretical model put forward through intuition or to investigate a language with an emphasis
on what is typical in this language or what is called norms of use.
The early Arab linguists relied mainly on three sources of linguistic data to describe their
language: the Holy Qur’an, poetry and nomad proverbs. This is obvious in their use of
quotations from these sources as linguistic evidence. Such quotations were certainly taken
from a corpus they designed for their inquiry about language. They have postulated certain
selection criteria for designing such a corpus. Versteegh explained, ‘on the one hand, the
corpus used by the grammarians was closed, being limited to the text of the Qur’an and the
pre-Islamic poetry, but on the other hand, the grammarians upheld the fiction of native
speakers whose judgement could be trusted’ (1997: 42).
They made it as representative as possible. Ditters (1990: 130) described this corpus as
consisting of specific media, registers, genres, styles and varied topics including poetry and
` 25
prose. He (ibid: 133) pointed out the way early Arab grammarians employed the corpus they
assembled:
Originally corpus-information constituted the basis for a grammar of
the Arabic language, but instead of the grammar being tested out again
and again on corpus-data in a cyclic process as is the case in modern
corpus linguistics, this grammar became the norm for language use.
As for English language corpora, Francis (1992) gives a full description of English pre-
computer corpora. He divided corpora into three types: lexicographical, dialectological and
grammatical. But he pinpointed some drawbacks in these collections due to (1) the editors of
lexicographical collections like Oxford English Dictionary and Webster’s Dictionary in
particular, encountered a big problem, as they did not have enough citations for function and
simple words like, prepositions, articles and pronouns. (2) The major difficulty with
collections assembled for grammatical investigation is that ‘they are inevitably skewed in the
direction of the unusual and interesting constructions that the readers encounter, at the
expense of the normal core of the language’ (Francis 1992: 28). Commenting on this,
Johansson (1995) suggested, ‘the natural solution to this problem is to collect texts in a
systematic manner and subject them to the principle of “total accountability”‘ (Johansson,
1995: 244).
Quirk, in an attempt to avoid the shortcomings of the other corpora, collected a more
representative corpus (spoken and written), taken from a wide range of genres, as a basis for
describing English grammar. Therefore, his Survey of English Usage is considered a
landmark in corpus-based grammatical description in the 20th century. It is important to note
that ‘the spoken part of SEU corpus was, however, later computerised yielding the London-
Lund Corpus’ (Svartvik, 1990 quoted in Kenny 1999: 32). Therefore, Kennedy (1990: 17)
pointed out that the SEU corpus, which was initially manually assembled is considered a
transitional point between a non-computerised corpus and modern corpus linguistics.
Undoubtedly, working on such large corpora was tedious and exhausting. This is because
corpora without the assistance of computer techniques are time-consuming, banal, error
prone, boring and very expensive to process (McEnery and Wilson, 1996:10). It now takes a
` 26
matter of minutes to process such corpora by computer accurately.
As a point of departure we can conclude that the methodology of corpus linguistics, however
unrepresentative of the actual use of language, was widespread in linguistics for a long time.
Corpora remained as a source of data for linguistic research in spite of the difficulties raised
above until the 1950s, when the corpus for linguistic research underwent a severe blow at the
hands of Chomsky, who invalidated it as a reliable methodology (see 3.2).
Below I am going to give a brief account of two major English corpora: Brown Corpus as the
first computerised corpus and Birmingham Collection as the first major computerised corpus
used for dictionary-making based on a thorough study of the language use.
Brown Corpus
This was, undoubtedly, a pioneering corpus not only because it was the first computerised
corpus of English, but also because it was against the mainstream, which was intuition-
` 27
oriented. The corpus consisted of about one million words of the written English printed in
US in 1961, comprising 500 text samples of about 2000 words each. The samples were taken
from a variety of genres excluding verse and drama. The project started in 1961 and only
after three years (in 1964) was the corpus ready for distribution on a magnetic tape.
Birmingham Collection
The starting point of this corpus goes back to the 1960s in the form of research carried out at
Birmingham University where Sinclair (1969) issued his early computational British corpus:
OSTI project (135000 running words of informal conversation transcribed and
computerised). The collection undertaken at Birmingham University is made up of written
texts and transcribed speech. It was intended to provide raw language data for a variety of
purposes, relevant to the needs of the learners and teachers, lexicographic in particular
(Renouf, 1984: 4-5). Since 1980 Cobuild, which is a joint venture between Collins and the
School of English at Birmingham University, has been collecting a corpus for dictionary
compilation and language study, making use of the Birmingham collection.
In October 2000 the latest release of the corpus amounted to 415 million words and it
continues to grow with the constant addition of new material. Research at COBUILD over
the last fifteen years has shown that very large samples of text are necessary for good
linguistic study, since the vocabulary of English is so large (well over half a million different
words) and there is such variety in current usage. In order to draw statistically valid
conclusions from computerised analysis of a corpus, researchers need to have adequate data
samples at their disposal (http://titania.cobuild.collins.co.uk/).
In addition to the corpora mentioned above, there are ‘a number of initiatives that have aimed
at collecting and disseminating textual material amongst the international research
community’ (Kenny 1999: 34). Below are examples of these initiatives: The ACL/DCI (the
Association for Computational Linguistics’ Data Collection Initiative) which produced a CD-
ROM containing just plain orthographic text. It consists of the Collins English Dictionary;
selections from the Wall Street Journal; the Penn Treebank of skeleton-parsed data compiled
by Mitch Marcus and his team at the University of Pennsylvania; and a database of scientific
abstracts. There are also some other initiatives like ECI (European Corpora Initiative), LDC
` 28
(The Linguistic Data Consortium), ELRA (The European Language Resources Association).
Some of the first considerations in constructing a corpus is to specify for whom and for what
the corpus is designed: for personal research, or to serve as a general resource. Kennedy
(1998: 70) argued, ‘the optimal design of a corpus is highly dependent on the purpose for
which it is intended to be used.’ Anyhow, Atkins et al (1992) and Meyer (2002) drew up the
principal features of corpus design for whatever purpose. They discussed the practical stages
in building a corpus: selection of sources, text annotation, copyright permission, in addition
to some extra-linguistic variables.
` 29
3.4.2 Text Sampling
The next step after deciding the type, purpose and content of a corpus is to select and sample
the actual texts which will make up the corpus. Biber (1993: 243) pointed out that any
selection of texts is considered a sample, irrespective of being representative or not, but he
noted that ‘a corpus must be ‘representative’ in order to be appropriately used as the basis for
generalisations concerning a language as a whole.’ However, we have to bear in mind, in the
first place, that there may be a corpus that is designed to represent not the language as a
whole but one particular genre or the whole works of an author for example. Secondly, it is
feasible to get a grip of the complete Old English corpus or the complete Early Middle
English corpus, but a complete 20th c. British or American English corpus is not feasible.
This is because it is too difficult to access all the publications in a given language, let alone
speech.
Moreover, ‘the value of a corpus as a research tool cannot be measured in terms of brute size.
The diversity of the corpus, in terms of the variety of registers on text types it represents, can
be an equally important (or even more important) criterion’ (Garside, Leech and McEnery,
1997: 2).
With this in mind, Garside, Leech and Sampson, 1987: 6) noted that Sinclair (1982) defined
the problem of corpus compilation as a problem of selecting the right sample from the
existing massive quantities of machine–readable texts. The main challenge in
` 30
sampling the population8 of a given language lies in
representing all the relevant genres, topics or registers
while keeping the corpus at a manageable size. Therefore, sampling
has to be conducted according to statistical measures and thus
will be qualitatively and quantitatively representative of the
entire publication and population.
Sinclair (1995: 27-28) made a distinction between a ‘whole text’ corpus and a ‘sample
corpus’. He noted that ‘samples are small, in relation to texts such as newspapers, books,
radio programmes, and of a constant size, hence not qualifying as texts.’ Unlike many corpus
linguists like Francis and Kucera in their pioneering corpus (Brown Corpus) in 1961, he
thinks that ‘whole text corpus’ should be a default value for anyone building a corpus. To
him, ‘the use of small samples is just a remnant of the early restraints on corpus building’
(ibid). Stubbs (1993: 11) also argues in favour of whole texts being the unit of study. He also
quoted Sinclair saying that ‘few linguistic features of a text are distributed evenly
8 To statisticians, this word does not necessarily refer to human beings as commonly used. We may have a
population of anything to be counted such as people, animals, trees, companies, books, cars, etc. (Stuart, 1968:
10).
` 31
throughout’, which could be overlooked with use of sample texts.
In sampling written texts, the designer of the corpus has to take into account some important
information about both the author and the reader who differ in regard to certain author-related
and work-related criteria. Such considerations, in addition to contextual criteria, are also
required when sampling spoken data. These criteria are by definition non-linguistic.
Atkins et al. (1992) have given a full systematic account of non-linguistic characteristics in
corpus design. Work-related criteria include, among other things, mode (written, spoken,
written to be read, written to be spoken), text origin, preparedness, participants, genre, style,
setting, factuality, topic, date of publication. Author-related criteria are those
associated with authors. These criteria are mainly
demographic: geographical, ethnic, socioeconomic, and social
(age, education, sex, profession, nationality, age and size of intended
audience or readership, etc.). Contextual criteria refer to situationally-
defined varieties such as conversation (face-to-face vs.
telephone (informal), monologue vs. dialogue, personal vs. impersonal.
Before starting the process of creating a corpus, the designer may have to get permission
from the publishers of his selected works, national or international, to use the text in an
electronic form for language research. Having got permission, he needs to capture the data.
` 32
Written corpora are easy to capture by keyboarding, scanning or downloading from the
Internet. However, proofreading is still needed to make sure of the reliability of the data.
Spoken material, on the other hand, is difficult to capture. Spoken materials need to be
recorded and then transcribed before processing. To have a reliable transcribed text is,
undoubtedly, time-consuming, expensive and error-prone. This is because people’s perception
of speech may differ in respect of prosodic features, situations, homophonous words, etc.
Once a text, written or spoken, is captured electronically, some information can be added,
electronically, to indicate some text features such as titles, chapters, paragraphs, sentence
boundaries, headings, various types of hyphenation, etc. This process is called marking-up.
There is also some other information, which can be added to the text to show the parts of
speech of each sentence (as in tagged corpora), or the sentence structure and the function in
the sentence for each word (as in parsed corpora).
Barnbrook (1996), Meyer (2002) and Kenny (2001) gave an overview of how to process such
a corpus. The first thing the computer techniques can do with texts is to provide word
frequency lists for the whole contents of the texts.
Frequency Lists
These lists can be made by identifying every word form in the text, counting identical forms
and classifying them according to a particular order: alphabetical, or according to their
frequency. This can be done in descending or ascending order. Listing words according to
their frequencies can show how often every single word form occurs in the text. Therefore,
‘by examining a list, one can get an idea of what further information would be worth
acquiring: or one can make guesses about the structure of the text, and so focus on
investigation’ (Sinclair, 1991: 31).
` 33
Concordances
A concordance can be defined as listing all occurrences of search-words in the text with a
short section of the context that precedes and follows each word. Unlike word frequency lists,
the search-word is represented within its contextual environment; this can give more
information about the nature and behaviour of words. This process is also called KWIC (key
word in context). The search-word can be highlighted by putting it in the centre of each line,
with a space on each side. The arrangement of each key word is alphabetical according to the
left-hand or the right-hand context. Barnbrook (1996) describes the main features of
concordance programs in detail.
Collocation
In addition to KWIC and word frequency lists, most programs also offer the possibility of
searching for word combinations within a specified range of words. Furthermore, if the
program is a bit more sophisticated, it might also provide its user with lists of collocates
based on some statistical tests. Collocation is discussed in detail in Chapter Five.
3.7 Summary
This chapter has given a brief account about the methodology of corpus linguistics and has
surveyed its historical background. We have investigated some aspects of corpus linguistics
to make it easy for the reader to be aware of the state of the art. Such aspects include the
methodology for creating a corpus, such as representativeness, size, sampling, etc., the types
of corpora as well as the technical requirements needed for utilising corpora.
` 34
Chapter Four: Description of the Corpus and Tools of Analysis
4.1 Introduction
Based on the information given in the previous chapter we embarked on building a
computerised Arabic corpus to use in our linguistic study on lexical collocations and
synonymy in Arabic, taking into consideration the state of the art of Arabic which we will
discuss below. We attempted to meet all the design criteria for corpora compilation in order
that we can conduct a methodical study based on it and to make it available as a resource for
other researchers to use in the future.
be mentioned’, a noun ٌ وَ ْردward ‘flower’, a noun ٌ وِرْدwird ‘watering place’. For more details
about the difficulties of analysing Arabic computationally see Goweder and Roeck (2001),
Khoja, Garside and Knowles (2001), Van Mol (2002).
9 A few written Arabic texts contain vowels; the most famous one is Qur’an, with a fully-detailed vowel system.
Then we can find some old Arabic poems and some primary schoolbooks with only vowels that mark the words
cases.
` 35
to launch an Arabic OCR that can handle Arabic efficiently, even the problem of diacritics.
Using the latest techniques to handle Arabic through OCR, a lot of attention has been given to
render Arabic texts, especially religious material, into machine-readable form; many such
texts are now available on the web. However, these texts cannot be considered corpora
because they lack systematicity, representativeness and proper planning. Nevertheless, there
was some work predating the widespread use of personal computers capable of handling
Arabic script at European universities, although this work used transliterated versions of the
Arabic texts. One of the pioneering projects, done on a mainframe computer, was the corpus
of early Arabic poetry assembled by Alan Jones at Oxford University. This corpus was
considered one of the major computerised sources of Arabic literary material before the
personal computer could handle Arabic.
Al-Jabouri and Knowles (1988) compiled a transcribed corpus of Arabic to investigate the
quantitative properties of cohesion in Arabic. This corpus is also transliterated. They noted
some difficulties that they encountered in the process of digitising Arabic that had not
previously been tackled. For instance, identifying orthographic words in Arabic is more
complicated than in English, because many Arabic words can be attached to the following
string of characters, like wa ‘and’ and fa ‘then’ which are always attached in writing to the
following word.
Izwaini (2000) attempted to use corpus-based analysis with respect to Arabic but he ended up
using a manual Arabic corpus; it was not manually keyed, but he used the corpus he selected
in a hard copy form. He studied the impact of translation on collocations in Arabic using two
corpora: English and Arabic. The English corpus, which is part of TEC (Translational English
Corpus)10 is electronic, consisting of translated English text from Arabic. The Arabic corpus
which is used for analysis with the naked eye11 consists of translated Arabic text from
Swedish. Later on, after the remarkable development in the field of computational Arabic,
10 This corpus consists of translated works into English; it was first suggested by M. Baker (1995), CTIS,
Manchester University.
11 He analysed the English corpus electronically using Wordsmith tool, but with Arabic corpus he could not find
at the time an efficient tool (OCR) to convert the text into an electronic form nor a tool to process it (a
concordancer). So, he used to look up the novels he selected with his naked eyes to find interesting patternings.
` 36
Izwaini in his Ph.D. thesis (in progress) used another corpus covering these three languages
(Arabic, English and Swedish) electronically.
ELRA (European Language Resources Association) provide two Arabic corpora: An-Nahar
newspaper Corpus, containing around 140 million words and Al-Hayat newspaper corpus,
cotaining 18 million words. The former is just a raw corpus whereas the latter has only mark-
up notation, i.e. with more information relating to the original layout of the texts, including
sentence and paragraph boundaries, headings, deletions, and typographic features.
LDC (Linguistic Data Consortium) have also two Arabic corpora: a corpus of Arabic
newswire text, containing 76 million words and a corpus of Egyptian Arabic speech,
consisting of 60 unscripted telephone conversations, lasting between 5 and 30 minutes.
The Sakhr software Company in Egypt, is a pioneering company using the latest techniques
` 37
to fulfil the needs of the Arabic market and the Arabic speaking population in the field of
Arabic processing. It provides a large number of text collections and databases, which have
recently become available on its web site: www.sakhr.com.
Although tagged corpora are now available for many Roman languages, this sort of corpora is
lagging behind in connection with Arabic. There is an Arabic tagged corpus in Lancaster
assembled by Shereen Khoja based on an Arabic morphosyntactic tagset along with an Arabic
part-of-speech tagger in the Computing Department, University of Lancaster (Khoja, et al
2001). But it is manually tagged and is very small; it only consists of 1700 words with the
following tags (Arabic POS (N, V and Particle) plus some syntactic information (sing., masc.,
` 38
and definite common noun)). She also has a tagged corpus of 50,000 words of Arabic
newspaper text with the basic tags (N, V, Particle). Not until (2003) was Khoja able to
produce a tagger for Arabic in the fulfilment of her Ph.D. thesis (Khoja 2002). Indeed, the
Arabic language is relatively difficult to tag due to most of the problems raised in section 2.4.
The Institute of Modern Languages of the Catholic University of Leuven started with the
manual annotating of a 4-million-word Arabic corpus. They are still working hard to
elaborate this corpus which will be used in the future as a basis for a semi-automatic tagging
of raw Arabic corpora (Van Mol, 2002).
Apart from Khoja’s corpus which is very small and manually tagged in addition to the
Leuven one, which is still in progress, we do not know, at the time of writing, of any other
tagged corpus except for LDC’s and Sakhr’s. The most recent of these is the one produced by
the Linguistic Data Consortium (LDC) in 2003. They produced an Arabic Treebank: Part 1 v
2.0 consisting of 140,265 words (168,123 tokens after clitic segmentation). This is published
as part one of a 1m. words Modern Standard Arabic corpus. As for Sakhr’s, Sakhr Company
in Egypt often claims that it owns a tagged corpus, but the company said it is for their own
purposes; they did not want to share it even for academic research.
Although these years witnessed a vast stride in development of machine-readable tools that
can handle Arabic, barely can we find a public domain tagged corpus12 or a POS tagger that
can work on Arabic to disambiguate unvoweled written Arabic texts which is a very daunting
task. Almuhanna (2003), for example, had to romanise the Arabic alphabet (transliteration)
following Bulkwalter13 in an attempt to tag his Arabic corpus. He followed this process: 1)
compiling a raw corpus, 2) transliteration, (3) segmentation, (4)
tagging, (5) re-transliteration into Arabic. He used the
language-independent Brill tagger to automatically tag his
transliterated and segmented text after training it by using a
training corpus of 100,000 words, which was already tagged
manually using Freeman’s tagset (2001).
12 The tagged LDC corpus was not personally assessed; in addition we could not find in the literature a proper
description of how much computation was involved in tagging that corpus other than what is mentioned above
(http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T06).
13 http://www.xrce.xerox.com/competencies/content-analysis/arabic/info/translit-chart.html
` 39
With Brill’s tagger Almuhanna achieved 93% accuracy in his corpus
consisting of 1-million words.
Khoja’s APT (Arabic Part-of-Speech Tagger)14 skips the above steps needed to tag Arabic
texts and works directly on Arabic script. She used a corpus of 50,000 words to train the
tagger. Her rule-based tagger arrives at word-roots by removing all affixes which are then
used to determine the grammatical position of the attached word. Some words were so
ambiguous that they did not receive any tags. So, she used a probability-based tagger with
which she managed to achieve 90% accuracy after disambiguating ambiguous words.
Nonetheless, there must be a manual tagging for all the lexical items in the training phase
(www.comp.leeds.ac.uk/bshawar/papers/group_paper.doc).
` 40
XConcord, developed by Malek Baualem, Mark Leisher and Bill Ogden (1996)
MonoConc, designed by Michael Barlow (1999)
Concordance by R. J. C. Watt (2001).
aConCorde by Andy Robert (early release, 2004)
MonoConc is a concordance program. This Windows tool is very easy to use, it can initiate
concordance searches for words and phrases immediately. MonoConc offers functionality and
flexibility through a variety of configurable options. This program works well for Arabic text
analysis in Arabic windows, but with one major drawback: the concordance output is
presented on the screen backwards. In other words, in the middle we find the KWIC (key-
word-in-context) as normal, then the context that is supposed to come after the key word
appears before it and that which precedes the key word follows it. However, you can save the
concordance output to a text-only file, and when you open it in a text editor (e.g. MS Arabic
Word), the text appears in the right order. Although this is a serious interface problem, the
program generally gives a good result. A sample of the search screen of Monoconc program
and the search results after saving it to a text-only file are given in Appendix 4 and 5.
This program was not designed to deal with Arabic orthography, since Arabic is not among
the list of languages it is claimed to handle. When I used it with Arabic texts, it turns out that
it can deal with them but with some discrepancies, due to the idiosyncrasies of Arabic
mentioned in section 2.4. The fact that Arabic is written with or without vowels15 requires
extra laborious work to search for all the possible forms of a given word separately. For
example, if you search using Monoconc for a voweled word, it will give you all the exact
occurrences of that word in the corpus disregarding the other possible forms of that word or
even the character variations such as alif with/without hamza and dotted/un-dotted yaa’. This
makes it extra hard to arrive at all possible forms of words under examination or work out the
various conventions of Arabic writing. A big problem that needs to be solved is
15 Vowels are diacritics to be put above or under the consonants. In Modern writing it is left to the intuition of
the reader to guess about them.
` 41
lemmatization. For example, الولدal-walad ‘the boy’ and ولدwalad ‘boy’ get counted as
separate items unless the user lists all possible forms of this word. A lot of Arabic connectives
(conjunctions) are attached to their following constituents, so we need to strip off any affixes
and look for the base-form, but this program also fails to define prefixes to ignore when
sorting. Anyhow, I found this program useful in dealing with Arabic despite the problem of
interface, which is not a major impediment. Without that program I would not have finished
my work.
Concordance is a program for Windows NT 4.0, Windows 2000, Windows 95/98, and
Windows ME which makes wordlists, concordances, and Web Concordances from electronic
texts. This is an online program; it gives you a 30-day free trial and for further use the user
needs to buy a registration. Watt’s Concordance is not designed to handle Arabic. Text lines
need to be short (no more than 15 words per line), otherwise it would be awkward to trace the
full line of the ‘headword’ in the ‘view’ window. To me, the major drawback is that it creates
a large number of temporary files and one huge ‘Concordance data file’. For example, for a
text document of 120 KB, a 1.9 MB concordance data file is created. It would, therefore, be
essential to have a large storage capacity on a machine for text files containing several
million words.
aConCorde is originally developed for native Arabic concordance and support right-to-left
languages. This program, which can be downloaded free
(http://www.comp.leeds.ac.uk/andyr/software/index.html) is written in Java and will run on
any platform that has the Java Runtime Environment installed. However, this program is
released early with shortcomings noted by the designer, like its inability to cope with markup
notation, ignoring punctuation and limited search options; it only accepts one item as a
search-term.
As mentioned above, MonoConc and Concordance are not designed to handle Arabic texts,
but they happen to do so. The major problem that faces the user is that these two programs
handle the Arabic text as languages written in the Roman alphabet in terms of alignment
whereas Arabic is right-aligned language, i.e. written right-to-left. It is noteworthy to mention
that Watt is currently working to add Arabic in the language list in his program
(Concordance).
` 42
The reason why I can use Monoconc is that the only text that has diacritics is the Holy
Qur’an which constitutes 1.8% of the corpus as discussed in 4.3 below. In addition, it is
compatible with Windows 95, 98 and 2000 and thus assures a reasonable degree of user-
friendliness with its graphical user interface. For quick concordances and word frequency
counts MonoConc is a very useful tool, and it is particularly useful for anyone involved in
Arabic lexicography and basic corpus-based linguistics.
The works I downloaded are mainly books. I also gathered some short poems written by one
poet into a collection and I treated them as a text. The time span of these writings starts as
early as the advent of Islam up to the end of the eleventh century. As for the question of
copyright, all of these materials, apart from the Holy Qur’an, go back ten centuries or more,
so they do not need copyright from their authors. However, I got a copyright permission for
academic use from the web site designers for the effort they have made in making these
books available on the Internet (see appendix 1).
To investigate lexical collocation in Arabic it is important to create a big corpus to inform our
research. Experience has shown that grammatical patterning can be identified and described
on the basis of a relatively small corpus. Lexical patterning on the other hand requires the use
of very large corpora. Sinclair (1991:100) argues, ‘fairly small corpora, of one million
words or even fewer, are adequate for grammatical purposes, since the frequency of
occurrences of so-called grammatical or function words is quite high.’ It is easy to pinpoint
16 Some of the Classical Arabic texts are originally spoken texts such as the Holy Qur’an which is Allah’s Book
and the Prophet traditions; they were transmitted orally for long time before the early Muslims put them in a
written form. However, I will count them as written texts because they reached us in such a form.
` 43
some generalities concerning function words through checking how common a word is. For
example, in LOB, as Sinclair (ibid) pointed out, the first most frequent word that one can
notice in LOB corpus is the at 68,315. Although the corpus is just one million words, we still
are able to make some investigations about the function words and other grammatical issues
which are expected to co-occur frequently in any corpus.
With regard to our corpus, a 5-million-word corpus is not large compared to the available
non-Arabic corpora. Nevertheless, it is the biggest Classical Arabic corpus assembled so far
and I have a motivation to keep on maintaining it to become bigger and much more diverse.
In addition, it is noteworthy to mention that the Cobuild dictionary was informed at the very
beginning by the observations derived from a 7.3-million-word corpus.
Since the corpus is limited to the early period of Islam, there is a possibility to include every
text that exists. By doing this, it would definitely be representative (Biber, 1993), but this
may take a long time to do. Moreover, it is enough for the purpose of my corpus to conduct a
principled selection rather than a mere accumulation of texts.
Generally speaking, the first possible dichotomy is into fiction and non-fiction. The
proportion of fiction, which is apparently a part of literature, at 11% is drastically less than
the non-fictional texts. This is because of the considerable lack of fictional materials in the
period under investigation. It is a well-known fact that the novel and drama have only
recently been introduced to Arabic literature. However, there are a variety of stories and
popular legends written to have a moralistic impact on Muslims. Although the majority of
these narratives concern the leading personalities in Islam, many of them are fictional
(Somekh, 1991: 21).
Therefore a further dichotomy for my corpus is needed which can give a close picture of the
major interests of the early Muslim writers. The corpus can be divided into four genres: belief
and thought, literature, linguistics and science as represented in table 4.1. The genres can
further be divided into subgenres as shown in table 4.2. These two tables are represented in
charts as shown in Appendix 2. Under belief and thought we have five subgenres: the Holy
Qur’an, the Prophetic Tradition, theology, biography and philosophy. Literature has also two
subgenres: poetry and fiction. Linguistics is represented in this corpus as having two
` 44
subgenres: proverbs and lexicons. Finally, under science we have geography, mathematics,
physics and medicine. For more illustration about the natural texts included in the corpus see
appendix (3).
17 The overall total of CAC is exactly 5-million words; we arrived at that number after deleting a part from the
theology genre, particularly the Tabari’s book on Tafseer (exegisis of the Qur’an), since it is too long to include
in the CAC. Tabari’s book after that deletion constitutes about one-sixth of the corpus.
` 45
To select parts of the population as an object of research there has to be consensus among
linguists on the authenticity of the selected parts or the selection has to be based on principled
choices. This can ensure some sort of representativeness and this is what I adopted in this
thesis.
As mentioned earlier, there are two ways of sampling: ‘whole text’ or ‘word text fragment’.
There are many corpora nowadays based on the approach of whole texts, like the Cobuild
corpus and the Bank of English. Likewise, I prefer whole texts to be the unit of study, as it is
more convenient for investigating Arabic, where we may come across sentences that extend
over a number of lines.
The corpus in hand includes among other things texts from the main branches of knowledge
introduced by the advent of Islam. By doing this, I tried not to skew the corpus too much in
any direction as ‘the stylistic idiosyncrasies of a particular author can be reduced in
significance if texts by many different authors are included’ (Atkins et al, 1992: 2).
` 46
The early Arabs privileged language; they held public fairs for poetry in Mecca, especially at
‘Ukaz, where they used to present valuable prizes for the best poet. Within this specifically
Arab context, the Prophet Muhammad was sent as a Messenger and his major evidence is the
Qur’an.
The Qur’an is inimitable; it is unique in style and unexcelled in beauty. God challenged the
Arabs to produce even a verse (a line) like the Qur’an but they could not. This point is
repeatedly emphasised in the Holy Book itself. Thus the Qur’an says:
If the whole of mankind and the jinn were to gather together to produce the like of this
Qur’an, they could not produce the like thereof, even if they backed each other up. (17:88)
2. Next to the Qur’an, poetry has been regarded as a main and authentic source of pure
language. The selection criteria for Cobuild excluded poetry from the Cobuild collection
because, to them, poetry is unrepresentative of mainstream linguistic behaviour (Renouf,
1984). To me, poetry cannot be ignored when looking into Arabic linguistics, especially
Classical Arabic. Poetry was highly valued in Arabic cultures of the Middle Ages. The
importance of poetry as a source of data for linguistic investigation can be shown in
Sibawayh’s reliance on it as the primary type of textual evidence. In his Kitab he referred to
poetry as evidence 1050 times, to Qur’an 447 times, six times to Hadith, and 350 to prose.
Ibn Abbas in his commentary on the Qur’an relied on poetry to explain the meaning of
unclear lexical items in the text of the Qur’an. He said, ‘When you want to learn the meaning
of any weird word in the Qur’an, look for it in poetry.’ Also, we have to bear in mind that the
older the poetry, the more authority it possessed.
3. Hadith (Prophetic Tradition): It is a main source of authentic data. Hadith by definition can
be subsumed under spoken material as it includes all the recorded sayings and actions of the
Prophet Muhammad. It was transmitted orally as was the Holy Qur’an.
To me, the use of Hadith for grammatical investigation in classical Arabic is of great
importance since the Prophet Muhammad is considered one of the most eloquent speakers of
his community because of his early upbringing among Bedouins who were renowned for
` 47
their eloquence. Hadith literature also retains a lot of ancient usage. Some scholars compiled
and classified these hadiths in systematic collections; the most authentic of them are Al-
Bukhari and Muslim. These two collections, which I included in my corpus, are usually
referred to by scholars as Al-S{ah{ih{aan, i.e. the two authentic collections.
4. Proverbs and Bedouin sayings: As already mentioned the third authentic source of data
which early Arab grammarians depended on is the Bedouin proverbs. The language of the
Bedouin has changed less than other varieties because they live away from urban
communities where different people of different dialects and languages live together.
The first person to collect the Arabic proverbs was Al-Mufad}d}al ibn Salim (d. 784 AD).
Based on what Al- Mufad}d}al did, Abu Hilal Al-Askary (d. 1004) and Al-Maydani (d.
1124) compiled their collections of proverbs in a more comprehensive way. Al-Mydani’s
Majmac Al-Amthaal (the Collection of Proverbs) contains explanatory notes on poetry.
5. Theology: This type of texts flourished very early as the Muslims encouraged by caliphs
and motivated by their interest in studying their religion, introduced some sciences related to
the Holy Qur’an and Hadith such as the Qur’an exegesis, jurisprudence (Fiqh), dogmatics,
etc. The Qur’an Exegesis deals with the meaning of the verses, the reasons behind their
revelation i.e. the historical references, and comments on the syntactic and semantic structure.
One of the most famous works on Tafseer is Al-Tabari’s (d. 922), which is considered ‘the
richest repository in this branch of study containing from verse to verse everything he could
gather from earlier literature’ (Goldziher: 1966: 46). Jurisprudence was also introduced to
explain the Islamic rulings that concern all Muslims in worshipping, daily transactions,
political system and relations with other people. These rulings were derived from the Qur’an
and Hadith.
Another branch of theology was the Foundations of Creed, dogmatics. This science of
studying the basics of belief was introduced as a result of defending the Islamic belief against
heretics and other sects. This gave rise to the rational approach of presenting Islam to non-
Muslims. One of the most important works in this field is Al-Ashcari’s (d. 935) Al-Ibaanah fi
` 48
‘Us}uul al-Diyaanah (The Explanation of the Roots of Creed). Al-Ashcari was the first to
formulate the orthodox thinking of creed. His book Al-Ibaanah has influenced most writings
on theology even today.
6. Biography: The first coherent biography of the Prophet was written by Ibn Ishaq (d. 768)
whose Siirat Rasuul Allaah (The life of the Apostle of Allah) was revised and reworked by
Ibn Hisham (d. 833) to make the oldest and most classical work in this field. Ibn Ishaq was
first entrusted by the Caliph Al-Mansur with the task of writing a book for his son Al-Mahdi
on history since the first man on earth until their time. Ibn Ishaq’s work was more
comprehensive than Ibn Hisham’s, as the latter was only on the biography of the Prophet.
History then became an independent genre with works like Al-Akhbaar Al-T}iwaal (Long
Narratives) by Abu Hanifa al-Dinawri (d. 895), Taariikh al-rusul wal-umam wa al-muluuk
(the History of prophets, nations and kings) by Al-Tabari (d. 922) which reflected various
historical and cultural aspects of Islamic life.
7. Philosophy: Arabs started to try the speculative methods in order to defend or spread Islam.
Philosophy was just another aspect of religious studies. Al-Farabi, for example, held the
belief that philosophy and Islam are in harmony. One of his most important contribution is
Aara’ Ahl Al-Madiinah Al-Faad}}ilah (The Utopia). It is a significant contribution to
sociology and political science. The shift undertaken by Al-Kindi (d. 872) from writing on
philosophy as a religious tool to writing on pure philosophy is considered the beginning of
the separation of philosophy from dogmatics. Therefore he was the first independent writer
on philosophy. Then came Ibn Sina, who is known in the West as Avicenna, to inform this
genre with many works, including a book in logic. He combined Greek philosophy and
Muslim theology.
` 49
In addition to lexicography, philology was also mastered by the early Arab linguists. Al-
Thacalibi’s Fiqh Al-Lughah (The Code of Language) (d. 1037) was really a marvellous
compendium of philology.
It is noteworthy to mention that there are arguments for not including lexicons and
linguistics works in a corpus because in the first place they may contain citations from other
works which have their own grammars and stylistics. Secondly, to include such works in a
corpus could be misleading in the sense that they may use other languages to prove a
universal phenomenon or to investigate these languages themselves (Paul Bennett, personal
communication). This is not the case with the works I included in CAC since the works of
linguistics and lexicons I included are written entirely in Arabic without quoting any single
foreign word. In addition, citations from other works are only restricted to certain texts:
Qur’an, pre-Islamic poetry and nomad proverbs (cf. 3.2.2).
9. Science: The early Arab scientists paved the way for the modern scientific observation in
mathematics, medicine, physics and so on. In medicine Ibn Sina’s book: Al-Qaanuun fi Al-
T}ibb (Canon of Medicine) was considered the first comprehensive encyclopaedia in
medicine. When the Al-Qaanuun fi Al-T}ibb (Canon of Medicine) was translated into Latin,
it became the textbook for medical education in Europe in the 12th century.
Another field of science in which the West was indebted to the Arabs was mathematics. Arabs
are the inventors of the symbol 0 (zero) and this laid the foundation of positional arithmetic.
The first to write an arithmetic was Al-Khawarizmi (d. 849).
Physics was also studied by Arab scholars. Al-Biruni’s (d. 1048) contributions in physics
were pervasive during the first part of the last millennium. He was a pioneer in the study of
metals and precious stones. His book Kitaab al-Jamaahir discusses the properties of various
precious stones. In geography Al-Maqdisi (d. 977) studied most of the Islamic world and
wrote his marvellous book: Ah}san al-taqaasiim fi macrifat al-aqaliim (The best Division in
the knowledge of Climes) that made him a pioneering geographer of his time.
` 50
10. Fiction: As mentioned earlier that there is a considerable lack in Arabic fictional works,
however, some early Arabic works can be subsumed under the narrative prose such as Al-
Bukhalaa’ (The Misers) by Al-Jahiz and Arabian Nights. The Thousand and One nights
(Arabian Nights) (850) was originally written in Persian. It was translated and reworked
completely to leave no Persian traces so as not to contradict the Islamic thought during the
Abbasid period. Al-Jahiz (d.868) had contributions in a variety of genres among which are
philology and artistic prose. His book, The Misers is a collection of anecdotes that criticises
the social conditions of his time in a comic way.
4.4 Conclusion
To sum up, the 5-million word Classical Arabic corpus (CAC) is considered a pioneering
corpus for the following reasons:
1)It is an electronic corpus; this makes investigating Arabic a more accurate and faster
process.
2)It is balanced; it covers a wide scope of written Arabic texts to be used for more than one
purpose.
3)It is a monitor corpus; we will keep on maintaining it by adding more texts and genres.
4)More importantly, this corpus is synchronic, which deals with only one variety of Arabic
along a particular period of time, i.e. early Classical Arabic. This can make the study
based on it more consistent and more methodical.
` 51
Chapter Five: Lexical Collocation
5.1 Introduction
As the subject matter of this thesis is to look at synonymy in Arabic contextually through
lexical collocation, it would be sensible to give a brief account of the relationship that holds
between synonymy and collocation. Synonyms in their propositional sense can be substituted
for one another, as will be discussed in detail in chapter six, and collocation can only be
observed through repeated usage (Smadja, McKeown, and Hatzivassiloglou. 1996:5). Both
involve two different kinds of relations18: synonymy is a paradigmatic relation and
collocation is syntagmatic. We are of the position that both types of sense relations,
paradigmatic and syntagmatic, are complementary to each other because words acquire
meaning from both axes. Through collocation we can distinguish one sense of a word from
another and know whether the seemingly synonymous words (for example) are real
synonyms or not. Collocation is, therefore, a device with which a particular sense of a word is
activated.
The relationship that a linguistic element has with other elements inside the sentence is called
syntagmatic. This is mainly a syntactic relation. Let us consider the following example:
The word work in (1) above is syntagmatically related with the definite article the, and the
copulative verb is is related with the adjective interesting, or the noun work with the adjective
interesting. Generally speaking, what is the first word that comes into your mind when you
come across a word like work? There are many possible answers such as is, does, place, etc.
This is called a syntagmatic reply because it provides the phrase or the sentence with a
required syntactic form; it is the next word in the phrase or the sentence. Or the answer could
be words like job or career. This is called a paradigmatic reply because it chooses another
word from a set of semantically related words, not mentioned in the sentence. Finally, if the
18 Any two words can have a relation, but there might be words which are more significant than others. Sense
relations are divided into three classes: paradigmatic, syntagmatic, derivational. The significance of a relation is
discussed by Cruse (2000:145-47).
` 52
answer uses the same word but in a different form, this is called derivational.
Collocation is a clear-cut way of looking at word meaning in a practical way rather than by
means of conceptual analysis. Firth (1957) emphasised that the meaning of a word is
determined by its co-occurrence with other words. He called this phenomenon collocation, as
will be illustrated below. Likewise, Sinclair states, ‘meaning can be associated with a distinct
formal patterning’ (1991: 6). Such a trend could be an interpretation of Wittgenstein’s
statement that ‘the meaning of a word is its use in the language’ (1953: 20). With the
possibility of carrying out linguistic contextual analysis of large quantities of electronic texts,
we become more or less able to account for the interaction between meaning and syntactic
structure in an empirical way. Hanks (2000:1) argues that ‘corpus analysis shows that
differences in meaning (metaphorical and literal alike) are associated with different
phraseological and syntactic contexts. A list of phraseological norms derived from corpus
analysis corresponds to a cognitive profile of the word’s meaning.’ Stubbs also noted that the
word meaning could be defined not only by individual words or grammatical structures but
also by collocations (Stubbs 1996: 89). As Fraas (in press, quoted in Stubbs (2001a)) noticed,
collocates provide observable evidence of word meaning.
Collocations can be defined as the co-occurrence of words, as are idioms, compounds and
clichés. Idioms are those in which the meaning of the whole cannot be understood from the
meaning of its parts. For instance, kick the bucket, break a leg, etc. A cliché is defined as a
‘trite, stereotyped expression; a sentence or phrase, usually expressing a popular or common
thought or idea, that has lost originality, ingenuity, and impact by long overuse’ (the Random
House Dictionary of the English Language, 1967). For example, the early bird catches the
` 53
worm, life sucks, and then you die, salt of the earth, etc. Compounds are built up of two or
more free morphemes in a single lexical unit. On the other hand, a collocation is a group of
words that occur together more often than by chance.
Mitchell uses the term ‘composite element’, under which collocation, idioms and compounds
can be subsumed (1971: 57). Following Mitchell, Cowie (1981: 224) refers to them as
composite units. He then makes a distinction between collocation and idioms in terms of
substitutability of items. He points out that the former permits the substitutability of at least
one item of its constituent elements. The latter, on the other hand, cannot undergo any type of
transformational processes of substitution, transposition, expansion, etc. (ibid: 224). So, the
main distinction between collocation and idiom is that, unlike idioms, the meanings of
collocation can be predicted or deduced from the meanings of their parts. This varies from
idiom to idiom as some idioms are more frozen than others. For example, kick the bucket
which is more fixed in terms of the transformational or substitutional processes than idioms
like spill the beans, which can undergo a process of passivisation as follows: the beans have
been spilt. Specifically speaking, ‘idioms contain frozen parts that do not allow any sort of
substitutions’ (Gross, 1990: 16). Let us consider the following examples:
The above sentences are idiomatic in the sense that we cannot understand the meaning of the
whole by understanding the meaning of its parts. In (2) above we can only change one part of
the sentence (i.e. the subject) for any thing equivalent without missing out the idiomatic
sense. In (3), both the subject and the object can be swapped or changed. In (4) and (5) only
John and Bill can be changed. In (5), only John and Bill can be changed. In (6), only the tense
is free.
Nelson (2000) calls such a phenomenon of word packaging Multi-Word Items. He gives an
` 54
interesting brief account of the definitions and types of such multi-word items since 1864.
We are going below to present some definitions and examples of collocations, as well as
methods for their extraction and classification. Let us first have a look at some of the
definitions of collocation.
2. ‘sequences of lexical item which habitually co-occur, but which are nonetheless fully
transparent in the sense that each lexical constituent is also a semantic constituent’ (Cruse,
1986: 40).
3. ‘Collocation is the occurrence of two or more words within a short space of each other’
(Sinclair, 1991: 170).
5. ‘A sequence of words that occurs more than once in identical form.... and which is
grammatically well structured’ (Kjellmer, 1987:133).
` 55
than five words’ (Smadja, 1993:151).
10. ‘a sequence of two or more consecutive words, that has characteristics of a syntactic and
semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived
directly from the meaning or connotation of its components’ (Choueka, 1988 quoted in
Manning 1999: 172).
These definitions seem to have three main characteristics: the co-occurrence of at least two
words, the frequency of this co-occurrence and the fact that the whole chunk should occur
within a given span of words. However these definitions do not mention how frequent a given
combination must be or whether a single occurrence in a corpus should be eliminated or not
(see 5.8.4 for more detail). Secondly, Choueka’s definition deals only with collocation of
adjacent words, which, as far as I know, is contrary to all linguistic definitions of collocation.
Thirdly, apart from Kjellmer, Cowie, Choueka, and Smadja, there is no syntactic condition
given. For Kjellmer and Cowie, the grammatical structure must be considered, the other two,
Choueka and Smadja, put it more specifically within the boundaries of sentence. To me,
collocation, in its general sense without any sort of syntactic restrictions is more likely in
conformity with the commonly asserted claim that all words and expressions, regardless of
their syntactic position, are restricted in their distribution (van der Wouden 1997: 45). This
also enables us to investigate interrupted phrases of interesting distribution which we might
not be able to account for with a restricted definition of collocation. To pursue the premise I
shall, after addressing the types of collocations, discuss in more detail the questions of spans
and frequency to arrive at a definition which could be closer to the purpose of the present
study.
` 56
The statement of meaning at the grammatical level is in terms of word
and sentence classes or of similar categories and of the inter-relation
of those categories in colligation. Grammatical relations should not be
regarded as relations between words as such – between ‘watched’ and
‘him’ in ‘I watched him’ – but between a personal pronoun, first
person singular nominative, the past tense of a transitive verb and the
third person singular in the oblique or objective form.
Firth (1957: 13)
Halliday makes a distinction between collocational and grammatical levels or lexis and
grammar but he noted that they are still interrelated. He used the term lexical as a substitute
for collocational (1966: 152). He argues that ‘collocation is outside grammar: it has no
connection with the classes of the words. It is the lexical item, without reference to grammar,
that enters into collocation’ (1966: 20). For example open in open the window, an open
window, or the opening of the window collocate with window in the same way irrespective of
its grammatical position.
To differentiate between grammatical and lexical levels, Halliday et al (1964: 32-33) note that
where there is a choice between different classes of language items at a place in structure we
have the grammatical level. For example, when we choose between this which is singular and
which is not that and between items like who, whose, what, which, we can account for
differences between such items grammatically. Other language items, on the other hand,
cannot be described in this way, as grammar cannot fully distinguish between items like table
and chair. Such items do not belong to grammar but to lexis.
Following Halliday, both McIntosh and Sinclair view grammar and lexis as separate.
McIntosh considers the distinction between grammar and lexis necessary ‘if the patternings
are to be economically stated or defined’ (1966: 183). He also states,
We can only preserve the simplicity of our grammatical description if
we are prepared from the start to let it be understood that there are
lexical factors, factors of collocational eligibility, which tend to rule
` 57
out of actual use a large number of ‘sentences’ (and smaller units)
even though they seem to conform to all the rules of grammatical
pattern.
(McIntosh and Halliday, 1966: 183-84)
Sinclair (1987b: 322) emphasises that lexical collocations and grammatical collocation are
just tendencies and choices. Recently he defines colligation as ‘the co-occurrence of words
with grammatical choices’ (2000: 10).
Mitchell (1971: 53) argues in favour of Firth’s approach but with additional implications. To
Mitchell, collocation is different from colligation in the sense that the former uses words and
the latter uses word-classes. Therefore, colligation can be defined as a class of collocations,
for example, (‘motive’ verb + ‘directional particle). So the relationship that holds between
collocation and colligation is just a matter of generality.
Recently, Hoey (1998: 8) reintroduces the term colligation. He rejects the Halliday and
Sinclair approach and views colligation as necessary to account for the assemblages that
prefer to appear in a certain structure. He notes that lexical items tend to co-occur with other
lexical items in a certain grammatical position. For example, the different senses of the word
reason as meaning cause or rational faculty or logic can be accounted for grammatically.
Counting the frequency of the occurrence of each sense shows that reason in its sense of
‘cause’ occurs more frequently with the demonstrative deictics (like this, that, which(ever),
what(ever)) and not with the possessive ones (like my, your, John’s, whose) (ibid). He
therefore defines colligation as ‘the grammatical company a word keeps’ (ibid)-- more
specifically, the grammatical item or class that a word tends to co-occur with.
` 58
example, the search term back is less frequent than words like at, from, on , he, him, get. On
the other hand, downward collocation is when the collocates around the search term (node)
are less frequently used than the node itself, like back with arrive, bring, climbed, come.
Some authors, like Emery 1988, Smadja 1993 and Lewis 1993, made another distinction
where several types of collocations can be identified according to the degree of the
collocation strength and currency.
Smadja (1993) identifies three types of collocations: 1) rigid noun phrases, 2) predicative
relations, and 3) phrasal templates. The first is the most fixed type of collocation; it is a
sequence of words that cannot be broken or interrupted without losing the meaning of the
phrase such as stock market and foreign exchange. The second, which is the most flexible
one, yet hardest to identify is made of ‘two words repeatedly used together as a similar
syntactic relation’ (ibid.: 148] such as to make a decision we can put in several ways such as
made an important decision, decisions to be made etc. The third type is characterised as long
and domain-driven collocations. ‘Phrasal templates consist of idiomatic phrases containing
one, several, or no empty slots’ (ibid 149) such as the often repeated sentence in weather
reports: temperatures indicate previous day’s high and overnight low to 8 a.m.
Emery (1988), following Cowie (1981) sees collocations as a scale at the end of which lie
idioms. He observes three types of collocation: open collocation (which is free word
combination), restricted, and bound (rigid). The last is considered ‘a bridge category between
collocations and idioms’ (Cowie 1983: 228). Open collocations contain elements which can
be used with different words without a big difference such as (bada’at/intahat
alh{arb/almacraka ‘the war/the battle began/ended’ ) (Emery 1988). Emery does not count
such combinations as collocations because they are unrestricted by usage (ibid.: 27).
Combinations of words that select each other not only in terms of semantics (like in open
collocation) but also by usage are called restricted collocations (Ainsenstadt, 1979 quoted in
Emery, 1988). Cowie (1983: xiii quoted in Emery: ibid) puts a condition for such restricted
collocation that one item of the combination should have a figurative sense. Compromising
between Cowie’s and Aisenstadt’s views, Emery says, ‘in a restricted collocation, one (but
` 59
not more) of the elements may be either literal or figurative’ (ibid: 27). For example, the verb
in explodes a myth/ a belief has a figurative sense whereas that in clench one’s teeth is literal.
Finally, in bound collocation ‘one of the elements is uniquely selective of the other’ (ibid.:
29), such as foot the bill.
In this study I will take the view that the phenomenon of collocation should be understood as
a gradual cline along which we may locate different degrees of collocation: fixed
(rigid/bound), semi-fixed (restricted) to free (flexible).
In fixed collocations the replacement of individual words is not allowed, whereas the
collocations in which individual words may be replaced by certain other words, are called
free collocations. For example, in the free collocation القاضي أمرamara alqaad}i ‘the judge
commanded’,19 ‘ أمرamara’ can be replaced by certain other verbs such as حكمh}akama
‘sentenced’ and قضىqad}a ‘made a judgement’. All three possibilities are collocations and
have the same meaning. But in the fixed collocation يداك تربتtaribat yadaak ‘may your
hands become dusty’20 there is no alternative to the noun يداكyadaak ‘hands’. The sounds of
animals and birds, in Arabic or in English can be subsumed under fixed collocations. For
example, the sounds made by dogs or donkeys نباحnubaah} ‘barking’, نهيقnahiiq
‘braying’ have a strong bond to appear in such a context; there is no other word to describe
the dog’s or the donkey’s sound in normal language use, i.e. in non-metaphorical expressions.
On the other hand, the free collocations are words that are most likely to co-occur in
infinitely creative ways (Lewis 1993). The third type is restricted collocation, which
constitutes the majority of Arabic collocations and falls halfway between fixed and free
collocation. It is a combination of two or more words which attract one another syntactically,
semantically and by usage. For example, sakaraatu al-mawt ‘death throes’ al-dunya wa
al-‘aakhirah ‘this world and the hereafter’, al-ghiiybah wa al-namiimah ‘backbiting’, etc.
This discussion of collocation is apparently a semantic-based, which can give intuition a free
hand to identify it. In section 5.8 we will embark on a more methodical way in identifying
collocation using corpus-based methodology.
` 60
Unlike Arabic, strong (fixed) collocations in English are relatively few (Lewis & Hill 1998,
quoted in Nelson 2001). This is because Classical Arabic has a very rich and varied
vocabulary with highly specific meanings. It is also remarkable for its abundance of near
synonyms. While some languages have a single word to describe one thing, Arabic has
hundreds. For example, there are over 500 words for ‘lion’, 200 for ‘snake’, each with a
specific connotation (Ibn Faris, al-s}aahibi, p. 21). In his investigation on Arabic
collocation, Hoogland (1993) concentrated on restricted collocation because as he argued it
constitutes a large and unpredictable category.
Such abundance in vocabulary is a treasure trove that can let words select particular words
without repetition. Therefore, it is expected for any researcher on Arabic collocations to be
swamped by a huge amount of collocations varying from free to fixed.
5.5 Spans
Jones and Sinclair (1974: 21) use the term ‘span’ to refer to the number of lexical items on
each side of the word under investigation (the collocate). They prefer that span to consist of
four words to the right of the node and four to the left. Later, Sinclair (1991) proposes a short
span of no more than five on each side of the search term. Others like Martin et al (1983
quoted in Kenny 1999: 70) think that five words to the left and five to the right are enough.
More practically, Berry-Rogghe (1970) made some experiments to arrive at an optimal span
on his corpus which consists of three works: A Christmas Carol by Charles Dickens, Each in
his own wilderness by Doris Lessing and Everything in the Garden by Giles Cooper.
When he tried three words as the span size, the collocations of the word house include words
like: sold, decorate, this, empty, buying, painting, opposite, loves, outside, full, my. Increasing
the span to six words, irrelevant words found their way as standard collocates such as
Bernard, God, etc. He intuitively found out that four words are the optimal span as it is long
enough to produce an optimal number of relevant counts.
To limit the span to the number of orthographic words, as proposed by Sinclair, does not
` 61
work in Arabic all the time. In Arabic which is different from English in terms of the
grammatical structure, we would need a careful treatment of the question of span. As
mentioned before in chapter three, the range of the Arabic sentence could be a bit bigger than
in English. For instance, we may find a big distance which may extend over a number of lines
between a verb and its subject or complement provided the verb contains a referential
pronoun irrespective of how many words intervene between them.
We can fix the span to two or more according to the mobility of our linguistic items. For
example, prepositions in Arabic show a tendency to precede their objects without any sort of
interruption, so a span of two words on each side of the search term would be enough.
Moreover, some items could be modified through the text which might extend over the
concordance line. Even though, there might still be plenty of occasions where such two items
appear close enough to each other. However, I would like to utilise all the corpus results as
much as I can because my corpus is not so big and I do not want to miss the occurrences of a
particular word because of the distance between them. At the same time, I cannot examine all
the occurrences that extend over a concordance line. Therefore, a flexible span, which ranges
from two to seven, based on the particular item we investigate, would be more realistic.
We can decide the size of the span according to the grammatical position of the category to be
examined. For example, to examine idiomatic verbs, we can easily search for their immediate
constituents to study what particles can follow such verbs. So, a span of five will be
sufficient for such study. In some cases we might extend the span to involve as many items as
we can from the concordance line such as in studying nouns, transitive verbs, etc. which more
likely tend to have relations across the text. It all depends on the first reading of the
concordance lines before getting to any analysis.
` 62
5.6 Semantic Prosody
As words, on the one hand, collocate with a particular grammatical class, i.e., colligation,
they, on the other hand, collocate with a semantic class of words, which is called semantic
prosody. Louw (1993: 157) defines semantic prosody as a ‘consistent aura of meaning with
which a form is imbued by its collocates.’ It is Louw who gives this phenomenon its name,
although the idea of semantic prosody was known for a long time before he coined it. Sinclair
(1987b: 322) noted that ‘many uses of words and phrases show a tendency to occur in a
certain semantic environment.’ He further argued that words seem to co-occur in a certain
semantic profile, either with positive or negative connotations. For example, the verb happen
collocates with unpleasant things such as accidents etc. (Sinclair 1991:112) and the phrasal
verb set in occurs primarily with words which refer to unpleasant states of affairs, such as rot,
decay, malaise, despair, ill-will and decadence (1991: 70ff). Hence, we can conclude that the
study of semantic prosody is more or less a useful way for employing pragmatic information
in the collocational analysis. McIntosh (1966), as mentioned above, proposes that items like
chair, seat, and sofa are all likely to occur, or collocate with, the items sit and comfortable
and so they are all members of the same class which share the same probability of
occurrence, i.e. which have the same range of collocations.
The collocational range is defined as the whole collocates of a single node grouped together
in a particular text or corpus, i.e. all collocates that a given search-term has across a particular
text. From this collocational range, on one hand, one or more of the collocates can be used as
a semantic category label for the others (comfortable for example). On the other hand,
semantic prosody is the phenomenon for which a common semantic feature among the
collocates provides evidence.
The notion of semantic prosody, later termed discourse prosody (Stubbs, 2001a), is further
enhanced by Stubbs (1995a), where he highlighted a similar tendency towards negative or
positive semantic prosody of collocates. He also noted that collocation can be simply defined
as the semantic feature which stretches over several units (2001b), describing the
phenomenon as the connotations that words have when they occur together (1996: 172).
` 63
In Arabic I studied the lemmas سنهsanah and عامcaam by looking at their occurrences in
CAC, it turned out that سنهsanah ‘year’ and عامcaam ‘a year’ which are widely regarded as
synonyms are used in different contexts. The corpus provides lots of unpleasant examples for
sanah and only one pleasant example as in (a) below:
In (a) above the examples show the most frequently recurring left collocates for the word
sanah. These collocations can be summarised as follows:
•To refer to a bad experience that happened during this year, like a drought, a plague
or common crisis.
On the other hand there is a considerable shortage of negative examples for caam. For the
positive collocates the corpus shows the following examples as in (b) below:
The corpus-based analysis shows us how each word has its own preferred collocates and
relatively different distributions. There are some neutral collocates, which seem negative21,
shared by sanah and caam. Such collocates are considered neutral as they refer to a certain
historical incident such as‘ عام الطاعونcaam al-t}acuun ‘the year of Plague’ and عام الزنcam al-
h{uzn ‘the year of sadness’. Such incidents became milestones in Muslim history, so they
designate a period of time and do not carry any positive or negative sense. The only real
negative word that collocates with caam is drought which is also shared by sanah.
One clear piece of evidence does come from the Qur’anic verse that states,
` 64
(7) ف َسنَ ٍة إِلّا َخمْسِيَ عَامًا َفأَ َخذَ ُهمُ الطّوفَا ُن َوهُمْ ظَاِلمُون
َ ْوََل َقدْ أَ ْرسَ ْلنَا نُوحًا إِلَى َقوْ ِمهِ َفَلِبثَ فِي ِهمْ أَل
And indeed We sent Nuh (Noah) to his people, and he stayed among
them a thousand (sanah) years less fifty caam (years) [inviting them to
believe in the Oneness of Allah (Monotheism), and discard the false
gods and other deities], and the Deluge overtook them while they
were Z{aalimuun (wrong-doers, polytheists, disbelievers, etc.).
(Qur’an: 29: 14)
where sanah and cam are used altogether to refer to different stages of the life of the Prophet
Noah22 who suffered a lot to call his people to belief in God until God destroyed them by
flood. Hence, the word sanah is used with reference to the first stage of his life which was
full of hardships and cam is used for the rest of his life. In Modern Standard Arabic the
frequent use of cam on happy occasions is quite evident. Egyptians are very likely to say
when congratulating one another with a new year: ‘caam saciid’ (happy new year), but much
less likely to say: ‘sanah saciidah’. More interestingly, sanah could encapsulate the meaning
of its collocation, as we can drop that collocation and use sanah to give the same meaning.
Consider the following example in (8) below:
In the (8) sanah is used to describe a hard experience happened to people, which could be
infertility of the soil, drought or famine. This use found its way to the Arabic lexicons as an
equivalent to infertility of the soil, drought or famine (cf. Lisaan Al-Arab and Al-Muheet).
22 According to the Muslim literature, the Prophet Noah lived among his people for 950 years working hard to
guide them to Allah. Unfortunately, they did not believe, so Allah destroyed them by flood and saved Noah and
the believers. So Noah lived a tiring life, for 950 years, before the flood. Then, Noah with the believers
repupolated the earth in peace and serenity.
` 65
5.7 Extraction of Collocation
Collocations can be identified intuitively, semantically, lexically or quantitatively. McIntosh
(1966: 194) says that our experience of the meanings that a given word has in a certain
context sheds light on what words it collocates with and what range of collocations they have.
For example, the lexical items: chair, seat, and sofa are all likely to occur, or collocate with,
the items sit and comfortable and so they are all members of the same class which have the
same range of collocations. This is due to our experience with such items in a variety of
contexts. Firth views such a phenomenon as a relation of mutual expectancy and as an
inseparable part of the native speaker’s knowledge of his own language, i.e. competence
(Emery 1986). However such an approach cannot figure out what is more frequent or typical
in language use and we can discover interesting aspects of our language, which could not be
formed by introspection. In addition, by using advanced technology in the field of corpus
linguistics we can assess the problem more accurately and quickly. I gave a detailed account
on the credibility of intuition vs. empiricism in Chapter Three.
To assess a given collocation, we can resort to semantics. Cruse (1986) makes a distinction
between two types of semantic co-occurrence restrictions: (1) selectional restrictions which
can be defined as ‘semantic co-occurrence restrictions which are logically necessary’ (p.
278), (2) collocational restriction, which is defined as ‘co-occurrence restrictions that are
irrelevant to truth conditions’ (p. 279). For example, the verb die in John died, the tree
leaves died and *the book died needs to be preceded by a (+animate) grammatical subject;
this is called selectional restriction. Further semantic requirements are needed in sentences
like John kicked the bucket, *the cow kicked the bucket and *the tree kicked the bucket. The
lexical item kick the bucket requires in addition to the (+animate) feature another restriction,
which is (+human). Restrictions of this type are called collocational restrictions. In short, the
semantic approach tries to define collocations by the actual meanings they have and by the
usefulness of combinations of words in different contexts.
The lexical approach23 concentrates on the language as a complete unit; it does not make a
distinction between grammar and vocabulary. This approach differs from the semantic one in
23 The lexical approach not only deals with individual words, as might be understood, but also with larger units
i.e. the word combinations that we store in our minds.
` 66
that the latter tends to account for all the relations that hold among lexical occurrences ‘in a
semantically motivated way’ (as in Cruse’ collocational restriction) (Emery, 1988: ch.1.2.3).
The lexical approach, on the other hand, looks at collocation, for example, as a matter of
combinatorial process without giving any explanation. It does not explain why a given lexical
item collocates with another lexical item (Lehrer, 1974: 176). Therefore, in this approach we
can easily make use of computer analysis of large corpora to focus on high frequency
language and to highlight typical patterns of language use.
However, Lehrer (1974: 173) criticised both approaches: the lexical approach does not give
an explanation for the co-occurrence of lexical items whereas the semantic approach cannot
account for the combinations that are arbitrarily restricted. Therefore, she argued for an
eclectic view that combines aspects from both approaches.
It is not our goal to discuss in detail the various methods of extracting collocations, i.e.
intuitively, semantically or lexically. I am rather more concerned with applying the most
commonly used methodology, statistics, in extracting collocations from Arabic corpora. This
approach tries to define collocations by the frequency of certain word combinations in a text.
In Chapter Three we talked about the concordance as a means of processing corpora. The
available concordancing programs can do lots of applications: frequency lists, word
associations, etc. (cf. Barnbrook 1996 for more details). However, human intervention is
needed to run, edit and analyse such concordances. Concordances can only help us find the
words under examination in their environments as shown in figure (5.2) below.
` 67
5.7.1.1 Lemmatisation
When examining a word, it is often useful to consider the different forms of the word
altogether. In doing so, I faced some problems when searching for words as base-forms
(lemmas). In English we can, to a great extent, search for a word irrespective of its
grammatical change such as tense or plurality by the wild card search, which can provide all
possible forms of a given word. For example, if you search for the lemma ‘play’ using wild
card, the output will include words like ‘plays’, ‘played’, ‘playing’ and so on. On the other
hand, using wild cards with Arabic to get all related word classes, verbs, nouns, adverbs etc
reveals that the output needs an exhausting hand editing before proceeding further to any
assessment. It would be difficult to search for the lemmas without some sort of human
intervention such as editing our automatic counts. This is because Arabic is an inflected
(synthetic) language where affixes have a different function from non-synthetic languages
like English. In Arabic a lemma is actually a stem of a set of forms (hundreds or thousands of
forms in each set) that share the same morphological, syntactic or semantic features (Dichy,
2001 and Kamir, 2002). For example, if we search for the word سنةsanah ‘a year’ we will
have many forms such as سنواتsanawaat and سنيsiniin (fem. & masc. pl. ‘years’), سنsaniy
(pl. ‘years’ in genitive case), السنةal-sanah ‘the year’, سنتهsanatahu ‘his year’ سنتهاsanataha
‘her year’, سنينهsiniinahu ‘his years’, سنينهاsiniinaha ‘her years’, سنينهمsiniinahum (masc. ‘their
years’) سنينهنsiniinahun (fem. ‘their years’), سنتانsanataan (dual in nominative case ‘two
years’), سنتيsanatayn (dual in accusative and genitive case ‘two years’). Although this seems
a very simplistic search, yet we could not find at the time of writing this thesis a program
which can combine between the features of a concordancer and an Arabic stemmer.
` 68
for the company’s research purposes in 1997. Only in 2002 were
they able to produce an improved commercial version for
teaching purposes and as a component in larger natural language processing
analysis: (1) roots and patterns; (2) affixes, enclitics and function words which are normally
attached to words as prefixes. The program uses a very limited Arabic dictionary of 4930
roots. It can deal with all words with or without diacritics. However, it analyses words
separately from their contexts, which might produce some ambiguous forms. For example, if
you search for a word like ktb, it shows all the meanings of the root ktb without a specific
reference to the word in context (http://www.xrce.xerox.com/competencies/content-
analysis/arabic/).
The wild card search is useful sometimes with Arabic when the search-term is not polysemic
or the base word has a limited potentiality for word building. For example, using wild card
search with z{anna, is not problematic because this word is not polysemous in the first
place. Secondly, there are only a few irrelevant instances containing the same root letters but
they do not belong to the search-term such as lafaz{ani, hafaz{ani, haz{una, ayqaz{ani.
Although the second root letter that is n, in these examples, is not originally a root letter; it is
rather a suffix added for a morphological reason. Otherwise, most of the Arabic words need
an extensive hand editing because of the absence of vowels in Arabic which makes these
forms morphologically identical. This makes the process of singling out the search-term quite
complicated and sometimes ambiguous.
` 69
Biber et al. (1998: 91) propose a statistical way for editing such data26. Their procedure is
meant to remedy the tagged corpora, where we may meet irrelevant grammatical categories
(i.e., which are not under examination). However, I think, with slight refinement, this method
is also useful for work on raw corpora to exclude the counts which are inaccurate
morphologically27. These steps, which involve a lot of hand-editing based on intuition, are as
follows:
1.select a random sample of the counts of the word under investigation,
2.edit it by hand,
3.compute the proportional use of the irrelevant counts in the sample,
4.multiply the total number of your counts, in the corpus, by the proportion computed in
step 3.
To guarantee the accuracy of this procedure Biber et al. (ibid: 91) suggest ‘more than one
random sample should be taken from each category in order to make sure that the proportions
are similar across samples’.
Let us consider the following example. In CAC I have got 10447 instances of the base-form
ktb. Let us select a random sample28 of 2402 including relevant and irrelevant hits. Having
edited the sample, I found out that 290 hits are irrelevant, although some are derived from the
same root while others do not belong to the root (see table 5.1 below). The proportion of the
irrelevant examples can be represented as follows.
` 70
on the same root
Irrelevant Counts from Yaktasib 212 9%
Other Roots
Table (5.1): This table shows how misleading it can be to search on an Arabic raw corpus
without hand-editing.
We can notice that the irrelevant counts can be calculated proportionally. Using the above
statistical methodology in editing our data will save time and effort in hand editing. In other
words, instead of going through the whole corpus to eliminate the irrelevant forms of a given
search-term, we can rather select a sample, edit it manually and run a proportional calculation
as shown above.
5.7.1.2 Concordances
With KWIC (key word in context) we can search the whole corpus in a way that saves our
time and effort instead of looking up each occurrence of the word under investigation. In
figure 5.2 below, a list of the occurrences of the word ktb using wild card search.
د بن داود الدينوري رحمه الله وجدت فيما [[كتب]] أهل العلم بالخبار الولى أن آدم.....1
.… عليه
` 71
... .2من عاد كما قد قصه الله تبارك وتعالى في [[كتابه]] وهو أصدق الحديث .قال :ونشأ
في ذلك الدهر ...
... .3لم يرعووا فأهلكهم الله عز وجل كما نص في [[كتابه]] وهو أصدق الحديث .ويقال :إنه
كان بين مه ...
... .4إياه بتكليمه ورسالته ما قد قصه علينا في [[كتابه]] وانصرف إلى شعيب ورد أهله إليه
ومضى حتى ...
... .5عيب إلى قومه فكان منهم ما حكاه الله في [[كتابه]] .أبرهة قالوا :ثم ملك أرض
اليمن أبرهة ب ...
... .6هي بلقيس ما قد قصه الله تبارك وتعالى في [[كتابه]] إلى أن تزوجها وبنى بأرض
اليمن ثلثة حص ...
... .7خزائن من خزائنه وإن عبد الملك بن مروان [[كتب]] إلى عامله في بلد المغرب
موسى بن نصير ... -
... .8بن دارا تجبر واستكبر وطغى .وكانت نسخة [[كتبه]] إلى عماله :من دارا بن دارا
المضيء لهل ...
... .9وغربها ليعامل الناس على قدر فلما انتهى [[كتابه]] إلى دارا بن دارا غضب من ذلك
غضباً شديدا ...
... .10يزل يؤديها إلينا أيان حياته فإذا أتاك [[كتابي]] هذا فل أعلمن ما بطأت بها فأذيقك
وبال أ ...
... .11عذرك والسلم .دارا والسكندر فلما ورد [[كتابه]] على السكندر جمع إليه جنوده
وخرج متوجه ...
... .12د فعلت .ثم أمر بهما فرجما حتى ماتا .ثم [[كتب]] إلى أم دارا وامرأته بالتعزية وهما
بمدي ...
... .13لك سرت فكتبت إليه :إن الذي حملك على ما [[كتبت]] به فرط بغيك وعجبك بنفسك
فإذا شئت أن تسي ...
... .14ما ذقت من غيري والسلم .فلما رجع جواب [[كتابه]] أرسل إليها بملك مصر وكان
في طاعته ليدعو ...
... .15قصته وبنائه الردم ما قد أخبر الله به في [[كتابه]] فسألهم عن أجناس تلك المم
فقالوا :نحن ...
... .16ى بئر الملك فكان من قصته ما هو مشهور قد [[كتبناه]] في غير هذا الموضع .قالوا:
ولما ابتعث ا ...
... .17إليه أردشير بالدخول في طاعته فلما أتاه [[كتابه]] امتل غيظاً وقال لرسله :لقد
ارتقى ابن س ...
... .18ابي وفيرك الذي تدعى مرتبته مهران وجودرز [[كاتب]] الجند وجشنساذربيش كاتب
الخراج وفناخسرو
... .19بته مهران وجودرز كاتب الجند وجشنساذربيش [[كاتب]] الخراج وفناخسرو صاحب
صدقات الملكة
` 72
... .20نصارى الهواز يقال له يزدفنا .وأن قيصر [[كتب]] إلى كسرى يسأله الصلح ورد ما
احتوى عليه ...
... .21الموادعة فأجابه قيصر إلى ذلك فانصرف ثم [[كتب]] إلى عماله بأرمينية وأذربيجان
فاجتمعوا و ...
... .22أذن لعظماء أصحابه فدخلوا عليه ثم أقرأهم [[كتاب]] الملك إليه فلما سمع أصحابه
ذلك يئسوا م ...
... .23ر بمدينة همذان ارتاب بابن عمه ذلك وكتب [[كتاباً ]] إلى الملك يعلمه :أنه قد رده
إليه ليأمر ...
... .24لى محبسه فإنه فاجر فتاك وقال له :إني قد [[كتبت]] إلى الملك كتاباً في بعض
المور فأغذ ال ...
... .25اجر فتاك وقال له :إني قد كتبت إلى الملك [[كتاباً ]] في بعض المور فأغذ السير به
حتى تدفعه ...
... .26منه الساعة حين أخبرت بإدمانه النظر في [[كتاب]] كليلة ودمنة لن كتاب كليلة
ودمنة يفتح ...
... .27ت بإدمانه النظر في كتاب كليلة ودمنة لن [[كتاب]] كليلة ودمنة يفتح للمرء رأياً
أفضل من ر ...
... .28ابزين والنخارجان وسابور بن أبركان ويزدك [[كاتب]] الجند وباد بن فيروز وشروين
بن كامجار و ...
... .29ر هرمزد جرابزين حتى دخل على خاقان ومعه [[كتاب]] كسرى وأوصل إليه هدايا
كسرى وألطافه فقبل
... .30م في بلدهم فأجابوهم إليه وكتبوا بينهم [[كتاباً ]] :أل يتأذى أحد بأحد فأقاموا آمنين
واتخ ...
Figure (5.2) a sample of the concordance of the base-form (lemma) ktb in CAC.
Such a facility is useful enough when there are only a few lines to look into. But with
thousands of lines, the human mind could be overwhelmed with these large data. Statistical
techniques can help us go deeper and reveal what we might not have observed with the naked
eye.
Today, some software for analysing concordance lines statistically is available. One of the
early attempts that used statistics to analyse corpora automatically was Choueka et al (1983).
They proposed an algorithm to retrieve collocations automatically from texts. However, their
work can only deal with a particular type of collocation: uninterrupted bigrams.
` 73
Church and Hanks (1990) proposed a measure to estimate collocations directly from
electronic corpora. This measure, which is called association ratio, is mainly based on the
Mutual Information statistic. This program is able to retrieve interrupted word pairs but
limited to retrieving collocations that contain no more than two words.
To remedy such drawbacks, Smadja (1991) designed a program called Xtract that can make
statistical observations in collocation extraction. He used statistical methods such as z-score
to identify relevant pairs of words. ‘Xtract retrieves interrupted as well as uninterrupted
sequences of words and deals with collocations of arbitrary length’ (Smadja, 1993: 150).
Statistical programs such as Collocate and Typical (Sinclair et al: 1998) can also analyse the
lexical context of words under examination. Collocate is designed to assess the significance
of collocations in a concordance file as it calculates the actual frequency of a given
collocation and normalises it with its expected frequency (ibid: 229-230). Typical is designed
to find the most typical citations for a given word in a line by assessing the significance of
co-occurring words in a line and then evaluating the whole line (ibid: 232).
In addition, there are some concordance programs consisting of a number of tools in one
package such as wordlist, concordance and key words. Such programs also make use of
statistics in a wide range. For example, Wordsmith, which is designed by Scott (1996), uses
the chi-square measure, while CobuildDirect uses mutual information and t-score (Oaks,
1998: 193).
These programs, in the first place, are designed to work on languages written in the Roman
alphabet. Secondly, problems of polyseme (words with two meanings) cannot be sorted out
automatically except in opportunistic (specialised) corpora (Smadja 1991), otherwise they
need some sort of hand-editing. For instance the words like bank, which means either a
financial body or one side of a river, could be disambiguated if we have an economic corpus
for example.
Homonyms are relatively uncommon in Arabic. However, Arabic is rather full of homographs
` 74
which are distinguished in pronunciation. Some learners of Arabic think that most Arabic
words are mainly homonymous, which is not the case. This is due to the absence of vowels in
modern orthography; the vowels are rather predicted.29 Arabic is a language in which vowels
are represented in diacritic form. Change from a vowel to vowel makes a different base-form
and ignoring these vowels produces such homographs. Such a phenomenon can make
problems for both human learners of Arabic as a foreign language and electronic processing.
This can be easily sorted out by inserting the diacritics when keying the corpus, though this is
obviously tedious and time consuming. Alternatively, one can use a tool to diacritise Arabic
text, providing the case endings according to their position in the sentence. This tool which is
called the Diacritiser, produced by Sakhr Company, helps disambiguate the seemingly
homonymous words. In theory, to diacritise Arabic texts does not work all the time since the
program is expected to make a choice from a big list of probable words. For example, ورد
الرجلin wrd ala alrajul, وردwrd can be diacritised in the following diverse ways:
ّ وَرَدwa radda ‘and replied’ and ّ وَرُدwa rudda ‘and was replied’
All of the above choices can fit in the text on the syntactic level. To solve such a dilemma we
need to disambiguate these senses semantically. Moreover, we may change the positions of
the words inside the sentence for rhetorical reasons without breaching its meaning. Let us
consider example (9):
29 The vowels in Arabic are predicted according to personal intuition. In other words, an Arabic reader would
predict a certain vowel to occur in a certain position according to his own mental lexicon.
` 75
In (9) the nominative ُ َرّبهrabbuhu ‘his Lord’ which occurs next to the verb ابتلَىibtala ‘tested’
name, because he is the Most High; this is a rhetorical device. In such sentences, a normal
morphological parser will confuse the accusative with the nominative because of the absence
of the diacritics which can distinguish between both of them. Therefore, proofreading and
hand editing is necessary to eliminate such discrepancies before doing any sort of statistics
automatically. This is apparently tedious and time-consuming as well. The tool has not been
personally assessed (see www.sakhr.com).
5.7.1.3 Frequency
When we have large masses of electronic data to analyse, we have to find a way to sort it out
and simplify it in such a way that it would be easy to examine and manipulate. Statistics is
considered a good way of simplifying and telling us what things we would like to highlight,
as, for instance, some combinations of words will tend to occur relatively often, while others
are rare or impossible.
Statistics has been a useful technique in all branches of language studies (cf. Miller, 1963 &
Fasold, 1984). For corpus linguistics, it is particularly very important (Allen, 1995, Charniak,
1993, Krenn and Samuelsson, 1997 and Oaks, 1998). Corpus-based statistical study, which is
an extension of traditional descriptive linguistics, can shed light on some aspects in language
which we might not be able to discern otherwise.
The starting point to analyse our corpus quantitatively to find collocations is counting. The
more frequent the word under examination (the node) with another word (or words) the surer
we are that this combination has a significant pattern. Analysing our corpus in such a way
does not work all the time since much of the output we get may not be very interesting as
shown in figure (5.3) below. The table shows the frequency of the top 30 trigrams with ktb
(write, book) i.e., the most frequently occurring three word phrases, in CAC.
` 76
in…Allah الله ... في 110
from … Allah 53
الله ... من
what… Allah 35
الله ... ما
and in… Allah 32
from … in الله... وفي 32
on … like من ... في 28
to… Allah 25
على... كما
from… this 24
harm … and not الله... إلى 24
what … to them هذا... من 23
ول... يضار
لهن... ما
Figure (5.3) the top 10 co-occurring trigrams of the base-form (lemma) ktb in CAC.
In figure 5.3 above most of the patterns do not have a special justification to occur together.
We can notice that five of the ten trigrams with ktb significantly co-occurs with God’s name
(Allah) referring to the Qur’an whereas four occurrences are flanked with function words.
Frequency does not tell you very much, it may be misleading because ‘frequency-based
search works well for fixed phrases. But many collocations consist of two words that stand in
a more flexible relationship to one another’ (Manning & Schütze, 1999:147).
By using statistical tests we are more likely to get reliable results and test how likely two
words are to occur near each other. There are some interesting and useful statistics that one
can use to assess and enhance such counts. The most prominent ones are z-score (Berry-
Rogghe: 1970), mutual information (Church & Hanks, 1990) and t-score (Church, Hanks and
Hindle, 1991).
` 77
probability with chance, i.e. to count the number of the occurrences of the combination with
the number of the occurrences of each word independently. Words with large mutual
information scores are likely to be more interesting (Church et al, 1991).
The formula as introduced by Church et al. for given two words reads:
The Mutual Information compares probabilities of x and y together with probabilities of (x)
and (y) independently. Church and Hanks (1990) argue,
If p(x, y) is bigger than p(x) p(y), then it is evidence that there is more
likely a genuine association.
If p(x, y) equals or is less than p(x) p(y), then we can predict no
interesting association.
Paul Johnston30 in his web site designed a program that can do the calculation automatically
on condition that one has the number of each variable. To use these formulae to find
collocations in CAC let us have the word الدنياal-dunya (the world) as our search-term and
then carry out the calculations for the word as shown in figure (5.4). The word under
investigation al-dunya is given (x) value whereas the corpus size is represented as (n).
30 http://lismore.ccl.umist.ac.uk/paulj/develop/mutual.html
` 78
f(x) = 1350, n = 5000000
(x,y) )f(x,y )f(y MI
the world perishable الدنيا الفانية 6 11 10.98
the world and its الدنيا وزينتها 7 17 10.57
adornment
the world and the الدنيا والخرة 79 571 9.00
hereafter
the world good deed الدنيا حسنة 14 398 7.2
the world and torture الدنيا وعذاب 5 338 5.77
the world little الدنيا قليل 6 673 5.04
the world house الدنيا دار 6 551 5.03
the world and certifies الدنيا ويشهد 9 1864 4.16
the world what الدنيا ما 24 6939 3.67
the world without الدنيا دون 6 2150 3.36
the world means الدنيا يعني 8 4462 2.73
the world except الدنيا إل 18 11157 2.57
the world mentioning الدنيا ذكر 5 3853 2.26
the world from الدنيا من 45 34830 2.25
the world and for الدنيا وأما 7 6987 1.89
the world in الدنيا في 19 24009 1.55
the world until الدنيا حتى 10 13165 1.49
the world to الدنيا إلى 11 11214 1.34
the world namely الدنيا أي 8 13356 1.14
the world then الدنيا ثم 11 22956 0.82
the world said الدنيا قال 15 32835 0.75
the world verily الدنيا قد 7 24534 0.07
the world on الدنيا على 12 47664 0.10-
the world and not الدنيا ول 6 32000 0.52-
the world the statement الدنيا القول 5 32835 0.82-
Table (5.4) the left collocates of the word al-dunya with maximum frequency of 100
and minimum frequency of 5.
It is the nature of Arabic orthography to attach some particles, personal pronouns in either
genitive or accusative case and the definite article with the following or preceding string of
characters. For instance conjunctions like وwa ‘and’, فـfa ‘and, consequently, after’ and the
definite article al (the) are considered, in writing, as parts of the words that follow. To
decompose such combinatory units automatically may lead to a serious problem of
identifying what a word is. This is because such units can be kernel parts of base-forms in
Arabic. For example, وwa ‘and’ could be a conjunction and could function as the initial
letter of hundreds of Arabic words like وجدwajada ‘he found’, واحدwaah{id ‘one’ وليد
` 79
waliid ‘newborn’, etc.
To me, it would be more realistic if we stipulate from the very beginning what a word is. The
word is defined as 1) a sequence of characters with spaces in between; 2) minimal permutable
unit; 3) maximally uninterruptible (Cruse 2000). I will consider the word as what is between
spaces as it is easier and more practical. In addition, in English, the mainstream is to count
words like within, insofar and themselves as three words irrespective of how many units they
contain. To be more practical, I would consider the particle as a part of the word like an affix.
In table 5.4 the search term الدنياal-dunyaa ‘world’ occurs 1350 times in CAC. The most
significant left collocates, i.e. pairs with the highest MI scores, are الخرةal-‘aakhirah
‘hereafter’ with MI score at 9.00, زينتهاziinatahaa ‘adornment’ at 10.57 and الفانيةal-
faaniyah ‘perishable’ at 10.98. These collocations, reiterated by Muslims in religious
contexts, describe the reality of this world according to the Muslim perspective. So, Muslims
view the world as an adornment which will inevitably perish, whereas the genuine life will be
in the Hereafter.
The pair حسنة الدنياal-dunyaa h{asanatan ‘the world a good deed’ appears as a strong
collocation. It is a part of an often-quoted prayer (supplication) in (10) below,
(10)اللهم آتنا في الدنيا حسنة وفي الخرة حسنة وقنا عذاب النار
The collocate h{asanatan does not modify al-dunyaa in the first place, it is rather a direct
object for the verb aatinaa (give us). Moreover, the repeated citation of this prayer is not an
independent occurrence of this collocation. Accordingly, we need a mechanism to single out
real collocations from the apparent ones. The t-score, which is a general statistical measure
that can compare two probabilities, is a useful statistic to assess the relative strength of
` 80
collocation. This will be discussed in detail in the next chapter.
Table (5.5) shows that the search term رسولrasuul ‘Messenger’ collocates with ‘ اللهAllah’,
مرسلmursal ‘sent’ and مصدقmos}addaq ‘truthful’ for their high MI scores. Of the three
collocations, the first has a strong bond with our node with MI at 7.99. This draws our
attention to the strong bond between the pair الله رسولrasuul Allah ‘Allah’s Messenger’ in
addition to the main traits of this Messenger which are ‘truthful’ and ‘sent by Allah’. For the
very strong bond between رسولrasuul ‘Messenger’ and اللهAllah the word الرسولal-rasuul
‘the Messenger’ with definite article can replace the whole pair.
The major problem with MI is that it does not work very well when there is not much data,
i.e. sparse data problems; it is the problem of all statistical tests. We cannot calculate the
probability of a given pair if one of the variables has the value zero and with very low
occurrences, the measure does not work very well either. Manning and Schütze (1999: 169)
calculated the MI scores of ten bigrams that occurred once to prove the invalidity of MI with
sparse data. They found out that ‘a large proportion of bigrams are not well characterised by
` 81
corpus data (even for larger corpora) and that mutual information is particularly sensitive to
estimates that are inaccurate due to sparseness’.
Obviously such a problem can be superficially avoided by using words with a frequency of at
least four or three. Sinclair (1996) proposes a primitive test to measure the significance of a
given pattern by looking into patterns with minimum frequency of two. He noted that despite
the insufficiency of such a condition it could guarantee that such a pattern is not accidental.
This is in conformity with the corpus linguistics methodology, i.e. to investigate how typical
a given pattern is.
Allen (1995: 194-5) proposes a more practical solution by adding a small amount to each
count to guarantee that there will be no zero probabilities. This process is called expected
likelihood estimator (ELE). For example, if one category of the formula happens not to occur
in our corpus, i.e. it equals (0), the Mutual Information statistic will be inapplicable, giving
no result. ‘The ELE, however, gives an equally likely probability to each possible word class’
(ibid: 195).
Let us now consider the following example to see how useful MI is in extracting collocations.
MI can reveal what is not expected or often missed out of the obvious typical patterns. When
discussing the collocations of body parts in Classical Arabic, Emery (1988) argued that the
adjectives in ضروس حربh{arb d}aruus ‘fierce war’ and جرار جيشjayshun jarraar ‘huge
army’ uniquely collocate with their preceding nouns. Having analysed CAC, I found out that
none of them co-occur with such nouns even once. The adjective ضروسd{aruus ‘fierce’
rather co-occurs with مطرmat}arun ‘rain’. For the other adjective جرارjarraar ‘huge’ it
collocates with ‘ عسكرaskarun ‘soldiers’. On the other hand, he was successful in ascertaining
` 82
that the verb أطرقat{raqa ‘bowed’ uniquely collocates with a particular body part: رأسra’s
‘head’. However, Table (5.6) below shows that there are more categories, other than body
parts, which أطرقcan collocate with such as حياء أطرقat}raqa h{ayaa’an ‘bowed out of
shyness’ كرا أطرقat}raqa kara ‘Kara bowed’.
Nevertheless, أطرقat}raqa ‘bowed’ in table 5.6 strongly collocates with رأسهra’sahu ‘his
head’ because of its high MI score. The second combination in the table seemingly appears
as a strong collocation despite its low frequency because the word it collocates with is rare.
All three occurrences of this word occur only with أطرقat}raqa. This gives an indication
that such combination is more likely to be a cliché, an idiom or any other stereotyped phrase.
In fact, all of them belong to a certain context or domain: proverbs; this leaves us no doubt
that the combination under investigation is part of a proverb.31
MI is useful only for testing similarities, which is good for finding collocations as it can
calculate the probability whether two words occur together very often in a text. But we
cannot use it to test the differences between words, which is necessary for assessing
seemingly synonymous or collocated words. It can give evidence for the closely related
words, if you find x, you are more or less likely to find y. For testing differences, I need a
31 Using corpus-based analysis to assess Emery’s results is useful in supporting or invalidating his hypothsis.
` 83
different statistic, which is t-score.
In Chapter Five we introduced Mutual Information statistic which is useful for detecting
similarity between items. We will now use another statistical technique: t-test, which is useful
in assessing the significant differences between two groups of patterns, typically pairs of near
synonyms. The main aim behind it, as suggested by Church et al (1991), is to see the more
significant words that are more likely to appear with each item of the synonymous pair. An
example of strong and powerful, as given by Church et al, can show the importance of this
test. Contrary to Mutual Information which can only make positive statements32 or what is
more likely to occur after a given item, the t-test can work the other way around, i.e. highlight
what is less likely to occur after that item. In other words, the difference between powerful
and strong in powerful support and strong support can be brought out by comparing the most
significant right collocates of both of them. By analysing the significant collocates, Church et
al (1991) managed to abstract an attribute that can differentiate between both words, namely
intrinsic vs. extrinsic.
T-test simply calculates the difference between two probabilities. The formula as given in
Church et al (1991) for the pair of words ‘strong’ and ‘powerful’ is represented as:
where w stands for the collocate and ơ for the standard deviation.
32 By positive statement I mean that in MI we can find the words which are more likely to co-occur after X but
we cannot account for the items which are more significant with Y or did not occur at all with either. T-test can
make a negative statement by looking at items which are less likely to co-occur with either X or Y altogether.
` 84
Finally, we have to bear in mind that the statistical calculation is not an end in itself in
linguistic analysis. As Sinclair (1996: 80-81) puts it.
` 85
Chapter Six: Synonymy: An overview
6.1 Introduction
As discussed earlier, synonymy is a paradigmatic relation that holds between words on the
vertical axis (cf. Ch. 5). This type of sense relation simply means the sameness or similarity
of meaning as defined in dictionaries. When asking anybody about the meaning of a given
word, they will intuitively provide you more than a word as alternatives. In this respect, many
dictionaries are assembled to fulfil that purpose like Roget’s Thesaurus, Webster’s Synonym
Dictionary and Crabb’s English Synonyms. Such dictionaries provide for every entry a list of
words that have close meaning or descriptive detail of the concept. But is that closeness of
meaning considered synonymy?
On the other hand, some linguists like Bloomfield (1935: 145) deny the existence of
synonyms in natural languages. Therefore we need from the very beginning to explain what
exactly synonymy is and give a systematic survey of the phenomenon to put forward a more
convincing explanatory hypothesis that will be statistically applicable.
As the subject matter of this thesis is to look into synonymy in a different way and to
examine readily empirical issues that have interesting theoretical results, I will try in this
chapter to review the phenomenon from the corpus linguistic perspective. Through corpus
analysis we can show whether two items are indeed absolute synonyms or not by checking
their relations in all the available contexts.
6.2 Definition
Synonymy is defined as two or more expressions which are different in form but not in
meaning (Harris, 1973: 6). To have two different phonological words of the same meaning
can bring up some arguments as regards how much sameness do both of them have. Should it
be complete similarity, strong similarity or even a thin shade of similarity to be considered?
Below in section (6.2.1) we are going to explain how similar a word is to another in meaning
to be called a synonym. Let’s first talk about the expressions involved.
` 86
It is important from the very beginning to distinguish two notions of semantic similarity: a)
similarity between single words and b) paraphrase. For instance commence and start are
synonymous verbs, whereas human male is a paraphrase of man. We are concerned with the
first type of similarity. Synonymy in this sense is defined as a relation of similarity in
meaning between lexical items.
Synonymy can also be defined as sameness of intension or extension34 (Jones, 1986: 66). The
extension of a word (denotation) is all things referred to by a word. For example, the
extension of the word dog is the class of dogs. On the other hand, the intension is the word
property (or to put it in Lyon’s words: ‘the set of attributes which characterise any entity to
which the term is correctly applied’ (Lyons 1968:454)). For example dog entails an animal.
Palmer (1981: 88) defines synonymy as ‘symmetric hyponymy’. For instance, if we take car
and automobile as synonyms, then they have to be mutual hyponyms to each other, i.e. all
cars are automobiles and all automobiles are cars. In this respect, synonymy is considered a
type of hyponymy. ‘if X is a hyponym of Y and if Y is also a hyponym of X, then X and Y are
synonymous’ (Hurford & Heasley, 1983: 107).
A more restrictive definition of synonymy was put forward by Quine who views synonymy
as ‘two forms are synonymous if their interchange leaves their contexts synonymous’ (Jones,
1986: 70). This definition requires that the two forms under investigation are interchangeable
in every possible context. The synonymy relation between a given pair of words can be ruled
out if we spot any change in the context. Tests have been introduced to check the credibility
of any seemingly synonymous pairs as will be discussed in section 6.2.2 below.
34 The extensional approach is the only way to give all sorts of information. However, the intensional approach
to meaning is more general than the extensional approach since there are some words which do not have an
extension like unicorn. Unicorn entails animal; however it does not refer to anything extensionally. Therefore, it
is relatively easier to study words intensionally than extensionally. Moreover, hyponymy will not come up if we
do not take that approach (e.g. dog is a hyponym of animal).
` 87
similarity to be considered synonymy.
` 88
synonymy in five foot wide. He did not go into more detail. He also stated that ‘one can also
distinguish between synonyms by finding their opposites (antonyms).’ For instance decline
and reject are not synonymous if opposed to rise and accept.
Haas (quoted in Cruse, 2000) comes up with a different test, i.e. the normality profile test.
The meaning of a word is its normality profile across its grammatical occurrences. He argued
that there is a normality profile for all possible words and sentences in a language. One single
occurrence where we find one item of a pair more or less normal than the other can
undermine the synonymy relation between them. ‘Every difference of meaning between two
expressions will show up as a difference of normality in some context’ (Cruse, 2000: 12). For
example: illness and disease are not synonymous as in (1a&b).
We can distinguish between words through the grammatical aspects of meanings, as every
word can be more or less normal than the other. Haas used the notion of normality as a
primitive intuition.
` 89
Fourth Approach: halfway between the two extremes
The definitions given by Lyons (1969 and 1981) and Cruse (1986 and 2000) represent this
attitude. They do not see interchangeability in all texts as a requirement for synonymy
recognition. They rather made a distinction between different categories of synonymy.
Later on, Lyons (1981: 50-51) drew a distinction between three types of synonymy:
1.synonyms are fully synonymous if, and only if, all their meanings are identical;
2.synonyms are totally synonymous if, and only if, they are synonymous in all contexts;
3.synonyms are completely synonymous if, and only if, they are identical on all (relevant)
dimensions of meaning.
Absolute synonymy combines all these three categories. Lyons (ibid: 51) states, ‘absolute
synonyms are expressions that are fully, totally and completely synonymous’. If one of the
above criteria were missed, synonymy would be partial (ibid). Lyons made a further
distinction between partial synonymy and near-synonymy. The latter is defined as
‘expressions that are more or less similar, but not identical, in meaning’ (ibid: 50). Cruse
(1986: 292) viewed Lyons’ distinction of partial and near-synonymy as one. ‘By his (Lyons’)
definition near-synonyms qualify as incomplete synonyms, and therefore as partial synonyms
(though, of course, they represent only one variety)’. But later on he regarded them as
different degrees of similarity as shown below.
` 90
three types of synonyms according to the degree of the similarity that holds between items:
absolute, propositional, and near-synonymy. These types can be located on a scale at the end
of which falls absolute synonymy.
In the above examples the (a) sentences are more normal than their (b) counterparts. It should
be noted that when doing the test we have to stick to one meaning of the word under
investigation especially with words of subtle differences. For example, sad and happy, in
` 91
4 and 5 below, can modify either animate or inanimate objects.
The Haasian test is semantically based and does not work otherwise. Cruse (1986: 281) gave
some examples with no semantic explanation. For instance, one’s record can be spotless,
unblemished or impeccable, but not flawless, whereas one’s credentials cannot be but the last.
This is called idiosyncratic collocational restrictions. This is an important point that gives our
premise more credence when we talk about the treatment of synonymy through collocations.
For instance, fiddle and violin are propositional synonyms, because the two sentences: He
plays the violin very well and He plays the fiddle very well, i) have the same syntactic
structure, ii) have the same truth-conditional properties as they entail one another.
Accordingly, Lyon’s definition of synonymy discussed in section 6.1.1 above comes out as
propositional.
` 92
The key point in defining propositional synonymy is substitutability with the truth-condition
preserved, as Lyons put it ‘substitutability salva veritate’, so it is less strict than absolute
synonymy which requires, in addition to keeping the truth-condition of the substituted words,
the same contextual environment. The way to prove that two words are propositional
synonyms is to find a situation where one is more or less typical than the other, while
preserving the truth-value. Such a type of synonymy is more common than absolute
synonymy, for instance, begin: commence, car: automobile, die: pass away and brave:
courageous.
a.Dialectal variations, e.g. autumn and fall (the latter is used in American English).
b.Stylistic variations, e.g. begin and commence (the latter is more formal).
c.Emotive variations, e.g. politician and statesman (each show approval and
disapproval).
d.Collocational variations, rancid bacon or butter and addled eggs or brains.
More precisely, Cruse (1986 & 2000) discussed these differences as follows:
•Differences in expressive meaning: a sentence can be expressively neutral, positive or
negative as shown in the examples below:
The sentence (7) above is neutral, whereas (8) has an additional meaning of respect and (9)
has a sense of disrespect.
` 93
•Differences of evoked meaning
Dialect: different lexical items that are used in different dialects in the same range of
references. Geographical: autumn: fall, corn: wheat: oats, etc.; temporal: wireless: radio,
swimming baths: swimming pool; social: sofa: settee, lavatory: toilet.
Register: the change of situation, the audience or the speaker’s intention may bring up
different lexical items with same range of reference. Field: marriage: matrimony, dead:
deceased; mode: re: concerning: about; style: money: bread: dough: dosh: filthy lucre.
10.a) x died.
i)x is an organism.
ii) not-(x became not-alive).
Therefore, (i) is a logical presupposition of (6.a).
` 94
In negation, one or two of the meaning components in (11.a) are left intact.
In (11) above, pass away is not a special way of dying, it just means ‘die’ when speaking
respectably of humans. So two synonyms can differ in respect of what is highlighting, i.e. one
item can highlight one aspect and the other highlight another. Therefore (i) and (ii) are
presuppositions (the latter is arbitrary because it depends on usage and collocation; we do not
use pass away with animals).
The above variations give rise to the significance of using synonyms in sensitive areas, like
taboo areas, such as when talking about sex, urination, defecation, etc. In other emotionally
sensitive areas like death and money one can also make use of these variations to choose
what is regarded as euphemistic (Cruse, 2000: 158).
6.2.2.3 Near-synonymy
Near-synonymy (called plesionyms in Cruse, 1986) is the type commonly adopted by
dictionary-makers, so can be called dictionary synonymy. The difference between it and
propositional synonymy is that near-synonymy is not propositionally equivalent. Near
synonyms must share central aspects of meaning but are allowed to differ in peripheral
aspects. In more detail, when analysing sentences componentially, we can divide word
meanings into components or atoms. Thus, by central aspects we mean the capital
components, i.e. the heads, whereas peripheral means subordinate ones or the modifiers. For
example, pretty can be analysed as [GOOD LOOKING] [FEMALE], the former is considered
the head component and the latter is the subordinate one. In other words, the head is the first
sense that comes to one’s mind about a given word. So, pretty, handsome and beautiful are
considered near synonyms because they share the same capital component.
` 95
Near-synonymy can be easily tested by expressions like or rather, and more exactly (Cruse
1986: 287) with which we can signal the minor differences between near synonyms. Let us
consider the following example,
In (12.b) above lake and tree are not near-synonyms because of the great differences between
them.
In conclusion, in this study I will take the view that the phenomenon of synonymy should be
understood as a gradual cline along which we may locate different degrees of synonymy:
absolute synonymy, propositional synonymy and near synonymy. This view is consistent with
the widely held opinion among semanticists that strict or absolute synonymy is rare in human
languages (see Cruse: 1986). A further step is taken here in this study to demonstrate that
absolute synonymy does not exist in Arabic. The study will argue that Arabic never has two
words that mean nearly the same thing and are used in the same range of grammatical and
lexical patterns. To prove the credibility of such hypothesis, we will apply corpus-based
analysis methodology to a list of selected Arabic word pairs which are presumed by some
Arabic linguists to be absolute synonyms to see how credible their presumption is.
` 96
constellations, time, clothes, food, weapons, etc. However, the best known of these classical
thesauri is Thacalibi. This was an Arabic dictionary based on a concept classification.
Haywood (1965: 113) described it as follows:
Such attempts were unsystematic by modern standards and cannot be regarded equivalent to
the modern thesauri since they were not arranged alphabetically and lack comprehensiveness.
Generally speaking, synonymy was frequently discussed, from a theoretical point of view, by
early Arab linguists. Some linguists like Sibawayhi, Al-Mubarrad and Al-Siyuti stressed that
synonymy is widespread in Arabic. On the other hand, Ibn Faris denied the existence of
synonyms because this would contradict the wisdom of Arabs, who always used words for a
reason. He argued that every word should have a specific meaning. Furthermore, Thaclab
argued that there is a difference of meaning between any given pairs of synonyms. For
example, investigating the contexts of qacada and jalasa ‘sit’ which are commonly taken as
synonyms will show that they have different meaning from each other (Versteegh et al, 1983,
p.174). Perhaps the idea of denying the existence of synonymy was introduced by Ibn Al-
Arabi (d. 802) whose apprentice Thaclab reported him saying, ‘any two forms used
synonymously by Arabs, everyone of them has a specific meaning which is missing in its
` 97
counterpart’ (Al-Anbari, Al-Addad: 7). Thaclab investigated the differences between lemmas
like qacada and jalasa manually and we can surely offer a more accurate analysis if we
investigate the phenomena computationally. However, we will not investigate this very pair,
because it has already been discussed by Thaclab. So, we will pay more attention to pairs
which are still considered as absolute synonyms.
In the above examples it is obvious for Arabic speakers that the two different verbs in every
sentence can be substituted for only one verb in English. Dickins, Hervey and Higgins (2002:
59) noted that all major parts of speech (N, V, Adj and Adv) can undergo such a phenomenon.
They also stated that the repetition of synonyms can be ‘syndetic’, when a connective is used,
particularly with the use of adjectives or ‘asyndetic’ without using connectives. This
conjunction between seemingly synonymous words is not only acceptable in Modern
Standard Arabic but is used frequently in the everyday language as well.
` 98
We may also find this phenomenon often used in Late Classical Arabic. Let us consider the
following examples from Al-Hamadhani’s Maqamat quoted by Tamas Ivanyi (1993: 52-53):
Ivanyi offered an explanation for how such pairs of conjoining seemingly synonymous words
exist in Arabic. He argued that the synonymity of such pairs could be discredited by the
virtue of semantic attributes like static and dynamic. In this way, each item of the pairs in
examples (16-19) can be either static or dynamic. However, this is not an inclusive condition
since we may have pairs in which one item can be regarded as more general than the other.
In addition, in some cases the two terms of the pair could be dynamic or static, so a
refinement of Ivany’s proposition is needed. To me that proposition can be restated as the
meaning of one of the two items of the pair may be more general than the other.
Accordingly, in (16-19) above, one term of the pair tends to have more action than the other
in the sense that one expects the addressee to understand the repetitive synonymous term, i.e.
one of the two synonymous words as emphasis. This process is used merely for subtle
discourse as Ivanyi (1993: 53) put it:
` 99
(and not stylistic) roots of the phenomenon we called here semantic
conjunction.
6.5 Conclusion
This chapter discusses the various approaches and types of synonymy; this is very important
for our research orientation to instigate the analysis of our data in the following chapter based
on a detailed theoretical stance. With respect to absolute synonymy, the notion of
substitutability in all contexts can easily be grounded on corpus evidence, by comparing the
concordances of claimed synonymous items in order to point out all possible contextual
overlaps or disparities.
` 100
Chapter Seven: Collocational Treatment of
Synonymy in Arabic
7.1 Introduction
This chapter will discuss the semantic relation of synonymy, how synonyms behave in all
contexts, in order to highlight the subtle differences that might occur between them, to extract
semantic features which can make distinctions between them and to explore the possibility of
distinguishing such differences using statistical analysis of corpora.
I will argue that collocation is very useful to describe word meaning and is a mechanism by
which we account for seemingly synonymous pairs. According to Lyons (1995), the
collocational range of an expression can reveal the differences between apparent synonyms.
So collocation is one of the conditions he gives to consider a pair of words absolute
synonyms.
Following Lyons, I propose that employing collocation in the analysis of synonyms can help
distinguish their meanings and reveal the similarity and/or dissimilarity that hold between
them. By this technique, it is possible to compare seemingly synonymous words to find out
whether they are real synonyms or not. As mentioned in the previous chapter, absolute
synonyms can be ruled out if we come across one context in which one of the synonymous
pair carries more meaning, has a different distribution or is used in a different register. I will
argue that absolute synonyms do not exist in terms of their collocational patterns. Through
collocation we can distinguish one sense of a word from another and know whether a
seemingly synonymous pair are real synonyms or not. Collocation is, therefore, a device with
which words of multiple senses can be accounted for precisely. In order to prove that these
subtle differences can be brought out by collocation, I will analyse the collocates for a list of
synonymous pairs.
` 101
is Ghali (1998). The items in this list are also used in Al-Askari (non-dated), Al-Hamadhani
(1991), Al-Yaziji (1970), and Leceibi’s (1980). The meanings of these items were examined
first in four Arabic dictionaries in order to arrive at the most seemingly synonymous pairs
which are presented in Table (7.1) below. These dictionaries are: Al-Fayruzabadi’s Qaamuus
al-Muh}iit} ‘Al- Muh}iit} Lexicon’, Al-Bustani’s Muh}iit} Al- Muh}iit}, Majmac
allughah al-carabiyyah’s Al-Wasiit}}, and Ibn Manz}ur’s Lisaan Al-cArab.
This list is selected to be general words rather than genre-specific words whose usage and
meanings may differ from one domain to another. For example, the word ‘elements’ in
physics could mean ‘the four natural elements’ and in literary texts ‘factors’ or ‘principles’.
Table (7.1): The sets of randomly selected synonyms for our analysis.
Following Barnbrook (1996: 90), I will consider words that occur at least three times within
the span to be relevant for collocational analysis. This is because words that occur just once
or twice can give spuriously high significance scores.
For practical reasons I would suggest to use the distribution of the word under investigation
represented in its collocates rather than using the whole concordance line. This helps us
decide from the very beginning what to look for in the concordances. The Mutual
` 102
Information statistic can help us observe what patterns are most distinct. This is a good step
in recognising what we are going to analyse as it summarises all the concordance lines and
enables us to make comparison and contrast to bring out the subtle differences between
seemingly synonymous items by examining their collocation.
Collocation as defined in Chapter Five does not necessarily work on adjacent words; we may
have collocation between interrupted words. Collocation can include items that habitually
collocate with other items from a definable semantic set, i.e. semantic prosody.
A semantic feature is then identified, based on their collocational distribution, to show the
difference between both items. The semantic feature that would distinguish the meaning of a
given synonym can be discovered by dividing the collocations of each item into a distinct list
according to their frequency. Then the word senses of both items are probed through their
collocation to find out the semantic attribute that makes one item different from the other, in
terms of collocation.
If the difference between a given pair of words is not brought out by a simple scrutiny of the
MI results, we will use the t-test statistic. An independent t-test compares the averages of two
samples that are selected independently of each other (the words in the two groups are not the
same).
For more explanation, we will apply the substitution test of one word for the other (Ullmann,
1962: 143) to see if any change happens in the meaning of the sentence based on intuition. If
we can exchange one word for the other in all contexts without changing the meaning of the
sentence to any extent, these two words are definitely eligible to be called absolute
synonyms.
The remaining part of this chapter will examine the four case studies presented in table (7.1)
above.
` 103
7.4 A case study: The word pair jaa’a and ata ‘come’
To prove the credibility of our methodology let us take the first synonymous pair: jaa’a and
ata, which are widely regarded as absolute synonyms and then have a look at the their
contextual distribution. But before that we give the definitions of jaa’a and ata as provided in
the most authentic Arabic dictionaries.
Table (7.2) Definitions of jaa’a and ata in Arabic dictionaries
The dictionaries above distinguish three main meanings for jaa’a: (1) ‘come’, (2) ‘arrive’, (3)
‘do’. The other meaning (4) ‘bring’ comes up because of the preposition bi ‘with’; however, it
is more or less closely related to the meaning (1). As for ata, it has the meanings (1), (2), (3),
Al-Muh
(4) in addition to (5) ‘have sex’ which is euphemistically related to the meaning in (3). The
remaining meanings are mentioned because of the following prepositions: bi ‘with’ and cala
‘on’. Al-Wasiit} gives one more sense for ata: ‘approach’ which is also related to the
previous meanings. In table (7.2) the words are defined in terms of each other.
In order to analyse significant collocations, we took a number of preliminary decisions. First,
we discarded all combinations with a frequency lower than three as indicated in 7.3.
Secondly, we will see how frequent every item of the pair is in the corpus as a whole before
doing further analysis. Then we can compare that to the frequency of the words used with
them. n will stand for the total size of our corpus, x for our search term and y for the
collocate. Thirdly, because of the inapplicability of wild-card search with Arabic texts, as
mentioned in section 5.4.2, hits are calculated first to include all possible syntactic forms of
the pair under investigation. The significant collocations for ata are shown in table 7.3 below
whereas the collocations of jaa’a are represented in table 7.4.
` 104
the prophet النب 169 6777 5.81
no ما 6 6939 5.07
man رجل 9 8646 5.06
Jibreel جبيل 5 386 4.86
Syria الشام 4 520 4.11
calamity بأس 3 404 4.06
the good الي 15 2269 3.89
the mosque السجد 5 924 3.60
heaven السماء 4 814 3.46
to إل 51 11214 3.35
messenger رسول 46 11805 3.13
command أمر 10 3002 2.90
Makkah مكة 3 914 2.88
the night الليل 4 1352 2.73
Moses موسى 5 1985 2.50
owner ذا 5 2023 2.47
the truth الق 3 1171 2.31
on على 74 36416 2.19
women النساء 4 2035 2.14
his family أهله 8 537 1.83
king ملك 8 5046 1.78
Umar عمر 4 3180 1.50
with it به 35 28912 1.44
his wife امرأته 6 401 1.22
people بن 7 4564 0.96
from من 24 34830 0.63
with him ومعه 5 8738 0.36
day يوم 3 5243 0.36
his tribe-men قومه 4 7901 0.01-
sin معصية 7 19246 0.50-
son ابن 5 16734 0.57-
in ف 6 24009 0.82-
father أبا 4 17618 0.96-
he هو 3 15055 1.15-
that ذلك 3 17455 1.36-
` 105
with fertility بالصب 3 3 10.92
with good deed بالسنة 8 11 9.97
empty فارغا 5 11 9.79
empty فارغا 5 11 9.79
visiting زائرا 3 7 9.70
clear proofs البينات 24 137 8.41
with lies بالكذب 3 29 7.65
nomad أعراب 10 244 6.31
second ثانيا 4 111 6.13
dragging ير 3 87 6.07
victory نصر 10 422 5.52
time وقت 27 1726 4.92
the truth الق 14 1171 4.54
knowledge العلم 15 1334 4.45
Ramdan رمضان 4 385 4.33
the night الليل 13 1352 4.22
man رجل 81 8646 4.19
to إل 96 11214 4.06
Islam السلم 12 1493 3.96
one أحد 22 2775 3.94
Jibreel جبيل 3 386 3.92
owner صاحب 7 1011 3.75
somebody ٌفلن 6 1005 3.54
The Qur’an القرآن 6 1117 3.38
in ف 117 24009 3.24
the day-time النهار 3 706 3.04
wants يريد 4 961 3.01
information الب 6 1486 2.97
the boy الولد 6 828 2.81
the Prophet النب 18 6130 2.51
wealth مال 4 1479 2.39
from من 67 34830 2.08
Moses موسى 4 1985 1.97
Umar عمر 6 3181 1.87
explanation تأويل 6 3347 1.80
the country-men القوم 4 2537 1.61
women النساء 3 2035 1.52
other آخر 6 4132 1.50
Messenger رسول 17 11805 1.48
Abraham ابراهيم 3 2154 1.44
on على 49 36416 1.39
command أمر 4 3002 1.37
day يوم 7 5243 1.37
` 106
hitting يضرب 3 3007 0.95
after بعد 7 7383 0.88
before قبل 3 3825 0.61
Allah ال 15 19246 0.60
people ناس 4 5882 0.40
already وقد 13 24534 0.04
with مع 4 7901 0.01-
father أبو 8 17618 0.17-
about عن 9 21153 0.27-
except إل 5 12261 0.33-
said فقال 12 32835 0.48-
that ذلك 4 17455 1.16-
until حت 3 13165 1.17-
this هذا 3 15501 1.40-
Analysing the concordances of ‘ata and jaa’a shows that there is a wide range of overlap
between them; as we can see in the examples below, there are several instances where both
appear with words denoting place, person, time or abstract object. But jaa’a tends to be more
frequently used with time and, unlike ata, is always followed by the preposition ila ‘to’
before places as can be seen in the tables. For the sake of brevity, we have only given
translation and transliteration for the information which are relevant to our discussion.
(1.a.)
فسار يشي ويتتبع آثار الطريق حت جاء إل باب الدينة
fasaara yamshii wa yatatabbac aathaara al-t}ariiq h}atta jaaca ila baab al-
madiinah.
He kept going, following the road signs until he arrived at the entrance of the city.
(1.b.)
فسار حت أتى الشام فقتل أهلها
` 107
(2.a.)
.ث جاء النب صلى ال عليه وسلم يشي ف الصفوف
(2.b)
.أتى النب صلى ال عليه وسلم بيت فاطمة فلم يدخل
(3.a)
فلما جاء الليل نام
(3.b.)
ولا أتى الليل طلبته أمه فلم تده
(4.a)
ُحقّ َوزَ َهقَ الْبَا ِطل
َ َوُقلْ جَاء اْل
(4.b.)
وهل يأت الي بالشر ؟
` 108
Let us now study the statistics given in table (7.3 & 7.4) above to see how similar or
dissimilar the collocations of the word pair under examination are. The first obvious point we
can get is that in table (7.4) the most statistically significant collocation of jaa’a , i.e.
collocates of highest MI scores, is bi-l—khis}b ‘with fertility’ with an MI score at 10.92. As
for ata, table (7.3) shows that al-faah}ishah ‘mischievous deed’ is the strongest collocate
with MI score at 9.60. Can we say then that the semantic feature which distinguishes between
jaa’a and ata is positivity vs. negativity?
Actually, we cannot come up with an exclusive distinction between jaa’a and ata by making
such a simple analysis, simply because we should be aware of the fact that words could have
multiple senses and different syntactic forms could entail different senses. So we need to
make a more precise analysis before coming to a conclusion.
It is important to mention that jaa’a and ‘ata followed by the proposition bi (with) are
frequently used in CAC with the meaning ‘to bring’ but our pre-theoretical approach of what
a word is does not count propositions or conjunctions that are attached to the root word. ata,
in particular, has several meanings in different contexts. For example, ata followed by the
proposition cala means ‘to finish off or destroy something’ ata cal al-t}acaam (he has
finished all food), (‘ata cala al-‘akhd}ar wa-l-yaabis he destroyed everything (literally: he
destroyed the cultivated and non-cultivated land). It can also be used metaphorically to refer
to having sex. For example, ata imra’tahu/ahlahu (to have sex with his wife). Using ata in
this sense is called euphemism which is widely used in Qur’an. However we will be
restricted to analysing only one sense of ata, namely ‘come’ to set it off against jaa’a which
is mainly used in this sense. For example, we manually eliminated instances where ata means
‘commit’, which constitute about 3% of the whole occurrences of ata. To look at this sense,
i.e. ‘come’, only we have to manually proofread our counts and exclude all the instances
which have other meanings. This particular use of ata and jaa’a is interesting to analyse
because their meanings are so similar that native speakers of Arabic tend to use them
interchangeably. This gives another dimension for the use of both verbs, in addition to the
previous differences brought out between them.
A closer look at the words reveals that the two words are not synonymous all the time. We
` 109
cannot always use the two words interchangeably. I examined all the concordances of ata and
jaa’a throughout CAC which enabled me to come up with the following three major
distinctions between them. I made a further analysis of the concordances of jaa’a and ata
with a minimum frequency of three. The result of this further analysis will be tested later on
by t-test statistic as shown in table (7.5) below. Now let us have a look at the following uses
of both of them:
i. When ata is followed by a place it means that place is not a destination point.
(5.a.)
ُحَتّى إِذَا أََتوْا َعلَى وَادِي الّنمْلِ قَاَلتْ َن ْملَةٌ يَا أَّيهَا الّنمْلُ ادْ ُخلُوا َمسَاكَِن ُكمْ لَا َيحْ ِطمَّن ُكمْ ُسلَْيمَان
.َشعُرُون
ْ وَجُنُودُهُ وَ ُهمْ لَا َي
h}atta idhaa ataw cala waadi al-namli qaalat namlatun ya ayyuha al-namlu
udkhulu masaakinakum…
When they came to a valley of ants, one of the ants said: ‘O you ants, get into your
habitations’ (Qur’an, Al-Naml: 18).
(5.b.)
فَان َطَلقَا َحتّى إِذَا أَتَيَا َأهْلَ قَرْيَةٍ اسْتَ ْطعَمَا َأ ْهلَهَا فََأبَو أَن ُيضَيّفُو ُهمَا
A full translation35 of the example in (5.a) can make the meaning clearer.
35 The translation of Qur’anic verses are taken from Al-Hilali and Khan’s The Noble Qur’an, but it is slightly
amended, to omit information which is irrelevant to the main discussion and the exegetical glosses included in
the translation and marked by inverted commas or brackets. We only focused on the phrases which contain the
words under investigation, so we deleted the transliterated glosses and all extra explanatory comments rendered
by the translator for elucidation.
` 110
When they [Solomon’s army] came to a valley of ants, one of the ants said: “O you ants, get
into your habitations, lest Solomon and his troops crush you without knowing it.” (An-Naml:
18)
The ants’ colony was not meant to be the destination point for Sulayman and his army, nor
did they stay there for a long time. The whole army was only passing by the colony when
Sulayman heard the ant warning the rest of the colony of an imminent destruction by
Sulayman and his army.
In (5.b) it is a part of Moses’ story with Al-Khidr when he set out on a journey searching for
that knowledgeable person. After Moses had found him, Al-Khidr started teaching him a
series of lessons practically. Then they passed by a town, which was not their terminal point,
where they got hungry, so they asked them for food but the people of that town refused to
host them.
Conversely, the place that follows jaa’a is meant to be a destination point where one can stay
for longer time or for ever, so it gives a sense of stability.
(5.c.)
. ث نام، فصلى أربع ركعات، ث جاء إل منله،صلى النب صلى ال عليه وسلم العشاء
(5.d.)
حتْ أَْبوَاُبهَا
َ َوسِيقَ اّلذِينَ اّت َقوْا رَّبهُمْ ِإلَى اْلجَنّةِ زُمَرًا حَتّى إِذَا جَاؤُوهَا َوفُِت
In (5.c) the prophet returned to his house after giving his prayers to sleep. His house is
` 111
therefore an end point as he did not mean to carry on going to any other place. In (5.d) the
paradise is the final abode of the pious people so when they come to it they will live therein
forever.
ii. jaa’a when followed by an event means that event has been waited for or expected.
(6.a.)
ُإِذَا جَاء َنصْرُ اللّهِ وَاْلفَتْح
(6.b.)
َسجِد
ْ فإِذَا جَاء وَ ْعدُ الخِرَةِ ليسوءوا وُجُو َه ُكمْ َولِيَدْ ُخلُواْ اْل َم
In (6.a) the conquest of Makkah and the victory over the disbelievers of Makkah was
something which the Prophet and all Muslims were longing for. They were expelled from
their own hometown, Makkah, without a just cause and left behind everything. In addition,
since the advent of Islam, they were prevented from performing their pilgrimage to the Holy
House to fulfil the duty which Allah had imposed upon them. Likewise, the example in (6.b)
is mentioned in the context of the conflict between Muslims and the Jews where Allah
promises the Muslims to return to their mosque and defeat the Jews in the end. Actually,
freeing Jerusalem and the Al-Aqsa mosque is the dream of all Muslims; they are all waiting
for Allah’s promise to come.
` 112
On the other hand, ata associates with things that happen unexpectedly. For example,
(6.c.)
َقُلْ َأرَأَيُْتكُم ِإنْ أَتَا ُكمْ َعذَابُ اللّهِ َأوْ أَتَْت ُكمُ السّاعَةُ أَغَيْرَ اللّهِ َتدْعون
(6.d.)
ْحَّتىَ إِذَا أَ َخذَتِ ا َلرْضُ زُخْ ُرَفهَا وَازّيَّنتْ َوظَنّ أَ ْهُلهَا أَّن ُهمْ قَا ِدرُونَ َعلَْيهَا َعلَْيهَا أَتَاهَا َأمْرُنَا لَيْلً َأو
.ِلمْس
َ جعَلْنَاهَا َحصِيدًا كَأَن ّلمْ َتغْنَ بِا
َ َنهَارًا َف
In (6.c & d) the events are not expected because Allah keeps such things hidden so that every
person is rewarded for what he does, and the people are not aware of what is hidden for them.
iii. jaa’a means ‘arrive’ as shown in (5.c & d) above, whereas ata has a sense of
approaching a place or a time.
(7.a.)
ُجلُوه
ِ أَتَى َأمْرُ اللّهِ َفلَ َتسَْت ْع
` 113
Allah’s command is the Last Day (the Day of Judgement) and this apparently contradicts the
situation but it rather means ‘approached’.
.َشعُرُون
ْ وَ ُهمْ لَا َي
h}atta idha ataw cala waadi al-namli qaalat namlatun ya ayyuha al-namlu udkhulu
masaakinakum…
‘When they came to a valley of ants, one of the ants said: “O you ants, get into your
habitations’ (Qur’an, Al-Naml: 18).
The English translation by Dr. Muhsin Khan and Dr. Muhammad Al-Hilali given below
translated ata to ‘at length … came’ which is a close interpretation to the meaning of ‘come’
in this context. Indeed, Sulayman and his army have not reached the ants’ colony yet, they
were still by its outskirts because one of the ants asked the rest of the ants to go inside their
colony.
More interestingly, the slight change in the contextual use between ata and jaa’a in the
following three verses can bring out the subtle difference between them. Transliteration is
provided for the underlined Arabic words. We also marked the similar parts throughout the
following three examples with square brackets.
(8.a.)
َفَلمّا أَتَاهَا.َقَالَ لِأَ ْهلِهِ ا ْمكُثُوا إِنّي آَنسْتُ نَارًا لّ َعلّي آتِيكُم مّْنهَا ِبخَبَرٍ َأوْ َج ْذوَةٍ مِنَ النّارِ لَ َعّل ُكمْ َتصْ َطلُون
` 114
qaala li-ahlihi imkuthuu innii aanastu naaran lacallii aatiikum minhaa bi-
khabarin… falammaa atahaa nuudiya…
‘[(i) He said to his family: “Tarry you;] [(ii) I perceive a fire;] [(iii) perhaps I can
bring you from there some information, or a burning firebrand, that you may warm
yourselves.”] [(iv) But when he came to the (fire)], [(v) he was called] from the right
bank of the valley, from a tree in hallowed ground: “O Moses! Verily I am Allah”‘
(Qur’an, Al-Qasas: 29-30).
(8.b.)
ستُ نَارًا ّلعَلّي آتِيكُم مّْنهَا ِبقَبَسٍ َأوْ أَ ِجدُ َعلَى النّارِ ُهدًى َفَلمّا أَتَاهَا نُودِي يَا
ْ َفقَالَ لِأَ ْهلِهِ ا ْمكُثُوا إِنّي آَن
مُوسَى إِنّي أَنَا رَبّكَ فَا ْخَلعْ َنعْلَيْكَ إِنّكَ بِاْلوَادِ اْل ُم َقدّسِ ُطوًى
faqaala liahlihi imkuthu innii aanastu naaran lacallii aatiikum minha bi-qabasin…
falamma ‘aataaha nuudiya…
‘So [(i) he said to his family, “Tarry you;] [(ii) I perceive a fire;] [(iii) perhaps I can
bring you some burning brand therefrom, or find some guidance at the fire.”] [(iv)
But when he came to the fire], [(v) he was called:] “O Moses! Verily I am thy Lord!
therefore put off thy shoes: thou art in the sacred valley”‘
Tuwaa’ (Qur’an, Taha: 10-11).
(8.c.)
شهَابٍ قَبَسٍ لّ َعّل ُكمْ َتصْطَلُونَ َفَلمّا
ِ قَالَ مُوسَى لِأَ ْهلِهِ إِنّي آَنسْتُ نَارًا سَآتِيكُم مّْنهَا ِبخَبَرٍ َأوْ آتِيكُم ِب
َجَاءهَا نُودِيَ أَن بُورِكَ مَن فِي النّارِ َومَنْ َح ْوَلهَا َوسُْبحَانَ اللّهِ رَبّ اْلعَاَل ِمي
` 115
As shown above we have three verses from different surahs (chapters) relating the story of
Moses when he saw the fire where Allah talked to him. The story is put in different wordings
in these three surahs, because every verse tells one aspect of the story. We can notice that
there are similar parts in each verse (as marked in i-v). The remaining parts make the
meanings of the three verses different from one another. For example, in (8.a) and (8.b.)
Moses asked his family to wait until he goes and sees the fire. The verb ata ‘bring’, marked
iii in both verses, is used in subjunctive form to express a wish but it is uncertain. In (8.c)
Moses does ask his family to wait and the verb ata ‘bring’ used is in near future which
expresses certainty. The fireplace was so remote that he had to promise his family not to give
up. He has the intention to do his best to get some information from the people around the
fire or to get a burning brand from it to warm themselves. This is to reassure his family even
if the fire is far or he takes long time. Therefore he used the verb without a modal of
probability.
Most importantly, in (8.a.) and (8.b.) ata is used to indicate that Moses is still far from the
actual fireplace, because the call following ata in (8.a.) comes from the bank of the valley,
and (8.b.) mentions that he is in the sacred valley and has not arrived at the fireplace. But in
(8.c.) the call implies that Moses arrived at the fireplace because Allah says ‘blessed are those
in the fire and those around’; ‘those in the fire’ refers to Moses and ‘those around’ are angels
as al-Razi said. In Arabic one can use the preposition fi ‘in’ to mean absolute closeness.
So, jaa’a ‘come’ marked [iv] in (8.c) is used to relate the final part of the story after Moses’
arrival at the fire-place, where he talked to Allah. It is interesting to know that the two verses
(8.a & 8.b) employ the verb ata ‘come’ marked [iv] in both to refer to a degree of nearness to
the fire-place, whereas their equivalent in (8.c) uses jaa’a to describe a state of absolute
closeness.
One more piece of evidence that supports the above argument is that the word Allah occurred
in object position with ata 7 times and did not occur at all with jaa’a. Let us consider the
following example,
` 116
ٍَيوْمَ لَا يَن َفعُ مَالٌ وَلَا َبنُونَ إِلّا َمنْ َأتَى اللّهَ ِب َقلْبٍ سَليم
yawma la yanfacu maalun wa la banuun illa man ata allaaha biqalbin saliim.
The Day whereon neither wealth nor sons will avail except him who came to Allah
with clean heart
ata is used in the above example, because Allah is not limited to a place nor can vision grasp
Him, so no one can come to a point of closeness to Allah’s entity like a physical object.
Let us now use the t-test to show what sort of differences holds between jaa’a and ata as
shown in table (7.5) below. To do the test it would be better to stick to one sense of the words
under investigation.36 We will analyse the most significant left collocates, i.e. items with the
highest MI scores. As mentioned earlier, we will be restricted to analysing only one sense of
the pair, namely ‘come’.
36 Customarily, the test can be done generally without restricting it to one sense of the words under
investigation. To me, it would be easier if we chose to do the calculation inside a closed set for short cut and
quick results, i.e. I will search the items whose MI scores are significant, which co-occurred with ja’a and ata in
a particular sense.
` 117
َثانيا
second 111 4 1 1.34 P<0.20
النب
the Prophet 6777 18 169 11.04 P < 0.0001
عذاب
torment 401 1 29 5.11 P < 0.0001
ال
Allah 168 0 7 2.64 P<0.01
الكفر
disbelief 406 0 4 2.00 P<0.05
الشام
Syria 520 0 4 2.00 P<0.05
بأس
calamity 404 0 3 1.73 P<0.10
كاهن
soothsayer 64 0 3 1.73 P<0.10
أمر
command 537 4 10 1.60 P<0.20
باطل
falsehood 94 1 4 1.34 P<0.20
جبيل .Not sig
Gabriel 386 5 5 0
Table (7.5) the most significant ten left collocates37 with jaa’a
(the top ten words) and ata (the last ten words).
The t-scores in table (7.5) show the differences between jaa’a and ‘ata; the former has a
strong tendency to occur in positive contexts, whereas the latter has a negative sense. The
bigger the t-score, the more different the pair under examination. jaa’a gets the highest
scores with the following positive items: alcilm ‘knowledge’, al-h}aq ‘the truth’, and al-
bayyinat ‘clear proofs’. On the other hand, ata frequently co-occurs in negative contexts:
adhab ‘torment’, al-kufr ‘disbelief’, ba’s ‘calamity’, kaahin ‘soothsayer’38, ‘amr ‘command’
c
(meaning difficulty or torment), and baat}il ‘falsehood’. The highest scores of jaa’a and ata
in the table show that the items having this score is more likely different from each other.39
Therefore, ata and jaa’a as shown in table (7.5) above are not synonymous because they are
used in a different range of contexts.
Two points might seem contradicting to the above conclusion. In the first place, the positive
use of ata in tables (7.3) as in ya’ti ‘comes’ followed by khayr ‘good’ or h}aqq ‘truth’ is not
` 118
considered strong evidence because they are only used with ata in its present tense form. We
think there might be a morphological reason why ata in its present simple form is used for
both negative and positive sense. jaa’a in present tense form, i.e. yajii’, is not as easy to
pronounce as ata. jaa’a in its present form occurs 181 times whereas its corresponding ata
occurs 980, so jaa’a in present form is about five times less common in CAC than ata in
present form. Secondly, the high t-scores in table (7.5) with al-nabiyy ‘the prophet’ (11.04)
and waqt ‘time’ (4.91) are not significant because they are both neutral, so they fall in the
area of overlap between ata and jaa’a as indicated in (7.6) below.
Analysing the concordances of jaa’a and ‘ata with minimum frequency of 1 can show their
tendency to occur in negative or positive contexts. Further examples from CAC show that
‘ata is overwhelmingly used in unpleasant contexts. The main collocates concern committing
sins, trouble, and falsehood. Figure (7.6) below shows the contextual preference of both of
them.
Figure (7.6) the collocational differences of jaa’a and ata with minimum frequency 1.
The native speakers of Arabic are themselves unaware of these collocational differences
between jaa’a and ata. The only difference brought out by Al-Askary, who belongs to the
Classical period, in Al-Furuuq is that ata requires a complement. For example,
` 119
came the-man self-him
The man arrived himself.
Otherwise they can replace each other without any loss of meaning. This is not consistent
with Al-Askary’s proposition that difference in form must produce difference in meaning but
that difference was abandoned as time passed (Al-Askary, Al-Furuq: p. 9).
To me, the use of jaa’a in (9.a) above is consistent with our approach that jaa’a is always
followed by the preposition ila ‘to’ before places. Therefore, the missing preposition in (9.a)
eliminates the possibility of a following category that refers to a place. So, the use of jaa’a in
(9.a) involves some sort of directional motion which implies an action not toward a place but
rather toward the speaker. On the other hand, the multiplicity of the senses40 with ata makes
leaving the complement position empty as in (9.b) above, ambiguous.
7.4.1 Summary
To sum up, the analysis of the seemingly synonymous pair jaa’a and ata was carried out in
three stages in order to highlight the subtle differences that occur between them. The first
stage consisted of lexical search for all occurrences in CAC of the tokens jaa’a and ata. The
second stage involved the categorisation of the tokens syntactically and according to their
frequency; this included manual elimination of all irrelevant hits. In the third stage, we used
MI to highlight the collocations of both. Then we managed to highlight some distinctions
between the two items by analysing their contexts. We finally used T-Test to capture the
subtle differences between the pair by extracting a semantic feature, which can differentiate
between them, i.e. negativity vs. positivity.
40 The senses with ja’a are all related to a directional motion, whereas the senses with ata are diverse and some
of them are metaphorical or euphemistic.
` 120
7.5 A case study: the word pair ithm and dhanb ‘sin’
ithm and dhanb ‘sin’ are commonly treated as synonymous as shown in the dictionary
definitions in (7.7) below. A casual account of the two words reveals that the two Arabic
nouns ithm and dhanb, which have a similar semantic and syntactic form and also a broadly
similar frequency (645 vs. 917 word forms) have been used in CAC to mean ‘committing a
bad deed’ in general.
Al-Muhiit
In the last section we used a range of two words on either side of the node to get an
understanding of the contextual distribution of a given pair. In Arabic, as mentioned in
chapter five, some differences might be overlooked within that short range due to the
syntactic structure of Arabic. Indeed, a small window seems not effective for languages with
many non-adjacent complements that result in non-adjacent collocations as shown in table
(7.8) below. We searched both ithm and dhanb in a span of 3:3 and the result was
` 121
inconclusive with both. So, let us consider the top ten collocates of the pair under
investigation to see how insufficient a span of 3:3 is as indicated in the following table.
In the table above we could not find any statistically significant collocation for ithm except
for one item: al-maysir ‘gambling’, which will be analysed in table (7.9) below. The span of
3:3 resulted in either non-adjacent complements. To take an example the underlined words in
the table above are part of a Qur’anic verse about performing pilgrimage which reads,
ِجلَ فِي َيوْ َمْينِ َفلَ ِإْثمَ َعَليْ ِه وَمَن َتأَخّرَ فَل ِإثْمَ عََلْيه
ّ فمَن تَ َع
faman tacajjala fi yawmayni fala ithm calayhi wa man ta’akhar fala ithm calayh.
so-who haste-becomes in days-two so-no sin on-him and who late-becomes so-no
sin on-him.
But whosoever hastens to leave in two days, there is no sin on him and whosoever
` 122
stays on, there is no sin on him. (Qur’an, al-Baqarah: 203)
The verb tacajjala ‘hastens’ does not appear in the table as a collocation for ithm because it
comes fourth to the right, so the span of 3:3 failed to capture it.
As mentioned in section 5.2 we chose to work on flexible spans since there might be some
expressions in Arabic that stretch over the average span: 4:4. In addition, it is hard to capture
in that span the semantic features that stretch over several units not included in our span41.
Further analysis of the pair under investigation by using MI statistic and in a bigger span
(7:7) shows more interesting collocations which are statistically significant. Therefore the
span size for this study is set to 7:7 i.e. seven word forms to the left and to the right.42 For
example, some verbs co-occur more often with one item than the other as in table (7.9).
41 In fact, taking a span of 3 or 4 words can reveal some intersting differences between the collocates of ithm
and dhanb. However, in the first place, if we stick to such a span in this case, we will ovelook many intersting
collocations. Secondly, we would like to make use of all possible collocations in our realtively small corpus.
42 The maximum span I can handle automatically using the Monoconc tool is 3:3. So, I will use Microsoft Word
to capture the nodes that extend over that span. I run the concordance first and save the result into a text-only
file, then I use Microsoft Word to count the hits which I see relevant to my search term such as an adjective
modifying it which is located far apart in the line.
` 123
increase يزداد 12 200 8.86
earn يكتسب 5 179 7.75
carry احتمل 15 542 7.74
record يكتب 5 928 5.83
incur يوجب 4 615 5.65
Table (7.9) the top ten verb collocates of ithm & dhanb in a span of 7 words.
The most significant V+N collocation of dhanb, i.e. hits with high MI scores as shown in the
table above, are astaghfiru ‘I ask for forgiveness’, yudhnib ‘sin’, taqaddama ‘precede’,
ictarafa ‘confess’ and yatuub ‘repent’. As for ithm, tabuu’ ‘bear’, yatacammad ‘intend’,
yaftari ‘allege’, yaksib ‘gain’, and yajtanib ‘avoid’ are the strongest collocates.
According to our approach, there must be a difference in meaning between these two items
because of their contextual differences. The first question which can be raised, then, is why
such collocates appear more frequently with either item. For example, ‘ask for forgiveness’
and ‘repent’ often co-occur with dhanb. In the Islamic creed sins can be forgiven, whether
venial or deadly, except blasphemy against Allah. But how can one attain forgiveness?
Sins can be forgiven by doing good deeds and/or repentance. There are sins which can only
be forgiven through repentance when a fault is done against people. Then there must be an
extra action: compensating those who have been wronged or obtaining their forgiveness.
Therefore, we can say that every repented sin can be forgiven: venial sins by the act of inner
repentance alone (by asking for forgiveness or practically by doing good deeds and refraining
from bad deeds), and mortal sins by repentance expressed through the compensation or
reconciliation with those who you wronged. On the other hand, collocations of ithm do not
reveal how sins are expiated. They rather refer to a state of accumulation of such sins.
Let us now examine all other types of collocations as shown in (7.10) below to support or
eliminate that distinguishing feature.
` 124
n= 5000000, x(ithm/w) = 645 x(dhanb/w) = 917
(y) (ithm/w) )f(x, y )F(y MI
changing تبديل 21 15 12.43
staying on التأخر 99 178 12.07
Table (7.10) the most significant noun collocates with ithm (the top ten words) and dhanb
(the last nine words) with minimum frequency 3.
The underlined words in the table above are parts of multi-word religious concepts which are
frequently used in CAC. They read as follows: ‘ تبديل الوصيةchanging the will’, التأخر أو التعجل ف الج
position of someone’, ‘ الكذب على الlying to Allah’, ‘ منع الزكاةnot paying charity, نقض اليثاق
‘breaking treaties’.
The table reveals that ithm is mainly used for sins that are personal or do not entail a
punishment in this world, like missing some obligatory worshipping acts or doing a bad deed
` 125
that recurs on oneself, like drinking, gambling, lying to Allah, etc. On the other hand, dhanb
is used for sins that entail punishment in this world or the next, e.g. killing, theft, adultery,
etc.
dhanb
1)Doing an act that causes harm to others.
2)Doing an act which is considered illegal.
3) Committing a major sin.
4)Doing an act that might entail punishment in this world.
The items of the pair are not absolute synonyms though they share the same range of
application (both refer to committing a bad deed in general). However, a subtle difference
will emerge by using the t-score statistic.
T-score can tell us how much difference exists between ithm and dhanb by comparing the
frequency of the co-occurrence of either word of the pair and its collocates with the other.
This will help us find out each word’s preferential usage. Then we will be able to abstract out
of the differences that will come up the main attributes that distinguish both of them.
` 126
(y) (ithm/w) )f (ithm/w )f(dhanb/w T
changing تبديل 0 15 3.87 P < 0.01
staying on التأخر 99 0 9.94 P < 0.0001
As we can see in table (7.11), the high t-scores go with ithm when describing one’s own
actions that bring harm to oneself such as gambling, drinking wine, eating dead or
unslaughtered animals or birds or missing an obligatory worshipping act like staying on or
haste to leave in pilgrimage. In addition, it describes one’s actions whose results will only
` 127
affect one’s abode in the hereafter such as missing prayers, not paying charity, etc. On the
other hand, dhanb gets the highest t-scores45 with actions that bring harm to other people like
murder, theft, etc. So, we can say that ithm is intrinsic whereas dhanb is extrinsic.
He said: My Lord! I have killed a man among them, and I fear that they will kill me.
Al-Qasa: 33
2) or with committing suicide, such as ‘the ithm ‘sin’ that recurs when some people starved
themselves until they died’.
Al-khamr ‘Drinking wine’ is considered a major sin in Islamic belief. However, it collocates
45 Although the scores are significant with a few examples of dhanb, yet we still can draw a conclusion. This is
the most we can get out of our corpus. Hopefully, by increasing the corpus in the future, more examples may
appear.
` 128
with ithm not dhanb. An interpretation given by Al-Asfhani in his Mucjam (Lexicon) explains
why it is considered an ithm. He said that drinking might prevent the drinker from doing
obligatory acts he is entitled to do so as to enter Paradise; so he is harming himself when
missing such acts, which are described as ithms accordingly.
Zinaa ‘Fornication/adultery’ co-occurs more often with dhanb than with ithm. This indicates
that it is not a personal action as some people might think in that it does not affect others.
Indeed, the consequences of fornication are grievous and can harm the whole society by
unwanted pregnancy or abstaining from marriage, the right soil for procreation.
Kufr (unbelief) is described one time as ithm and other time as dhanb. It is, in the first place,
something between a man and his God; it is something that rests in one’s heart. But if this
behaviour is publicised, i.e. it contradicts the main stream of the society; it will spoil the unity
of this society; it will be called then riddah ‘apostasy’. Therefore it becomes a public menace
and should be controlled.
7.5.2 Summary
The MI tests conducted in this section for ithm and dhanb show that these two words
significantly collocate with negative actions. They both describe one’s bad deeds in religious
terms. The most significant collocations of ithm refer to sins which involve harming oneself,
such as drinking intoxicants, missing some obligatory prayers, etc. On the other hand, dhanb
significantly collocates with sins which involve harming others, such as unjustly taking of
people’s property, killing, etc.
The T-test highlighted an interesting difference between the two words by comparing all
occurrences of both words with high MI information scores. The semantic feature that was
extracted from the T-test tables affirmed that the semantic feature that distinguishes between
ithm and dhanb is intrinsic vs. extrinsic.
` 129
7.6 A case study: The word pair h}asiba and z}anna ‘think’
The seemingly synonymous pair, h}asiba and z}anna, which mean ‘think’, will be
examined below to extract other semantic features. The dictionary meaning is given in Table
(7.12) before doing our corpus-based analysis.
Table (7.12) definitions of h}asiba and z}anna in four Arabic dictionaries
In the first place, the dictionary meanings of the pair as shown above give the denotation of
the words under investigation which simply refers to (1) uncertainty or probability, (2)
suspicion and (3) certainty. The meanings (1) and (3) seem contradictory because it would be
confusing to have a word meaning somethingAl-Muh
and its opposite.
iit In the second place, the
dictionaries presume that the pair is synonymous by not giving a definition to h}asiba but
rather refer to it as a sole synonym. The near synonym pair h}asiba and z}anna are used
to define each other in many dictionaries. In addition to the similarities in meaning, these two
verbs are seemingly syntactically parallel since both are ditransitive verbs i.e., they can have
two direct objects (like give) and they may have an intransitive usage as well. In addition,
they can be nominal modifiers and undergo nominalisation. In addition, both can occur with
subordinate clauses. Let us consider the following examples in (10a-12b) below.
(10.a.)
ظنّ عمرو بكرا خالدا
(10.b.)
ومن يغترب يسب عدوا صديقه
` 130
(11.a)
....” “شهادة الزور:وأكب ظن أنه قال
(11.b)
... ما أعد من عقابه لهل معصيته بسبانم أنم فيما أتوا من معاصي ال مصلحون...
(12.a.)
...ظن قوم أن سم الصلة خاصة بارد..
(12.b.)
ُسبُ أَنّ مَاَلهُ أَ ْخلَدَه
َ ْ َيح..
In Arabic grammar books, h}asiba and z}anna are called ‘afcaal al-quluub’ (heart verbs)
and they function as nawaasikh46. They are both used to mean certainty and probability but
probability is more likely to be the dominant case (Mubarak, 1982: 180).
As shown above, it seems that z}anna and h}asiba can be used interchangeably. As corpus
data revealed some important and subtle differences between the previous pairs of synonyms
that are hard to recognise solely by intuition, we need to carry out the same methodology to
46 nawaasikh are verbs that assign case endings for the first two nouns that follow. Some verbs like kaana
assign a nominative case for the first noun, the subject, and an accusative case for the second, the complement.
z}anna assigns an accusative case for two objects that follow, apart from the subject which is always in
nominative case.
` 131
examine whether the pair under investigation are synonyms or not. This can help us extract
the contrasts or subtle differences that pertain to their collocational distribution.
The statistics47 shown by the corpus demonstrate that for z}anna the most frequent word
form is z}ann in nominal form. There are 1254 instances of that particular form, which is
53.58% (1254/2340) of the total. The second most frequent is the verb z}anna, with 1086
hits, which is 46.41% (1086/2340). That is a lot and suggests that z}ann as a noun and
z}anna as a verb are both central items to learn. But h}asiba is exclusively used as a verb,
with only one occurrence in nominal form.
We will examine the first left and right collocates of z}ann as nominal modifiers below,
particularly those collocates that function as adjectives or genitives. In order to analyse the
significant collocations of this word, we, in the first place, searched all collocations with
minimum frequency 3. Secondly, we discarded all insignificant collocations, i.e.
combinations with MI scores lower than 1. Thirdly, we manually eliminated collocations
other than adjectives and genitives. Table 7.13 below shows all adjective and genitive
collocates of z}ann with minimum frequency 3.
z}ann (left collocates)
(x) F(x,y) MI
true ًصادقا 10 7.75
false كاذب 10 8.25
invalid فاسد 7 8.56
the era of ignorance الجاهلية 6 5.03
suspicious سيئ 6 5.75
47 We have carried out the statistics after singling out all possible forms of the search-term. This can be done
easily automatically if we have a tagged corpus. Having a tagged corpus will require a lot of time before
conducting such lengthy research, so we searched the corpus for what we know as verbs and nouns separately.
` 132
bad سوء 40 7.92
good حسن 31 6.51
more likely أغلب 18 10.85
much كثير 6 3.05
certain مؤكد 4 10.25
suspicious سيئ 3 7.79
Table (7.13 the first left collocates with nominal z}anna (the top five words) and the right
collocates (the last six words): adjectives and genitives with minimum frequency 3.
Examining the most significant collocates of z}ann (in nominal form) as represented in
table (7.13), we find out that z}ann occurred more frequently with words of negative sense.
For the negative collocates the table shows the following examples which occurred altogether
72 times in CAC: invalid ‘faasid’, false ‘kaadhib’, the era of ignorance ‘aljaahiliyyah’,
suspicious ‘sayyi’’, bad ‘suu’’. For the positive collocates which occurred 41 times, the table
shows the following: good ‘h}asan’, true ‘s}aadiqa’. A minority of examples are neutral,
which constitute only 28 examples: more likely ‘aghlab’, much ‘kathiir’), certain
‘mu’akkad’.
However, we cannot draw any conclusive description of z}anna before studying the other
form (verb). Let us now examine the left collocates48 of z}anna (verb) with the same
procedure taken above in table (7.13) as shown in table (7.14) below.
(x) )F(x,y MI
48 In Classical Arabic, the canonical structure of a sentence is VSO. The alternative basic order which is SVO is
also possible provided that we have a good reason like emphasis. Also, we may have a fronted object as in
iyyaaka nacbudu ‘You-(alone) we-worship’ (surah Al-Fatihah: 5). According to the Arabic grammar you can
say, nacbuduka ‘worship-You’ but a pronoun referring to Allah preceded the verb to exclude any other one from
the act of worshipping. So we examined the left collocates of z}anna (V) because (1) analysing the right
collocates could mislead us by counting items relating to other verbs and (2) most of the items which modify the
Arabic verb fall on the left-hand side.
` 133
suspicions الظنون 18 10.22
untrue غير الحق49 14 10.10
false كاذب 5 6.61
bad سوء 3 2.70
death الموت 3 2.02
As shown above z}anna is mainly used negatively. As for h}asiba, (which is mainly used
as a verb in CAC) we have collocates like the following:
(x) )F(x,y MI
eternity for the martyrs الله أموات سبيل في قتلوا 18 12.96
entering Paradise تدخلوا الجنة 9 7.92
safety from torment العذاب من مفازة 6 13.43
good 4 4.22
خير
truth 3 3.23
الحق
(13.a.)
َّ م لُؤ ْلُؤ ًا َ َ م
منثُوًرا ْ ُسبْتَه
ِ ح
َ م َ خل ّدُو
ْ ُن إِذ َا َرأيْتَه َ ُّ ن ْ ِف ع َلَيْه
ٌ م وِلدَا ُ وَيَطُو
` 134
(13.b.)
وهم يحسبون أنهم بفعلهم ذلك مصلحون
We can further support the hypothesis of negativity and positivity of z}anna and h}asiba
by searching their occurrences in CAC, i.e. with a frequency lower than 3. However we will
not be able to show all these occurrences for the sake of brevity; we would rather refer to
their total number added to the total number of the negative and positive occurrences of
z}anna and h}asiba with minimum frequency 3 as shown in table (7.16) below. We are
now able, motivated by the collocational analysis above, to look at all the occurrences of
z}anna and h}asiba (without designating a particular threshold) to draw some differences
between them in terms of their semantic features. The result is given in table (7.16).
We can notice that the neutral sense50 of z}anna and h}asiba is dominant. However, we
still can say that z}anna is used more negatively than positively whereas h}asiba shows a
tendency to be more positive. Most of the negative occurrences of h}asiba, unlike
z}anna, are negated51. For example:
50 Examples 10a and 10b mentioned earlier are good examples of the neutral sense of z}anna & h}asiba.
51 The negative forms are underlined in the Arabic text along with the transliteration.
` 135
(14.a.)
ول تسب ال غافلً عما يعمل الظالون
(14.b.)
فل تسب ال ملف وعده رسله
One more item of evidence is that one of the dictionary meanings of z}anna, as mentioned
earlier, is ‘suspicious’ and this sense is mainly negative in Arabic, on the other hand
h}asiba is used in the context of praise as in the following hadith:
(15.)
.إذا كان أحدكم مادحا أحدا فليقل أحسب فلنا هكذا
We can also wonder why the difference in frequency between z}anna and h}asiba is so
great. Two explanations can be given here. The first explanation is that the non-occurrence of
h}asiba in the nominal form could be for morphological or phonological reasons since
z}anna can be used as a verb and as a noun alike. To change h}asiba into a noun one
morphological process (step 1 below) and two other phonological processes (steps 2 and 3)
have to take place: 1) to add a suffix which is (aan) in this case, 2) to change the first vowel
` 136
into u and 3) delete the second vowel. So the output form is h}usbaan. Secondly, we can
say that the less frequent word (h}asiba) is used in a more restricted sense, perhaps mainly
in particular contexts only. To see whether this is the case, we can start by looking at the
distribution of the pair across the Qur’an subcorpus, and compare the distribution.
z}anna is mentioned in the Qur’an 55 times, of which 6 are in nominal form whereas
h}asiba occurs 43 times, entirely verbs. This search in the Qur’an subcorpus shows us how
similar z}anna and h}asiba are in terms of frequency with 49 times for z}anna as a
verb and 43 times for h}asiba. Such similarity in frequency in the Qur’an subcorpus can
give equal data for analysis.
Sometimes z}anna is used in the Qura’n to mean ‘certainty’ and other times ‘doubt’. So it
refers to two contradicting senses: the thing and its opposite. All commentators of Qur’an
give two contradictory meanings to z}ann. They treat z}ann as a polyseme that has two
different meanings, but different here means oppositeness. One commentator, Al-Tabari,
gives more examples from Arabic to strengthen his point of view; he mentions al-sudfa to
mean darkness and light, al-s}ariikh to mean the rescued and the rescuer. He is not actually
the only one who is in favour of this approach. Ibn Al-Anbari compiled a book called Al-
Ad}daad ‘The Opposites’ where he collected all homophones of opposite meanings, the top
word of which was z}ann.
On the other hand, some linguists denied the existence of this phenomenon in Arabic52 like
Ibn Durustwayh who compiled a book called Ibt}}aal Al-Ad}daad (Refuting the book of
Opposites) in which he denied that approach because it contradicts the wisdom of Arabs (Al-
Suyuti, Al-Muzhir: 400).
52 Some other linguists give two explanations for the existence of such phenomenon:
1.Broadening of meaning, such as al-s}areem, which literally means (the separated) to mean the night because
it is separated from the day and the same applies to the day that is separated from the night. Al-sudfa which
means both light and darkness can be explained in the same way, al-sudfa is originally put to mean to hide
so when darkness comes it hides the light of the day and when light comes it hides the darkness of the
night.
2.Dialectical variations: for instance, al-jawn means black in Tamim’s dialect and white in Qays’.
One more reason can be added to the above explanations which is not mentioned in Al-Muzhir: narrowing. For
example, al-ma’tam which originally means a gathering of men and women for a sad or a merry occasion is
limited later on to the sad occasion. Therefore oppositeness can no longer hold between homophonous words.
` 137
Now let us come back to the subject matter of this chapter by looking at z}ann which is
often regarded as a polyseme that has two opposite meanings, ‘doubt’ and ‘certainty’. Some
commentators, like Mujahid who says whenever z}ann is mentioned in Qur’an it means
certainty yet he interprets z}ann in some verses as meaning doubt. The selection of
meaning depends entirely as they presume on context. For example, z}ann in the following
two verses mean certainty in (16.a) and doubt in (16.b).
(16.a)
. الذين يظنون أنم ملقو ربم وأنم إليه راجعون.واستعينوا بالصب والصلة وإنا لكبية إل على الاشعي
Allah described the true believers as those who have z}ann that they will meet Allah and
they will return to Him. This is a matter of belief. If they are doubtful they would not be
called believers.
(16.b.)
. ومنهم أميون ل يعلمون الكتاب إل أمان وإن هم إل يظنون
` 138
upon false desires and they but guess. (The Noble Qur’an, 2: 78)
z}anna is translated above as guess. The verse in (2) talks about some Jews who are
illiterate and do not know the reality of their book; however, they follow their scholars
blindly and believe them. This is a different category from those who know the truth and
falsify it mentioned in the verse preceding it (2: 77). So if we interpret z}anna as doubt or
guess as commentators say, we presume that that second category of Jews who do not know
the reality of their book do not believe in it. But this is not the case since this category is
blindly following their scholars and this is a type of belief. We would rather say there are
some Jews who only know the false version of the Bible and they are certain about what they
believe even if it is false.
The inconsistency of the interpreters of Qur’an and the translators later on created a big
confusion when assessing the following verse.
)17(وذا النون إذ ذهب مغاضبا فظن أن لن نقدر عليه فنادى ف الظلمات أن ل إله إل أنت سبحانك
The first dictionary meaning of naqdira is ‘be able’. Ibn Katheer and Al-Qurtubi interpreted
naqdira as ‘to narrow’ or ‘constrict’ as in (18):
` 139
has given him (Qur’an, 65:7).
The meaning of the verse as presented by Ibn Katheer is ‘So Jonah (Dhul-Nuun) thought that
Allah might not constrict him in the belly of the fish’.
Commentators on Qur’an eliminated the possibility that the Prophet Jonah had doubt that
Allah was not able to get him by explaining the meaning of qadara as to constrict. But the
question is still raised, how Prophet Jonah, who is infallible according to the Islamic creed,
thinks that Allah might not constrict him in the belly of the fish, while he went off in anger
fleeing from his people without permission from Allah. If we interpret z}anna here as
certain, the whole argument will be solved. So the meaning is Jonah was certain that if he
prayed to Allah he will be saved. The use of the fa with the following verb naada clarifies
this point as fa introduces a result. So we can say, Jonah was certain he won’t be constricted
in the belly of the fish if he prayed to Allah.
In short, there is a subtle difference between z}anna and h}asiba because of the
contextual variation that occurs with them. Still we need to figure out what semantic features
that make them different in a more methodical way by means of corpus-based analysis.
Let us try to study the whole environment of z}anna and h}asiba particularly the first and
second left collocates to see what preferential distribution they appear in.
` 140
suspicions الظنون 22 25 10.87
most غالب 18 22 10.77
meeting ملقو ال 11 14 10.71
Allah54 كل الظن 4 6 10.47
certainly مؤكد 4 7 10.25
certainly َظنا 17 64 9.14
very أن 579 15537 6.31
that بـ 252 34845 3.94
in كثي 6 1547 3.05
much ف 48 24009 2.09
in
Table (7.17) the1st & 2nd left collocates with z}anna with minimum frequency 3.
Table (7.18) the1st & 2nd left collocates with h}asiba with minimum frequency 3.
The first thing to notice from the above table is the high frequency of an ‘that’ (an introducer
of a subordinate clause) which occurred 579 times with z}anna, whereas it occurred just
102 times with h}asiba. However, an ‘that’ has almost the same percentage with the both
items: h}asiba (388/102= 26.82%) and z}anna (2340/579=24.74%). So they both have
the same proportion of occurrences with subordination.
54 We searched this item plus the following one because they constitute one concept which is resurrection.
` 141
We can also see that z}anna collocates with the full range of intensifiers such as ‘certainly,
much, most, very’, whereas none of these intensifiers occurs with h}asiba, even after a
further assessment of all possible occurrences of both items. Therefore, we can say that
z}anna is something that can increase or become more certain. It can increase to reach a
level of conviction as mentioned above in example (16.a) Qur’an: 2:46).
We then see that ‘z}anna mulaaquu Allaah’ (they believe they will meet Allah) has a high
MI score at 10.71. We can say then that z}anna collocates with a word denoting belief in
resurrection and this involves certainty. In fact, the dominating sense for z}anna so far, on
the basis of the evidence given throughout, is to denote belief. However, there are some
occurrences of z}anna which are assumed to denote probability or doubt as mentioned
earlier. For practical reason, we can fit all these senses in an epistemic scale.
‘Factuality’ in the above scale represents the highest degree of certainty, whereas ‘possibility’
and ‘doubt’ is the lowest. So, we can easily include all senses of z}anna: probability, belief
and certainty, to get the unanimity of all lexicographers by just sticking to one sense which
resides halfway between ‘doubt’ and ‘certainty’ or between ‘doubt’ and ‘certainty not’, i.e. a
state of strong or weak possibility, as represented in the following scale.
In fact, the use of z}anna to mean ‘believe’ reflects a faith-related commitment. Therefore,
this sense eliminates its use in relation to the prophet in the following verses:
(19.a)
ول تسب ال غافلً عما يعمل الظالون
` 142
And not think Allah unaware of-what do-they the-wrongdoers.
Consider not that Allah is unaware of that which wrongdoers do. (14: 42)
(19.b)
فل تسب ال ملف وعده رسله
Two explanations are given in Tabari’s Tafseer (Commentary on the Qur’an) for h}asiba in
this particular context:
(1) To highlight the Prophet’s belief that he does not consider Allah unaware of what the
wrongdoers do. Similar phrasing can be earmarked in Qur’an in more than one place. For
instance, Allah says, ‘O ye who believe, believe’ (4: 136).
(2) To draw the attention to the fact that Allah is aware of the wrongdoers actions and He will
punish them accordingly.
If the addressee is not the Prophet, the literal meaning will not be infringed.
In Al-Qurtubi’s explanation, he said, ‘this is to relieve the Prophet (Muhammad) after relating
to him this sad story about the people of Abraham and how impudent they are in discrediting
his religion.’
To know, in the first place, that the addressee in the following verse (14: 44) is the Prophet
shows that the addressee in the previous verse has to be him as well.
However, the addressee in the above verses can include all categories of the participants in
the speech-act: the speakers, the listener/ reader and the audience, because this is put in an
admonishing style. This is in conformity with the basic idea of prophethood and the
revelation which is for the good of the whole people, not just for the Prophet. Accordingly,
h}asiba should have another meaning, that suits all potential addressees, different from
z}anna which means something in between certainty and doubt. We would better define it
as a verb that refers to the inclination of one’s heart to think. Secondly, it is obvious that the
` 143
two verses (19a & b) are imperative and negative at the same time. This use is only typical
with h}asiba. The negated imperative occurs 37 times with h}asiba (i.e. 9.56%) and only
three times with z}anna (0.12%). In this context I examined z}anna and h}asiba in
verbal forms and it turned out that all their occurrences in negative imperative are followed
by clausal complements (subordinate clauses) and these clauses can function as subject,
object or complement. This is quite significant in drawing up the differences between
z}anna and h}asiba.
We have seen that h}asiba occurs as negative imperative more than z}anna. Let us now
look at some possible explanations as to why this is so.
Basically, the pronouns used with h}asiba in imperative case must be second person,
singular or plural, feminine or masculine. The personal pronoun, you is used in ‘a direct
address language’ (Leech, 1966: 34). The language of direct address is an appropriate vehicle
for effective communication, where the addresser seems as if holding a conversation and
talking to the addressee directly. In this case, the person receiving the message, the addressee,
is the passive part of the speech-act. Is not that proof that h}asiba is a passive word? No,
we cannot make that claim before we assess the other part of the description, namely the
imperative mood.
First of all, the literal meaning of imperative mood is for direct instructions and admonition.
Imperatives can be positive, meaning direct exhortations, or negative when connoting
prohibitive warnings (ibid 110-111). All occurrences of h}asiba and z}anna in imperative
mood are accompanied by the negative form. Therefore, they are used as prohibitive
warnings. This sense, prohibitive warnings, coupled with the language of direct address are
significant in religious discourse where the speaker tries to remedy the defects of the
listeners/ hearers without any sort of sophisticated locution. The speaker only aims to touch
the souls of his/ her audience in a simple and short cut way.
Secondly, the use of h}asiba in this way implies that the message to be delivered is enough
to treat a superficial problem that did not find its way to the heart. Thus, it can reduce the
` 144
discourse complexity, by expressing in just one or two sentences (as in example (19 a & b)
above) what would otherwise have been expressed in a lengthy address with z}anna. As for
z}anna as in (Qur’an 2:154-171), Allah gives an account, using z}anna, of the behaviour
of some Muslims in the battlefield and the remedy of it. He, therefore, gave a lengthy
treatment of such a problem, which is cowardice or fear of death, after it had found its way to
their hearts. Therefore the main distinction between h}asiba and z}anna is that the former
is used for deeply held belief or conviction whereas the latter is for superficial belief (i.e.
belief about relatively unimportant issues). For example, z}anna is used throughout the
Qur’an subcorpus to mean a state of belief or disbelief that leads either to heaven or Hell-fire.
Let us consider the following examples of z}anna:
(20.a.)
. الذين يظنون أنم ملقو ربم وأنم إليه راجعون. واستعينوا بالصب والصلة وإنا لكبية إل على الاشعي.
(20.b.)
.وأما من أوتى كتابه بيمينه فيقول هاؤم اقرؤوا كتابيه إن ظننت أن ملق حسابيه
wa ammaa man utiya kitaabahu bi-yamiinihi fa-yaquulu … inni z}anantu anni mulaaqin
h}isaabiyah.
And as-for who given-(him) book-his in-right-hand-(his) so-says … verily-I thought
that-I meeting reckoning-me.
Then as for him who will be given his Record in his right hand will say… Surely, I
did believe that I shall meet my Account. (Qur’an, 69: 19-22)
(20.c.)
` 145
.وأما من أوتى كتابه وراء ظهره فسوف يدعو ثبورا ويصلى سعيا إنه كان ف أهله مسرورا إنه ظن أن لن يور
wa amma man uutiya kitaabahu waraa’a z}ahrihi fa-sawfa yadcu… innahu z}anna
an lan yahuura.
And as for who given-(him) book-his behind back-his so-will invoke-he
destruction… Verily-he thought that not return.
But whosoever is given his Record behind his back, He will invoke (his) destruction.
… Verily, he thought that he would never come back (to Us)! (Qur’an, 84:10-14)
(21.a.)
قيل لا ادخلي الصرح فلما رأته حسبته لة وكشفت عن ساقيها
(21.b.)
سبَْتهُمْ ُلؤُْلؤًا مّنثُورًا
ِ َخلّدُونَ إِذَا َرَأْيَت ُهمْ ح
َ َوَيطُوفُ عََلْي ِهمْ وِلدَانٌ ّم
(21.c.)
يسبهم الاهل أغنياء من التعفف
` 146
The one who knows them not, thinks that they are rich because of their modesty.
(Qur’an, 2: 273)
We can notice in the above examples that h}asiba is used to describe one’s own impression
of a particular situation. In (21a) when Queen Belqees visited Solomon, the latter wanted to
impress her in a way that makes her believe in Allah. He asked her to enter a glass palace
built on water. She had never seen such edifice before, so she thought nothing was there and
tucked up her clothes. She came to her decision just by mere sighting, So the use of
h}asiba here refers to a state of roughly-held perspectives based on non-methodical
conception inducted to one’s mind or heart through mere sighting as in (21a-b) or hearing or
by prediction as in (21c).
We can eventually say that h}asiba and z}anna are verbs whose meanings imply a
personal element which is described by Badawi (2000) as an introducer for the relationship
that holds between subject-predicate on the basis of one’s own point of view. But h}asiba
describes a personal state attained via feelings or mere senses rather than on facts and
knowledge. z}anna as discussed above is based on personal perspectives residing in one’s
own mind with which he can believe in the validity or the invalidity of a given concept.
These perspectives can be true with someone and false with another according to how
accurate or inaccurate his perception of something is. So the semantic feature which can be
deduced out of these differences between h}asiba and z}anna is that the former is
immediate reaction (based on one’s feelings or mere senses) whereas the latter is considered
reaction (based on one’s own ideas which he obtained after long contemplation on it).
In conclusion, we have probed two different semantic features that distinguish between
h}asiba and z}anna: positive vs. negative and immediate reaction vs. considered reaction.
The two features, although apparently unrelated, complement each other. The discourse
characterised by h}asiba tends to be an immediate reaction which is mainly positive in the
sense that it represents only what is the case, without deep thinking. With z}anna, by
contrast, it gives the impression of a considered reaction which is mainly a negative report of
the events, i.e. it expresses one’s personal evaluation of the situation or state of affairs
referred to.
` 147
Therefore, it turns out that the synonymy relation can no longer hold between h}asiba and
z}anna. One more piece of evidence is that if we assume that h}asiba and z}anna are
synonymous, we should be able to exchange one word for the other without changing the
meaning of the sentence to any great extent. If we try that with the examples above, we get,
(20.a.)
ول تسب ال غافلً عما يعمل الظالون
(20.b.)
ًأحسبك رجلً عاقل
(22a.)
*ول تظنن ال غافلً عما يعمل الظالون
The replacement seems to work for the second sentence but not for the first.
This is because the addressee in (22a) is the Prophet who basically believes in Allah’s
` 148
ultimate power and has no doubt that Allah is a ware of everything. So z}anna which
means something based on facts does not fit in here, simply because he is a prophet.
h}asiba can only fit with its meaning ‘Do not let the phenomenal situation of Allah’s
wisdom (in postponing the punishment of the tyrants and the wrongdoers and giving them the
upper hand) be inducted to your mind or heart through just mere observation of the situation.
7.6.1 Summary
In this section we had to carry out some preliminary analysis prior to the statistical tests,
represented in 1) singling out the most central forms of the pair; 2) discussing the
grammatical and semantic position of both words; 3) refuting the polysemous nature of
z}anna as having two opposite senses. Then we have identified interesting differences
between the pair of words by probing the semantic features of both: negative vs. positive and
considered vs. immediate reaction.
Statistically, we have only used MI to highlight how significant the collocations of both
words are. We found out that the T-test is not useful with this pair of words, because lists of
collocations with both are different. So, the difference between them can be brought about by
MI.
` 149
7.7 A case study: The word pair h}bb and wdd ‘love’
The synonymous pair, h}bb and wdd , which are commonly taken to mean ‘love’, will be
examined below to see if they are absolute synonyms. Let us have a look first at the
dictionary meaning in table (7.19) below.
Table (7.19) definitions of h}bb and wdd in four dictionaries
Al-Muhiit
Having made a search for the exact match we found out that the output is dramatically less
than the one reached by using a wild card although the results include all word classes of the
above lexeme. Searching every category separately takes a lot of time but it is more accurate.
So, we will search all the word-forms that occur in CAC. This could leave some word-forms
without analysis because they did not occur in our corpus. The total number of the
occurrences of the base-word h}bb in CAC is 1972 and the search result can be represented
in the following table.
` 150
27 حبيبتي my lover (fem.) N 9
28 محب the lovers (acc.) V 9
29 يحبونهم they loved them N 9
30 التحاب love N 8
31 حبيبا a beloved (acc.) N 7
32 الحبيبة the beloved person (fem.) V 7
33 محبتكم your love (pl.) N 6
34 تحابا they loved each other (dual) V 5
35 تحبوا you (pl., jussive/ acc.) love N 5
36 محبتكم (your love) N 4
37 محبتهم their love N 4
38 الحباء the lovers pl. N 4
39 أحبهم he loved them V 4
40 المحبون the lovers (nom) N 4
41 تحبها you love her N 4
42 تحابوا love one another (pl.) V 4
43 المتحابان the (dual) lovers N 4
44 المتحابين the lovers N 3
45 يحبهم he loves them V 2
46 يتحابون they love one another V 2
47 يحبونها they love him V 2
48 يحبان they (dual) love N 2
49 حبيبه his beloved (masc.) N 2
50 تحبان you (dual) love V 1
51 يحبوا they (pl., jussive/ acc.) love V 1
52 يتحابوا they love one another (pl.) V 1
53 تحاببتم you loved one another (pl.) V 1
54 حبيبته his beloved person (nom., fem.) N 1
55 حبيبتها her beloved person (acc., fem.) N 1
56 محبتين two lovers (acc., fem.) N 1
57 محبتهن their love (acc., fem.) N 1
Total: 1972
Table (7.20): Lexical Frequency of h}bb in CAC
The lexical items in the table above are all derived from the same root, h}bb. However, to
discuss them all will be a tedious work and time-consuming. Instead, based on corpus
linguistics techniques, we can choose the most frequent items from the above list to analyse
and see if we can get a significant understanding of the whole scope.
Table (7.21) above shows that these top ten word forms comprise most of the overall
occurrences of the base-word h}bb. They altogether form 72% of the total frequency, which
` 151
is enough to work on for a realistic result. We can also notice that there is no big difference in
frequency between verbal and nominal forms.
Initial observation of the base-word h}bb, unlike wdd, shows that it is more likely to be
frequent in love stories and fiction. To be consistent, we need either to avoid analysing this
item or treat it as exceptional since we are working from the very beginning on just general
words as mentioned in 7.2. So let us see how often this word occurs in different kinds of
texts. This can help us find out whether it is a general word or register-specific, related to
love stories and fiction. In the examples below, one can see the distribution of the base-word
h}bb (of the above top ten items) in the different subcorpora in the CAC.
Subcorpus Text size V N Total 55
.Perc
The Holy Qur’an 88,622 43 7 50 056.
Biography 393,933 105 28 133
033.
Fiction 579,223 107 301 408
070.
Hadith 683,970 196 82 278
Lexicons 404,080 9 8 17 040.
Philosophy 478,141 23 49 72 004.
Poetry 69,385 10 11 21
015.
Proverbs 362,054 36 70 106
Science 903,205 26 18 44 030.
004.
028.
Total 5,000,000 786 640 1424 028.
Table (7.22): Subcorpus frequencies of the top ten forms of the base-word h}bb in CAC.
The list above shows that the base-word h}bb is more frequently used in Fiction, the Holy
Qur’an and Hadith than in any other text type. The occurrences of the words under
examination in these sub-corpora exceed the overall occurrence of such words in the whole
corpus, which is .028%. They are least frequent in the texts that are considered technical like
55 By percentage Imean the ratio of the item under examination ‘hbb’ per subcorpora.
` 152
linguistics, science and philosophy. In the first place, this could be an indication that the more
general the text is the more likely the word love occurs. Secondly, if a word is frequently used
in a specific text, it is probably important in that text, but if it is frequently used in all texts, it
is not important in any of them. Therefore, h}bb is a general word because it occurred in all
texts and is used frequently in general texts.
Let us now search our corpus for the words in question to see which dictionary meaning
mentioned in table (7.19) is the most common and what semantic feature/s are associated
with it. This can be done by concordances, which are able to detect patterns of usage in
different contexts. This can enable us to examine their collocation easily and discover what
words they group with.
Analysing the co-occurrences of h}bb shows that this word occurs in a pattern. In table
(7.23) below is a list of the immediate left and right collocates of the forms yuh}ibbu,
[al]h}ubb and [al]mah}abbah in a window of 2 items on either side of the search-term
with a minimum frequency of 3.
` 153
3 the world الدنيا 6 أحدكمone of you
3 faith اليان 5 disbelievers الكافرين
3 Zabeedah زبيدة 4 his action عمله
3 dog الكلب 4 the poor الفقراء
4 who الذين
4 man الرء
4 the soul النفس
4 woman امرأة
3 praise الدح
3 optimism التيمن
3 fun اللهو
3 sleep النوم
3 traitors الائني
3 food الطعام
3 people الناس
3 life الياة
Table (7.23): The base-word h}bb in a window of two items on either side.
Not all hits are represented in this table or discussed below, simply because we filtered the
results by removing adjunct56 examples. Those examples, although they contained the desired
lexemes (i.e. left or right collocates), did not constitute subjects or objects for the verbs or
complement of the noun phrase.
Studying the right and left collocates of h}bb (verbal and nominal) can reveal potential
subjects. It emerged that all of the right collocates which can stand for subjects are animate.
The concordances show that the most frequent subject in the list is the relative pronoun ‘who’
105 times57, the word Allah 76 times, people 9, man 6, son 6, father 5, soul 5, dog 3. So, the
base-word h}bb reflects one’s inner feeling of liking something.
We can also examine the objects and then ask, “What does X love?” Most of the objects
listed in the table above are either good or bad qualities, however the most frequent left
collocate in the list is the word Allah. Also we have objects like Messenger, man, woman,
food, fun and sleep. We then can conclude that the base-word h}bb can describe someone’s
strong feeling of liking towards something. That thing which is loved can either be animate
` 154
such as people, man or inanimate such as food, fun, sleep. The collocates of h}bb can be
summarised in terms of frequency in the following domains: (1) religious experience, (2)
friendship, (3) sexuality, (4) family, and (5) non-human objects.
Let us now have a look on the other item of the pair: wdd, which is more problematic
because there is no consistency in explaining its meaning in the Arabic Qur’anic exegeses
and in translating it afterwards. It is sometimes translated as affection, kindness or friendship
as will be discussed below.
` 155
24 مودتنا our love N 1
25 وداد love N 1
26 ودادي my love N 1
27 مودتهم their love (mas.) N 1
28 ودادكم your love (mas.) N 1
29 ودادتي my love N 1
30 أوده love him V 1
31 التواد the love N 1
32 يودوا they love V 1
33 يوادونهم they love them V 1
34 ودها her love V 1
35 يودك he loves you V 1
36 يوده he loves him V 1
Total: 280
Table (7.24): Lexical Frequency of wdd in CAC
We can notice that the overall frequency of wdd in the whole corpus is far less than h}bb;
the frequency of wdd constitutes only .005% (280 occurrences) in the whole corpus whereas
h}bb is .039% (1972 occurrences).
.Freq Right1 .Freq Left1
15 who من 12 58
one of them أحدهم
11 who الذين
10 many of كثيthem
4 somebody فلن
4 the friend الصديق
3 family أهل
Table (7.25): The base-word wdd in a window of two items on either side.
Examining the first node on the left and right hand side of wdd, as represented in table (7.25)
above, does not give as much significant information about collocation as h}bb. The only
semantic feature that can be extracted out of these examples is that wdd co-occurs with
+human lexemes whereas h}bb is more general as it can co-occur with +animate lexemes.
This can be represented in the following table.
58 In Arabic, the personal pronoun in plural masculine position, attached or detached, is used to refer to humans
only.
` 156
h}bb (L) h}bb (R) wdd (L) wdd (R)
Allah √ √ X X
Messenger √ √ X X
Man √ √ √ √
Woman √ √ √ √
Food √ X X X
Sleep √ X X X
Dog X √ X X
Sexual intercourse √ X X X
Fun √ X X X
Table (7.26) Left and right collocates of h}bb and wdd
Further analysis of the same collocates without applying a filter may reveal other semantic
features invisible to us within a span of two. For example, law (if) followed wdd (verbal) 27
times, on the other hand, law did not co-occur at all with h}bb. We can say then that wdd
behaves like verbs of imagination such as hope and wish because of the similarity between
wdd as a verb having if-clauses following it and these verbs having the same function. Let us
have a look at the following examples.
(23)
... َي َودّ أَ َح ُدهُمْ َلوْ ُي َعمّرُ أَلْفَ سنة
` 157
َيوْمَ َتجِدُ ُكلّ َن ْفسٍ مّا َعمَِلتْ ِمنْ َخْيرٍ ّمحْضَرًا وَمَا عَ ِمَلتْ مِن ُسوَءٍ َت َودّ َلوْ أَنّ َبْيَنهَا َوَبْينَهُ َأ َمدًا َبعِيدًا..
The use of wdd followed by an if-clause in the above examples sheds light on the possibility
of using this word to mean either love or wish; this is mentioned in the dictionary meanings.
That extra sense (wish) is obviously a good distinction between h}bb and wdd. However, to
make the analysis to find subtle differences between h}bb and wdd we need to stick only to
one side of the meaning: affection. Therefore we will exclude examples containing if-clauses.
This apparently applies to wdd in verbal forms, because none of the if-clauses occurred after
wdd in nominal form in CAC. wdd in verbal form occurred 112 times in CAC, 55 of which
are followed by law (if). So we will exclude these 55 examples to get a reliable comparison
between h}bb and wdd meaning affection. This leaves 57 examples and after examining
them we do not get any interesting collocation either, as shown in table (7.27) below.
As looking into the concordances of wdd in a small span does not show any significant
collocation we need to increase the span to see whether we can get any particular distribution
of that word. In a span of five on each side of the search-term we found out that that word
tends to occur in a certain semantic profile different from h}bb. Below we will examine the
` 158
word wdd meaning affection in nominal forms.
Having searched the concordances of wdd and h}bb in that bigger span, we found the
following results:
1) None of the intensifiers or adverbs of degrees, such as shadiid ‘very’, kathrat ‘much’ and
zaa’idah ‘exceedingly’, did occur with wdd, whereas h}bb occurred with intensifiers like
shadiid or shiddah ‘very or strong’ (17 times), kathrah ‘much’ (4), and adverbs like fart}u
‘exceedingly’ (6), zaa’idah ‘excessively’ (5).
2) Some verbs occur more often with wdd than with h}bb. For instance, wdd occurs more
frequently with verbs that mainly describe a concrete or observable action such as ta’ti
‘come’, tadnu ‘come closer’, tanqatic ‘cut off’, yussir ‘does discretely’, abana ‘show’, yas’al
‘request’, yabdhul ‘give’, yaquum cala ‘maintain’, ad}aaca ‘waste’, yunaasih} ‘does
sincerely, incaqad ‘interlink’ and jacala ‘make’, tarjuu ‘wish’. The last verb is the only
example which describes an abstract action.
3) The verbs that co-occur with h}bb mainly describe an abstract or unobservable action:
tahakkama fi ‘control’, sakanat fi ‘rest in’, yaddaci ‘claim’, yarzuq ‘bless’, tashtadd
‘strengthen’, ra’a ‘see’, yuz}hir ‘disclose’, yufrit} ‘exaggerate’, yu’thir ‘prefer’, mazaja
‘establish’, yud}mir ‘hide’, yuksib ‘cause to gain’, zaada fi ‘increase’, yucadhib ‘torture’,
yajlub ‘bring’, waqaca fi ‘fall in’, ‘alqa fi ‘put in’. The last four verbs tend to be concrete.
4) The preposition fi ‘IN’ occurs 37 times with the verbs that precede h}bb, whereas it
occurs twice only with wdd.
We then need to compare the two sets of verbs and determine how likely the difference
between the two sets occurred by chance; this can be done by the t-test59. We only selected
the verbs with minimum frequency 3 for the test below.
59 I used to only search the items whose MI scores are significant. We found this test useful in summarising the
whole data which we can use for further analysis. In case of a limited list like the one we have in table 7.29 we
prefer to run the t-test statistic only.
` 159
V f(h}bb /w) f(wdd/w) Gram. Function T Significance
In the table above the higher the t-score the more different the pair under examination. We
can notice that h}bb gets the higher t-score in the context of verbs like waqaca ‘fall’, zaada
‘increase’, taghalghala ‘establish’, yajlub ‘bring’ and yaddaci ‘claim’; it co-occurs with verbs
that describe an abstract action. On the other hand wdd gets higher t-score when co-occurring
` 160
with verbs that refer to a concrete action, such as ‘come, cut, keep, request and give’. In other
words, wdd is used with verbs that express a practical action which affects somebody else,
such as cutting a relation with him, maintaining a relation with him, asking him, giving him
etc. As for h}bb, it expresses an abstract action like X falls in love, love increases, love is
established in his heart, s.th. brings love, X claims love. By abstract action, I mean a private
action which does not necessarily affect the recipient.
On the basis of the above results (1-4) and table (7.29), we can conclude that wdd (as in result
1) is more emphatic than hbb, because intensifiers are superfluous items used to amplify
actions. So the absence of intensifiers often indicates more emphasis. Secondly, the frequent
use of motion verbs with wdd (as in 2 and 3 & table (7.29)) shows that wdd is more concrete
than h}bb. Finally, the preposition IN, which means containment or inclusion, i.e. locating
or limiting the activities of the contained entity, occurs more frequently with h}bb (as in 4),
which might be an indication that h}bb tends to be contained or lying in a particular place.
So we can conclude that wdd is +emphatic and +concrete.
Secondly, a further look on the concordances of the pair, in a span of five on both sides,
reveals that qalb ‘heart’ co-occurs 79 times with h}bb and only once with wdd. Because the
word heart co-occurs more frequently with h}bb, this indicates that there is a strong bond
between them and that the heart is traditionally and psychologically connected to feelings like
h}bb. This gives another evidence that h}bb is an abstract feeling.
Thirdly, the following Qur’anic verse can be a piece of evidence in favour of the above
conclusion as shown (26):
(26) O ye who believe! Take not my enemies and yours as friends (or protectors), offering
them (your) mawaddah (love), even though they have rejected the Truth that has come to
you, and have (on the contrary) driven out the Messenger and yourselves (from your homes),
(simply) because ye believe in Allah your Lord! (Qur’an, Al-Mumtahinah: 1)
This verse was revealed about a man (Hatib ibn Abi Baltacah) who was in the Muslim army
` 161
heading towards Makkah to liberate it from Pagans. He sent a message to the pagans of
Quraysh requesting protection for his children and relatives left behind in Makkah in return
for information about the Muslims’ strategy and weaponry being prepared to conquer
Makkah. When the man was caught he declared that he hates those people to whom he sent
the message and he was truthful about his feeling. He only intended to do the Makkah people
a favour by virtue of which his family and property in Makkah may be protected. The
Prophet said he was truthful. This story is recorded in the Qur’an where Allah described the
favour he did towards the People of Makkah as wudd.
Fourthly, one of Allah’s names is al-waduud (the Loving). This is because, in the first place,
hbb (love) is commonly understood, as a bond between two entities and some kind of need.
Secondly, it is a state of lack of control. These apparently do not fit with Allah’s perfection.
Moreover, wdd is more general than h}bb, i.e. h}bb is devoted to particular persons, which
could be in Allah’s sight, the pious who are real true believers. This is because all occurrences
of hbb in verbal forms with Allah show that Allah loves particular people who are righteous
and does not like the wrongdoers. So if He named Himself Al-h}abiib this would be a static
attribute that eliminates some people forever. In other words, if Allah named Himself Al-
h}abiib, this would exclude some people from His bounties and blessings, which are available
to all people.
Therefore, based on the above remarks, the semantic feature that can be extracted to
differentiate between hbb and wdd is abstract vs. concrete.
7.8 Conclusion
First, the widely claimed four synonymous pairs discussed above can be summarised as
follows:
•intrinsic vs. extrinsic (as between ithm& dhanb)
•closer vs. further (as between ata & jaa’a )
•negative vs. positive (as between ata & jaa’a , h}asiba & z}anna)
•considered vs. immediate reaction (as between z}anna & h}asiba)
` 162
•abstract vs. concrete action (as between h}bb & wdd)
Secondly, we used the following methodology to test the synonymy between two items of a
given pair.
Using a corpus to help get hold of all the occurrences of the pair under investigation
quickly and accurately.
Identifying the word class of a given item. This is important in looking for collocation
because it enables us to know which word is more significant. For example, the
collocation of a word which is a verb, is more likely to be found in the right hand side
in Arabic. This is done manually.
Determining the syntactic function of the term under investigation. It is also important
because sometimes we need to look at the complement of an item.
Analysing collocation.
Analysing the context to understand how/when the variants are used (semantic
prosody).
Applying statistics to find anything interesting about their distribution.
The identification of a semantic feature of the search term according to their
contextual use.
Substituting one word for the other to see if any change happens in the meaning of the
sentence.
` 163
Chapter Eight: Conclusion
Arabic corpus linguistics is a very active area; I had to rework what I have done several times
because of the incessant contributions in this field, especially when discussing the available
corpora and tools that work on Arabic language.
One of the main important contributions this study made is providing a computational Arabic
corpus of the early classical Arabic. This corpus will be available for research purposes to be
exploited in NLP applications for Arabic and for more accurate analysis of Arabic linguistic
phenomena. With regard to size, Arabic corpora should be big enough to be reliable for
generalisation, due to the richness of Arabic vocabulary. For example, in Arabic a given word
is expected to appear less often than in an English text of the same length (Goweder and de
Roeck (2001)). This is because of the inflectional nature of Arabic and the abundance of its
vocabulary (cf. p. 99).
The corpus-based analysis can be used as a successful methodology for testing what has been
introduced by early linguists on all linguistic levels (morphology, syntax, semantics, etc.).
More than that, it can give new insights and introduce rules and models which have not been
previously discussed. There seems to be no corpus-based research directly analysing
synonymous words in Arabic, classical in particular. I do not claim that my analysis is correct
or privileged, but rather that it is more methodical and systematic than one based on intuition.
Final findings suggest that applying corpus linguistics methodology to Arabic can help us
improve lexical awareness and choice as most Arabic linguists are unaware of the
collocational differences between synonymous pairs, let alone ordinary native speakers of
Arabic.
Corpus-based analysis of items which are often regarded as roughly synonymous in Arabic
can highlight subtle differences in meaning among such items. This can be done by
abstracting semantic features through comparing differences observed in their contextual
idiosyncrasies and examining practical examples of the usage of such items. In this way,
` 164
absolute synonyms can be ruled out if we come across one context in which one of the
synonymous pair carries more meaning, has a different distribution or is used in a different
register. Also, with the aid of statistical techniques we can have an accurate account of
whether there are systematic differences in the use of certain types of seemingly synonymous
words by summarising their distribution in the corpus.
The results given throughout my work imply a need for a fresh look at Arabic studies. The
new and unexpected shades of meanings will raise lots of questions about the credibility of
most old and modern Arab contributions in the following fields:
1)Lexicons
2)Interpretation of the Holy Qur’an
3)Translation of Qur’an
4)Jurisprudence
5)Prophetic Traditions (Hadith)
6)Poetry
7)Linguistics
In lexicography, for example, had the dictionary-makers been aware of the subtle differences
and uses of seemingly synonymous words they would have made more accurate definitions.
Suppose we use the corpus-based methodology to build up an Arabic lexicon. As mentioned
elsewhere, the macro structure of an Arabic-Dutch dictionary contains 24,000 words. Since
the prevailing view is that the Arabic vocabulary is very extensive, we might ask ourselves if
a dictionary containing 24,000 words will serve the user sufficiently when reading or
listening to Arabic. Although Nijmegen University recently has managed to create that kind
of corpus-based lexicon, it is only restricted to texts written in Modern Standard Arabic.
In the field of Qur’an exegesis lots of work has been done but based on the old perspectives:
non-corpus-based. The outcome was huge, yielding various contributions. Nonetheless, some
verses are left either vague or misinterpreted because of the vagueness of some lexemes as in
verse 2:78
wa minhum ummiyuuna la yaclamuun al-kitaaba illa amaaniyya wa in hum illaa
` 165
yaz}unnuun.
And from-them unlettered not know-they the-book except wishes and but they think-
they.
And there are among them unlettered people, who know not the Book, but they trust
upon false desires and they but guess. (Qur’an, 2: 78)
The verse above talks about some Jews who are illiterate and do not know the reality of their
book; however, they follow their scholars blindly and believe them. This is a different
category from those who know the truth and falsify it mentioned in the verse preceding it (2:
75). So if we interpret z}anna as doubt as commentators, like Al-Tabari and ibn Katheer,
say, we presume that that second category of Jews who do not know the reality of their book
do not believe in it. But this is not the case since this category is blindly following their
scholars and this is a type of belief. We would rather say there are some Jews who only know
the false version of the Bible and they are certain about what they know even if it is false.
This meaning cannot not be attained by simple study of the word; it rather requires an
accurate probing of the whole senses of the word based on the corpus methodology.
As for the translation of Qur’an, it is basically based on its exegesis. It depends on the same
methodological approach of the author of the exegesis.
In Jurisprudence, much of the arguments between Muslim scholars and schools of thoughts
arises from their own understanding of the language of the Qur’an and Hadith. One of the
main reasons of such differences is their linguistic differences concerning some texts of the
Holy Qur’an and Prophetic traditions on the syntactic or semantic level. This sometimes
leads to the difference in understanding and formulating laws derived from such texts. For
example, s}uurah as used in hadiths is interpreted as ‘picture’. Such interpretation could
lead to forbidding all types of painted pictures or photographs. This is the opinion of a big
group of Muslims nowadays called Salafis who understand s}uurah as a picture. Another
group of Muslim scholars interpret s}uurah as statue because this is the meaning which
was current in the Prophet’s lifetime. They further argue that this ruling is only applicable to
statues which are made to be respected and worshipped.
` 166
Corpus-based analysis can distinguish between the different senses of a given word
synchronically or diachronically. With this methodology, a particular sense of a word is
clarified.
` 167
Appendices
Appendix 1: Copyrights
Muhaddath website
Conditions for copying books from our site: taken Top of Page
from www.muhaddith.org
You may copy books from our site to another, according to the following
conditions:
•Theusage of the book on your site must be for non-commercial
purposes.
•Next
to each book you copy, mentioning that your source is “Al
Muhaddith Project”, and adding a link to our site.
•Giveproper notice concerning books that are not permitted to use for
commercial purposes.
As an example, refer to our note concerning Ibn Katheer’s summary by
Sabooni.
•Uponcompletion, informing us and sending us a link to
manager@muhaddith.org
From: “Moutasem Zakkar” <moutasem@cosmos-software.com>
To: <Abdel-Hamid.Elewa@student.umist.ac.uk>
Subject: Re: I need permission for downloading
Date sent: Sat, 2 Mar 2002 10:30:37 +0400
Alwaraq website
Dear Sir :
Regards
Moutasem Zakkar
Technical manager
` 168
----- Original Message -----
From: “Abdel-Hamid Elewa” <Abdel-Hamid.Elewa@student.umist.ac.uk>
To: <moutasem@cosmos-software.com>
Sent: Tuesday, February 12, 2002 5:37 PM
Subject: I need permission for downloading
` 169
Appendix 2: mathematics
medicine
The contents of the CAC are summarised in the following charts (1) & (2):
physics
geography Qur'an Hadith
Chart (1)
biography
lexicons
proverbs philosophy
fiction theology
Chart (2)
poetry
Appendix 3:
Genres and texts included in CAC.
science
Genre: Thought and Belief
Subgenre Texts Text Size Perc.
belief &
The Holy Qur’an The Holy Qur’an thought
linguistics 88,622 1.8
Prophetic literature
1.Sahih Al-Bukhari 683,970 13.6
Tradition (Hadith) 2.Sahih Muslim
` 170
Lexicons 1.Al-’Ayn (Al-Khalil 404,080 8.0
2.Fiqh al-Lughah (Al-Tha’alibi
Science
Geography Ahsan al-Taqaseem fi ma’rifat al- 82,499 1.46
Aqaleem by Al-Maqdisi
Appendix: 4
A sample of concordances as appearing on the Monoconc window.
` 171
Appendix 5:
A picture of the concordance lines run by Monoconc and then saved to an only-text file.
` 172
.
` 173
Bibliography
Aijmer, K. & Altenberg B. (1991). English Corpus Linguistics. Longman, London and New
York.
Al-Anbari, Abu Barakat, (d. 1207). nuzhat al-alibbaa’ (The Fun of the Men of Wit). (ed.) Abu
al-Fadl Ibrahim, Daar Nahd}at Misr li-l-T}{abbc wa-l-Nashr, Cairo.
Al-Ashcari, A. (1994). al-ibaanah can us}uul al-diyaanah (Explanation About the Basics of
Belief). (ed.) Abbas Sabbagh, Daar Al-Nafaa’is, Beirut.
Al-Askary, A. (1931). al-furuuq fi al-lughah (Differences in Language). (ed.) Imad al-Barudi,
Daar Al-Kitaab Al-cArabi, Damascus.
Al-Bustani, B. (b. 1819-1883). muh}}iit al-muh}}iit: ay qaamuus mut}awwal lil-lughah
al-cArabiyyah (The Comprehensive Ocean: i.e. A Detailed Dictionary of the Arabic
Language). Beirut.
Al-Fayruzabadi, M. (1952). al-qamuus al-muhiit (The Comprehensive Lexicon).
Mat}bacat Mustafa Al-Babi Al-Halabi, Cairo.
Al-Hamadhani, A. (1991). al-alfaaz} al-kitaabiyyah (The Literary Words). Dar al-Kutub al-
cIlmiyyah, Beirut.
Al-Jabouri, A. J. R. & Knowles, F. E. (1988). “A computer-assisted study of cohesion based
on English and Arabic corpora: An interim report”. Proc. of 13 th International
conference, University of E. Anglia (Norwich), 1-4 April 1986, Computers in
Literary and Linguistic Research. Paris; Geneva, Chapion; Slatkine, pp. 59-77.
Allen, J. (1995). Natural Language Understanding. The Benjamin/Cummings Publishing
Company, Inc. Redwood City, CA.
Almuhanna, A. (2003). Scientific and Technological Term Transfer into Arabic: A Corpus-
Based Study of Arabic Noun + Noun and Noun + Adjective Compounds.
Unpublished Ph.D. thesis, UMIST, Manchester.
Al-Qurtubi M. (1998). al-jaamic li-ahkaam al-qur’aan (The Compendium of
Qur’anic Rulings). Daar Al-Fikr, Lebanon.
Al-Sakkaki, Y. (b. 1066). Miftaah} al-culuum (the Key to Sciences) (1st ed.).
Mat}bacat Mustafa Al-Babi Al-Halabi, Cairo.
Al-Tabari, I. (d. 922). jamic al-bayaan fi ah}kaam al-qur’aan (The
` 174
Comprehensive Book in the Rulings of the Qur’an). Daar Al-Fikr, Beirut,
Lebanon.
Al-Yaziji, I. (1970). nujcat al-raa’id wa shurcat al-waarid fi al-mutaraadif wa-l-mutaawarid
(The Spring of the Seeker in Synonyms and Associations) (2nd ed.). Maktabat
Lubnaan, Beirut.
Atkins, S. Clear, J. & Ostler, N. (1992). “Corpus Design Criteria”. Literary and Linguistic
Computing, vol. 7, 1: 1-16.
Badawi, E. (2000). “An opinion on the meanings of icrab in Classical Arabic: The state of the
nominal sentence”. In Diversity in Language: Contrastive Studies in English and
Arabic Theoretical and Applied Linguistics. (eds.) Ibrahim, Z., Kassabgy, N.,
Aydelott, S. The American University in Cairo Press, Cairo.
Bakalla, M. H. (1983). Arabic Linguistics: An Introduction and Bibliography. Mansell
Publishing Ltd, London.
Barlow, M. (1999). Monoconc Program, Version 1.0. Athelstan, Houston, USA.
Barnbrook, G. (1996). Language and Computers. Edinburgh University Press, Edinburgh.
Benson, M. (1990). “Collocations and general-purpose dictionaries”. International Journal of
Lexicography, vol. 3, 1: 23-35.
Benson, M., Benson, E. & Ilson, R. (1997). The BBI Dictionary of English Word
Combinations. John Benjamins, Amsterdam/Philadelphia.
Berry-Rogghe, G. L. M. (1970). Collocations: Their Computation and Semantic
Significance. Unpublished Ph.D. thesis, University of Manchester.
Biber, D. (1993). “Representativeness in corpus design”. Literary and Linguistic Computing,
vol. 8, 4: 243-257.
Biber, D., Conrad, S. & Reppen, R. (1994). “Corpus-based approaches to issues in applied
linguistics”. Applied Linguistics, vol. 15, 2: 169-189.
Biber, D., Conrad, S. & Reppen, R. (1998). Corpus Linguistics: Investigating Language
Structure and Language Use. Cambridge University Press, Cambridge.
Bloomfield, L. (1935). Language. Allen & Unwin, London.
Bohas, J., Guillaume P., & Kouloughli, D. (1990). The Arabic Linguistic Tradition.
Routledge, London.
Brinton L. and Akimoto, M. (eds.) (1999). Collocational and Idiomatic Aspects of Composite
` 175
Predicates in the History of English. John Benjamins Publishing.
Carter, R. (1987). Vocabulary: Applied Linguistic Prescriptive. Allen & Unwin, London.
Charniak, E. (1993). Statistical Language Learning. The MIT Press, Cambridge,
Massachusetts.
Chejne, A. (1969). The Arabic Languag: Its Role in History. University of Minnesota Press,
Minneapolis.
Chomsky, N. (1965). Aspects of the Theory of Syntax. The MIT Press, Cambridge,
Massachusetts.
Chomsky, N. (1971). Chomsky: Selected Readings. (eds) J.P. Allen & Paul Van Buren.
Oxford University Press, London.
Choueka, Y.; Klein, T. and Neuwitz, E. (1983). “Automatic retrieval of frequent idiomatic
and collocational expressions in a large corpus”, Journal for literary and linguistic
computing, vol. 4: 34-38.
Christopher D. M. and Hinrich S. (1999). Foundations of Statistical Natural Language
Processing. The MIT Press, Cambridge, Massachusetts.
Church K. and Hanks P. (1990). “Word association norms, mutual information and
lexicography”, Computational Linguistics, vol. 16: 22-29.
Church, K., Gale W. Hanks P. and Hindle M. (1991). “Using statistics in lexical analysis”, in
Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon, (ed.) Zernik,
U. Hillsdale, NJ: Lawrence Erlbaum Associate, 115-164.
Clear, J. (1993). “From Firth principles, computational tools for the study of collocation”. In
Text and Technology, in Honour of John Sinclair, (eds.) Baker, M., Francis, G. &
Tognini-Bonelli, E. John Benjamins, Philadelphia/Amsterdam, pp. 271-292.
Cowie, A. P. (1978). The Place of Illustrative Material and Collocations in the Design of a
Learner’s Dictionary, in Honour of A.S. Hornby. Oxford University Press, Oxford.
Cowie. A.P. (1981). “The treatment of collocations and idioms in learners’ dictionaries”.
Applied Linguistics vol. 3, 223-235.
Cruse D. A. (2000). Meaning in Language: An Introduction to Semantics and Pragmatics.
Oxford University Press, Oxford.
Cruse D.A. (1986). Lexical Semantics. Cambridge University Press, Cambridge.
Crystal, D. (1987). The Cambridge Encyclopaedia of Language. Cambridge University Press,
` 176
Cambridge.
Dichy, Joseph (2001) “On lemmatization in Arabic, a formal definition of the Arabic entries
of multilingual lexical databases”. 39th Annual Meeting of the ACL, Workshop on
Arabic Language Processing: Status Prospects, Toulouse.
Dickins, J., Hervey, S. and Higgins, I. (2002). Thinking Arabic Translation, A Course in
Translation Method: Arabic to English. Routledge, London.
Ditters E. (1990). “Arabic corpus linguistics in past and present”. In Studies in the History of
Arabic Grammar II, (eds.) Versteegh K. & Carter G. John Benjamin Publishing
Company, Amsterdam, pp. 129-141.
Emery, P (1988). Body-Part Collocations and Idioms in Arabic and English. Unpublished
Ph.D. thesis, University of Manchester.
Fasold R. (1984). The Sociolinguistics of Society: Introduction to Sociolinguistics vol. 1.
Basil Blackwell Ltd. England.
Fillmore, C. (1992). “‘Corpus linguistics’ or ‘Computer-aided armchair linguistics’”. In
Directions in Corpus Linguistics, (eds.) Svartvik J. Mouton de Gruyter, Berlin, New
York.
Firth, J.R. (1935). Papers in Linguistics, Oxford University Press, London.
Firth, J.R. (1957). “Modes of meaning”. In Papers in Linguistics, Oxford University Press,
London.
Francis, N. (1992). “Language corpora B.C.”. In Directions in Corpus Linguistics, (eds.)
Svartvik J. Mouton de Gruyter, Berlin, New York.
Freeman A. (2001). “Brill’s POS tagger and a morphology parser for Arabic”. 39th Annual
Meeting of the ACL, Workshop on Arabic Language Processing: Status Prospects,
Toulouse.
Garside, R, Leech G, and Sampson G. (1987). The Computational Analysis of English, a
Corpus-Based Approach. Longman, London and New York.
Garside, R. Leech G. and McEnery, T. (1997). Corpus Annotation. Longman, London & New
York.
Ghali, M. (1997). Synonyms of the Glorious Qur’an. Daar al-nashr lil-Jaamicaat, Cairo.
Ghazala, H. (2001). “Cross-cultural link in translation (English-Arabic)”. Majallat Al-Lisaan
Al-cArabi (The Magazine of the Arabic Language), vol. 50, Al-Ribat, Morocco.
` 177
Goldziher, I. (1966). A short history of Classical Arabic Literature. (trans.) J.DeSomogyi.
Georg Publishers, Olms.
Goweder A. and Roeck, A. (2001). “Assessment of a significant Arabic corpus”. 39th Annual
Meeting of the ACL, Workshop on Arabic Language Processing: Status Prospects,
Toulouse.
Granger, S. (1999). “Use of tenses by advanced EFL learners: Evidence from an Error-tagged
computer corpus”. In Out of Corpora, (eds.) Hasselgard & Signe Oksefjell, Rodopi,
Amsterdam, pp 191-202
Granger, S. (eds) (1998). Learner English on Computer. Longman, London and New York.
Gross M. (1990). Constructing Lexicon-Grammar. University of Paris, Paris.
Guillaume, A. (1931). The Legacy of Islam. Oxford University Press, Oxford.
Haeri, N. (2003). Sacred Language, Ordinary People: Dilemmas of Culture and Politics in
Egypt. Palgrave Macmillan, New York.
Halliday, M.A.K. (1991). “Corpus studies and probabilistic grammar”. In English Corpus
Linguistics, (eds.) Aijmer, K. & Altenberg B. Longman, London and NewYork.
Halliday, M.A.K., McIntosh, A. and Stevens, P. (1964). The Linguistic Sciences and
Language Teaching. Longman, London.
Hanks, P. (2000). “Literal and metaphorical word meaning”. Tuscan Word Centre document.
Harris, R. (1973). Synonymy and Linguistic Analysis. University of Toronto Press,
Toronto.
Haywood, J. (1965). Arabic Lexicography, (2nd ed.). Brill, Leiden.
Hitti, P. K. (1958). History of the Arabs. Macmillan, New York.
Hoey, M. (1997). “From concordance to text structure: New uses for computer corpora”. Talk
given at the 1997 Practical Applications of Language Corpora (PALC) conference,
University of Lodz, April 12-14, Later published in Melia, J. & Lewandoska, B.
(eds) Proceedings of PALC 97. Lodz University Press, Lodz.
Hoogland, J. (1993). “Collocation in Arabic (MSA) and the treatment of collocations in
Arabic dictionaries”. The Arabist, Proceedings of the Colloquium on Arabic
Lexicology and Lexicography, Budapest, 1-7 Sept. 1993, (eds.) Devenyi, K., Ivanyi,
T. and Shivtiel, A. Csoma de Koros Soc, Budapest, Hungary.
Horrocks, G (1987). Generative Grammar. Longman, London & NewYork.
` 178
Hurford, J. & Heasley, B. (1983). Semantics: A Coursebook. Cambridge University
Press, Cambridge.
Ibn Al-Anbari, (1904). al-ad}daad (Antonyms). (ed.) Abu al-Fadl Ibrahim, Al-
Maktabah Al-cAs}riyyah, Lebanon.
Ibn Faris, A. (d. 1105). al-s}aahibi. (ed.) Al-Sayed Sakr. Mat}bacat Isa Al-Babi Al-Halabi
wa-shurakaah, Cairo
Ibn Katheer, I. (1996). Tafseer al-qur’aan alcaziim (Explanation of the Great Qur’an). Daar
al-macrifah, Lebanon.
Ibn Jinni, A. (d. 1102). al-khasaa’is (The Properties). Mat}bacat Al-Hilal, Cairo
Ibn Manzur, M. (b.1232-1311 or 12). lisaan al-carab (Arabs’ Language). Daar Bayruut lil-
T}ibacah wa-al-Nashr, Beirut.
Ivanyi, T. (1993). “Dynamic vs. static: a type of lexical parallelism in the maqamat of al-
Hamadhani”, The Arabist, Proceedings of the Colloquium on Arabic Lexicology and
Lexicography, Budapest, 1-7 Sept. 1993, (eds.) Devenyi, K., Ivanyi, T. and Shivtiel,
A. Csoma de Koros Soc, Budapest.
Izwaini, S. (2000). Translating Collocations: Arabic/English/Swedish. Unpublished MA
dissertation, CTIS, UMIST, Manchester.
Izwaini, S. (in progress). Translation and The Language of Information Technology: A
Corpus-Based Study of the Vocabulary of Information Technology and Translation
from English into Arabic and Swedish. Unpublished Ph.D. thesis, UMIST,
Manchester.
Jackson H. (1988). Words and Their Meaning. Longman, London and New York.
Johansson, S. (1995). “ICAME-Quo Vadis? Reflections on the use of computer corpora in
linguistics”. Computer and the Humanities, vol. 28: 243-252.
Jones, S. (1986). Synonymy and Semantic Classification. Edinburgh University
Press, Edinburgh.
Jones, S. and Sinclair, J.M. (1974). “English lexical collocations”. Cahiers de Lexicologie,
vol. 24: 15-61.
Kamir, D. Soreq, N. Neeman, Y. (2002). “A Comparative NLP system for Modern Standard
Arabic and Modern Hebrew”. In Rosner, M. & Wintner, S., Proceedings of the
Workshop on Computational Approaches to Semitic Languages. University of
` 179
Pennsylvania.
Kennedy, G. (1998). An Introduction to Corpus Linguistics. Longman, London.
Kenny, D. (1999). Norms and Creativity: Lexis in Translated Text. Unpublished Ph.D. thesis,
UMIST, Manchester.
Kenny, D. (2001). Lexis and Creativity in Translation: a Corpus-Based Study. St. Jerome
Publishing, Manchester.
Khalid, J. AlDaimi and Maha A. Abdel-Amir (1994). “The syntactic analysis of Arabic by
machine”. Computers and Humanities, vol. 18: 29-37.
Khoja, S., Garside, R. and Knowles, G. (2001). “A tagset for the morphosyntactic tagging of
Arabic”. Proc. of the Corpus Linguistics 2001 Conference, Lancaster University, 29
Mar-2Apr 2001.
Khoja. S. (2003). An Automatic Arabic Part-of-Speech Tagger. Unpublished Ph.D. thesis,
University of Lancaster.
Kjellmer, G. (1987). “Aspects of English collocations”. In Corpus Linguistics and Beyond,
(ed) Meijs, W. Rodopi, Amsterdam, pp. 133-140.
Knowles, G. (1996). “Corpora, databases and the organisation of linguistic data”. In Using
Corpora for Language Research, (eds.) Thomas J. & Short M. Longman, London
and NewYork.
Koenraad, d., Hazel, G, Espen, O., Tito, O, Harold, S, Jacques, S. and William V. (eds.)
(1999). Computing in Humanities Education: A European Perspective. The
University of Bergen, Bergen.
Krenn B. and Samuelsson, C. (1997). “The Linguist’s Guide to Statistics”,
http://citeseer.nj.nec.com/krenn97linguists.html
Langendoen, T. (1968). The London School of Linguistics. MIT Press, Cambridge,
Massuchesetts.
Leceibi, H. (1980). al-taraaduf fi al-lughah (Synonymy in Language). Dar al-Rashiid,
Baghdad.
Leech, G. (1991). “The state of the art in corpus linguistics”. In English Corpus Linguistics,
(eds.) Aijmer, K. & Altenberg B. Longman, London, pp. 8-29.
Lehrer, A. (1974). Semantic field and lexical structure. North-Holland, London.
Lewis, M. (1993). The Lexical Approach. Language Teaching Publications, Hove, England.
` 180
Louw, B. (1993). “Irony in the text or insincerity in the writer? The diagnostic
potential of semantic prosodies”. In Text and Technology: In
Honour of John Sinclair, (eds.( Baker, M., Francis, G. and E. Tognini-
Bonelli. John Benjamins, Amsterdam, pp. 157-176.
Lyons, J. (1963). Structural Semantics. Basil Blackwell, Oxford.
Lyons, J. (1969). Introduction to Theoretical Linguistics. Cambridge University
Press, Cambridge.
Lyons, J. (1977). Semantics. Cambridge University Press, Cambridge.
Lyons, J. (1981a). Language, Meaning and Context. Fontana Paperbacks, GB.
Lyons, J. (1981b). Language and Linguistics. Cambridge University Press,
Cambridge.
Lyons, J. (1995). Linguistic Semantics: An Introduction. NewYork: Cambridge University
Press, Cambridge.
Majmac al-lughah al-carabiyyah (1977) al-wasiit} (the intermediate) Daar al-macaarif, Cairo.
Makkai, A. (1987). “Major diseases of linguistics”. In Language topics, Essays in honour of
M. Halliday, (eds.) Ross S.& Terry T. John Benjamin Publishing Co., Amsterdam/
Philadelphia. pp. 269-280.
Manning C. and Schütze, H. (1997). Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, Massachusetts.
Matthews, P. H. (1993). Grammatical Theory in the United States from Bloomfield to
Chomsky. Cambridge University Press, Cambridge.
McEnery, T. & Wilson, A. (1996). Corpus Linguistics. Edinburgh University Press,
Edinburgh.
McIntosh, A. and Halliday, M.A.K. (1966). Patterns of Language: Papers in General,
Descriptive and Applied Linguistics. Longmans, London.
Meyer, Ch. (2002). English Corpus Linguistics. Cambridge University Press, Cambridge.
Miller, George A. (1963). Language and Communication. McGraw-Hill Bark company, Inc.
New York, Toronto, London.
Mitchell, T.F. (1971). “Linguistic ‘going on’: Collocations and other lexical matters arising
on the syntagmatic/linguistic record”. ARCHIVUM LINGUISTICUM, 2 (new series,
35-69).
` 181
Mubarak, M. (1982). The Grammar of Arabic. Daar Al-Kitaab Al-Lubnaani, Beirut.
Mujahid, I. (1989). Tafsiir Mujahid (Explanation of the Qur’an by Mujahid), version 1.
verified by M. Abdel-Salam. Daar Al-Fikr al-Islaamiy al- h}adiithah, Cairo.
Nelson, M. (2000). A Corpus-Based Study of Business English and Business English
Teaching Materials. Unpublished Ph.D. thesis, Manchester: University of
Manchester.
Owens, J. (1988). The Foundation of Arabic Grammar. John Benjamins Publishing Company,
Amsterdam
Palmer, F. R. (1981). Semantics (2nd ed.). Cambridge University Press, Cambridge.
Rene, A. (2000(. “Language, concepts and culture: old wine in new
bottles”. Bilingualism: Language & Cognition, vol. 1, issue 1.
Cambridge University Press.
Renouf, A. J. (1984). “Corpus development at Birmingham University”. In Corpus
Linguistics: Recent Developments in the Use of Computer Corpora in English
Language Research, (eds.) Jan Aarts & Willem Meijs. Rodopi, Amsterdam, pp. 3:39.
Robert, A. (2004). aConCorde Program, version 0.4. University of Leeds
(http://www.comp.leeds.ac.uk/andyr/software/index.html).
Roulet, E. (1975). Théories grammaticales, Descriptions et Enseignement des Langues
(Applied linguistics and language study). (trans.) Christopher N. Candlin, Longman:
London.
Sinclair, J. (1987a). Looking Up. Collins London and Glasgow.
Sinclair, J. (1987b). “Collocation: a progress report”. In Language topics, Essays in honour
of M. Halliday, (eds.) Ross S.& Terry T. John Benjamin Publishing Co., Amsterdam/
Philadelphia, pp. 319-332.
Sinclair, J. (1991). Corpus, Concordance and Collocation. Oxford University Press, Oxford.
Sinclair J. (ed.) (1995). Collins Coubild English Dictionary (2nd ed). HarperCollins, London.
Sinclair, J. (1996). “The search for units of meaning”. Reprinted with permission from Textus
IX, pp. 75-106.
Sinclair, J., Mason, O., Ball, J. and Barnbrook G. (1998). “language independent statistical
software for corpus exploration”. Computers and the humanities, 31: 229-255.
Smadja F. (1991). “Macrocoding the Lexicon with Co-occurrence Knowledge”. In Lexical
` 182
Acquisition, (ed.) Zennik, U. Lawrence Erlbaum Associates, NJ., 165-189.
Smadja F. (1994). “Retrieving collocations from text: Xtract”. In Computational Linguistics.
MIT Press, vol. 19, 1: 143-177.
` 183
Svartvik, J. (1992). “Corpus Linguistics comes of age”. In Directions in Corpus Linguistics,
(ed.) Svartvik J. Mouton de Gruyter, Berlin, New York, pp. 7-13.
Thacalibi, Abd al-Malik ibn Muhammad, (b. 961 or 2-1037 or 8.) fiqh al-lughah wa-sirr al-
carabiyyah (The Philology of the Arabic Language and Its Secrets) (vol). 1,
Maktabat Al-Khanjiy, Cairo.
The Nijmegen Dutch-Arabic/Arabic-Dutch Dictionaries (2003) Bulaaq, Amestrdam.
http://www.let.kun.nl/WBA/Content1/PractInfo.htm
Tognini Bonelli, E. (2000). “Lexis in contrast”. In Studies in Corpus Linguistics, (eds.)
Sylviane Granger and Bengt Altenberg. Benjamins, Amsterdam and Philadelphia.
Ullmann, S. (1962). Semantics: An Introduction to the Science of Meaning. Basil,
Blackwell, Oxford.
Van der Wouden, T. (1997). Negative Contexts: Collocation, Polarity and Multiple Negation.
Routledge, London and New York
Van Mol, M. (2002). “The semi-automatic tagging of Arabic corpora”. In Arabic Language
Resources and Evaluation-Status and Prospects. A workshop held in LREC, 3rd
international conference on language resources and evaluation. Las Palmas, Spain.
Versteegh, K. (1997). Landmarks in Linguistic Thought, the Arabic Linguistic Tradition.
Routledge, London.
Watt, R. (2001). Concordance Program, Version 3.0, personal product.
http://www.rjcw.freeserve.co.uk/.
Wehr, H. (1980). A Dictionary of Modern Written Arabic. Macdonald and Evens Ltd,
London.
Whitaker, B. (2002). “Lost in translation”. The Guardian (UK), Monday June 10, 2002.
Wittgenstein, L. (1953). Philosophical Investigation. Blackwell, Oxford.
` 184