Vous êtes sur la page 1sur 148

Copyright 2014

DHA Suffa University

Karachi - Pakistan

Conference Committees
Organizing Committee:
Athar Mahboob, DHA Suffa University, Karachi (General Chair)
Sarmad Hussain, University of Engineering & Technology, Lahore (General Co-Chair)
Miriam Butt, Universitt Konstanz, Germany (Technical Committee Chair)
Nadir Durrani, University of Edinburgh, UK (Publication Committee Chair)
Tafseer Ahmed, DHA Suffa University, Karachi (Programme Committee Chair)
Technical Committee:
Afaq Husain, King Faisal University, Saudi Arabia
Amir Kamran, Charles University, Czech Republic
Asim Wagan, DHA Suffa University, Karachi
Annette Hautli, Universitt Konstanz, Germany
Awais Athar, National University of Computer and Emerging Sciences, Lahore
Bal Krishna Bal, Kathmandu University, Nepal
Bushra Jawaid, Charles University, Czech Republic
Dipti Sharma, Indian Institute of Technology (IIT)Hyderabad, India
Ghulam Raza, Pakistan Institute of Engineering & Applied Sciences (PIEAS), Islamabad
Faisal Shafiat, University of Western Australia, Australia
Haroon Babri, University of Engineering and Technology (UET), Lahore
Hassan Sajjad, Qatar Computing Research Institute (QCRI), Qatar
Khaver Zia, Beaconhouse National University, Lahore
Khurram Junejo, PAF- Karachi Institute of Economics & Technology (KIET), Karachi
Miriam Butt, Universitt Konstanz, Germany (Technical Committee Chair)
M. Abid Khan, University of Peshawar, Peshawar
Naila Ata, The Resource Group (TRG), Karachi
Nadir Durrani, University of Edinburgh, UK
Philip Williams, University of Edinburgh, UK
Riyaz Bhatt, Indian Institute of Technology (IIT)Hyderabad, India
Roni Rosenfeld, Carnegie Mellon University, USA
Sadaf Abdul Rauf, Fatima Jinnah Women University, Rawalpindi
Samar Husain, Universitt Potsdam, Germany
Sarmad Hussain, University of Engineering & Technology, Lahore
Sebastian Sulger, Universitt Konstanz, Germany
Seemab Latif, National University of Science and Technology, Rawalpindi
Sohail Abdul Sattar, NED University of Engineering & Technology, Karachi
Tafseer Ahmed, DHA Suffa University, Karachi
Tania Habib, University of Engineering and Technology (UET), Lahore
Tina Bgel, Universitt Konstanz, Germany
Tracy Holloway King, eBay, USA
Qurat-ul-Ain Akram, University of Engineering and Technology (UET), Lahore
Zeeshan Ahmed, North Side Inc., Montreal, Canada


Publication Committee:
Nadir Durrani, University of Edinburgh, UK (Chair)
Shoaib Siddiqui, DHA Suffa University, Karachi (Co-chair)
Programme Committee:
Asim Wagan, DHA Suffa University, Karachi
Ayaz Ahmed, DHA Suffa University, Karachi
Badar Sami, University of Karachi, Karachi
Hammad Ahmed, DHA Suffa University, Karachi
Humayun Qureshi, PAF-Karachi Institute of Economics & Technology (KIET), Karachi
Imran Jami, DHA Suffa University, Karachi (Co-chair)
Farah Chughtai, DHA Suffa University, Karachi
Khubaib Ahmed, DHA Suffa University, Karachi
Khurram Junejo, PAF-Karachi Institute of Economics & Technology (KIET), Karachi
Mansoor Samoo, DHA Suffa University, Karachi
Mazher Iqbal, DHA uffa University, Karachi
Mobeen Movania, DHA Suffa University, Karachi
Muhammad Imad uddin, DHA Suffa University, Karachi
Muhammad Rafi, National University of Computer and Emerging Sciences, Karachi
Munazza Kanwal, DHA Suffa University, Karachi
Mutee u Rehman, Isra University, Hyderabad
Naila Ata, The Resource Group (TRG), Karachi
Nasir Abbas, Lasbela University, Uthal
Rauf Malick, DHA Suffa University, Karachi
Shaukat Wasi, DHA Suffa University, Karachi
Shoaib Siddiqui, DHA Suffa University, Karachi
Sohail Abdul Sattar, NED University of Engineering & Technology, Karachi
Tafseer Ahmed, DHA Suffa University, Karachi (Chair)
Wajiha Kanwal, DHA Suffa University, Karachi


It is my distinct pleasure to write a foreword to the proceedings of the 5th Conference on Language and
Technology 2014 (CLT14) held at DHA Suffa University (DSU) 13-15 November 2014. These
proceedings include 11 research papers which were selected as full papers, based on a rigorous peerreview process, out of a total submission of 39 papers. Another 7 papers have been accepted for poster
presentation. CLT14 thus maintained the high standards and selectivity associated with the earlier CLT
conferences. We would like to thank all the members of the Program Committee for performing timely
reviews of submitted papers.
The papers selected for presentation at the CLT14 provide a panoramic view of the state of the art
developments in the field of application of computing technology to processing of natural languages,
particularly those of Pakistan. The developments presented in these papers show the trends in further
integration of computing into our daily lives and bridging the digital divide between the anglicized and
vernacular aspects of communication in our society.
From the organizational perspective, the CLT14 is a culmination of more than a year of strenuous efforts
of the team led by Dr. Tafseer Ahmed Khan which included faculty members of the Computer Science
Department at DSU. Others who have played significant supporting role include Dr. Imran Jami, Head of
Computer Science Department at DSU. The overall support and encouragement provided by Prof. Dr.
Sarfraz Hussain, Vice Chancellor, DSU has also been instrumental in making the CLT14 possible.
The conference program has been arranged to include technical sessions, poster session, a panel
discussion, technology demonstrations, a banquet and an excursion tour. We hope that the overall
experience of CLT14 fulfills the expectations of participants and look forward to feedback.

Prof. Dr. Athar Mahboob

Conference Chair


Table of Contents

Prosodic Phrasing and the Parsing of Modifier Attachment Ambiguity in Deep and Shallow Orthography
Hala Abdelghany....1

What's in a name? Automatic extraction of lexical and functional units of Pakistani names
Tafseer Ahmed and Naila Ata..9

Extracting Arguments and Collocations for Urdu Complex Predicates

Tafseer Ahmed.......................15

Framework of Urdu Nastalique Optical Character Recognition System

Qurat-Ul-Ain Akram, Sarmad Hussain, Farah Adeeba, Shafiq-Ur-Rehman and Mehreen Saeed...................23

Design of Speech Corpus for Open Domain Urdu Text to Speech System Using Greedy Algorithm
Wajiha Habib, Rida Hijab Basit, Sarmad Hussain and Farah Adeeba...31

An optimized Pashto Keyboard Layout based on Character Frequency

Muhammad Junaid, Kamran Ghani and Iftikhar Ahmed Khan......39

Multitier Annotation of Urdu Speech Corpus

Benazir Mumtaz, Amen Hussain, Sarmad Hussain, Afia Mahmood, Rashida Bhatti, Mahwish Farooq, Sahar Rauf ..47

Hidden Markov Model (HMM) based Speech Synthesis for Urdu Language
Omer Nawaz and Tania Habib.....55

Alphabet Signs Recognition using Pixels-based Analysis

Mohammad Raees and Sehat Ullah.........63

Sense Tagged CLE Urdu Digest Corpus

Saba Urooj, Sana Shams, Sarmad Hussain, Farah Adeeba........71

Structural Analysis of Linking Urdu WordNet to PWN 2.1

Ayesha Zafar, Afia Mahmood, Sana Shams and Sarmad Hussain...................79




CLE Urdu Books N-grams

Farah Adeeba, Qurat-Ul-Ain Akram, Hina Khalid and Sarmad Hussain..87

Accent Classification among Punjabi, Urdu, Pashto, Saraiki and Sindhi Accents of Urdu Language
Afsheen Rafaqat Ali, Saad Irtza, Mahwish Farooq and Sarmad Hussain93

Text Processing For Urdu TTS System

Rida Hijab Basit and Sarmad Hussain.101

Urdu Keyword Spotting System using HMM

Saad Irtza, Khawer Rehman and Sarmad Hussain.......109

HPSG Analysis of Arabic Verb Form Derivation

Md. Sadiqul Islam113

Spoken Dialog System: Direction Guide for Lahore City

Aneef Izhar, Aneek Anwar, Aitzaz Ahmed, Tania Habib, Sarmad Hussain and Shafiq Rahman..119

Computer-aided Error Analysis of L2 Spoken English: A Data Mining Approach

Yuichiro Kobayashi.127


Proceedings of the Conference on Language & Technology 2014

Prosodic Phrasing and the Parsing of Modifier Attachment

Ambiguity in Deep and Shallow Orthography
Hala Abdelghany
The Graduate Center, City University of New York, New York, USA


grouping smaller domains into larger ones, such as

segments into syllables, syllables into words, or
words into phrases. Contrasting includes: stress,
which refers to prominence of syllables at the word
level; accent, which refers to prominence at the
phrase level and the distribution of tones (pitch
events) such as lexical tones associated with
syllables; or intonational tones associated with
phrases or accents. These various aspects of prosody
taken in conjunction are seen as characterizing the
prosodic structure of an utterance.
Prosody is used in spoken language to signal how
words are grouped into phrases and how each word
and phrase contributes information to the discourse.
In text these functions are signaled through
punctuation and text enhancement (e.g., boldface),
but in spoken language, pitch, timing, loudness, voice
quality, and other properties of speech convey such
information. Linguists propose that underlying the
prosody of all languages is a universal structure, the
Headed Constituent (HC) and a prosodic hierarchy of
prosodic categories, which group syllables and words
together, with one prominent element per constituent
designated the Head element within a hierarchy.
These constituents determine how speech sounds are
coordinated (at a physical level), and how speech is
mapped onto syntactic and semantic structures that
determine utterance meaning.
In NLP, phrase break prediction is a classification
task within Text-to-Speech synthesis that attempts to
simulate human chunking strategies by assigning
prosodic-syntactic boundaries to input text. A
boundary-annotated and part-of-speech (PoS) tagged
corpus is a key language resource for training such

This paper presents research that investigates the

effect of prosodic phrasing on syntactic parsing. The
focus is on the ambiguity of a modifier (relative
clause or adjective phrase) in relation to the two
nouns in a complex noun phrase in Arabic. Ambiguity
resolution tendencies for this construction differ
across languages. These effects have been shown to
occur even in silent reading, so the suggestion is that
the parser projects onto a text a default prosodic
phrasing which then influences the final syntactic
parse and semantic analysis of the sentence. The
structure of Arabic permits use of a method for
tapping into implicit prosodic boundaries. Liaison, a
phonological process occurring across word
boundaries, is sensitive to patterns of prosodic
chunking in Arabic. These liaison phenomena make
the phonological phrasing of Arabic sentences easy
to detect in listening. But also, they are indicated by
diacritics in the vowelized version of Arabic
orthography which simulates the overt prosody.
Implications for the use of this method in text/speech
syntactic annotation, tree banking, and semantic
analysis is discussed as well as implications for
building an interface ontology that characterizes the
principled interaction of a prosodic and syntactic
derivation in sentence parsing.

1. Prosody and Sentence Processing

In recent years, researchers have documented the
important role of prosody in sentence processing.
Various studies have shown that prosody influences
parsing decisions [1, 2, 3, 4]. Keating [5] defines
prosody as the organization of speech into a
hierarchy of units and domains of definite size and
structure, some of which are more prominent than
others. Keating [5] points out that prosody serves a
grouping function and a prominence marking
(contrasting) function in speech. Grouping includes

2. The Phrasing Function of Prosody

Prosodic phrasing or the chunking of speech into
prosodic groups has proven to be constrained by a
number of different factors such as constituent
length, metrical balance requirements, speech rate

Proceedings of the Conference on Language & Technology 2014

and syntactic branching, among others. However, the

effects these factors have on phrasing are not the
same across languages. There are two main ideas
within the sentence processing literature on prosodic
phrasing. There is the idea that prosodic boundaries
cause a break between constituents during the parsing
process. There are also theories that claim that the
primary role of intonational phrasing in
comprehension is to group relevant constituents
together [6, 7, 8]. The norm of speech is continuity or
what has been termed a speech flow, a stream or
a word string. Therefore, the hypothesis is that
cohesion is the baseline or default in speech, while
inserting a break or boundary or demarcation is a
process that is employed by users of a language to
facilitate comprehension (as well as production). As
pointed out by Selkirk [9], as far as the syntaxphonology interface is concerned, languages must opt
for a dominant cohesion strategy or a dominant
demarcative structure, as represented in the relative
optimal rankings of constraints at the interface such
as Wrap-XP and Align-XP, R which are among a
number of proposed universal interface constraints on
prosodic phrasing [10]. Moreover, empirical research
has demonstrated that prosodic phrasing influences
syntactic attachment decisions, focus interpretation,
and the availability of contextual information in the
resolution of lexical and syntactic ambiguity [11, 12,
13, 14 ].

(high vs. low modifier attachment), contrary to

otherwise universal parsing tendencies. One
explanation proposed is a prosodic one. The crosslinguistic effects occur even in silent reading, so the
suggestion is that readers mentally project onto the
text a default prosodic phrasing (which may differ
between languages), that then influences their
syntactic ambiguity resolution (The Implicit Prosody
Hypothesis) Fodor [15].

4. The Garden-Path Model

Considerable evidence shows that speakers
organize their speech with reference to an internal
notion of cohesion between segments of an utterance.
Previous research [16, 17] has shown that hearers
interpret a prosodic break before a modifier as a
marker of a syntactic boundary which blocks local
attachment and prompts high attachment, whereas the
absence of a break signals a prosodic grouping effect
of cohesion and promotes local attachment. Boundary
demarcation is generally measured in current
research by acoustic cues such as pitch movements
(boundary tones; post-boundary F0 reset), and preboundary lengthening of the final stressed syllable.
Research has revealed that different languages favor
different prosodic phrasings for similar syntactic
constructions (Prieto [18] for Catalan; Nibert [19] for
Castilian Spanish; Frota, [20] for European
Portuguese and DImperio [21] for the Neapolitan
variety of Italian). Languages also differ with respect
to the effects that various constraints have on the
possible or preferred phrasing of a given syntactic
construction. Whether this is due to parametric
variation in rules, or to differences in ordering of
constraints, it is something to take note of in
performance models of the role of prosody in
syntactic processing.
Research in sentence perception and comprehension
has generated working models about the nature and
design of the sentence processing mechanism. An
especially relevant area of investigation concerns the
nature of the parser, the machinery that deals with the
building of syntactic structure from linguistic input.
Researchers have hypothesized that the parsers online tendencies reflect the architecture of the
processing mechanism.
According to the influential Garden-Path model
of sentence processing [22, 23], principles of parsing
are universal and this is what gives the theory its
simplicity and explanatory power. The theory argues
that, when processing linguistic input, the parser
utilizes a set of principles which are part of the
inherent architecture of the human language
processor. Languages differ in their individual

3. Modifier Attachment Ambiguity

A syntactic ambiguity which has been a focus of
recent research is the ambiguity of a relative clause
(RC) modifier in relation to a complex noun phrase
(NP). An example from Standard Arabic (SA) is
shown in (1):

zaara -l-muhaafith maktabat-a l-madrasat-i

the -mayor
library-ACC the-school-GEN
llatii gudiddat
that was renovated
The mayor visited the library of the school that was

library renovated (high RC-attachment)?

school renovated (low RC-attachment)?

This construction is important because experiments

have shown that speakers of different languages
exhibit different ambiguity resolution tendencies

Proceedings of the Conference on Language & Technology 2014

grammars but the principles that navigate and deploy

the grammar are the same. Variation is attributed to
the individual grammar of the particular language,
not to the principles.
Kimball [24] proposed specific parsing principles
based on characteristics that he suggested were
defining of the nature and functional capacity of
human cognition. These principles serve to minimize
storage requirements and bypass difficulties related
to the limitations of human short-term memory.
Frazier & Fodor [23] refined Kimballs principles in
their Sausage Machine Model. Among these
principles are Minimal Attachment [23] and Late
Closure [22]. These principles have been argued to
optimize the speed, efficiency and ease of processing
by opting for the analysis which requires the least
effort and exerts the least demands on memory load
and retention on the part of the structure-building
device. These principles are therefore seen to be
inherently driven by the general conditions of the
human cognitive system. These principles reflect
cognitive economy by keeping mental processes as
simple and efficient as possible which in turn serves
to reduce the burden on working memory during
A theory of this kind which attributes all parsing
strategies to the design characteristics entails that
these strategies should apply universally in all natural
languages and invariantly across all constructions.
The Garden -Path theory is based on such a universal
account. A serious challenge to the claim of
universality would be the discovery of crosslinguistic variation in parsing strategies which cannot
be associated with differences in their grammars.
Thus, a major challenge facing the universalist
approach is the apparent failure of the Late Closure
Principle to apply cross-linguistically.
The Late Closure Principle is among the principles
of the Garden-Path model which was first formulated
for English [22] and assumed to hold across all
languages. Late Closure is similar to Kimballs [24]
principle of Right Association:

familiar example of this principle is the following

from Kimball [24]:
(2) Tom said that Bill left yesterday.
In (2) the parser is faced with a structural ambiguity.
The adverb yesterday can be interpreted as modifying
either the first verb said or the second verb left. Late
Closure correctly predicts the preferred reading: the
ambiguously attached constituent yesterday is
typically interpreted as attaching low.
Research has shown that Late Closure applies in a
variety of constructions in English and in other
languages as well (see [22, 24] for a number of
English constructions.) However, the Late Closure
principle has since been challenged, initially by
researchers working on Spanish [25] and
subsequently by a number of researchers studying
other languages. A certain type of ambiguity has been
the focus of this research question: the complex noun
phrase construction. For a complex NP with two
nouns, such as (3), a relative clause modifier may be
associated with the first noun (N1) the daughter or
the second (N2) the colonel.
(3) Someone shot the daughter of the colonel who
was on the balcony.
The Late Closure Principle predicts that the relative
clause attaches to the NP currently being processed,
and thus is interpreted as modifying N2 (low
attachment). This prediction holds for English, not
dramatically but fairly reliably [25]. N2 preference
(low attachment) holds also in other languages
including Swedish, Norwegian and Romanian [26].
But N1 preference (high RC attachment) has been
observed in other languages, e.g., Spanish [25, 27,
28), German [29], Dutch [30] and French [31]. In a
study using written sentences in an ambiguity
resolution questionnaire, Cuetos & Mitchell [25]
tested the interpretation preference for sentences like
(3) in English and Spanish. Their results revealed that
when asked to make a choice between two alternative
interpretations, English-speaking subjects preferred
the second noun or the lower site interpretation, while
Spanish-speaking subjects preferred to attach the RC
to the higher site.

Kimball: Terminal symbols optimally associate to

the lowest non-terminal node.
Frazier: When possible, attach incoming material
into the clause or phrase currently being

5. The Prosodic Account

The implication for parsing is that when the parser

has a choice between a non-local (distant or high) and
a local (recent or low) attachment, it will choose to
attach to the local attachment site. Closure ambiguity
involves a problem connected with determining
constituent boundaries in two different ways. A

The argument proposed by the prosodic account

with regards to cross-linguistic variation is that
different attachment preferences in sentences
containing the relative clause ambiguity are to be
explained by differences in the prosodic component

Proceedings of the Conference on Language & Technology 2014

of the grammar. The assumption is that it is

differences in the characteristic prosodic packaging
patterns of languages that give rise to cross-linguistic
variation in RC-attachment. This account has its roots
in the observation made within the Sausage
Machine framework [23] that packaging, a function
of the lengths of constituents, affects parsing
decisions. In 1978, the need for packaging was
ascribed to short-term memory limitations, but in
1998 the similarity to prosodic packaging became
In reading, according to Fodor [32], prosodic
packaging is assigned prior to a full syntactic analysis
of the word string; hence it cannot show much
sensitivity to syntactic alignment principles except
where the relevant syntax is quite local. In reading,
therefore, prosodic packaging is determined largely
by other factor, such as constituent weight. Prosodic
heaviness may be counted in terms of length
(number of syllables or pitch accents) but also
possibly in terms of syntactic branching and/or
syntactic category). The central idea is that large
constituents tend to attach high, and small ones low
(Anti-gravity [23]). In Fodor [32], Anti-gravity is
expressed in the Same-Size Sister Principle, which
is a close relative of other principles favoring
geometrical tree balance such as the Uniformity
Principle and Balance Principle.

6. Arabic Phonology and Orthography

Since implicit (silent) prosody cannot be directly
observed, previous research has had to infer it by
analogy with overt prosody. Specifically the unique
properties of the genitive construction in Arabic with
its phonological properties of prosodic cohesion as
well as specific features of the orthographic system
of Standard Arabic (SA) offer an interesting testing
ground for theories that deal with the role of prosody
in parsing. This paper presents work showing that the
phonology and orthography of SA permit use of
novel methods for tapping into the silent prosody of
readers. Liaison phenomena sensitive to prosodic
boundaries make phonological phrasing in SA very
easy to detect. Also, liaison is indicated by diacritics
in the vowelized version of SA orthography. This
research shows that clear data on prosodic phrasing
patterns in SA complex nominals can be related to
their preferred syntactic/semantic interpretations.

7. Annotation of Phrase Boundaries

The standard model for prosodic annotation of
machine-readable text is ToBI or the Tones and
Break Indices annotation scheme [33] which
identifies five theoretical levels of juncture between
words:{0, 1, 2, 3, 4} which focuses on two types of
events in the speech contour, namely pitch accents
and prosodic phrase boundaries, via a discriminating
set of labels for ToBI as in the following example

Same-Size Sister Principle:

Attach a constituent to a sister of its own size.

Fodor [32] advances the idea that in silent reading a

prosodic contour is mentally imposed on the input
and that the syntactic parser is sensitive to that
prosodic phrasing. In case of a syntactically
ambiguous sentence, the hypothesis is that in silent
reading the prosodic parser assigns a preferred
(default) prosodic contour and then the
According to the Implicit Prosody Hypothesis, a
syntactic ambiguity may be resolved according to
whichever of the alternative structures is most
compatible with the default intonational patterns and
prosodic groupings of words in that language. Since
languages have different prosodic patterns, this can
give rise to cross-linguistic differences in the
preferred attachment of a relative clause. Languagespecific prosodic constraints may influence the
packaging of an RC in relation to its preceding noun,
which will tend to influence the syntactic tree
structure that is assigned to the ambiguous word

L* H

Tone Tier

L* H-H%

Orthographic Tier

Will you have marmalade, or jam?

Break Index Tier

1 4

Table 1: Example ToBI transcription from Guidelines for

ToBI Labeling.

The Break Index (BI) tier recognizes five degrees of

juncture between words in an utterance :{ 0,1,2,3,4}.,
Break index {0} denotes no separation or
cliticization, while index {1} applies to most phrase
medial junctures between words. Index {2} is a
special (and somewhat ambiguous) case, denoting
either a hesitation that does not affect the tonal
contour, or a disjuncture that is less strong than
expected. Indices {3} and {4} correspond to minor
and major boundaries These pitch accents are
transcribed in the Tone tier; in the above example the
word "marmalade" exhibits a low accent (L*) on the
first syllable rising to a high phrase accent (H) at the

Proceedings of the Conference on Language & Technology 2014

(5) Liaison without enchainment.

boundary site (%). This paper presents a break index

scheme for Arabic which is cued mainly by postlexical segmental phonological processes at word
edges. The following are proposed break indices for
degrees of boundary strength in SA correlated with
the respective phonological processes(s):


b- khaadimat-u mumathilat-i-n
servant-NOM actress-GEN-INDEF famous
a servant of a famous actress
c- khaadimat-u-n

BI 0 = highest degree of close cohesion between two

words: liaison with enchainment &
BI 1= Liaison without enchainment
BI2 = Short vowel at word ending without
BI 3= Tanwiin (cannot undergo enchainment)
BI 4= Short vowel deletion, Sukuun, deletion of
Tanwiin, ah- form



a poor servant

8. Corpus Analysis
Three sources of data from three genres of corpora
were analyzed:
- Quranic Arabic Corpus Data.
- Television News Broadcasting from Al-Jazeera.
- Radio Broadcasting from Al-Bernaamag Al-aam.

These break indices were used to analyze the corpus

and experimental data reported in this paper.
Examples of these correlates of boundary strengths
can be demonstrated in the following.
In 4 (a), N2 (second noun in the complex noun
phrase) almumathilah the actress has the pausal
form ah, not the contextual form at, and the
definite article, which begins the following relative
marker, retains its glottal stop consonant. The use of
the pausal form indicates a break between N2 and the
relative marker. In 4(b), N2 has the contextual form
at which forms a liaison with the relative marker
(no break) causing a deletion of the glottal stop and
an assimilation of the definite article of the relative

A search was made for the occurrence of the

relative marker (individual searches were done for
each type of relative marker: masculine singular,
feminine singular, masculine plural, feminine plural,
and dual in the whole of the Quranic text. The corpus
was from a Dual Dependency Constituency Treebank
corpus [34]. 1464 tokens of the relative marker were
extracted from the search. For each token that was
found a documented file description was created of
the type of noun phrase that the relative clause
modified. This description included the length of the
relative clause (in prosodic words PWd), the structure
of the sentence, the type and number of nouns in the
NP as well as the attachment height if applicable. The
description also included marking the occurrence of a
boundary on perceiving a pause in the recitation (the
software included an audio recitation) as well as on
the basis of explicit diacritic marking of a break in
the orthography.
Corpus was also collected from Al-Jazeerah TV, an
Arabic language news channel. The total duration of
the recordings was 420 minutes of news time. A
process of extraction of the target construction N1 N2
RC was then followed by creating a sound file each
extracted sentence into using Speech Analyzer
software. There were 58 such sentences in total. A
file card was created for each sentence which
documented the sound file number, recorder tracking
number, newsreader codename, newsreader gender,
speech type, sentence structure, sentence position,
attachment height, RC length, and a written version
of the sentence in Arabic and an English gloss. Each
extracted sentence which included the target


xa:dimat almumathilah allati
servant the actress
Liaison with enchainment



a servant of a famous actress

Break Indices (BI) for Boundary Strength in Arabic



servant-NOM-INDEF. actress-GEN-INDEF. famous

xa:dimat almumathilatillati
servant the actress who

In 5 (b) the n-tanwiin marker is deleted and only the

t-liaison + vowel case marker is retained. Note
however that there is no re-syllabification of the tliaison and vowel with the initial consonant of N2.
This kind of liaison without enchainment has been
observed in French and here in SA in example like

Proceedings of the Conference on Language & Technology 2014

construction was then segmented and the perceived

pauses were noted. The intended meanings of the
sentences were disambiguated either by syntactic
feature agreement (gender or number) where
possible, and elsewhere by compatibility with the
semantic context.
The Egyptian Radio News Corpus was from
Skogseth [35]. All recordings were made between
September and November 1998 in Central Cairo. The
data collected amounts to 6 hours and 52 minutes of
recording time, with 161 pages of transcribed data. In
the transcription, Skogseth marked perceived pauses
with forward slashes: a short pause with a single
slash; a longer pause with two and a very long break
with three. This research conducted a manual search
of the roman phonetic transcription of the Arabic
news broadcasts documented to extract all instances
of the target construction of the present study. The
search extracted 258 occurrences of the relative
marker in the corpus. The sentences were
disambiguated either by gender or number markings,
or by the semantic and pragmatic context. For each
sentence, a note was made of the attachment height
of the RC as well as the presence or absence of a
break before the RC based on Skogseths index of
perceived pauses and verifying these perceived
pauses and correlating them with the transcribed
segmental data of word endings (vowels) as a cue of

When comparing the findings from all three sets of

data as shown in Figure 1, there is an overall
tendency to attach low rather than high as well as a
tendency to place a break before the RC when
attaching the RC modifier high in both news corpora.
The Quran data shows little amount of breaks when
attaching high: this however may be attributed to the
lack of sufficient data found in the Quranic text to
warrant such a comparison. Furthermore, there was
also a tendency not to break when attaching the RC
modifier low across all corpus data.

9. Corpus Results

Perception Series (Attachment Preference)

The findings obtained from the data presented in

this study point to a close correlation of modifier
attachment height with certain tendencies of prosodic
phrasings. The following patterns of phrasings in
Standard Arabic emerge: a prosodic phrasing in
which there is no boundary before the RC correlates
more with low attachment of the modifier whereas a
prosodic phrasing in which there is a boundary before
the modifier correlates more with high attachment.

Experiment 1

Attachment preferences for silently read

sentences presented with an unvowelized
(Standard Arabic) orthography.

Experiment 2

Attachment preferences for overtly read

sentences presented with an orthography
vowelized to block a phrasing break
internal to the target construction, i.e.,
[N1-N2-Modifier] phrasing.

Experiment 3

Attachment preferences for overtly read

sentences presented with an orthography
vowelized to imply a phrasing break
between complex noun phrase and
modifier, i.e., [N1-N2][Modifier] phrasing

Two series of psycholinguistics experiments are

reported here. The first examined Arabic speakers
preference for the attachment of a modifier (relative
clause, RC or adjective phrase, AP) in sentences
containing a complex NP with two nouns. The
second series examined Arabic speakers prosodic
phrasing preferences for the same sentences. Six
(Experiments 1, 2 and 3), and three production
experiments (Experiments 4, 5, and 6) were
conducted (Table 2). The aim of these experiments is
to shed light on the hypothesis that there is a causal
relationship between prosodic phrasing and
attachment decisions in sentence processing.


Radio News



Radio News




10. Sentence Processing Experiments

Production Series (Prosodic Phrasing)

Experiment 4


Figure 1: Mean percentage of low and high attachment

across all three kinds of corpus data. Graph also shows
proportion of breaks and absence of breaks for both low
and high attachment

Prosodic phrasing patterns implied by

vowelizations that participants inserted
(tashkeel procedure) into silently read
sentences presented with an unvowelized

Proceedings of the Conference on Language & Technology 2014

Experiment 5

Overt prosodic phrasing patterns in

utterances elicited with a protocol
disambiguating for low-attached modifier

Experiment 6

Overt prosodic phrasing patterns in

utterances elicited with a protocol
disambiguating for high-attached modifier

prosodic constraints include a constraint that

prohibits a prosodic break between the two nouns of
the genitive construction. There is also a constraint
that prohibits short phonological phrases.
Fodors (2002) Structural Interpretation of
Prosody Principle (SIPP) proposes an intricate
interface relationship between the production of
prosodic structure and the perception of syntactic
structure. The idea is that prosodic structure is
assigned by the mediation of factors such as phrase
length and syntactic category and then this projected
prosodic pattern is interpreted as signaling a syntactic
configuration. Both the tendencies found in the
different sources of corpus data analyzed as well as
the tendencies found in the data in the series of
experimental studies show that prosodic phrasing
patterns in SA complex nominals are strongly related
to their preferred syntactic/semantic interpretations.
The data also supports the view that in silent reading,
mentally projected prosody significantly influences
the final syntactic representation of the sentence.

Table 2: Overview of two experimental series

In Standard Arabic the feminine noun suffix ends in

/ah/ preceding a prosodic boundary, but shows a
liaison consonant /t/ when no boundary follows. The
/t/ carries the case marker vowel (ACC /a/ and
GEN/i/, and together they syllabify with the onset of
the following word. This is illustrated in (2b):


a. maktabat-a l-madrasah allatii gudiddat

library-ACC the school that was renovated
(No N2-RC liaison) Prosodically: N1 N2) (RC


b. maktaba-a l-madrasatilatii gudiddat

library the school-GEN was renovated
(N2-RC liaison)
Prosodically: N1 N2 RC

This study has pointed to the need for a detailed

functional interface model that characterizes the
principled interaction of prosodic constraints and
syntactic structure for sentence processing tasks. Also
it characterizes the importance of prosody integrated
parsers for prosodic annotation in syntactic
dependency paring as well as phrase break prediction
tasks within Text-to-Speech synthesis which is
needed to optimize the chunking of text to maximize
semantic analysis. The methodology presented in this
paper can also be extended and used to test prosodic
phrasing in other languages with similar phonology
and orthography such as Urdu.

Participants in Experiment 1 silently read sentences

like (6) above presented without vowel diacritics, and
added diacritics as they thought appropriate. This is a
standard task, called tashkeel, in Arabic. The
diacritics that listeners inserted gave evidence of their
implicit prosodic phrasing of the sentence:
with/without a break before RC. Results showed a
strong bias for the liaison prosody (6b).
The orthography was then put to a different use in
assessing modifier interpretation under varying
prosodic conditions. In Experiments 2 & 3,
vowelized text was presented, establishing one
prosodic pattern or the other. Participants read aloud
and then indicated their interpretation of the sentence.
Attachment preferences were found to differ
significantly depending on the prosodic phrasing
imposed by the orthography. This provides two
standards against which to compare attachment
preferences in silent reading of unvowelized texts
(lacking prosodic disambiguation) in Experiment 4.
In Experiment 4 with prosodically ambiguous
materials, the observed attachment preference was
low, similar to that for the N2-RC liaison prosody
(6b) than for the non-liaison prosody (6a).
Data showed that, in Standard Arabic, a
phonological phrase can contain more than 4-5
prosodic words. This tolerance for long phrases is
argued to be facilitated by a liaison phenomenon that
holds successive words together. Also relevant

[1] Pynte, J., Prieur, B. 1996. Prosody Breaks and Attachment
Decisions in Sentence Parsing. Language and Cognitive
Processes, 11, 165-191.

Schafer, A., Speer, S.R., Warren, P., White, S.D. 2000.

Intonational disambiguation in sentence production and
comprehension. Journal of Psycholinguistic Research


Watson, D., & Gibson, E. 2001. Linguistic structure and

intonational phrasing. Paper presented at the 14th Annual
CUNY, Conference on Human Sentence Processing,
Philadelphia, PA.

[4] Clifton, C., Carlson, K., & Frazier, L. 2002. Informative

Prosodic Boundaries. Language and Speech, 45(2), 87-114.

Proceedings of the Conference on Language & Technology 2014

[5] Keating, P. 2003. Phonetic encoding of prosodic structure.

In S. Palethorpe and M. Tabain (Eds.), Proceedings of the 6th
International Seminar on Speech Production, Macquarie


[6] Speer, S., Kjelgaard, D., Dobroth, K. 1996. The influence of

prosodic structure on the resolution of temporary syntactic
closure ambiguities. Journal of Psycholinguistic Research,
25: 247-268.
[7] Kjelgaard, D., Speer, S. 1999. Prosodic facilitation and
interference in the resolution of temporary syntactic closure
ambiguity. Journal of Memory and Language, 40, 153-194.


Frazier, L., & Fodor, J. D. 1978. The sausage machine: A

new two-stage parsing model. Cognition, 6, 291-325.


Kimball, J. 1973. Seven principles of surface structure

parsing in natural language. Cognition, 2, 15-47.

[25] Cuetos, F., Mitchell, D. C., & Corley, M. 1996. Parsing in

Different Languages. In Manuel Carreiras, Jose E. GarciaAlbea & N. Sebastian-Galles (Eds.), Language Processing
In Spanish. Mahwah, NJ: Lawrence Erlbaum Associates.

Frazier, L., & Clifton, C. Jr. 1998. Sentence Reanalysis, and

Visibility. In J. D. Fodor & F. Ferreira (Eds.), Reanalysis in
Sentence Processing (pp.143-176). Amsterdam: Kluwer.

[9] Selkirk, E. 2000. The interaction of constraints on prosodic
phrasing. In M. Horne (Ed.), Prosody: Theory and
Experiment (pp. 231261). Dordrecht: Kluwer.

Selkirk, E. 2002. Contrastive FOCUS vs. presentational

focus: Prosodic evidence from right node raising in English.
In Speech Prosody 2002: Proceedings of the 1st International
Conference on Speech Prosody, Aix-en-Provence, 643-646.


Ferreira, F. 1993. Creation of prosody during sentence

production. Psychological Review 100:233-253.


Jun, S.-A. 2003. Prosodic Phrasing and Attachment

Preferences. Journal of Psycholinguistic Research, 32(2)219249.


Shattuck-Hufnagel, S. and Turk, A. 1996 A prosody tutorial

for investigators of auditory sentence processing. Journal of
Psycholinguistic Research 25(2). 193-247.


Watson, D., Wagner, M., Gibson, E. (Eds.). 2010.

Experimental and Theoretical Advances in Prosody: A
Special Issue of Psychology Press, Nov 12, 2010.


Fodor, J.D. 2002. Psycholinguistics cannot escape prosody.

Paper presented at Speech Prosody 2002, April 11-13.Aixen-Provence, France.


Lovri, N. 2003. Implicit prosody in silent reading: Relative

clause attachment in Croatian. Unpublished Doctoral
dissertation, City University of New York, New York,


Prieto, P. 2003. Syntactic and eurhythmic constraints on

phrasing decisions in Catalan. Paper presented at the GLOW
Workshop III, Boundaries in Intonational Phonolgy.


Nibert, H. 2000. Phonetic and Phonological Evidence for

Intermediate Phrasing in Spanish Intonation. Ph.D thesis,
University of Illinois, Illinois.


Frota, S. (2000). Prosody and Focus in European

Portuguese. New York: Garland.


D' Imperio, M., Elordieta, G., Frota, S., Prieto, P., Vigario,
M. 2003. Intonational phrasing and constituent length in
Romance. 1st PaPI- Phonetics and Phonology in Iberia.

Ehrlich, K., Fernandez, E.M., Fodor, J.D., Stenshoel, E., &

Vinereanu, M. 1999. Low attachment of relative clauses;
New data from Swedish, Norwegian and Romanian. Poster
presented at the 12th Annual CUNY Conference on Human
Sentence Processing. NewYork, NY, March 18-20.

[27] Carreiras, M., & Clifton, C. 1999. Relative clause

interpretation preferences in Spanish and English.
Language and Speech, 36.

Gilboy, E., Sopena, J.M., Clifton, C. Jr., & Frazier, L.

1995. Argument structure and association preferences in
Spanish and English complex NPs. Cognition, 54, 131-167.


Hemforth, B., Konieczny, L., & Scheepers, C. 1999.

Syntactic attachment and anaphor resolution: Two sides of
relative clause attachment. In Matthew W. Crocker, Martin
Pickering & J. Charles Clifton (Eds.), Architectures and
Mechanisms for Language Processing. Cambridge:
Cambridge University Press.


Brysbaert, M., & Mitchell, D. C. (1996). Modifier

attachment in sentence parsing: Evidence from Dutch.
Quarterly Journal of Experimental Psychology, 49A (3),

[31] Zagar, D.,Pynte, J., & Rativeau, S. (1997). Evidence for

early-closure attachment on first-pass reading times in
French. Quarterly Journal of Experimental Psychology,

Maynell, L. 1999. Effect of pitch accent placement on

resolving relative clause ambiguity in English. Poster
presented at the12th Annual CUNY Conference on Human
Sentence Processing. Scotland.


Lisbon, Portugal.
Frazier, L. 1987. Sentence Processing: A tutorial overview.
In M. Coltheart (Ed.), Attention and Performance XII: The
Psychology of Reading. (pp. 559-586.). Hillsdale, NJ.:


Fodor, J. D. 1998. Learning to Parse. Journal of

Psycholinguistic Research, 27(2), 285-319.

[33] Beckman, M. E., & Ayers, G. M. 1993. Guidelines for

ToBI Labeling. Version 3. Unpublished manuscript. The
Ohio State University Research Foundation. Department of
Linguistics, Ohio State University.


Dukes, K. 2011. The Quranic Arabic Corpus. Online.

Accessed: August 2011. http://corpus.quran.com.


Skogseth, J. G. 2000. Egyptian Radio Arabic: A

Phonological Analysis of the Language in EBA News
Broadcasts. Master's Thesis in Arabic, Institute of EastEuropean and Oriental Studies, Faculty of History and
Philosophy, University of Oslo, Norway.

Proceedings of the Conference on Language & Technology 2014

What's in a name? Automatic extraction of lexical and functional units of

Pakistani names
Tafseer Ahmed*, Naila Ata
*DHA Suffa University, Karachi
tafseer@dsu.edu.pk, naila.ata@gmail.com
This paper focuses on the analysis and automatic extraction of components of Pakistani names. Many Pakistani names are different from the standard given,
middle and family name patterns. More components
may occur in Pakistani names and their order is also not
In many cases, the first component of a Pakistani
name is the given/first name of individual. However,
many names start with family name or religious middle
name. Sometime, titles e.g. hafiz is written in official
documents. For this reason, there is a requirement for a
system that accepts Pakistani names and gives the
standard components as the output. Additionally, it
should also extract the other features and/or details of
the components.
We studied different naming patterns and terminology used around the world. Using this background we
develop two level tagset for Pakistani name.
The organization of rest of the paper is following.
Section 2 gives the previous work about analysis of international names including Pakistani and South Asian
names. Section 3 presents the Part of Speech Tagset designed to model Pakistani names. Section 4 gives the
introduction of different methods of automatic POS tagging. Section 5 describes our tagging experiment and its
results. Section 6 gives the conclusion.

The paper describes a two pass word tagging system
for the extraction of first name and surname from a
Pakistani (full) name string. The full name in Pakistan
does not follow a single fixed pattern. The order of its
component is flexible, and the simple pattern of firstname middle-name last-name is not applicable. There
are many peculiarities e.g. in the absence of family
name, the middle-name serves as the surname. To extract first name and surname, two sets of tags are designed. The first tagset consists of personal-name, family-name, religious-middle-name, particle and title. The
second tagset consists of first-name, surname, title and
middle-name. The output of the first pos tagging subsystem is fed to the second subsystem. The evaluation
gives 90+% accuracy by using POS tagger.

1. Introduction
For many official processing tasks, the world is using a common pattern of name that has its root in western (or in some of the western) traditions. A persons
full name is not a single entity. It consists of more than
one component. Internationally, the most commonly
used pattern is: Given-Name, Middle-Name and Family-Name. The example is:
(1) Stephen
Given-Name Middle-Name

2. Previous work on name pattern


As the patterns of names are different in different nations, there are some studies to discover and present the
variety of patterns, and the common elements among
them. For example, the studies are made about Chinese
names [3] and Korean names [4]. United Kingdoms
General Secretariat published a report about name patterns of different nations [2]. In this paper, we refer this
report as UK report. The study is carried to understand
names of immigrants and visitors. It uses the following
a. personal name (of the individual)
b. middle name
c. family name
d. fathers personal name: as a component of a
childs full name

In this paper, we term the above pattern as the standard pattern. However, the whole world does not follow this pattern. There are many patterns of name in
which the order of these components is not the same.
There are some traditions that ask for more or less components. e.g. Spanish names may have names of both
paternal and maternal families [1]. In some areas of
Southern India and Sri Lanka the pattern is: place-name
father-name personal-name [2].
Despite this variation, we need the components:
first/given, middle and family in many international
documents including passports and visa. Hence, there is
requirement for discovering the alignment between different systems and patterns.

Proceedings of the Conference on Language & Technology 2014


patronymic: name derived from personal name

of father
f. grandfathers personal name: as a component of
a grandchilds full name
g. (honorific) title: a name or title given to, or used
by, a person as a mark of status or respect
h. religious name: a name or title given to, or used
by a person according to the tradition of the persons religion.
Most of these labels are self explanatory. The term religious name needs some explanation and examples. In
South Asian Muslims, Sikhs and Hindus certain names
are used as part of the persons religious tradition. In
Sikhs, the name Singh is used with men, and Kaur is
used with women. The Wikipedia page about Indian
Hockey player Tarsem Singh mentions that Singh is not
his family name, and he should be referred by his given
name Tarsem1.
The UK report gives the following typical components of a name: personal name + family name. In
South Asia, the name pattern of Hindus of Northen India is:
personal name + [middle name] + family name
The common middle names are Kumar, Parkash, Chander, Lal, Nath, Datt (for males) and Kumari, Rani, Lakshmi, Devi (for females). According to this report, pattern for the Sikhs is:
personal name + religious name [+ family name]
The report describes Pakistani names as:
Male: religious name(s) + personal name(s) [+ family name] (in any order/combination)
Female: personal name(s) + female honorific title(s)
[+ family name] (in any order/combination)

Another work on international name patterns is conducted by Regional Organized Crime Information Center, USA [5]. For Pakistani names, it quotes the UK report.
A detailed study about components and structure of
names is conducted by [1]. It does not only introduce
different tags for the components of name, but discusses
the phrase structure of the name. The discussion is in
depth for the western names, but issues for
Pakistani/South Asian names are not described.
Schone and Davey [2] presented a fine grained
tagset for modelling the name. For example, it has four
types of Family name tags.
FN=Family Name, with the following subsets
FNF = Family Name of the Father
FNM = Family Name of the Mother
FNS = Family Name of the Spouse
FND = Name derived from a family name
They also introduced phrases for different components of name and hence represented the name as a tree
structure. The following is the example of bracketed
parse tree of the name John F. Kennedy.
(2) [NAME [GNP [GN John] [ABBRI F.]] [SNP
[FNF Kennedy]]]]
In the above example, GNP stands for Given Name
Phrase and SNP stands for Surnominal Phrase.

3. Modelling Pakistani name

The discussion section 2 gave us an introduction to
different labels/tags to model the components of full
name. It also presented the previous work about the
structure of Pakistani names. In this section, we present
our study about the different patterns and components
of Pakistani names in section 3.1. Then, a tagset for
Pakistani names is proposed in 3.2.

The report mentions that Titles are also part of the

names.The description of Religious Name (for Pakistani names) in this report needs some discussion. Religious Name is presented as a big miscellaneous entry.
It includes suffixes e.g. Allah or ullah, family names
e.g. Syed and the names of historical religious personalities e.g. Muhammad. We disagree with this definition
(or these examples) of religious names. In our opinion,
the suffixes are the part of personal name. The Religious Name (for Muslims) should only be confined to
the names of historical religious personalities.
Another claim of the UK report needs revision. It
says that different name component can occur in in
any order. This claim is not true. It is true that there is
a flexible order of components, but some of the patterns
does not exist or are very rare. e.g. we do not find any
example of the following pattern in Pakistani names:
father-name family-name given-name.

3.1. Analysis
The following are important components and features of Pakistani names.
a. Many Pakistani names following the standard
pattern: given-name2 family-name. For example:
(3) Younus

The family name usually appears at the end, but

some family names can occur at the beginning. For
(4) Syed
given-name middle-name

Some of the first occurring names can appear at the end.

For example,
(5) Amina

Given name is the first name of the person.

Proceedings of the Conference on Language & Technology 2014


Some names have multiple family names e.g.

(6) Usman
family-name family-name

On second layer (.g. in (7), it is the family name (or surname). We term this layer as functional layer because it
models the function of a particular word in the name.
The following are the tags for the lexical layer.

The reason for multiple family names is the hierarchy

of these names. Shinwari is a sub-family of Khan.

Personal-name (PNAME): It includes (i) the first

name of the person (ii) the father/husband or other relative name that has become part of the persons full
name. A full name can have more than one personal
names. Examples are Shahid, Zaheer, Amina and Kiran

There is a set of personal name that is used as

given name. Moreover, the personal name of father
or husband can be used as the surname in the absence of the family name. For example,
(7) Umar

Religious-middle-name (RNAME): These are the

middle names that are not the name ones father or husband etc. These may occur at the beginning or end of
the full name. The examples are Muhammad, Hussain
etc. Some of these are also used as the personal name.
Family-name (FNAME): The family names are not
used as the personal names. The examples are Syed,
Khan, Rizvi, Bugti and Malik etc.


There are some names of historical religious personalities whose name are used as middle name.
In this subsection, we use the term religious middle
name for these names. These may occur at the beginning of the name or after the given name.
(8) Muhammad


Title (TITLE): The examples of title are Mr., Dr.,

Hafiz, Moulana. Sometime it can occur at the end of the
name e.g. advocate.

As most of these religious middle names e.g Hussain,

Ahmed are also used as personal/given name, it is difficult even for human taggers to disambiguate or correctly guess their correct category.


Particle (PRT): The tokens that connect different parts

of name. The examples are bin, bint that connect child
name with parents name.

The titles are sometime used in official documents

e.g. registration forms, identity cards etc. During
the data collection (described in section 5), we find
the title Hafiz with the name of many students.
Similarly, titles Haji, Maulana, Dr., General
(retd.) are present with parliamentarian names.

As described earlier, the second layer tagset deals

with functional aspect of the components i.e. it defines
whether a certain personal name is used as given name,
surname or middle in a particular full name. We propose following tags for this layer.

The following example has most of the name components that are discussed above.
(9) Shahibzada

Given-name (GNAME): One of the personal name is

tagged as the given-name. In Umar Akmal, both Umar
and Akmal are personal names (on lexical layer). However, Umar is the given name on functional layer.
Surname (SNAME): The family-name is tagged as the
surname. If there are multiple family-names, then all of
those will be tagged as surname. In the absence of family-name, the last personal-name or religious-middle-name acts as the surname.
In Umar Akmal, the component Akmal (personal name
on lexical layer) is tagged as surname. In Moin Khan,
the component Khan (family name on lexical layer)
tagged as surname.
Almost all the full names have at least one surname.
However, there are names that consist of a single word.
The word is usually the personal name, and we classify
it as personal name.

3.2. Tagset Design

On the basis of the analysis presented in 3.1, we designed a two layered Part tagset for Pakistani names.
The reason for two layers is that the components of
name have two different type of functions or categorization scheme. A personal name e.g. Akmal can be
used as given or first name. But the same name can be
used as a family name in the absence of family name in
his childs/wifes name, as in (7).
Hence, we define two layers for analysis. On first
layer (termed as lexical layer), Akmal is personal name.

Proceedings of the Conference on Language & Technology 2014

Middle-name (MNAME): After the assignment of

first-name and surname, the remaining untagged personal-name and middle-religious-name are tagged as
middle name.

Word Cluster Path: [9] reports that tagging accuracy

can be improved by using word cluster features from
untagged dataset. We also created word clusters of
names using Brown word clustering algorithm[10] and
used cluster path as a feature.

Title (TITLE): The title of lexical layer remains title


5. Experiment and Results

For supervised learning, 1516 names are tagged as
training data, and 505 names are tagged as test data.
The names come from the list of students and list of national assembly of 2008-2012.
Pakistani names are written using modified Arabic
script of Pakistani languages e.g. Urdu, Sindhi and
Pashto etc. However, English is de-facto official language of Pakistan, so most of the name lists in government and private organizations are made using roman
script. In this experiment, we used the names written in
roman script.
We used TweetNLP tagger [11] for this task which is
MEMM tagger and gives facility to use various distributional features.
For first level tagging, we found that sequence of previous tokens and affixes are the most important features
which gives 96.5% accuracy and it improved a bit with
word cluster feature. However, the improvement isnt
very significant. Table 5.1 shows the accuracy of the
tagger with different feature sets. We did analysis of the
misclassified tags and found that the reasons of errors
are: unknown family names, unknown titles, unknown
abbreviations of name or title and names which are assigned multiple tags in the training data. For example,
Mohammad is used as Religious name as well as first
name. Table 5.2 shows the overall accuracy and accuracy on unknown words of our final tagger.

Residual (RES): Any other untagged component is

tagged as residual. The particle tag becomes residual
The following example have most of the tags of the
proposed two layered tagset.







4. POS Tagging
Part of speech tagging is one of the basic steps in linguistic processing pipeline. It is a well-studied problem
and there are many supervised learning approaches
which give around 96%-97% accuracy [6][7][8]. As
part of name tagging is also a sequence tagging problem, we used POS taggers for this task.
In this work, we used maximum entropy Markov model
based tagger [6]. We preferred it because of the following reasons:
a) It only considers the imposed constraints while
maximizing the entropy. With limited tagged
data, it helps to overcome the problem of over
fitting of training data.
b) Feature incorporation is easier.

Feature Set
Position+affix+prev words
Affixes + prev words
Position+prev words
affix+prev word + clusters
Table 5.1: Accuracy for layer 1 tagging

4.1. Features
As compared to the POS tagging for textual data, the
tag set and complexity of the part of name tagging is
smaller; we mostly used word features, affix n-grams
for this task. We also did experiments with word cluster
features [9].
Following are the details of the features used in the
Position Feature: Position of the word in the name
Previous Words: Preceding two words of the current
N-Gram Affixes: Suffix and prefix n-gram of the
current word, we found that accuracy is better with trigrams

The tagging for layer 1 is performed using different

set of features, we achieved 97% accuracy using affixes, token sequence and cluster features.
Overall Accuracy

Unknown Word
Table 5.2 Comparison of overall accuracy with accuracy on unknown words

Proceedings of the Conference on Language & Technology 2014

In the second level tagging, token sequences are the

main feature which helps to identify tags with the accuracy of 94.5% and it improved slightly by adding prefix
features. Table 5.3 shows the accuracies with various
feature set.
We found that missclassifications in this level are
mainly because of confusion between middle name and
surname and given name and surname.

Society for Information Science and Technology

64.1, 2013: 86-95.
[5] Law Enforcement Guide to International Names,
2012, ROCIC Publications. Available: https://info.publicintelligence.net/ROCICInternationalNames.pdf
[6] Ratnaparkhi, Adwait. "A maximum entropy model
for part-of-speech tagging." Proceedings of the conference on empirical methods in natural language
processing. Vol. 1. 1996.

Feature Set
Position+affix+prev word + clus- 94.8
Prev words
Prev word + prefix
Table 5.3: Accuracy of the tagger for level 2

[7] Brill, Eric. Some Advances in Transformation

Based Part of Speech Tagging, Proceedings of
AAAI, Vol. 1, 1994, pp. 722-727
[8] Kristina Toutanova and Christopher D. Manning.
Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger, Proceedings
of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large
Corpora (EMNLP/VLC-2000), Hong Kong, 2000,
pp. 63-70.

These results show that we can extract parts of name

from a name string with POS tagger using supervised
learning using limited training data with 90+% of accuracy. As the main source of miss-classifications is unknown words, accuracy can be improved by adding
more training data.
We also found that word cluster features dont contribute much for tagging in a closed domain and
token/character sequence based features are the most
important ones in such case.

[9] Olutobi Owoputi, Brendan OConnor, Chris Dyer,

Kevin Gimpel, Nathan Schneider and Noah A.
Smith. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters. In
Proceedings of NAACL, 2013.

6. Conclusion

[10] Peter F. Brown; Peter V. deSouza; Robert L. Mercer; Vincent J. Della Pietra; Jenifer C. Lai (1992).
"Class-based n-gram models of natural language".
Computational Linguistics 18

We presented an extraction scheme for parts of Pakistani names using two layer tagging scheme. We
showed that components of Pakistani names can be
tagged using MEMM tagger with a small training dataset. In the future work, we intend to apply this model to
other South Asian name patterns and to identify the differences between positioning of various semantic units
of name.

[11] http://www.ark.cs.cmu.edu/TweetNLP/

7. References
[1] Patrick Schone and Stuart Davey. "A Multilingual
Personal Name Treebank to Assist Genealogical
Name Processing." FHTW at RootsTech, 2012.
[2] A Guide to Names and Naming Practices. Available:
[3] Tsung O. Cheng. "The continuing confusion in figuring out the surname of a Chinese authorA proposed solution." Chinese journal of integrative
medicine 18.4, 2012: 243-244.
[4] Sungwon Kim, and Seongyun Cho. "Characteristics
of Korean personal names." Journal of the American

Proceedings of the Conference on Language & Technology 2014

Extracting Arguments and Collocations for Urdu Complex Predicates

Tafseer Ahmed
DHA Suffa University, Karachi
The advanced dictionaries contain argument structure,
modifiers and/or collocations related to the word.
There main reason for the absence of advanced
dictionaries in Urdu is the absence of the required
lexical and linguistic resources, specially the absence
of annotated corpus. Most of the advanced features of
dictionaries are extracted from treebanks that have
phrase structures. However, we do not find freely
available and sizeable treebanks for Urdu. Hence the
researchers have two choices. Either develop treebank
first and then use it to extract the required features.
However, development of treebank requires time and
money. The required time is not limited to the
development time of the treebank, but it also include
the time to train human resources who can perform the
The alternate method of using existing resources is
more practical and less time-consuming. There is
already some work done in this regard e.g. an
experiment to extract argument structures of simple
verbs from raw corpus is performed [4]. Our goal is to
focus on the complex predicates, as most of the verbal
predication e.g. mulaqAt 'meeting' (noun) + kar 'do'
(verb) in Urdu is done by using noun+verb (N+V) or
adjective+verb (A+V) complex predicates (cp) [8].
(The details are presented in section 2.)
By analysing and summarizing earlier theoretical
work on complex predicates, we defined different types
of arguments (or relations2) for the argument structure
of the cps. Then we developed a system that uses a
POS tagged corpus and generate arguments/relations
for the each of the sentence. As our method does not
involve processing of a manually tagged treebank (and
manually POS tagged corpus), we use the term pseudorelations for the relations guessed by the heuristics of
our algorithm. The pseudo-relations guessed for all the

The paper presents the automated extraction of
arguments and collocations for Noun+Verb (N+V) e.g.
safAI kar- cleaning.noun do and Adjective+Verb
(A+V) e.g. sAf kar- clean.adj do complex predicates
(cp) of Urdu. An automatically POS tagged corpus of
97 million words is processed, and the pseudorelations of nouns and complex predicates are
extracted by a devised algorithm (without using deep
parsing or chunking). The words of pseudo-relations
are processed to suggest the collocations for each
complex predicate. For a given cp, the commonly used
words in subject, object, genitive modifier of N+V,
non-canonical second argument (NCSA) and V+V
light verbs are extracted, if the argument exists for (or
relevant to) that cp. In the absence of big and freely
available Urdu treebank, the paper describes an
alternate method to get argument structures and
collocation of complex predicates. The pseudo-relation
extractor can also be further used in information
extraction tasks.

1. Introduction
Urdu is an Indo-Aryan language spoken mainly in
Pakistan and India [5]. Urdu is an under-resourced
language in terms of natural language processing
applications and computational linguistic resources.
Urdu has many paper and online dictionaries e.g.
Farhang-e-Asafya, Feroz-ul-Lughat, Urdu Dictionary
Board's dictionary of 22 volumes and Online Urdu
Dictionary1. However, the entries in these dictionaries
consist of word, part of speech, etymology,
meaning/explanation and example sentences. We do
not find any dictionary that presents argument structure
and/or collocation of the words. When we compare
Urdu dictionaries with the dictionaries of English and
other developed languages e.g. Oxford Collocation
Dictionary, we find a huge difference in the contents.

We try to keep this work as neutral for different theoretical models.

Hence, we are using the terms argument and relation (of dependency
structure) as interchangeable (that might be weird for some point of
view). Similarly, we considered Adjective+Verb sequence as
complex predicate. An alternate approach is discussed in [16].
However, the problem of collocation dictionary remains unchanged
in the other case because the adjective+verb sequences are found in
Urdu dictionaries.



Proceedings of the Conference on Language & Technology 2014

form or
it has an ergative
marker/postposition. In example (1), the subject has the
ergative postposition nE. The subject appears without
any postposition (in nominative form) in the following
(4) meN
kartA hUN
do.Impf be.Pres
'I clean the table.'

sentences are processed and summarized to give

argument structure/relation for complex predicates.
The structure of rest of the paper is as following.
Section 2 describes different types and arguments of
complex predicates in Urdu. Section 3 describes the
status of computational processing for argument
structure extraction and collocations of Urdu verbs and
cps. Section 4 describes our method to extract the data.
Section 5 presents the results and discussion about the
collocated verbs.

In Urdu, the subject of verb can have the dative

marker. This phenomenon is called dative-subject
construction and discussed in detailed by [10].
Following is an example of dative subject construction.
(5) mujh=kO
apple like.F.Sgbe.Pres
'I like apple.'

2. Complex Predicates in Urdu

Urdu has less than 1000 simple verbs [3][7]. The
verbal predication in Urdu is expressed by using
Noun+Verb (N+V) and Adjective+Verb (A+V)
complex predicates [8][14][15]. The example of N+V
complex predicate is:
(1) meN=nE
sabaq yAd
lesson memory.Sg
'I learnt the lesson.'

In this example, the kO-marked pronoun mujH is

the subject and sEb is the object. It is important to note
that the postposition kO is also used to mark object in
other constructions. For example,
(6) meN=nE
is sEb=kO
pasand kiyA
this apple=Acc like.F.Sg do.Perf
'I liked this apple.'

In this example, yAd+kar (N+V) is a complex

predicate. Similarly, the following is an example of
A+V complex predicate.
(2) meN=nE
'I cleaned the table.'

(A system for argument structure and collocation

extraction system must disambiguate the different
relations of kO in example 5 and 6, on the basis of
construction and other factors.)

In the above examples (1 & 2), the pronoun meN is

the subject. The object of (1) is sabaq and the object of
(2) is mEz. There can be other (mandatory) arguments
and (optional) adjuncts of a complex predicate.
Consider the following example.
(3) meN=nE
1.Sg=Erg room=in
'I cleaned the table in the room with the cloth.'

Object is the second mandatory argument of a

verbal predication. It is in nominative form or marked
by accusative marker kO. In Urdu, the second
mandatory argument does not always realize as an
object. Urdu have simple verbs and complex predicates
having second argument marked by locative
postpositions sE, par and mEN. These arguments are
called Non-Canonical Second Argument (NCSA) [13].
Following is an example of the verb Dar 'fear' that has
se-marked argument.
(7) us=nE
meeting.F.Sg do.Perf
'He/She met me.'

In terms of collocation, we can say that the frequent

noun appearing as sE-marked instrument with sAf kar
would be cloth, water and duster etc. The nouns spoon
and fork are less likely candidates to appear in sEmarked phrase of sAf+kar.
(Hence, our goal has two parts. (i) We have to find
that sE-marked phrases appears frequently with
sAaf+kar. (ii) The nouns pAnI 'water' and kaprA 'cloth'
are frequently used in sE-marked postpositional
Subject and Object are the mandatory argument of
the complex predicate (as well as simple verbs). In
Urdu, the subject of transitive verb can have

The mandatory argument can also appear as the

genitive argument of noun in N+V complex predicate.
In the following example, the argument lOgON 'people'
is expressed as the genitive modifier/specifier of the
noun madad 'help'.
(8) tum=nE
'You helped the people.'



Proceedings of the Conference on Language & Technology 2014

and A+V complex predicates. The input to the system

is a POS tagged Corpus. After the processing of this
corpus a database of pseudo-relations and statistics
about the words in these relations is obtained. After it,
the user can input a complex predicate and the system
will give the arguments/relations for that cp and the
frequently used words in that relation/argument.

Beside arguments, Urdu has collocations of verbs +

light verbs. In Urdu, the light verbs follow the main
verbs to show completion, suddenness and similar
properties [1][12]. In the following examples, the light
verb paR 'fall' shows suddeness and gayI 'go' shows
(9) gARI
move fall.Perf
'Car starts moving.'
(10) gARI
'Car stopped.'

4.1. POS-tagged Corpus

Ideally, one should process a manually POS tagged
corpus. However we could not found a huge freely
available POS tagged corpus for Urdu. Hence, we used
the automatically tagged corpus released recently by
Charles University, Prague [2]. The corpus consists of
95 million words extracted from different news sites,
blogs and online book sites. The use of automatic
tagged corpus implies that some of tags are not correct.
However we assume that the huge number of examples
provided by this corpus will outweigh the accuracy
problem (in the absence of manually tagged corpus of
similar size.)


A collocation extraction system for Urdu complex

predicate should deal all of the above peculiarities.

3. Extracting arguments and collocations

A lot of work is done on extracting subcategorization frames from raw corpus. [4] describes
different methods for argument extraction. With the
availability of huge treebanks and annotated corpus,
made more complex processing and more granular
extraction possible. Sketch Engine [6] is one of the
state of the art tools available for argument and
collocation extraction from annotated corpus. The
following figure shows an online dictionary that uses
sketch engine to find collocation. The dictionary is
available at http://forbetterenglish.com/.

4.2. Sentence Boundary Detection

The tagged corpus claims to have 5.4 million
sentences. However, the inspection of the corpus shows
that this number does not correspond to the
grammatical sentences of the corpus. We do not know
the method of sentence segmentation of [2], but it
seems that the following algorithm is used (or at least it
is the major part of breaking the text into sentences.)
(a) A newline character starts the new sentence
(b) A sentence separator (tag SM) starts the new
We found that there are many examples in this
corpus in which the writer does not insert the sentence
marker between sentences. Hence a single claimed
sentence can consist of 2 or more sentences. For
example, the following sentence has a single sentence
marker, however it consists of two sentences.

However, we cannot generate an Urdu dictionary in

this way, as there is no sizeable treebank available. For
this reason, Raza [4] uses a raw corpus to extract infer
sub-categorization frames for simple verbs of Urdu. He
uses postposition and subordinate conjunction kah to
infer subcategorization frames. (However, he did not
work on collocations.)

|VB |TA |PD |NN |AP |P

|NN |P |VB |TA |SM
lagta|VB he|TA yah|PD qataa|NN yahAN|AP kE|P
logON|NN nE|P banAyA|VB he|TA -|SM
The first sentence ends at the first tense auxiliary
TA (italicized by us). However, there is no sentence
marker SM or related symbol there, hence the two
sentences are considered as single sentence in the
We designed a rule based system for sentence
boundary detection. We used the fact that the verb
phrase (or verbal group) occurs at the end in most of

4. Method
Following is the description of the whole process of
arguments/relations and collocation extraction for N+V

Proceedings of the Conference on Language & Technology 2014

format for dependency structures5, with less number of

lagtA6 lag

the sentences3. We mark start of the verb group when

any of the verb tag (main-verb VB, auxiliary AA or
tense-auxiliary TA) is observed. The verb group ends
when a tense-auxiliary or a non-verb word is observed.
We insert sentence boundary, as soon as the verb group
is ended.
There are some exceptions in the above algorithm.
If an intensifier e.g. hI or bHI or negative adverb e.g.
nahIN occurs, we do not break the verb group. We use
other tags e.g. noun (NN) to find that the verb group
(hence sentence) has ended. Moreover, the tag KER,
used for embedded clause ends the verb group, but
does not end the sentence.
The tagset used in the corpus does not distinguish
between different forms of verb. However, we need to
find the infinitival forms of verb, as it can occur in
other cintexts e.g. in noun phrase. Hence, we checked
the last two characters of each VB tagged word. If it is
nA, nE or nI (the infinitival inflection endings), then we
did not mark the start of verb group 4. We changed the
tags of these infinitival forms as VBI or AAI
(corresponding to AA) for ease in further processing.
The subordinate clause marker kah connects two
clauses. However, we make it part of the first clause to
register it as a candidate argument in pseudo-relation



4.4. Pseudo-relation extraction

We developed an algorithm to process the tagged
and pre-processed corpus obtained in 4.3. The
algorithms gives the words (usually nouns or pronouns)
that have relation with the complex predicate of the
Ideally, we need a parse tree or at-least chunks (e.g.
noun group chunk) for this type of processing. In the
absence of such resources, we develop an algorithm
whose important features are described in the
(a) The sentence is processed from last word to first
word. The motivation for this backward processing is
that Urdu is head final language for many structures.
For example, in backward processing of postpositional
(or case phrase) i.e. NP pp, the postposition is
processed first, and the first noun/pronoun occurring
before it can easily be extracted as candidate argument.
(b) The (word's) stem preceding a postposition are
extracted as candidate of pseudo-relation. Any
intensifier e.g. hI or bHI is excluded, and we choose its
(intensifier's) previous stem as the candidate stem.
In the following example, the word laRkI will be
extracted as the candidate having (pseudo-) relation kO
with the verb.
laRkI kO
(c) The stem forms of the pronouns mEN
(1.Sg.Nom), ham (1.Pl.Nom) and e tumhEN (2.Sg.Acc)
etc. is changed to hpro. These pronouns are always
used for humans.
(d) The stem forms of other pronouns e.g. us
(3.Sg.Obl) are changed to pro.
(e) In backward processing, the noun or adjective
immediately preceding the verb is extracted as noun or
adjective part (respectively) of the complex predicate.
These two extracted relations are mutually exclusive
i.e. only one of these occur in the sentence, and is

4.3. Pre-Processing
The corpus with newly assigned sentence boundary
is pre-processed before applying pseudo-relation
extraction. The pre-processing involves the following
(a) Adding an additional feature, stem, with the
word. Currently, only one category i.e. verbs are
stemmed using a list of verbs and inflections. For other
tags, the word is copied as stem. (Stemming the words
of other categories is a part of future work.)
(b) The tagger assigned the tag P to genitive
markers kA, kI and kE. We re-assigned the newly
created tag G to these markers. (The reason is ease of
processing for this and some further-work tasks.)
Sample sentences after pre-processing is represented
in following figure. The format is similar to CoNLL

Urdu is a free order language i.e. Subject (S), Object (O) and Verb
(V) can occur in any order. For example, OVS is a valid word order
in Urdu. However, the canonical word order is SOV. the results
discussed in section 5 attests this observation/claim.
As sentence boundary detection is the only reason for verb group
identification, the counter examples e.g. verb-nA chahIyE are
irrelevant. In this case chahiye will be marked a VB hence start of
the verb group and the end of sentence will be detected by

All the processing is done of Urdu written in Arabic script. In
paper, we present examples in roman script to make it readable for a
wider audience.


Proceedings of the Conference on Language & Technology 2014

We ignore any intensifier or negative adverb

occurring between verb and noun/adjective.
(h) If a noun is extracted in the above step, then we
try to extract its genitive modifier that can be an
argument for some nouns. If the genitive tag G is
immediately preceding the noun, then the stem of the
word preceding the genitive marker is extracted as a
candidate of genitive marked argument.
(I) The noun preceding the adjective/noun of step
(e) is also extracted as a candidate for unmarked object
(j) If the marker of embedded clause KER occurs in
the sentence, then we exclude the postposition marked
relation e.g. kO-marked or sE-marked relations
occurring before it, as these can be arguments of the
verb of embedded clause. However, we do not ignore
the nE marked argument, as it is always part of the
main clause.
(k) If any other verbs occur in the sentence (except
for KER condition), then we change the stem of mainverb as double-verb (i.e. we mark that our algorithm
could not process this sentence properly.)
(l) the occurrence of kah at the end of sentence is
also noted.
(m) If the word preceding kO has the name of day or
month, we replaced the stem with temp. It is done to
distinguish temporal arguments with the object or
subject marked with kO.
(n) The auxiliaries e.g. rah (progression), sak
(ability) and chuk (completion) etc. are removed. The
AUX field is suppose to have V+V light verbs.
The list of possible relations is: nE, kO, sE, par,
meN (different postpositions), NN (noun part of cp),
ADJ (Adjective part of cp), NNG (genitive argument of
N+V cp), NN2 (object of cp or the subject of
intransitive verb), kah (the subordinate clause marker
kah) and AUX (auxiliary and/or V+V light verbs).

Otherwise, we try to find whether it is dative subject

verb. If there are more than 5 examples of kO marked
arguments consider it as dative-subject verb. We do not
impose the condition of 100% or 90+% (in above two
tests, as there might be many examples of pro-drops or
wrong sentence boundary detection that will result in
the drop(/omission) of nE or kO marked argument
found in these sentences.
If the complex predicate is not classified as simple
transitive or dative-subject construction in above steps,
then we consider it as intransitive (having nominative
subject). We use the words of NN2 (the unmarked
nouns) as candidate for subject-cp collocation. In this
case, we hide object-cp relation (and collocations) with
this cp. The pseudo-relation NN2 also gives the
candidate words for the object of dative-subject
As we introduced stem temp for commonly
occurring temporal words i.e. day, part of the day and
month names, we exclude stem temp in the calucation
for dative subject or accusative object of simple
transitive. The identification of genitive argument of
N+V and locative relation (using meN and par) is
The non-canonical second argument (NCSA) are
identified by using the heuristic that human arguments
usually appears with locative postpositions in this
construction. As we have stem hpro for first and 2nd
person pronouns, the stem helps us to identify NCSA.
If there exists more than 5 instances of hpro with a
locative postposition (sE, par or meN).
The procedure described above is summarized in the
following table.

4.5. Predicating Argument structure

The pseudo-relation extracted in the previous steps
are analyzed to find whether an argument type exists
for a particular complex predicate. If that argument
exists, then what are words commonly used in that
The common processing for both N+V and A+V
predicates is following.
The first step is to classify the type of cp as simple
transitive, dative subject construction or intransitive
verb. If there are more than 5 examples of nE relation,
then we consider the cp as simple transitive i.e. it has
nominative/ergative subject. The nE marked words are
the candidate for subject-cp collocation for this
complex predicate.














Gen. Arg. NNG


in hpro
in hpro
sE/par/mEN sE/par/mENh sE/par/mEN





Light V -aux/modal -aux/modal
Table 4.1: Heuristics for Argument Extraction


Proceedings of the Conference on Language & Technology 2014

4.6. Predicting the Collocations

Relation Collocation words

After guessing the arguments and verb type in the

above step, we focus on guessing the frequently used
words in each collocation. In most of the cases, the
system calculate TF*IDF7 (Term Frequency*Inverse
Document Frequency) for each word in each relationwith-cp. IDF is calculated as:
IDFwr = log (count of sentences in corpus /
count of word w appearing in the relation r)
Top n words are selected as the candidate
collocation for that relation.



hukomat 1 pronoun
'governement', jis 'who',
pakistan, committee


sAtH 'along',
'players', rabt
code , feature


Not present

khilariyon sAtH8

Light V dE, lE
Table 5.2: Collocations for shAmil+kar 'include'

5. Results
The corpus contains 5.465 million sentences. The
sentence boundary detector of section 4.2 generates
11.156 million sentences. Hence the number of
(guessed) sentences become double. 8.006 million of
these sentences have the tag verb, and we use these 8
million sentences in pseudo-relation extraction.
For evaluation, we selected five N+V and five A+V
complex predicates. Three of each type are randomly
selected from those cps that have high frequencies. The
other two are chosen from the medium frequency
region. The following is the collocation and arguments
extracted for these complex predicates.

Relation Collocation words

Relation Collocation words

Light V dE, lE
Table 5.3: Collocations for band+kar 'close'





pakistan, jis 'who', ajmal, 2 pronouns

bharat 'india', jinhon 'who


'government', 1 pronoun
logon 'people', unhoN 'they',


AnkHEN 'eyes', darvAzA nazar-band

'door', nazar, qalam, rAstON and qalam'paths'


Not present

Relation Collocation words

WikTeN 'wickets', taleem

'education', aizaz 'award', ilm
'knowledge', maqAsid 'goals'




dorA 'visit', license, AyAt

parvazEN 'flights'


Not present

Not present

Light V lE
Table 5.1: Collocations for hAsil+kar 'achieve'




Light V dE
Table 5.4: Collocations for mansookh+kar 'cancel'

sAtH is a noun (NN) in the POS tagset. Actually, it is a nominal

post-position or nominal adverb.

In other cases, we use Term Frequency.


Proceedings of the Conference on Language & Technology 2014

Relation Collocation words


YadEN 'memory', umIdEN sAtH

'hopes', tawwaquAt 'hopes',


[2] Bushra Jawaid, Amir Kamran, Ondej Bojar. A

Tagged Corpus and a Tagger for Urdu. In
Proceedings of the 9th Conference on Language
Resources and Evaluation (LREC 2014), Iceland,
[3] Ghulam Raza. Subcategorization acquisition
and classes of predication in urdu. Universitat
Konstanz, 2011.
[4] Ghulam Raza. "Inferring Subcat Frames of
Verbs in Urdu. In Proceedings of the 7th Conference
on Language Resources and Evaluation (LREC 2010),
[5] Joseph Grimes and Barbara Grimes (eds.).
Ethnologue, Dallas: SIL, 2000.
[6] Kilgarriff et al. The Sketch Engine, Proc.
Euralex, Lorient, France, July, 2004, 105-116.
[7] M. Humayoun. Urdu morphology, orthography
and lexicon extraction. MSc Thesis, Department of
Computing Science, Chalmers University of
Technology, 2006.
[8] Miriam Butt. The Structure of Complex
Predicates in Urdu. Stanford: CSLI Publications, 1995.
[9] Miriam Butt et al. "Identifying Urdu Complex
Predication via Bigram Extraction", Proceedings of
24th International Conference on Computational
Linguistics (COLING), Mumbai, 2012.
[10] Miriam Butt, Scott Grimm, and Tafseer
Ahmed. "Dative subjects." NWO/DFG Workshop on
Optimal Sentence Processing, Niemegen, 2006.
[11] Rafiya Begum et al. "Identification of conjunct
verbs in hindi and its effect on parsing accuracy",
Computational Linguistics and Intelligent Text
Processing. Springer Berlin Heidelberg, 2011, 29-40.
[12] Peter Edwin Hook. The Compound Verbs in
Hindi. University of Michigan, Ann Arbor, 1974.
[13] Tafseer Ahmed Khan. Spatial expressions and
case in South Asian languages. Diss. Bibliothek der
Universitt Konstanz, 2009.
[14] Tafseer Ahmed and Miriam Butt. "Discovering
semantic classes for Urdu NV complex predicates",
Proceedings of the Ninth International Conference on
Computational Linguistics, 2011.
[15] Tara Mohanan. Argument Structure in Hindi.
Stanford: CSLI Publications, 1994.
[16] Tafseer Ahmed, Miriam Butt, Annette Hautli
and Sebastian Sulger. "A Reference Dependency Bank
for Analyzing Complex Predicates", In Proceedings of
the Conference on Language Resources and
Evaluation 2012 (LREC 2012), Istanbul, 2012.


sE: shobE 'departments=', jis 1 pronoun

'who', idArE 'department',

Light V
Table 5.5: Collocations for vabastA+hE '(be)
The analysis of the above data shows that three
major problems in the guessed words.
(a) Some pronouns are not classified as pronouns.
As pronoun is a small class, we can make list of words
that should not be added in pseudo-relations. Currently,
we are doing this type of filtering for the tag personal
pronoun PP and some temporal words. It can be
extended to all pronoun stems.
(b) Multiword is not identified, hence the guess
word is incomprehensible. For example, Islamabad is
written in Urdu script as islAm AbAd. There is a
space between two components of the word. Our
system will extract only the last part i.e. AbAd. Using a
list of multiwords will reduce this problem.
(c) Some tagged nouns NN e.g. sAtH 'along', liyE
'for', mutAbiq 'according' are actually nominal
postposition. One can re-assign a different tag to these
class of words in preprocessing.

6. Conclusion and Future Work

We automatically extracted collocation and
arguments from an automatically POS tagged corpus. It
is an effort to use limited resources (annotated corpus)
to achieve the bigger goal of linguistic data extraction.
For future work, some suggestions are present in
section 5. Moreover,
the use of better word
segmentation and character normalization utilities will
resolve many problems. As the structure of Indo-Aryan
languages is similar, the method can be applied to other
closely related languages e.g. Sindhi and Punjabi.

7. References
[1] Abul Lais Siddiqui. Jamaul Qawaid (Comprehensive Grammar), 1971, Karachi.

Proceedings of the Conference on Language & Technology 2014

Framework of Urdu Nastalique Optical Character Recognition System

*Qurat-ul-Ain Akram
Sarmad Hussain
Farah Adeeba
Mehreen Saeed
Al Khawarizmi Institute of Computer Science
Center for Language Engineering, Al-Khawarizmi
University of Engineering and Technology
Lahore, Pakistan
ainie.akram@kics.edu.pk, firstname.lastname@kics.edu.pk

to write Arabic text, has four unique shapes for a

character. Unlike Naskh, Nastalique has contextual
character shaping [5] as can be seen in Figure 2. Some
cases of contextual shaping are highlighted with red

The development of Urdu Nastalique Optical
Character Recognition (OCR) is a challenging task
due to the cursive nature of Urdu, complexities of
Nastalique writing style and layouts of Urdu document
images. In this paper, the framework of Urdu
Nastalique OCR is presented. The presented system
supports the recognition of Urdu Nastalique document
images having font size between 14 to 44. The system
has 86.15% ligature recognition accuracy tested on
224 document images.

(a) Character shaping in Naskh

writing Style

(b) Contextual Character

shaping in Nastalique writing

Figure 2. Contextual character

haracter shaping of
character highlighted with red color

1. Introduction

In Nastalique,, rules for the placement of Nuqtas

and diacritics are complex, which are based on the
contextual existence of characters in a ligature. This
characteristic also adds additional complexity for text
image segmentation especially during marks
association with the respective character/ligatures. In
Nastalique, some of the characters and diacritics have
size. The examples of
same shapes but are different in size
diacritics and main body confusions are listed in Table

Urdu belongs to Arabic script which is cursive in

nature. Urdu has an extended character set shown in
Figure 1, and additional aerab which are normally used
for pronunciation [1]. One or more characters of Urdu
are joined together to form ligature [2].. A ligature
has a
base stroke called RASM or main body and secondary
strokes called IJAM or diacritics. Based on the shape
similarity of RASM, Urdu ligatures are divided into
different classes. Different ligatures are joined together
to form Urdu words. In Urdu, spaces are not properly
used to define Urdu word boundary [3, 4].

Table 1.. Diacritics and main bodies confusion





Diacritic of

Diacritic of

Figure 1.. Urdu character set

Nastalique writing style is normally used to write
Urdu books, magazines and newspapers. Nastalique is
written diagonally, which results in vertical
overlapping of characters and ligatures.
characteristic adds complexity in Urdu document
image segmentation. Naskh writing style which is used

Diacritic of

A ligature has variation of thick

thick-thin strokes which
introduces complexity in the pre
pre-processing module


Proceedings of the Conference on Language & Technology 2014

especially during binarization of Urdu document

images. The examples of ligatures indicating thick-thin
transitions are given in Figure 3.

Bukhari et al. [10] present layout analysis of Arabic

and Urdu document images having multiple layouts.
The existing systems for other languages are tweaked
for segmentation of Arabic and Urdu document
images. The system is tested on 25 Arabic and 20 Urdu
document images. The reported text and non-text
segmentation accuracy of the system is 99% for
Arabic test data. The text line extraction accuracies are
96% and above 92% for Arabic and Urdu document
images respectively.
Classification and Recognition module has two
phases. The first phase called training phase deals with
the classification of character/ligature shapes into
different classes based on the shape similarity. The
features of each class are extracted and used as input to
a classifier for training. In the recognition phase, the
features of input shape are computed and recognized
using the trained classifier. For the recognition of
cursive script, the classification and recognition
module is developed using two approaches; (1)
Ligature-based classification and recognition, and (2)
Segmentation-based classification and recognition.
In ligature-based classification and recognition, the
ligature as a whole is used for the classification and
recognition. The ligature-based recognition of
Nastalique main bodies is done using structural
features which are classified using neural network [11].
Sabbour and Shafait [12] use shape context features of
contours of the main body and diacritics for the
recognition of Urdu and Arabic text. The reported
accuracy of the system for Urdu is 91% tested on
synthesized data. Javed et al. [13] divide ligature
stroke into smaller windows and extract features for
recognition using HMMs as classifier. The system is
tested on synthesized data at 36 font size and has 92%
recognition accuracy. Lehal and Rana [14] perform
different experiments on different feature sets and
classifiers for the recognition of Nastalique ligatures.
The test data contains 4,380 images of 2,190 main
body classes and 1,700 images of 17 diacritics classes.
The DCTs as feature set with SVM performs well
among others features and classifiers. The system has
98.01% main body recognition accuracy and 99.91%
diacritics recognition accuracy. The ligature-based
classification and recognition of Nastalique main
bodies using Tesseract is also reported [6]. The
Tesseract engine is modified to improve the main body
recognition accuracy. The reported accuracy of the
system is 97.87% for 14 font size and 97.71% for 16
font size, tested on separate test data of 22,125
instances of 1,475 main body classes for 14 and 16
font sizes.
recognition system deals with the segmentation of
main body stroke into smaller parts which can be

Figure 3. Examples of thick-thin stroke

variation across characters in a ligature

2. Literature Review
Due to the complexities of Nastalique, limited
effort has been carried out for the development of
complete Optical Character Recognition (OCR) system
for Nastalique writing style. In this section current state
of the art for the development of Urdu OCR is
discussed. Normally, OCR has three modules; (1)
preprocessing, (2) classification and recognition, and
(3) post-processing.
Preprocessing module deals with the processing of
an input image to improve its quality and to segment
image into different areas. The relevant information
from these areas is extracted which is used in
classification and recognition, and post-processing
modules. The binarization system for Urdu document
images is developed by modifying the existing
binarization algorithm to address the Nastalique
complexities [7]. The evaluation of binarization
algorithm for Urdu document images is also devised in
this study. Shafait et al. [8] apply some of the existing
pre-processing techniques on Urdu Nastalique
document images to segment the page into columns
and text lines. The system is tested on 25 images
scanned from different magazines, poetry books, text
books, digest and newspapers. The reported accuracies
of the system are 91.45% , 92.31 %, 80.63%, 90.07%
and 72.16% for text books, poetry books, digests,
magazines and newspapers respectively.
projection profile method is also used to segment the
Urdu document image into text lines [9]. In addition,
some heuristics are also used to improve the line
segmentation results. The main body and diacritics of
ligatures are extracted using bounding box information,
and association of diacritics with the respective main
body is done using overlapping information. The
reported line segmentation accuracy is 100% tested on
20 images scanned from three poetry books. The
diacritics association accuracy is 94% tested on
synthesized data of 3,655 ligatures at 36 font size.


Proceedings of the Conference on Language & Technology 2014

characters. The segments are then used for the

classification and recognition. A segmentation-based
technique for the recognition of handwritten Nastalique
text is presented by Safabakhsh and Abidi [15]. The
Fourier descriptor, structural and discrete features are
extracted from the segmented primitives and are
classified using continuous-density variable-duration
HMMs. The reported accuracy of the system is 96.8%.
Javed and Hussain [16] segment the ligature into
smaller primitives using branch points information of
thin main body stroke. The DCTs features of the
segments are computed and classified using HMMs.
The sequence of recognized primitives is used to
recognize the ligature. The system is tested on
synthesized data at 36 font size, which contains 1,692
high frequency ligatures of six character classes. The
reported main body recognition accuracy of the system
is 92.73%. This approach is extended by Muaz [17]
for the recognition of ligatures of all 21 character
classes. The recognition accuracy of the system is
92.19% tested on 2,494 ligatures synthesized at 36 font
size. The bidirectional LSTM networks are used for the
recognition of printed Nastalique text [18]. The
synthesized dataset having text lines is used in this
approach. The pixel level information extracted from
the sliding window is used as a feature. This system
has 94.85% character recognition accuracy tested on
2,003 text line images. Rashid et al. [19] develop the
system for the recognition of multi script documents
using Convolutional Neural Networks (CNNs). The
reported script recognition accuracy of the system is
above 95% tested on Greek-Latin, Arabic-Latin and
Antiqua-Fraktur document images. Naz et al. [20]
evaluate state of the art techniques of preprocessing,
feature extraction and classification and recognition for
Urdu, Pashto and Sindhi document images having
Nastalique and Naskh writing styles.
The post-processing phase deals with the formation
of words and sentences using recognized characters
/ligatures sequences. This module deals with the word
segmentation, spell checker and POS tagger etc. submodules, and outputs the correct sequence of words to
form sentences. In Urdu, development of the word
segmentation system is challenging due to the
inconsistent use of space. Durrani and Hussain [21]
develop a rule-based word segmentation system which
automatically defines the word boundaries of the input
Urdu text. A statistical word segmentation system for
Urdu text corpus is also developed [4]. The ligature Ngrams and word N-grams are computed from the
corpus to statistically compute the best sequence of
words from the sequence of ligatures. The system has
96% word formation and 67% sentence formation

3. Methodology
In this paper, the framework of Urdu Nastalique
OCR system is presented. During books survey, it has
been analyzed that the font size of text books ranges
from 14 to 44 where 14 and 16 font sizes are used for
normal text and remaining font sizes are used in
headings of Urdu text books. The presented OCR
supports the recognition of Urdu document images
scanned from different books and magazines having
font sizes between 14 to 44. The architecture of Urdu
Nastalique OCR is illustrated in Figure 4. Urdu
Nastalique OCR has three main modules; (1)
Preprocessing, (2) Classification and Recognition, and
(3) Post-processing modules. To cover the range of 14
to 44 font sizes, four different recognizers are
developed at 14, 16, 22 and 36 font sizes. To support
the reasonable accuracy of normal text recognition, the
recognizers at 14 and 16 font sizes are developed
separately. The remaining font sizes normally appear
in headings, therefore to cover this range one
recognizer is developed at 22 font size and the other
recognizer is developed at 36 font size. The details of
each module are discussed in subsequent sections.

Figure 4. Architecture diagram of Urdu

Nastalique OCR


Proceedings of the Conference on Language & Technology 2014

developed. The text areas having font size between 18

to 20 are scaled up to 22 font size and text areas of 24
to 28 font sizes are scaled down to 22 font size. In the
same way, the text areas of 30 to 34 font sizes are
scaled up to 36 font size and 38 to 44 font sizes are
scaled down to 36 font size.
The connected components of the resized text areas
are disambiguated as diacritics or main bodies. The
main bodies are used to form text line. As some of the
diacritics are confused in shape with some of the main
bodies, but are different in sizes (see Table 1).
Therefore, separate recognizers of diacritics and main
bodies are developed. For better results of diacritics
association with respective main body, the diacritics
recognition is performed in pre-processing module.
Correct diacritics association eventually affects the
ligature recognition accuracy. After diacritics
recognition, the association of diacritics with the
respective main body is done. In addition, the
positional information of the diacritics with respect to
main body such as above, below or middle of the main
body is also computed. This information is used in
ligature string creation sub-module of classification
and recognition. The diacritics association output is
shown in Figure 7. The blue and brown colors are
used to show the alternating ligatures in one line and
red and green colors are used to show the alternating
ligatures in next text line. The dark color indicates the
main body and light color indicates the associated
diacritics of the respective main body. The Latin script
is not processed to form the text lines. However, the
positional information in the respective text line is also
maintained so that its recognized text can be output in
proper location in the text line. The Latin text in Figure
7 is highlighted with gray color.

3.1. Preprocessing
In Preprocessing module, the binarization of Urdu
document images is performed using [7]. The layout
extraction is performed which segments the Urdu
document images into figure and text areas. The
extracted text areas are then sequenced into column(s)
according to the reading order. The output of the layout
extraction is given in Figure5, the figure areas are
marked with red rectangle and text areas are marked
with green rectangle.

Figure 5. Output of image segmentation into

figures and text areas
Each text area of the document image is processed
to mark the Latin and Nastalique text. Script detection
sub-module processes each connected component and
marks script identity either Latin or Nastalique. Figure
6 shows the sample output of script detection system,
the Nastalique script is marked with blue color and
Latin script is marked with red color.

Figure 7. Output of diacritics association

Figure 6. Output of script detection

3.2. Classification and Recognition

The Nastalique text is further processed to compute
the font size of the text area. Based on the computed
font size, the text area is resized to the nearest font size
on which pivot line segmentation and recognizer are

Open source Tesseract engine [22] for Latin is used

for the recognition of marked Latin script. The


Proceedings of the Conference on Language & Technology 2014

recognized output of the Tesseract is passed to the

post-processing module.
The classification and recognition module mainly
deals with recognition of Urdu main bodies. In the
training phase unique main bodies are classified into
classes using the RASM class information. Two
different classifiers are developed. Ligature-based
classification and recognition using Tesseract and (2)
Segmentation-based classification and recognition
using HMMs.
In ligature based classification and recognition
module, the main body as a whole is used for
recognition. The modified Tesseract engine for the
recognition of Nastalique main bodies is used for this
purpose. The details can be found in [6]. The
segmentation-based classification and recognition
module deals with the segmentation of main body into
constituent characters. Here the DCT coefficients as
features and HMMs as classifier are used for
recognition. Both classifiers outputs the recognized
ranked options of main body identifiers against a single
main body image.
After recognition, the recognized main body and
diacritics of respective ligature are used to recognize
the ligature string. For this purpose, a lookup table is
used which contains the information pertaining to a
recognized main body and its corresponding associated
diacritics. In addition, the diacritics positional
information is also maintained in the lookup table.
The recognized diacritics and positional information
are computed in pre-processing module. Therefore
recognized diacritics, positional information of
diacritics and recognized main body are used to
recognize the respective ligature string using a lookup
table. Against each recognized option in the ranked
list of a main body, the respective ligature string is
recognized from the lookup table. This module
generates recognized ranked ligatures list against a
ligature image.

words according to the position computed in preprocessing module. The output of word segmentation
system on sample input is given in Figure 8.

Figure 8. Output of word formation system

4. Testing and Results

For the testing of Urdu Nastalique OCR system, a
test data of different page layouts having one column is
prepared. The layouts have header, footer, mixture of
font sizes, figure in text, etc. Some examples of layouts
of test data are given in Figure 9.

Figure 9. Examples of test data layouts

In addition, page having text in different font sizes
from 14 to 44 font sizes has also been added in the test
data. The synthesized document images at missing font
size (does not appear in document images of books)
are prepared to test the recognition accuracy of the
OCR. Test data contains a total of 224 document
images. The document images having normal font
sizes contain approximately 700 ligatures per page.
The number of ligatures of document images goes
down for larger font sizes. The test data of 224
document images contains 371 on average ligatures per
page. The desired ligature string against each ligature
of document images is tagged. In addition, the broken
main body and special ligature (written in different
font style) are also marked. The font wise number of

3.3. Post-processing
The output of the classification and recognition
module is the recognized sequence of ligatures. Each
ligature has a ranked list. The word segmentation
system converts the sequences of ligatures into best
sequence of words using modified statistical model of
[4]. All combinations of ligatures ranked list are
formulated and given to the word segmentation system
to generate the best sequence of words of a sentence.
The statistical word segmentation system is developed
using language model of ligature N-grams and word Ngrams reported in [23]. In post-processing, the
recognized Latin text is also inserted between Urdu


Proceedings of the Conference on Language & Technology 2014

Table 3. Font wise ligature recognition


ligatures along with the count of broken and special

ligatures in the test data are listed in Table 2.
The accuracy of each font size is computed at three
levels; (1) classification and recognition (C&R)
module, (2) post-processing (PP) module and (3) Endto-End system. In classification and recognition
module accuracy, the ligature is marked as correct if
the desired ligature string is found in the ligature's
recognized ranked list. For post-processing (PP)
module accuracy, the word segmentation accuracy is
computed irrespective of C&R module errors.
Therefore the PP module accuracy is computed as the
total number of correct ligatures found in ranked lists
versus total number of ligatures ranked at the top by
the word segmentation system.
Table 2. Font wise ligature information of test
































































































































5. Discussion
In this paper, the framework of Urdu Nastalique
OCR to support the recognition of Urdu document
images for 14 to 44 font sizes is reported. The system
is tested on 224 document images and ligature
recognition accuracies are reported for each desired
font size. The complete process from pre-processing,
classification and recognition, to post-processing is
performed on each document image. The ligature
recognition accuracy is computed using the tagged
ligature string information. The module-wise
accuracies are reported for each font size in Table 3.
recognition accuracy of normal font sizes are stable.
The main reason is the availability of sufficient data of
normal text. As the larger font sizes appear in heading
therefore the training and testing data is not sufficient
for most of the font sizes. The drop in the recognition
accuracy, especially for 36 and 40 is mainly due to
unavailability of a large number of example images. In
addition, currently Urdu OCR does not handle broken
and special ligatures which also affect the ligature
recognition accuracy. The results show that developed
Urdu OCR can be used to port published Urdu content
online with minimal editing effort.

In addition, End-to-End system accuracy is also

computed which computes the accuracy in terms of
total number of ligatures of input document image
versus total number of correct ligatures ranked at top
by the word segmentation system. The font wise
ligature recognition accuracies are given in Table 3.
Urdu OCR gives 90.10%, 95.78% and 86.15% per
page ligature recognition accuracy for C&R module ,
PP module and End-to-End system respectively.


Proceedings of the Conference on Language & Technology 2014

[6] Q. Akram, S. Hussain, A. Niazi, U. Anjum and F. Irfan,

"Adapting Tesseract for Complex Scripts: An Example for
Urdu Nastalique," in 11th IAPR Workshop on Document
Analysis Systems (DAS 14), Tours, France, 2014.

6. Conclusion
In this paper, a framework of Urdu Nastalique OCR
is discussed. Initially, the layout of the Urdu document
images and complexities of Nastalique writing style is
analyzed to finalize the framework. Each sub-module
is tested and matured separately on data extracted from
document images at each font size. After finalization of
each sub-module separately, these are integrated in the
OCR framework. The testing and maturation pass of
the integrated OCR system is carried out to further
improve the document recognition accuracy. The
system has per page ligature recognition accuracy as
90.10% for C&R module, 95.78% for PP module and
86.15% for End-to-End system. The broken and
special ligatures caused misrecognition which will be
resolved in the future. The presented Urdu Nastalique
OCR is online available at: www. UrduOCR.net

[7] M. Naz, Q. Akram and S. Hussain, "Binarization and its

Evaluation for Urdu Nastalique Document Images," in The
16th International Multi Topic Conference (INMIC), Lahore,
[8] F. Shafait, D. Keysers and T. M. Breuel, "Layout
Analysis of Urdu Document Images," in INMIC'06. IEEE,
[9] S. T. Javed and S. Hussain, "Improving Nastalique
Specific Pre-Recognition Process for Urdu OCR," in 13th
IEEE International Multitopic Conference 2009 (INMIC
2009), Islamabad, Pakistan, 2009.
[10] S. S. Bukhari, F. Shafait and T. M. Breuel, "High
Performance Layout Analysis of Arabic and Urdu Document
Images," in International Conference on Document Analysis
and Recognition, 2011.

7. Acknowledgements
This work has been conducted through the project,
Urdu Nastalique OCR supported through a research
grant from ICTRnD Fund, Pakistan.

[11] Z. Shah and F. Saleem, "Ligature Based Optical

Character Recognition of Urdu, Nastaleeq Font," in
International Multi Topic Conference, Karachi, Pakistan,

8. References
[1] S. Hussain, "Letter to Sound Rules for Urdu Text to
Speech System," in Workshop on Computational Approaches
to Arabic Script-based Languages, COLING 2004, Geneva,
Switzerland, 2004.

[12] N. Sabbour and F. Shafait, "A Segmentaitotn Free

Approach to Arabic and Urdu OCR," in SPIE, Volume 8658,

[2] M. Davis, "Unicode Text Segmentation," AddisonWesley Professional, 2013.

[13] S. T. Javed, S. Hussain, A. Maqbool, S. Asloob, S.

Jamil and H. Mohsin, "Segmentation Free Nastalique Urdu
OCR," World Academy of Science, 2010.

[3] S. Hussain, "www.LICT4D.asia/Fonts/Nafees_

Nastalique," in 12th AMIC Annual Conference on E-Worlds:
Governments, Business and Civil Society, Asian Media
Information Center, Singapore, 2003.

[14] G. S. Lehal and A. Rana, "Recognition of Nastalique

Urdu Ligatures," in 4th International Workshop on
Multilingual OCR (MOCR '13), New York, NY, USA, 2013.

[4] M. Akram and S. Hussain, "Word Segmentation for

Urdu OCR System," in 8th Workshop on Asian Language
Resources, COLING2010, Beijing, China., 2010.

[15] R. Safabakhsh and P. Abidi, "Nastaaligh Handwritten

Word Recognition Using a Continuous-Density VariableDuration HMM," The Arabian Journal for Science and
Engineering, 2005.

[5] A. Wali and S. Hussain, "Context Sensitive ShapeSubstitution in Nastaliq Writing System: Analysis and
Formulation," in International Joint Conferences on
Computer, Information, and Systems Sciences, and
Engineering (CISSE), 2006.

[16] S. T. Javed and S. Hussain, "Segmentation Based Urdu

Nastalique OCR," in 18th Iberoamerican Congress on Pattern
Recognition (CIARP 2013), Havana CUBA, 2013.
[17] A. Muaz, "Urdu Optical Character Recognition
System," Unpublished, MS Thesis Report, National


Proceedings of the Conference on Language & Technology 2014

University of Computer and Emerging Sciences , Lahore,

[18] A. Hasan, S. B. Ahmed, S. F. Rashid, F. Shafait and T.
M. Breuel, "Offline Printed Urdu Nastaleeq Script
Recognition with Bidirectional LSTM Networks," in
International Conference on Document Analysis and
Recognition, 2013.
[19] S. F. Rashid, F. Shafait and T. M. Breuel,
"Discriminative learning for script recognition," in 17th IEEE
International Conference on Image Processing, Hong Kong,
[20] S. Naz, K. Hayat, M. I. Razzak, M. W. Anwar, S. A.
Madani and S. U. Khan, "The optical character recognition of
Urdu-like cursive scripts," Pattern Recognition, pp. 12291248, 2014.
[21] N. Durrani and S. Hussain, "Urdu Word Segmentation,"
in 11th Annual Conference of the North American Chapter of
the Association for Computational Linguistics (NAACL
HLT 2010), Los Angeles, US, 2010.
[22] R. Smith, D. Antonova and D.-S. Lee, "Adapting the
Tesseract open source OCR engine for multilingual OCR," in
International Workshop on Multilingual OCR, Barcelona,
Spain, 2009.
[23] F. Adeeba, Q. Akram, H. Khalid and S. Hussain, "CLE
Urdu books N-Grams," in Conference on Language and
Technology 2014(CLT14), Karachi, 2014.


Proceedings of the Conference on Language & Technology 2014

Design of Speech Corpus for Open Domain Urdu Text to Speech System
Using Greedy Algorithm
Wajiha Habib
Center for

Rida Hijab Basit

Center for

Sarmad Hussain
Center for

Farah Adeeba
Center for

A unit selection text to speech system requires a

large database of recorded and annotated speech,
which contains both phonetic and prosodic variations.
At run time, appropriate units are selected from the
database and they are concatenated to produce the
desired utterance. The required memory size for unit
selection system is very large. In addition, multilayer
annotation of recorded speech is needed, which is a
tedious and time consuming task. Hence, there is a
need to optimize the speech corpus in such a way that
maximum coverage of target units can be achieved
with minimum corpus size. Greedy algorithm serves
this purpose and has been used for intelligibly
reducing the corpus.
This paper proposes a greedy algorithm for
designing an optimal speech corpus for unit selection
text to speech system. The rest of the paper has been
organized as follows: Section 2 carries the literature
review of greedy algorithm techniques designed for
different languages, Section 3 describes the proposed
methodology and Section 4 contains description of
the data gathered for extraction of speech corpus.
Section 5 describes implementation and evaluation of
the proposed algorithm to select optimal speech
corpus, Section 6 analyzes the resulting corpus and
Section 7 holds conclusion.

Unit selection speech synthesis is one of the most
widely used techniques for high quality text to speech
(TTS) systems. A unit selection text to speech system
requires a large database of recorded and annotated
speech, which contains both phonetic and prosodic
variations. Designing phonetically rich and balanced
speech corpora with minimum number of utterances
is an intricate task. Several optimization methods are
used for this purpose and "Greedy algorithm" is one
of them. This paper introduces a greedy algorithm,
which maximizes the coverage of high frequency
unigrams, bigrams and trigrams while selecting
minimal number of sentences from input corpus. The
algorithm has been applied on different corpora
collected from different domains and a speech corpus
for Urdu TTS system is designed. A significant
coverage of tri-phone has also been achieved.

1. Introduction
Unit selection technique for speech synthesis is a
data-driven, concatenative approach. It dynamically
selects the longest sequence of phonetic segments
from the speech database, matching the
characteristics of the target to be synthesized. The
elegance of this approach lies in the lesser amount of
signal processing required on the final utterance
because the prosodic information is already a part of
the corpus stored in the inventory. Furthermore,
fewer concatenations result in a more natural speech
output. However, the quality of data-driven text to
speech system depends on the quality of its database.

2. Literature Review
Different techniques have been used to design a
corpus for speech applications. Greedy algorithm is
one of those methods employed, to extract an optimal
reduced speech corpus from large corpus. It is an
iterative approach that aims to maximize the
coverage of target units while selecting minimum


Proceedings of the Conference on Language & Technology 2014

[10]. Sentences with maximum score are selected in

[8,9,10]. Zhang et al. have selected sentences that
provide maximum syllable level information [16].
Sentences are scored according to the information
they provide regarding the syllables to be covered
and in the end, best score sentence is selected.
A cluster tree from general speech database is
built in [11] by clustering similar units (according to
some features) together. Based on these features
(phonetic, metrical and prosodic context), clusters are
split unless a small acoustic distance is obtained.
Greedy algorithm is then applied to find best
coverage utterances. For scoring a sentence, cluster
tree is traversed and sentences are scored
accordingly. The best score sentences are selected in
the end.
Least to Most (LTM) Greedy algorithm has also
been used for corpus reduction [14]. Tri-phones to be
covered are sorted in increasing order of their
frequency. Least frequent tri-phone is selected alongwith others that have the same occurring frequency
and a separate list is maintained. Another list contains
all the sentences that cover these tri-phones. These
sentences are scored and best score sentence is
selected. This process is repeated for all the triphones. In the end, redundant sentences are removed
from the reduced corpus manually.
Minimum match score sentence is considered as
the best sentence in [7]. For selecting sentences, the
context of diphones is checked in the selected
sentences and it is compared with the candidate
sentence. It then calculates the match score for the
two entities. Low match score indicates that the
diphone in the candidate sentence has a different
context as compared to the one in the selected
sentence, so in this case low cost sentences are
selected for the maximum coverage of diphones. The
greedy algorithm stops selecting sentences when a
certain number of sentences have been selected.
Greedy algorithm can be used to extract a list of
words from corpus that provides maximum coverage
of basic unit [12,17]. This reduced word list is then
used to construct sentences manually.
Semi-automatic algorithm has been used that
generates sentences using Finite State Transducers
(FST) [13]. States represent vocalic sandwiches
whereas arcs represent valid transitions between
vocalic sandwiches (bigram sandwiches). This
process involves human intervention. The algorithm
generates sentences that give maximum coverage but
as they are being generated by FST so they can be
completely incorrect or senseless. For this purpose, a
person with linguistics background must be sitting
and operating it. The operator can accept, reject or
ask to build another sentence based on the

number of sentences from the input corpus. The

target unit varies from phone to phrase level i.e.
phone, diphone, tri-phone, syllable, unigram, bigram
and trigram. Selection of target unit for corpus design
is based on the domain and needs of the application
field. Coverage of larger units results in larger
database, which in turn would produce high quality
speech whereas smaller size of target units results
into smaller database with compromised speech
Phoneme can be used as a target unit for speech
corpus design. Phone level coverage results in a
limited corpus but phoneme sized chunks fail to cater
the co-articulatory effects between adjoining
phonemes [19]. Acoustic behavior of a phone is
dependent on its previous and next phone. So, speech
corpus should contain all the phones in all contexts.
Therefore, tri-phone coverage is taken into account
[12,14,15]. However, full coverage of tri-phone is
impractical due to its very huge number. Diphone is
used as the basic unit for corpus selection in [7,9,10]
as it is affordable to build a corpus with high
coverage of diphones. Diphone is an acoustic chunk
from the middle of one phoneme to the middle of the
next phoneme. Diphone is a desirable unit in
concatenative synthesis because it gives complete
language coverage, consumes less memory and as the
co-articulation effects are minimal at the center of the
phoneme, it caters the co-articulatory effects.
For a tonal syllabic language like Chinese,
syllable coverage is the basic requirement for corpus
design [16]. A new linguistic unit "vocalic sandwich"
is defined in [13]. It is a sequence of phonemes, like
vowels and semi-vowels surrounded by two
phonemes (consonants). Greedy algorithm searches
for those sentences that maximize the vocalic
sandwich coverage rate.
In [8,15], multiple occurrences of target unit are
acquired to capture all acoustic variations. Target
phonetic distribution is focused in [17,18]. The aim is
to extract a small database having the same
probability distribution of phonetic features as the
distribution in total database. The phonetic
distribution includes phonemes, diphone patterns etc.
Sentence selection through greedy algorithm is
carried out on the basis of calculated score and
different criteria are employed for scoring the
sentence. Francois et al. use five different techniques
for scoring the sentences [8]. These include: high
number of units in the sentence, sentence length,
multiple occurrences of the unit and coverage of rare
units. Kelly et al. score sentences according to the
unique diphones they cover [9]. Different weights are
assigned to the diphones in the corpus and Okapi
formula is used to calculate scores of the sentences


Proceedings of the Conference on Language & Technology 2014

phone list is generated from the phones present in

Urdu language. This list is further reduced by
collapsing those phonemes that have similar acoustic
effect [12]. The remaining three target lists
(unigrams, bigrams and trigrams) are generated from
the corpus itself. Urdu corpus and these lists are used
to run the proposed greedy algorithm. Lists are
updated throughout the algorithm whereas selected
sentences are removed from the original corpus.
The algorithm assigns scores to all the sentences
in a corpus according to the number of uncovered
units in the sentence. A flow diagram of selection
process is shown in Figure 1.

requirements. This algorithm takes three minutes on

average to build a credible sentence.
Speech corpus for Urdu language has also been
designed [12]. The resulting corpus consists of
sentences which are manually fabricated from the
phonetically rich wordlist. Greedy algorithm has been
used to extract those words from the corpus which
give maximum coverage of high frequency triphones.
Manual construction of sentences is a tedious and
troublesome task. A better approach should be used
to avoid this time consuming and laborious effort.
Therefore, an algorithm is devised to select optimal
sentences directly from the corpus instead of
constructing the sentences manually through the use
of the wordlist. The proposed algorithm automates
the process of speech corpus selection and produces a
sentence based optimal speech corpus for Urdu.

3. Methodology
Diphones and tri-phones are used as basic units
for speech synthesis. The greedy algorithm
techniques employed, look for maximum coverage of
these two units while selecting the minimum
sentences from the corpus. The strategy used is good
enough for diphone and tri-phone concatenative
synthesis but in case of unit selection the size of the
unit can be varied. As we are designing speech
corpus for unit selection TTS system, we have
proposed an algorithm that takes four units of
different sizes that need to be covered while
constructing a reduced speech corpora. These four
units are: tri-phones, word unigrams, word bigrams
and word trigrams. 80% of speech corpus has been
extracted using top down approach in which coverage
of longer high frequency units (word unigrams, word
bigrams and word trigrams) has been maximized.
Tri-phone coverage has been given less attention
because phonetic transcription is required to report
the tri-phone coverage but the existing transcription
lexicon is in-sufficient to give 80% of Urdu language
coverage. A significant coverage of tri-phones has
also been achieved in the process. The rest of the triphones will be covered using the bottom up approach
in remaining 20% of speech corpus.

Figure 1. Flow diagram for proposed greedy

For scoring a sentence, a criterion has been
devised. Based on the criterion, a sentence is
considered optimal if it has maximum distinct units
and a small length. All this have been represented
using a formula which is as follows:

3.1. Proposed Algorithm

The proposed greedy algorithm takes Urdu corpus

and target lists as input. Target lists are the lists of
those units, which need to be covered in the reduced
corpus. The units consist of tri-phones, word
unigram, word bigram, and word trigram. Unique tri-

) (

) (

) (

Here, N refers to the number of uncovered units

and w refers to the weight of respective units which
has been decided in the testing phase. Output has
been analyzed with different weighting schemes


Proceedings of the Conference on Language & Technology 2014

during the testing phase and the scheme which

provided the best coverage is selected.
At each iteration, the algorithm picks the most
useful sentence (maximum score sentence) to include
in the selected sentence list, removes that sentence
from the corpus and updates the lists (tri-phones,
unigrams, bigrams, trigrams). These steps are
repeated until the lists have been completely covered
or the selected sentence score is less than some
threshold value or the number of words in reduced
corpus reaches some specified value. Devised
algorithm generates the following outputs:
Reduced Urdu corpus giving maximum
coverage of tri-phones, high frequency
unigrams, high frequency bigrams & high
frequency trigrams that occurred in the
larger Urdu corpus
Phone coverage report
Tri-phone coverage report
Unigram coverage report
Bigram coverage report
Trigram coverage report
Reduced corpus size

been selected and used for obtaining corpus from the

other two corpora. Details for testing and evaluation
of greedy algorithm have been documented in the
following sections.

5. Evaluation and Testing

During the implementation of greedy algorithm,
different target lists comprising of those units which
need to be covered in the reduced corpus, have been
used. Different techniques are used to generate these
lists, which will be explained in the following
section. Moreover, weight assignment will also be
described in detail.

5.1 Target Lists Generation

The 35 million word corpus has been used for
generating lists of unique word unigrams, word
bigrams and word trigrams along-with their
frequency. These lists are sorted on the basis of
frequency and the resulting lists are plotted to find
the threshold for target lists generation. In Figure 2,
unigrams are plotted against their frequencies. After
the frequency value 495, a constant behavior is
shown by graph. A sub-list is formed consisting of
only those unigrams having the frequency greater
than or equal to 495.

4. Corpus Description
The corpus for generic TTS system should be
gathered from a broad range of domains to ensure
diversity. Therefore, we have used three different
corpora for extraction of reduced speech corpus. One
of the Urdu corpora selected for TTS is a typed text
corpus that has been taken from Urdu books [21]. It
consists of 35 million words. The corpus contains
861 books from different domains i.e. religion,
science, biography, poetry, travel, short stories and
literature. These books not only cover Urdu
characters but also have a coverage of English
characters, Arabic, digits, URLs and special symbols.
Another corpus being used is "CLE Urdu digest
corpus 1M1" which has been collected from Urdu
digest [20]. Urdu news corpus of 2.6 million words is
the third corpus, which has been used for speech
corpus selection. The news corpus has been collected
from different Urdu news websites i.e. BBC, Jang
etc.. The news corpus is from the year 2005 and
covers different sections from the news. These
include: business, editorials, news and sports.
The proposed greedy algorithm has been
implemented on these three corpora described above.
The first corpus of typed Urdu books has been used
for testing the proposed greedy algorithm. The
weights that produce the best coverage result have

Figure 2. Unigram's frequency plot

Same method is followed for bigrams and
trigrams. The threshold value for bigram list is 465
and for trigram list is 125 as shown in Figure 3 and 4
respectively. Based on these threshold values, sublists for bigrams and trigrams have been generated.
These sub-lists are given as target lists to the greedy
algorithm and the coverage of these lists is focused
while obtaining the reduced Urdu corpus.



Proceedings of the Conference on Language & Technology 2014

has been kept constant whereas weights for other

three have been tweaked to obtain a reduced corpus
with above 90% coverage of unigrams, bigrams and
trigrams. The reason, the weight of tri-phone has
been kept minimal and tri-phone coverage has not
been taken into account, is that phonetic transcription
is not available for all the words in the corpus. The
words with no available phonetic transcription are
transcribed as silence. The stopping criterion is
70,000 words in reduced corpus. Results have been
summarized in the Table 1 as given below and the
weighting scheme which provided the best average
coverage %age has been selected.
Table 1. Coverage results with different
weighting schemes

Figure 3. Bigram's frequency plot































Figure 5 shows the average coverage against

different weighting schemes. The weights at which
the best coverage has been achieved are 0.017 for triphones, 0.2 for unigrams, 0.3 for bigrams and 0.483
for trigrams. These weights are tested with different
stopping criteria based on sentence score and
resultant number of words in reduced corpus.

Figure 4. Trigram's frequency plot

5.2. Weight Assignment

An appropriate weighting scheme is required to
prioritize the selection of target units. A unit with
higher contribution must be given the larger weight.
Weighting scheme devised, gives x weight to word
unigrams, 1/7x to tri-phones assuming that a single
word contains 7 tri-phones (5 phones) on average.
Word bigrams have been given weight 2x as it
consists of two words. Experiments have been
performed on three different weights for word
trigrams: 3x, 4x and 5x. 3x as word trigram covers
three words, 4x for covering two bigrams and 5x for
covering three words and two bigrams. Results have
been gathered by testing these weights on a smaller
corpus. The best coverage has been achieved
assigning 1/7x weight to tri-phones, x weight to
words, 2x weight to bigrams & 5x weight to trigrams.
Afterwards this weighting scheme has been
applied on 35 million word corpus but the results
were not so promising. For the better coverage of
unigrams, bigrams & trigrams; weight of tri-phone

Figure 5. Coverage result for different

weighting schemes


Proceedings of the Conference on Language & Technology 2014

6. Finalization of Corpus
More than 90% coverage of unigrams, bigrams
and trigrams target lists has been achieved in 80% of
speech corpus. Tri-phone coverage could not be
reported at the time of corpus extraction due to
incomplete lexicon. The speech corpus extracted
through greedy algorithm has been transcribed for triphone analysis and 34053 unique tri-phones are
found in reduced corpus.

Total speech required for TTS system corpus is of

10 hours. Top down approach has been used for
extraction of 80% of speech corpus (8 hours of
recorded speech). Approximately 6.5 hours of speech
corpus (70,000 words) has been obtained from 35
million word corpus whereas 1.5 hour speech corpus
has been obtained from 1M Urdu Digest and news
corpus. Target lists generation for second and third
iteration of greedy algorithm has been done using a
different method. For the generation of unigram,
bigram & trigram sub-lists from 1M corpus, the
unigrams, bigrams and trigrams covered in the
reduced corpus obtained from 35 million word corpus
are removed from the respective lists of 1M corpus.
The subsequent lists are then used to find the target
lists by plotting graphs for each of these lists.
After running the greedy algorithm, the reduced
corpus from 1M corpus has been merged with the
reduced corpus of 35 million word corpus. Now for
news corpus same process is repeated as with 1M
corpus but this time with the merged corpus (35
million word corpus & 1M corpus). The unigrams,
bigrams & trigrams covered in the merged corpus are
removed from the lists of news corpus and target lists
are generated by plotting graphs for each of the lists
(unigrams, bigrams & trigrams).
At the end of this process, 8 hours of speech
corpus has been gathered using greedy algorithm,
which is then used for recording purposes. Table 2
shows the results of greedy algorithm for the three
different corpora.

7. Conclusion and Future Work

In this paper, a greedy algorithm has been
proposed to extract minimum size of corpora from
some reference corpus while maximizing the
coverage of target units for text to speech systems.
Target units covered include tri-phones, word
unigrams, word bigrams and word trigrams. The
proposed algorithm is used to create a speech corpus
for open domain unit selection Urdu text to speech
The corpus obtained in the process is 80% of the
whole speech corpus required for TTS system.
Minimum attention has been paid to the tri-phone
coverage. In development of remaining 20% of
speech corpus, tri-phone coverage will be focused.
The selected speech corpus will be used for
recording. Those recorded speech files will be
annotated and the tagged speech will be used as the
database of unit selection Urdu TTS.

8. Acknowledgement
This work has been conducted through the
project, Enabling Information Access for Mobile
based Urdu Dialogue Systems and Screen Readers
supported through a research grant from ICTRnD
Fund, Pakistan.

Table 2. Results of greedy algorithm for

different corpora
Number of
Words in

9. References
[1] Taylor, Paul. Text-to-speech synthesis. Cambridge
University Press, 2009.
[2] Peterson, Gordon E., William SY. Wang, and Eva
Sivertsen. "Segmentation techniques in speech synthesis",
Journal. Acoustical Society of America. 30.8 (2005), 739742.
[3] N. Campbell and A. W. Black, "Prosody and the
Selection of Source Units for Concatenative Synthesis". In
J. van Santen, R. Sproat, J. Olive, and J. Hirschberg,
"Progress in Speech Synthesis". Springer Verlag, 1995.


Proceedings of the Conference on Language & Technology 2014

[4] Tokuda, K., Masuko, T., Yamada, T.: "An Algorithm

for Speech Parameter Generation from Continuous mixture
HMMs with Dynamic Features". In: Proc. of Eurospeech

[15] Franois, H. and Boffard, O., Design of an Optimal

Continuous Speech Database for Text-To-Speech Synthesis
Considered as a Set Covering Problem '', in proc.
Eurospeech, Aalborg, Denmark, 2001.

[5] Suendermann, David, Harald Hge, and Alan Black.

"Challenges in speech synthesis." Speech Technology.
Springer US, 2010. 19-32.

[16] Zhang, Jianhua Tao Fangzhou Liu Meng, and Huibin

Jia. "Design of Speech Corpus for Mandarin Text to
Speech"The Blizzard Challenge 2008 workshop, Oct.2008

[6] Zen, H., Toda, T.: "An overview of Nitech HMM-based

speech synthesis system for Blizzard Challenge
2005",inproc. of Inter speech 2005, Lisbon, pp. 9396

[17] Montero, Juan Manuel, Ricardo de Crdoba, Jos A.

Vallejo, Juana M. Gutirrez-Arriola, Emilia Enrquez, and
Jos Manuel Pardo. "Restricted-domain female-voice
synthesis in Spanish: from database design to ANN
prosodic modeling," in proc. INTERSPEECH, pp. 621624. 2000.

[7] B. Bozkurt, O. Ozturk, and T. Dutoit, "Text design for

TTS speech corpus building using a modified greedy
selection", in proc. Eurospeech , 2003.

[18] Vorapatratorn, Surapol, Atiwong Suchato, and

Proadpran Punyabukkana. "Automatic online text selection
for constructing text corpus with custom phonetic
distribution" in proc. Computer Science and Software
Engineering (JCSSE), 2012 International Joint Conference
on, pp. 6-11. IEEE, 2012.

[8] Franois, H. and Boffard, O., "The Greedy Algorithm

and its Application to the Construction of a Continuous
Speech Database'', in proc. LREC, Las Palmas de Gran
Canaria, Spain, 2002.

[19] Harris, Cyril M. "A study of the building blocks in

speech." The Journal of the Acoustical Society of
America 25.5 (1953): 962-969.

[9] Kelly, A., A. N Chasaide, H. Berthelsen, C. Campbell,

and C. Gobl. "Corpus Design Techniques for Irish Speech
Synthesis",in proc.China-Ireland International Conference
on Information and Communications Technologies, NUI
Maynooth, Ireland. 2009.

[20] Urooj, S., Hussain, S., Adeeba, F., Jabeen, F. and

Parveen, R. "CLE Urdu Digest Corpus", in proc. of
Conference on Language and Technology 2012 (CLT12),
Lahore, Pakistan.

[10] Wei, Zhang, Liu Yayu, Deng Ye, and Pang Minhui.
"Automatic Construction for a TTS Corpus with Limited
Text" In Measuring Technology and Mechatronics
Automation (ICMTMA), 2010 International Conference on,
vol. 1, pp. 707-710. IEEE, 2010.

[21] Adeeba F., Akram Q. Khalid H., Hussain S., "Urdu

Books N-gram Corpus", in proc. of Conference on
Language and Technology 2014 (CLT14), Karachi,

[11] Black, Alan W., and Kevin A. Lenzo. "Optimal data

selection for unit selection synthesis" In 4th ISCA Tutorial
and Research Workshop (ITRW) on Speech Synthesis.
[12] Raza, A., Sarmad Hussain, Huda Sarfraz, Inam Ullah,
and Zahid Sarfraz. "Design and development of
phonetically rich Urdu speech corpus",in proc.
[13] Cadic, Didier, Cdric Boidin, and Christophe
d'Alessandro. "Towards Optimal TTS Corpora." in proc.
LREC. 2010.
[14] Suyanto, "Modified Least-to-Most Greedy Algorithm
to Search a Minimum Sentence Set" in proc. TENCON,
Hong Kong, 2006


Proceedings of the Conference on Language & Technology 2014

An optimized Pashto Keyboard Layout based on Character Frequency

Kamran Ghani
Department of Computer Science & I.T,
University of Engineering and Technology,
Peshawar, Pakistan
E-mail: kamranghani@nwfpuet.edu.pk

Muhammad Junaid
Department of Computer Science & I.T,
University of Engineering and Technology,
Peshawar, Pakistan
E-mail: junaid370@yahoo.com
Iftikhar Ahmed Khan
Department of Computer Science
COMSATS, Abbottabad
Email: iftikharahmed@ciit.net.pk

Keyboard is the most widely used input device.

The efficiency of a keyboard is measured by the typing
speed achieved by using it, which itself depends on
many factors. To achieve maximum efficiency, keys
need to be placed at appropriate positions in order to
minimize finger travelling and awkward keystrokes.
For Pashto language, initially Persian and Arabic
layouts were used for text entry. However, later on
localized keyboard layouts were developed [1,2] , of
which the most common ones are Liwal, Khpala Pashto
and TolAfgan. To the best of our effort, we could not
find any work related to the development of these
layouts and none of these layouts have their keys laid
out according to the established principles of keyboard
layout design. Therefore, we proposed an optimized
local keyboard layout for Pashto, which has been
developed on sound scientific basis as described in
section 3.2. A brief introduction to the Pashto language
is provided in the following sub-section.

Pashto is spoken by an estimated 40 to 50 million
people in the world and has a very rich culture and
literary history. However, on the computing side, it is
not at par with other popular regional languages.
Many areas need to be explored. In this paper, we
have proposed the optimization of Pashto keyboard
layout. Though many Pashto keyboard layouts are
already in common use, no standard keyboard layout
is available. Furthermore, none of these have been
developed on sound scientific basis. The proposed
work is for Afghani Pashto and is based on similar
work done for other languages. To evaluate the
proposed layout, along with other factors, evaluation
criteria are also suggested. Analysis shows that the
proposed layout is superior to the existing layouts in
many respects.

1. Introduction

1.1. Pashto Language

Pashto has a rich literary history. However, to

survive as an independent language in modern era, a
language also needs to be technologically advanced.
Unfortunately, this is not the case with Pashto, because
of many reasons including cultural, political and even
geo-strategic factors. Recently, however, interest has
been shown by researcher for its development on this
front [5,6,8] and [23]. One aspect of this development
is also the localization, which is the adaptation of a
product to a specific region. In this paper we have
considered this aspect for the localization of Pashto
keyboard layout for efficient input of Pashto text.

Pashto belongs to the family of Indo-Iranian

languages and is spoken by an estimated 40-50 million
people [25].Rig-Veda, a sacred collection of Hindi
texts written in 1400 B.C., contains reference to Pashto
and Pashtoon people [3]. From different artifacts,
archives and written material researchers have
established the fact that Pashto language is 2500-3500
years old [4].
Pashto has 44 alphabets [17]. Its writing script has
been partly derived from Persian language which in
turn derived from Arabic script. It also has four
diacritics; though not in frequent use. Writing style can


Proceedings of the Conference on Language & Technology 2014

According to [18] speed optimization has three

major factors: equalize loads on both hands, maximum
usage of home row and maximum usage of alternating
hands plus minimal usage of same finger. DSK layout
is best suited for the first two constraints but lags well
behind in third constraint to QWERTY. It was claimed
that DSK has far superior performance than
QWERTY through different research studies and
experiments [19] and [20]. Many researchers have
doubted this claim. Liebowitz and Margolis [21] have
questioned the experimental setup and statistical
analysis used for experimentation purposes.US state
funded research project conducted by Earle Strong, a
professor at Pennsylvania State University claimed that
DSK has no speed gain in terms of typing speed to
outclass its counterpart QWERTY [22].
J.V. Parkinson [14] and G.B. Harbaugh [16]
suggested alphabetical ordering keyboard layouts in
more natural way by placing letters in alphabetical
ordering; grouped from left to right just like reading
pages of a book; keeping in view user-friendliness and
efficiency in terms of typing speed, not considered in
QWERTY and DSK. The alphabetical layout has one
obvious advantage of being the easiest to learn than
random layouts such as QWERTY and DSK. Though
[15] has concluded that such ordering has no
significant improvement over random layouts in terms
of ease of learning and efficiency. This approach could
also be extended to other languages and to see its
effectiveness in terms of typing speed with respect to
available layouts.
Inspired by the development and evolution of
different keyboard layouts for English; the research has
been extended to many other languages. The research
done by [7] and [13] have also focused on DSK
approach for general guidelines to achieve better
layouts. Another research study was also conducted to
use DSK layout to adapt it for regional languages in
Europe such as German and Croatian [9]. The main
aim of the research was to show the significance of
DSK layout along with its usability for other regional
languages; and to see whether it has similar
performance gains for other regional language just like
English. The research findings by [9] were suggestive
that it has performance gain over QWERTY for the
aforementioned languages. The research study was
restricted only to single letter occurrences.
For subcontinent, such work can be found for
Bengali, Hindi and Nepali languages [11, 12, 13] and
[27], which aimed at achieving optimized layouts
through optimization techniques and other methods
such as data mining techniques to achieve the same
layout objectives discussed earlier.

be divided into two broad categories along AfghanPakistan border lines i.e. Pashto written by Pushtoon
population in Pakistan which is called Yousafzai and
the other one is called Qandahari or Afghani
Pashto [8].

2. Literature Review
Any keyboard layout could be best judged on two
broad lines i.e. physical and psychological aspects of
typing associated with it [15]. Both aspects have
conflicting requirements. In order to achieve better
typing speed the most challenging task is to lessen the
gap between the two without compromising on the
principle design issues.
The most widely used keyboard layout is
QWERTY named after six alphabets on the top row
from left to right. This layout is also referred as
Standard or Universal keyboard layout; borrowed from
the first commercial typewriter invented by Christopher
Latham Sholes in 1866 [15]. It was designed to avoid
jamming of arms to improve typing efficiency. The
main focus of design was to place most commonly used
two-letter English sequences which is also called
digraph on opposite sides of keyboard apart to avoid
jamming. Though this affected the typing speed, but
this was not the main goal. Later it was realized to
design alternative layouts having clear layout
objectives in terms of various factors such as typing
speed, learnability, ergonomic features, equal use of
both hands, and avoiding carpal tunnel syndrome and
Repetitive Stress Injury (RSI) [reference?].
In 1920, an industrial engineer Frank Gilbreth
conducted research regarding time-and-motion studies
on many worker activities [15]. His research on typing
led to many questions arising from the QWERTY
layout. He claimed that improved typing speed could
be achieved with alternate layout. He emphasized on
major design issues best described by [10] as
allocating letters among keyboard rows, among
fingers, and between the left and right hands. It was
during 1930s; August Dvorak along with his colleagues
in University of Washington started working on
optimized layout based on the research findings by
Frank Gilbreth [15]. They developed the DVORAK
layout, which is also called as Dvorak Simplified
Keyboard (DSK). This layout was developed on sound
scientific basis [7 , 15]and was also based on the study
of letters, digraph frequency and physiology of hand. It
also placed vowels on left hand side to achieve better
hand alternation during typing which is considered one
of the major factors in typing speed.


Proceedings of the Conference on Language & Technology 2014

For Pashto language, the only work we could find is

by H. Najiullah and H. Sherani [1]. However, after
analyzing it we found many shortcomings. For
example, placement of vowels is not proper. Also many
letters are inappropriately positioned, which we believe
effect typing efficiency such as which is not in the
set of Pashto alphabets [17]. It is very rarely used in
Pashto but [1] has placed it at the home row without
shift key and placed instead the right with shift
status on the similar position. Similar is also a
vowel placed on right hand side.

3. Methodology
3.1. Data Collection
To develop an optimized layout, a large body of
text is required. Such collection can be found in text
corpora. Unfortunately, no such corpus exists for
Pashto language. Some work has been done by
Evaluation and Language resources Distribution
Agency (ELDA) to develop a Pashto corpus [8];
however, the data is still not available. Therefore, we
resolved to the alternate sources for data collection.
Fortunately, internet has many publicly available
Pashto web sites, from where bulk of data was obtained
which is composed of 93 million words containing
about 441 million characters and one million unique
words. These have been summarized in Table 1.
Frequencies of these characters have been shown in
Fig. 1 and 2.
Fig.2 shows non-Pashto characters, which are also
used alternatively for some Pashto characters or used
for Arabic text written inside Pashto text. These
characters are usually placed with shift key. The data
we gathered from the available resources are for
Afghani Pashto. Therefore, we considered only this
dialect of Pashto in this work. The data will be made
available on [www.kghani.net], once the research is

Figure 1: Afghani Pashto Character Frequency

Table 1: Summary of Statistics about

Collected Data

Stats Factor Calculated



Total Characters



Total words



Total Unique Words


Figure 2: Non-Pashto Character Frequency


Proceedings of the Conference on Language & Technology 2014

order to better understand the process of assigning keys

we used the template of Standard 34- keys keyboard
layout, just having three alphabetical rows to be used
for letter/character assignments, as shown in Fig. 3.
The proposed layout is shown in Fig. 4 and 5,which has
been developed using Microsoft Keyboard Layout
Creator version 1.4 [26].

3.2. Proposed Layout

According to L.J. West [7], T. Naki-Alfirevi and
M. urek [9],P.S. Deshwal and D. Kalyanmoy [11],
Md.A. Sattar et al [12] and A. K.M. Masum et al [13]
in their respective research studies, they
emphasized on the fact that a layout should be based on
the following principles and guidelines.

Same Finger usage must be avoided to the

fullest extent to avoid unnecessary load on
one specific finger; as two independent fingers
can type fast
Hand alternation is required i.e. both hands
alternation to increase typing efficiency as
both independent hands can type fast
Minimum keystrokes required to type
accurately and efficiently
Strong finger such as Index and middle must
share the maximum load
Pinky/little fingers minimum usage is required
to avoid Repetitive Stress Injuries (RSI) or
carpal tunnel syndrome. They are prone to
injuries and fatigues with excessive usage.
Home row usage must be at the maximum
where both hands rest and is the easiest row to
type. Therefore placing most high frequency
characters at this row to maximize its usage.
Bottom row usage must be at the minimum
Both hands should be equally utilized under
all circumstances to minimize the ratio
between the two hands usage.
Distance travelled by both the hands must be
on the shortest distance as much as possible
According to [9] just like Dvorak layout;
place all vowels on left hand side. Latin vowel
alphabets a, e, i, o and u all are kept
on left hand side in Dvorak layout to better
type digraphs with both hands simultaneously
to support alternating hands.
It must be easy to learn and work with. Its
learnability is desirable but not at the cost of
aforementioned points.

Figure 3: Standard 34-keys Keyboard

Template Showing Three Empty Alphabetical
Figure 3 shows at row 3 from top to bottom the
finger usage. 0 and 7 represent pinky/little finger
of left and right hand respectively; similarly 1 and
6 represent ring fingers of left and right hand
respectively.2 and 5 represent middle finger of left
and right hand respectively. 3 and 4 represent
index fingers of left and right hand respectively. x
represents cells/slots not to be assigned to any
character. Empty cells represent slots available to be
assigned to characters.
The main idea is to place characters according to
their frequency order keeping in perspective to assign
slots on equality basis on both left and right hand side
with additional leverage of placing vowels on left hand
side to provide efficient hand alternation with digraphs.
Home row has been chosen as the first row to be
assigned characters; then moved to top and followed by
bottom row. The characters having the highest
frequency are placed in the home row; and are
most frequently used character as can be seen in figure
1 have been placed to left hand side due to their vowel
status, consequently left hand usage is a bit more than
right hand i.e. below 1 percent as could be observed in
Table 2.
Some characters have similar shapes with a bit of
variation such as has another variation though
totally different character but similar shape just has
curve underneath . Similar is the case with .
Similar approach has also been applied for and
though having different shapes but belongs to same
class of phonology. Though is at position no. thirty

Keeping in view the above mentioned points

following process is undertaken to place characters at
the best possible positions to achieve optimal layout. In


Proceedings of the Conference on Language & Technology 2014

etc.) of a keyboard constant, then the overall total

distance D travelled by fingers can be calculated to
judge performance in terms of distance travelled. The
less is the value of D, the better the layout is.
Consider d is the distance between two keys centerto-center horizontally or vertically which is calculated
as 1.8 cm, after measuring the most commonly PC
keyboards available to us for our evaluation. Home row
is taken as a reference row i.e., the keys on home rows
will have a d value of 0.
If the row is top or bottom and column is either
one of these (5, 6 and 11) the distance between the keys
will be d*2.82. Similarly if the row is top row and
column is 13 the distance will be then d*6.32. All the
remaining keys will have a distance of 2*d as the finger
has to travel a distance d and then come back to its
respective position on home row horizontally or
vertically. Any key pressed with shift key will have
added distance of d*2.82.These constants values
multiplied by d have been calculated through
Pythagorean Theorem. Therefore, D can be written

two in the character frequency graph as shown in Fig. 1

but to increase home row usage along with ease of
learning it has also been placed with shift key on
similar position of . These types of characters are
placed on similar positions with their counterpart
shapes with shift key to make the layout easy to use and

Figure 4: Proposed Normal Afghani Pashto


Wherein = 2*d, = 2.82*d and = 6.32*d.2.

3.3.2. Keystrokes. It is also another major factor
contributing to the overall effort to typing efficiency. It
is the function of modifier key along with character that
needs to be entered by pressing particular key
combination. Since Pashto has a large set of characters;
therefore most frequently used characters must be
placed in normal keyboard state to avoid more
keystrokes e.g. to enter a character without pressing
shift or any other control key only one keystroke would
be required for the entered character whereas by
pressing desired character with shift key or any other
control key would require two keystrokes. Therefore,
the total keystrokes will exceed than total number of
characters entered as Pashto has large set of characters
depending on the text entered. The value shown in
Table 2 row 2 representing keystrokes efficiency factor
has been calculated through following formula:

Figure 5: Proposed Afghani Pashto Layout

with Shift Key
3.3. Evaluation Criteria
The proposed layout is compared to the most
commonly used Pashto layouts based on the many
guidelines given in L.J. West [7], T. Naki-Alfirevi
and M. urek [9], P.S. Deshwal and D. Kalyanmoy [11],
Md.A. Sattar et al [12], A. K.M. Masum et al [13] and
Martin Krzywinski [24].Considering these guidelines,
almost all important factors affecting overall typing
efficiency are described in order of their importance
mentioned in the aforementioned papers in the
following section.
3.3.1. Distance Travelled by both hands. Distance
travelled by both hands during typing is the most
important efficiency factor. We devised a formula for
calculating distance travelled by both hands. The
formula is based on the rational, that if we keep all the
physical features ( keys size and distance between keys

Wherein TKE = total keystrokes in excess to total
characters entered represented in percent, TK = total
keystrokes and TC = Total characters entered.


Proceedings of the Conference on Language & Technology 2014

TABLE 2: All Competing Layouts Results

This value will be of representative of how much

more keystrokes are required in percent to enter the
given text as shown in Table 2.


3.3.3. Rest of the factors. These include: same finger

usage, same hand usage, home row usage, bottom row
usage, pinky/little finger usage, strong fingers (Middle
and Index) usage, left hand usage and right hand usage.
These factors have been quantified in Table 2 by their
respective usage divided by total characters entered
then multiplied by hundred to get the values mentioned.
These factors are already explained in section 3.2.























Same Hand







Home Row







Row Usage
(Middle &

















Left Hand







Right Hand








4. Results
A sample of 10 text files from various publically
available web sites is used for the purpose of analysis.
It composed of total number of 3.3 million characters.
All the values in Table 2 are given in percentage
except distance travelled which has also been shown
in rounded figure. The minimum the value of efficiency
factors such as distance travelled, keystrokes, same
finger usage, same hand usage and little finger usage ;
results shown in Table 2 would be considered best
whereas results return in maximum value by efficiency
factors such as home row usage and strong finger usage
would be deemed as best. The left hand and right hand
usage is desired to be on equality basis; so that the
lesser the gap between left and right hand usage, the
better the layout is.
Table 2 shows the results after analysis performed
on the sample. The results show that the proposed
layout out performed in almost all factors. Except
percent keystrokes in KP is the only best among all
layouts; strong fingers usage is better in KP, LI and TA
than proposed except RL whereas little/pinky finger
usage is better in all layouts than the proposed one.
This reflects the fact that the keys are placed in such
locations that more distance needs to be travelled by
hands along with awkward keystrokes which in turn
lead to more distance travelled.





Note: Where TA = TolAfghan ,LI = Liwal ,

RL = Report Layout , KP = Khpala Pashto ,
PD = Proposed Layout

5. Conclusion and Future work

As evident from results in Table 2, the proposed
layout is far superior to remaining layouts in many
respects. The next best layout is the one [1] which is
quite obvious in the sense that they used some
methodology in their approach towards layout design.
Similar work is also under progress for Yousafzai
Pashto for which we are collecting data at the moment.
We also aim at designing an efficient phonetic based
layout based on hypothesis that users familiar with
QWERTY layout will be more proficient in usage of
such layouts. We also aim at developing a single
optimized keyboard layout for both Afghani and
Yousafzai Pashto dialects. Furthermore, to get more
accurate evaluation experimental results; studies on


Proceedings of the Conference on Language & Technology 2014

[12] Md.A. Sattar,Al-M.K. Pathan, and M.A. Ali.

"Development of an Optimal Bangla Keyboard Layout Based
on Character and Fingering Frequency", submitted to
National Conference on Computer Processing of Bangla
(NCCPB), Independent University, Bangladesh, 2004.

real subjects will also be performed. The work can also

be extended to other regional languages.

6. References
[1] H. Najiullah, H. Sherani, Pashto Keyboard Layout,
[Online], Retrieved (06,20,2014)

[13] A. K.M. Masum , M.M. Hassan and S. M.

Kamruzzaman. "The Most Advantageous Bangla Keyboard
Layout Using Data Mining Technique", Journal of Computer
Science, IBAIS University, Dkhaka, Bangladesh, Vol. 1, No.
2, Dec. 2007

UNDP, Computer Locale Requirements

Afghanistan, [Online],2003, Retrieved(06,20,2014)

[14] John V. Parkinson, User-Friendly and efficient




[15] P.Buzing, "Comparing different keyboard layouts:

aspects of qwerty, dvorak and alphabetical keyboards.",Delft

[3] M. Y. Khan, ,Da PakhtoTarikh, University Book

Agency, Peshawar, 1964

[4] A. H. Habibi, PataKhazana, Oxford Books Publishers,

Afghanistan, 2001.

[16] G. B. Harbaugh,Computer keyboard Layout, United

States Patent 5584588.
Available: www.lens.org/lens/patent/US_5584588_A

[5] A.W. Abbas,N. Ahmad and H. Ali, Pashto Spoken

Digits database for the automatic speech recognition
research, 18th International Conference on Automation and
Computing (ICAC), 2012 , IEEE, Loughborough,7-8 Sept.
2012, pp. 1-5.

[17] International Seminar on Pashto alphabets and

orthography, Pashto Academy, University of Peshawar, Bara
Gali, NWFP, 1987,1988.

[6] I.Ahmed,H. Ali,N. Ahmed and G. Ahmed, " The

development of isolated words corpus of Pashto for the
automatic speech recognition research", International
Conference on Robotics and Artificial Intelligence (ICRAI),
2012, IEEE,Rawalpindi, 22-23 Oct. 2012, pp.139-143.

[18] D. A. Norman and D. E. Rumelhart. Studies of Typing

from the LNR ResearchGroup, Cognitive aspects of skilled
typewriting. Springer New York, 1983, pp. 45-65.

[7] L.J. West, The Standard and DvorakKeyboards

Revisited: Direct Measures of Speed, Santa Fe Institute,

[19] Navy department Division of Shore Establishments and

Department of Services U.S,Training Section Civilian
Personnel. A practical experiment in simplified keyboard
retraining, July and October 1944.

[8] D. Mostefa,K. Choukri,S. Brunessaux and K.

Boudahmane, New language resources for the Pashto
language,in proc. 8th Workshop on Asian Language
Resources, COLING2010, Beijing (China)

[20] A. Dvorak, N.L. Merrick, W.L.Dealey and G.C. Ford.

TypewritingBehavior, American Book Co., New York,
USA, 1936.

[9] T. Naki-Alfirevi and M. urek, The Dvorak keyboard

layout and possibilities of its regional adaptation", submitted
to 26th International Conference on Information Technology
Interfaces, Cavtat, 7-10 June 2004

[21] S.J. Liebowitz and S.E. Margolis.The fable of the

keys, Journal of Law and Economics, April 1990,pp. 1-25.
[22] E.P. Strong. A comparative experiment in simplified
keyboard retraining and standard keyboard supplementary
training, Technical report, US. General Services
Administration, Washington DC, USA, 1956.

[10] J. Diamond, "The curse of QWERTY." Discover

Magazine , 04.18.1997.

[23] R. Ali,M.A. Khan and I. Rabbi, Strong Personal

Anaphora Resolution in Pashto Discourse, International
Conference on Emerging Technologies, 2007. ICET
2007,IEEE, Islamabad, 12-13 Nov. 2007,pp. 148-153.

[11] P.S. Deshwal and D. Kalyanmoy , "Design of an optimal

hindi keyboard for convenient and efficient use." KanGAL
Report 2003004 (2003), [Online], Retrieved (7,15,2014)

[24] Martin Krzywinski ,Carpalx-design

keyboard,[Online], Retrieved (07,19,2014)
Available: http://mkweb.bcgsc.ca/carpalx/




Proceedings of the Conference on Language & Technology 2014

[25] World Population Review,World Population Review,

[Online],Retrieved (07,19,2014)
Available: http://www.worldpopulationreview.com/
[26] Microsoft, Microsoft keyboard Layout Creator 1.4 ,
[Online], Retrieved (07,19,2014)
[27] C. Prajapati et al, Nepali Unicode Keyboard Layout
Standardization based on Genetic Algorithm, [Online],
Retrieved (07,21,2014)


Proceedings of the Conference on Language & Technology 2014

Multitier Annotation of Urdu Speech Corpus

Benazir Mumtaz, Amen Hussain, Sarmad
Hussain, Afia Mahmood, Rashida Bhatti,
Mahwish Farooq, Sahar Rauf
Centre for Language Engineering,
Al-Khawarizmi Institute of Compute Science,
University of Engineering and Technology,
development of a TTS it is very crucial that speech
corpus is annotated very precisely at multiple levels.
This paper describes the development, annotation, and
quality assessment process for thirty minutes of Urdu
speech corpus at phoneme, word, syllable and phrase.
The paper is organized in the following sections.
The previous research in the annotated speech corpus
development is presented in Section 2. The
methodology of Urdu speech corpus annotation at
phoneme, word, syllable and break-index level is
detailed in Section 3. Quality assessment for each
level of annotation is presented in Section 4. The
current status of the Speech corpus annotation is given
in Section 5 while future work and conclusions are
discussed in Section 6.

This paper describes the multi-level annotation
process of Urdu speech corpus and its quality
assessment using PRAAT. The annotation of speech
corpus has been done at phoneme, word, syllable and
break index levels. Phoneme, word and break index
level annotation has been done manually by trained
linguists whereas syllable-tier annotation has been
done automatically using template matching algorithm.
The mean accuracy achieved at phoneme and break
index label and boundary identification is 79.07% and
89.67% respectively. The quality assessment of word
and syllable tiers is still under investigation.



Literature Review

Speech corpus, annotated at phoneme, syllable,

word, and phrase level, is a pre-requisite to the
development of a robust TTS system [19].
Phoneme level segmentation is a two step process;
in the first step the individual phonemes are identified
and in the second step their corresponding boundary
marks are adjusted. Several methods for automatic
phoneme level annotation have been proposed that try
to mimic this two step process of annotation. Toledano
et al. [4] used the HMM-based models for phoneme
identification and proposed the fuzzy logic based post
correction rules for the accurate boundary marking.
Kuo and Wang [11] proposed a minimum boundary
error framework that attempts to minimize the
boundary error using manually annotated data. Wang
et al. [5] proposed an HMM and SVM based method
for automatic phoneme level annotation.

Annotated or tagged speech corpus is an electronic

corpus which contains information about the language
at phoneme, syllable, stress, word, phrase/ break index
and intonation levels. An annotated speech corpus is
very significant from computational linguistics
perspective as it gives an opportunity to the researchers
to observe, optimize, evaluate and re-evaluate the
linguistics hypotheses [15]. Moreover, it plays a
significant role in the development of a text to speech
(TTS) synthesizer.
TTS system needs linguistic input to produce a
language, similar to humans. Human child acquires
this linguistic information from his environment, stores
it in his memory and gradually starts using this
information. The TTS similarly takes its linguistic
input from annotated speech corpus. Thus for the


Proceedings of the Conference on Language & Technology 2014

Besides the automatic annotation of speech corpus,

manual annotation process has also been used. Sunitha
et al. [12] used manual annotation process for the
development of TELUGU TTS system. Similarly Chu
et al. conducted both manual as well automatic
phoneme level annotation of the speech corpus. The
obtained results show that manual speech annotation
produces good results in the development of a text to
speech synthesis system [13]. Although the automatic
annotation process is less time consuming, it fails to
produce accurate phoneme level annotation.
For the word-tier annotation, both manual and
automatic annotation process had been used.
Matouek and Romportl [9] proposed a two phase
manual annotation process. In the first phase a skilled
annotator annotates the speech at word level and in the
second phase the initial annotated speech is revised and
corrected by another skilled annotator. Arvaniti [1] had
also manually annotated the Greek speech corpus at
word level. She had used romanized form of Greek
language to annotate the word-tier. In contrast to
Arvaniti [1], Goldman [10] had introduced
"EasyAlign" tool which automatically aligns
continuous speech at three stages: macro segmentation
at utterance level, grapheme-to-phoneme conversion
and phone segmentation. At phone segmentation level,
phone and word are computed using Viterbi-based
HVite tool within HTK. Utterances are also verified to
its phonetic sequences at this level. The EasyAlign tool
uses HMM based models for word identification.
Therefore, the proposed method can also not identify
accurate word boundary points.
Syllabification of the speech can also be done in
two ways; manually or automatically. Sunitha et al.
[12] has taken the syllable as the basic unit and used
manual syllabification to attain accuracy but manual
syllabification is a time consuming process. Therefore,
automatic syllabification has been used in different
languages. For Telugu TTS system, Sunitha et al. [12]
and Tsubaki [7] generated the syllable tier
automatically. Hussain [18] has also proposed an
algorithm for automatic syllabification of Urdu
language words. He has used both Nucleus projection
and template matching techniques for Urdu language
word syllabification.
For break index/phrase level marking, TOBI system
has been used. Japanese language has used J_ToBI tool
to annotate the break index-tier manually as well as
automatically [8]. J_ToBI is a prosodic labeling tool.
Along with the BI (Break Index) 0, 1, 2 and 3, it also
assigns; (-) to show uncertainty, (p) to show disfluent
disjuncture, and (m) to show mismatch in disjuncture
and tone. C-ToBI is used to mark prosodic events in
Chinese [20]. A software package SFS (Speech File
System, from UCL) is used for C-ToBI transcription

but sometimes boundaries need to be modified

manually. The software assigns a scale as 2 for the
normal break level, 1 for reduced boundary, 3 for more
prolonged boundary than the normal, 0 for extremely
reduced boundary and 4 for extremely prolonged
The quality of a TTS system substantially depends
on the accuracy of speech segment identification.
Therefore, the quality estimation of annotated speech is
very essential before providing the annotated data for
the training of speech synthesizer. Several methods
have been used to ensure the quality of annotated
speech. Matousek & Jan [9] used a two step process to
produce a better quality of speech annotation. They
computed the word error rate and the sentence error
rate by comparing the raw text, and the first and second
time annotated speech.
Pollk and Cernocky [14] proposed a three step
process for the assessment of annotated speech. In the
first step the annotated speech goes through a syntax
test. The syntax test checks the usage of allowed
characters and special marks, and ensures that all the
annotated fields are non-empty. In the second step the
pronunciation of the annotated word is compared with
a standard pronunciation. The annotated pronunciation
is marked erroneous after the confirmation from a
specialized annotator. The final test involves the
listening of a random utterance. If the listened
utterance is same as the transcription then the
annotated speech passes this test. The labeled data will
be accepted if all the above mentioned tests are passed.
The merits of evaluation of the annotated speech
described by [16] used the metric of annotated unit's
label as well as the timing boundary of units having
identical label to estimate the quality of annotation.
While a lot of research has been conducted in
developing annotated speech corpora of various
languages, only limited work has been conducted for
Urdu language speech corpus development [6]. Thus
the current research aims to build on the previous
research efforts and develop a speech corpus annotated
at the defined four levels. The following section
presents the methodology followed for its


To build a speech corpus, thirty minutes of speech

has been recorded by a single speaker in the anechoic
chamber. This speech is recorded in mono form at a
sampling rate of 8 kHz. PRAAT software has been
used for the recording, annotation and quality
assessment of the speech corpus.


Proceedings of the Conference on Language & Technology 2014

The recorded speech corpus is segmented at multiple

tiers using Case Insensitive Speech Assessment
Method Phonetic (CISAMPA). See appendix 1 for the
detailed description of CISAMPA symbols. The
methodology for multitier annotation is discussed in
the following sections.


Annotation of
Phoneme Level



stressed syllable and 73 milliseconds for the

unstressed syllable [17].
A vowel is labeled as a nasal vowel only if it
is contrastively nasalized, if a vowel is
contextually nasalized, it is labeled as an oral

Once the corpus is annotated at phoneme level, it

then undergoes the phoneme level quality assessment
explained in Section 4.1. The annotated data is passed
to word level annotation phase if it is accepted by the
phoneme level quality assessment process.


Phoneme tier is annotated manually in this work. At

phoneme level, each consonant and vowel is distinctly
marked in the Text Grid file after conducting the
careful analysis of their properties in the spectrum and
time wave form. Following guidelines have been used
for the phoneme level annotation:
Silence is marked in the start and end of the
Each segment boundary is marked at the zero
crossing point where the sound wave
amplitude is going from negative to positive
While splitting a vowel and consonant sound,
boundary of the consonant is marked where
the personality of the vowel disappears.
If a few periods of the wave form are creating
ambiguity in determining the personality of
the vowel then the periods having mixed
properties (both of the consonant and the
vowel) are included in the vowel.
While splitting the vowel and vowel junction,
the periods with mixed properties of both the
vowels are divided into equal halves.
In case of consonant clusters within or across
the words, the wave time periods with mixed
properties of both consonants are divided into
equal halves and mark as two distinct sounds.
In case of gemination across the words or
within the word, phonemes are divided Sinto
equal halves and marked as two distinct
sounds but in case of geminated stops and
affricates, the closure period is divided into
equal halves.
If a sentence or phrase is starting with the
voiceless stop or affricate, the closure
duration taken for the onset voiceless stop is
100 milliseconds for the stressed syllable and
87 milliseconds for the unstressed syllable
If a sentence or phrase is ending with a
voiceless stop (there should be silence after
the word) and the burst of the stop is not
visible, the closure duration taken for the coda
voiceless stop is 77 milliseconds for the


Annotation of Speech Corpus at Word


Annotation at word level is done in two stages.

Firstly, the annotator listens and observes the
spectrogram of the wave file very carefully to find out
that all the words in the file are pronounced properly.
In case of mispronunciation/misreading, insertion of
extra phoneme in a word or deletion of required
phoneme from the word, the wave file is rejected and
sent back for the rerecording. In the next stage, the
word boundaries of correctly pronounced words are
marked manually. These boundaries are completely
aligned with the boundaries of the segments. The
annotator does not write the word labels between the
word boundaries. Symbols are automatically extracted
from the phone-tier to fill the word boundaries.
Since the boundaries of words in Urdu language
cannot always be identified on the basis of space, it
becomes very difficult to determine where the word
boundary mark be placed, especially in the case of
compound words. For example it is challenging to
decide that the word " ( " kl\good
looking) should be marked as one word or two.
Therefore, following principles have been used to mark
the boundaries between compound words:
A compound word consisting of two words
that are both meaningful is marked as two
different words as in the case of the
" "
bt t i:\candle).
A compound word consisting of a
meaningless prefix and meaningful word is
marked as one word as in the case of
compound word " ( " bhmni:\ as a
meaning of).
A compound word consisting of meaningful
word and meaningless suffix is marked as one
word as in the word " "


Proceedings of the Conference on Language & Technology 2014


A compound word consisting of two

meaningless words is marked as one word as
in the case of the compound word " "
A compound word consisting of a meaningful
prefix as well as a meaningful second word is
marked as one word as in the case of the
(xu:bsu:rt \beautiful).
A compound word consisting of meaningful
word and meaningful suffix is marked as one
word as in the case of the compound word
"( " rg sa:z\dyer).
A compound word consisting of two
meaningful words, combined with a
conjunction vao " " is marked as three
different words as in the case of the
compound word " ( " :r o:
A compound word combined with


Annotation of Speech Corpus at Break

Index/Phrase Level

Annotation at break index level is done manually.

Four TOBI levels have been used to annotate the Urdu
speech corpus at break index tier. These levels are; 4,
3, 1 and 0. Level 2 has not been used to avoid
confusion between Level 1 and Level 3.
The process of assigning break indices starts from
left to right. Level 4 is assigned at full intonational
phrase boundary. It is assigned at a pause that should
be around 100 ms or more than 100 ms.
Level 3 is assigned at intermediate intonational
phrase boundary. Three important clues are used in
determining the level 3; weak disjuncture, lengthening
of the vowel of last syllable and glottalisation. Level 3
has weak disjuncture that is usually visible in the pitch
track. This weak disjuncture should be less than
100ms. The duration of the closure period of a
voiceless stop and affricates should be carefully
separated from the weak disjuncture while assigning
level 3. To find out the lengthening of the vowel, the
vowel of the last syllable is compared with the same or
similar shortest vowel in the file. The lengthened
vowel should be 50% long than the shortest.
Glottalisation is also a clue of assigning level 3 at two
intermediate intonational phrases.

< >

" " is marked as one word as in the

case of the compound word " "
(drja:e:ra:vi:\ Ravi River).
A compound word combined with < > zair is
marked as two different words. The zair
phoneme should be the part of the first word
while marking word boundary as in the case
of " ( " mxlu:qe: xda:\ creature of
Once the corpus is annotated at word level, it then
undergoes the word level quality assessment explained
in Section 4.2. If the work package is within the
acceptance quality threshold it is then shipped to the
next level of annotation process.


Repeat from step (iii) until the phonemic

string is consumed completely.

Level 1 is assigned at typical word boundary where

there is no lengthening of the vowel, glottalisation and
pause. Level 0 is assigned when the boundary between
two words is completely removed as in case of clitics.

Annotation of Speech Corpus at Syllable


A sample of annotated speech wave file showing all

the layers has been given below:

Syllable tier is automatically generated for Urdu

speech corpus by using the algorithm for
syllabification presented by Hussain [18]. The
algorithm for the syllabification is as follows:
Convert the input phoneme string to
consonant and vowel string
Start from the end of the word (i.e., right to
Traverse backwards to find the next vowel
If there is a consonant before a vowel than
mark a syllable boundary before the
Else mark the syllable boundary before this

Figure 1: Annotated speech file


Proceedings of the Conference on Language & Technology 2014

In Figure 2 the upper phoneme based string is

extracted from a source file and the second one is
fetched from the corresponding reference annotated
file. After the string alignment, all the reported errors
are reviewed manually. This manual review is done by
considering the error margin involved in the generation
of reference files. After this review only those files will
pass this test whose phoneme labels are 100% accurate.
For checking the phoneme boundary, the phoneme
boundary marked in source file is compared with the
corresponding boundary mark in reference file. A time
period (T1) at the surroundings of every boundary point
(B1) is calculated and if in the duration of(B1 (T1
1.2)), there is no boundary point in its counter
annotated file then a boundary misalignment is
reported. This process is followed for the reference and
the source annotated files separately. If the
accumulated mismatch with respect to source and
reference files is more than 5% then before rejecting
the file all the reported mismatches are checked
manually to confirm the rejection of the source file.

Speech Annotation Quality Assessment

In this section the quality assessment procedure of

annotated layers has been discussed. All the manually
labeled files go through different tests at each layer of
annotation before they are accepted. Scripts are written
in PRAAT [2] which performs the quality estimation
tests and produce analysis files. These are explained in
the respective annotation layer quality assessment
sections below.
The general strategy for quality assessment is that a
certain percentage of speech files are manually
annotated by an experienced annotator known as the
reference files.
These files are then compared
automatically with the corresponding same speech files
annotated by the speech corpus annotation team, called
the source files.
The mismatches are manually
verified by the quality assurance personnel to identify
possible errors in the source files. If the error rate is
more than 5%, then the source file is rejected and the
work package is re-annotated.


Word Level Assessment

Phoneme Level Assessment

An annotated word goes through four types of tests
before it gets accepted that are explained below.

The phoneme level annotation is graded using a two

step process. In the first step the phoneme labels are
checked whether they are from a defined phone set ( as
given in appendix 1) and in the second step it is
estimated that all the starting boundary of each
segment is marked at zero crossing point; amplitude
going from negative to positive. The source file is
rejected even if a single marked label is not listed in
the phone set.

In the first test it is assured that a word label should

not contain any non speech phoneme label; SIL, PAU
as given in phoneset defined in appendix 1. In the
second step it is tested that the number of words in text
form should be equal to the number of annotated words
in the source file. The third test at the word layer is
designed to check that all the labeled words can be
syllabified according to the Urdu syllabification rules
[18]. The words that cannot be syllabified are reported
and these rejected words are reviewed by an expert
linguist to confirm their incorrectness.

In the second phase the correctness of phoneme

label text and boundary is assured. The source files are
compared with their respective reference files on the
basis of phoneme label and phoneme boundary.

In the final test, the pronunciation of labeled word is

compared with the standard Urdu pronunciation
available in the pronunciation lexicon and all the
erroneous pronunciations are reported after a manual
confirmation. In the pronunciation comparison two
possible scenarios occur: a word is not found in
lexicon or the annotated pronunciation is not found in
lexicon. If a word doesn't exist in the pronunciation
lexicon, an Urdu linguist is given with the following
1. Add the annotated pronunciation in the
2. Report the annotated pronunciation as an
erroneous pronunciation and add the correct
pronunciation in the lexicon

For the phoneme label text comparison maximum

string alignment algorithm is used [14]. The alignment
algorithm aligns the source and reference phoneme
based strings. The output of this alignment algorithm is
shown in Figure 2.

Figure 2: Label comparison through maximum

string alignment algorithm


Proceedings of the Conference on Language & Technology 2014

In a case that a word exits in the lexicon but the

lexicon pronunciation doesn't match the annotated
pronunciation then the Urdu linguist is prompted with
the following options:
1. Replace the annotated pronunciation with the
lexicon's pronunciation
2. Report the annotated pronunciation as an
erroneous pronunciation
3. Add the annotated pronunciation as an
alternative pronunciation

The percentage of accuracy achieved after applying

the phrase level quality evaluation tests is presented in
Table 2 below.
Table 2: Phrase level annotation quality assessment

The pronunciation lookup test will fail if the

pronunciation is reported as erroneous by the expert
The source file will be rejected if even a single
word fails any of the above mentioned tests.


Phrase Level Assessment

Phrase level annotation assessment is a two step

process. In the first step, the time of break index in the
source file is compared with a reference file. In the
second step the level of break index mark are
compared. Both these comparisons are done by using
the algorithms discussed in section 4.1. In the phoneme
level comparison, string alignment algorithm [3] is
used where the levels (0-4) are used as a basic unit
contrary to the phoneme label. After the analysis the
reported errors are reviewed manually. Files that
contained even a single error after the manual
verification are rejected. The methodology for
assessing the syllable tier is under process. Therefore,
it has not been discussed in this paper.

Table 1: Phoneme level annotation quality assessment

Number of

of Accuracy

Phoneme Label








Number of

of Accuracy

Break Index




Break Index
Time Mark





It is very important for the quality of TTS system

that the annotated speech corpus does not contain any
errors. Therefore, after the quality assessment results,
manual review both at phoneme and break index levels
has been carried out by the trained linguists to correct
all the errors.
Although this paper present sufficient details about
the process of annotating data at phoneme, word and
break index levels, there are still issues that need to be
resolved. In Urdu language, the existence of
diphthongs is still indeterminate. It cannot be precisely
stated that how many diphthongs exist in Urdu
language. Therefore, at phoneme tier, while
segmenting words such as "( "k:\Why), " "
(b:i:\ Brother), "( ":e:\ Came), " ( " ka:\
What) it is difficult to decide whether the vowels be
marked as diphthongs or the boundary should be
marked between them to make them two individual
Besides diphthongs, co-articulation factor has also
created problem in the identification of phonemes. For
example, due to co-articulation affect, voiced
consonants become voiceless, aspirated consonants
become unaspirated, and oral vowels become nasal
vowels when they are preceded and followed by the
nasal consonants.
Similarly, Level 0 is not used in marking break
indices as this level is reserved for clitics. This
phenomenon that Urdu language has clitics is still
under investigation and needs further research.
Building on this research, development of ten hours of
annotated speech corpus is underway. Currently the
phoneme, word, syllable and break indices tiers are

Reference annotated files were generated for the

complete thirty minutes of speech for the quality
assessment of annotated corpus. Results of segment
level assessment have been reported in Table 1 to
present the overall accuracy of annotation at this level.

of Phones

of Break

Current Status of the Urdu Speech

Corpus Annotation




Proceedings of the Conference on Language & Technology 2014

[9] J. Matouek, J. Romportl, "Recording and annotation of

speech corpus for Czech unit selection speech synthesis".
In Text, Speech and Dialogue. 2007. (pp. 326-333). Springer
Berlin Heidelberg.
[10] J. P. Goldman, "EasyAlign: An Automatic Phonetic
Alignment Tool Under Praat." INTERSPEECH. 2011.
[11] J.W. Kuo, H. M. Wang, "A minimum boundary error
framework for automatic phonetic segmentation". ISCSLP.
2006. (pp. 399-409). Springer-Verlag Berlin Heidelberg.
[12] K.V.N. Sunitha and P. Sunitha Devi, Bhaashika: Telugu
TTS System, International Journal of Engineering Science
and Technology Vol.2 (11), 2010 , Hyderabad, India ,
[13] M. Chu, Y. Chen, Y. Zhao, Y. Li, and F. Soong, "A
Study on How Human Annotations Benefit the TTS
Voice."Microsoft Research Asia,.2006, Beijing,China,pg4.
[14] P. Pollk, & J. Cernocky, "Orthographic and Phonetic
Annotation of Very Large Czech", in proc, Language
Resources and Evaluation Conference, LREC, (pp. 595-598).
Lisabon, (2004)
[15] R. Mitkov, C.Orasan, and R Evans. "The importance of
annotated corpora for NLP: the cases of anaphora resolution
and clause splitting." Proceedings of Corpora and NLP:
Reflecting on Methodology Workshop. 1999.
[16] S. Cox, R. Brady, & P. Jackson, "Techniques for
Accurate Automatic Annotation of Speech Waveforms", in
proc. International Conference on Spoken Language
Processing, (pp. 19471950). Sydney, 1998.
[17] S. Hussain, "Phonetic Correlates of Lexical Stress in
Urdu", PhD, Northwestern University, Illinois, 1997.
[18] S. Hussain, "Phonological Processing for Urdu Text to
Speech System", Lahore: Center for Research in Urdu
Language Processing, National University of Computer and
Emerging Sciences, B Block, Faisal Town, Lahore, Pakistan.
[19] S. Kiruthiga, and K. Krishnamoorthy. "Annotating
Speech Corpus for Prosody Modeling in Indian Language
Text to Speech Systems." International Journal of Computer
Science Issues (IJCSI) 9.1 (2012).
[20] Z. Weibin, S. Liqin, and N. Xiaochuan, "Duration
Modeling For Chinese Systhesis from C-ToBI Labeled
Corpus." ICSLP, 2000.

annotated but in future intonation tier will also be

focused. Automatic annotation methods will also be
further investigated in future.


In this paper, annotation and testing of 30 minutes

of Urdu speech corpus at phoneme, word, syllable, and
break index levels has been described. This annotation
is done using both manual and automatic methods. On
average 79.07% accuracy is achieved at phoneme tier
and 89.67% accuracy is achieved at break index tier.
After quality assessment results, manual review is also
conducted to correct all errors at phoneme and break
index levels. This work is in process and the
knowledge generated through this process will be used
to develop ten hours of annotated speech corpus.


This work has been conducted through the project,

Enabling Information Access for Mobile based Urdu
Dialogue Systems and Screen Readers supported
through a research grant from ICTRnD Fund, Pakistan.
We would also like to thank Wajiha Habib who was
the speaker of 30 minutes speech corpus.


[1] A. Arvaniti, & M. Baltazani, "GREEK ToBI: A System

for the Annotation of Greek Speech Corpora", LREC. 2000,
[2] B. Paul, W. David, Retrieved September 10, 2013,
Available: http://www.fon.hum.uva.nl/praat/
[3] D. Jurafsky, Minimum Edit Distance. 2013. Retrieved
Available: http://www.stanford.edu/class/cs124/lec/med.pdf
[4] D. T. Toledano, M. A. Crespo, J. G Sardina, "Trying to
Mimic Human Segmentation of Speech Using HMM". Third
ESCA/COCOSDA International Workshop on Speech
Synthesis, (pp. 1263-1266). Australia. 1998.
[5] H.M. Wang, J.-W. Kuo, H.-Y. Lo, "Towards A Phoneme
Labeled Mandarin Chinese Speech Corpus", International
Conference on Speech Databases and Assessments (ICSDA).
Istanbul. 2008.
[6] H. Sarfraz, S. Hussain, R. Bokhari, A. A. Raza, I. Ullah,
Z. Sarfraz, S. Pervez, A. Mustafa, I. Javed, R. Parveen.
"Speech Corpus Development for a Speaker Independent
Spontaneous Urdu Speech Recognition System." proceeding
of OCOCOSDA (2010).
[7] H. Tsubaki, M. Kondo, "Analysis of L2 English Speech
Corpus by Automatic Phoneme Alignment", Proceedings
from SLaTE 2011, Jan 8, 2012.
[8] J. J. Venditti, "The J_ToBI model of Japanese
intonation." Prosodic typology: The phonology of intonation
and phrasing, 2005, 172-200. Available at:


Proceedings of the Conference on Language & Technology 2014

Appendix 1:



in ,
, ,




t h







The grey highlighted sounds are used rarely.











Proceedings of the Conference on Language & Technology 2014

Hidden Markov Model (HMM) based Speech Synthesis for Urdu Language
Omer Nawaz
Centre for Language Engineering, AlKhawarizmi Institute of Compute Science,
UET, Lahore, Pakistan.

Dr. Tania Habib

Computer Science and Engineering
Department UET, Lahore, Pakistan.

effect the foot-print1 is very small (approximately 2MB2), compared to unit selection approach.
The HMM-based speech synthesis framework has been
applied to a number of languages that include English
[7], Chinese [8], Arabic [9], Punjabi [10], Croatian [11]
and Urdu [12] as well. In this work, we present the
development and evaluation of Speech Synthesizer for
Urdu language. The main contributions of the paper are
inclusion of prosodic information in the training process
and development of question set considering the
linguistic features relevant to Urdu language.
Figure 1 depicts the outline of parametric speech
synthesis with HMMs. The training part consists of
extracting the feature vectors of the training corpus as
mel-cepstral coefficients [13] and excitation parameters,
followed by model training. Whereas synthesis part is
the reverse process of speech recognition. First the text
is converted to context dependent sequence of phones
obtained as a part of Natural Language processing
(NLP) [14]. Then the excitation and spectral parameters
are obtained through a set of trained HMM models using
parameter generation algorithm [15]. Finally the
waveform is generated using the obtained spectral and
excitation features and providing them to the mel-log
spectrum approximation filter (MLSA) [16].

This paper describes the development of HMM based
speech synthesizer for Urdu language using the HTStoolkit. It describes the modifications needed to original
HTS-Demo-scripts to port them, for Urdu language,
which are currently available for English, Japanese and
Portuguese. That includes the generation of the fullcontext style labels and the creation of the Question file
for Urdu phone set. For that the development and
structure of utilities are discussed. Plus a list of 200 high
frequency Urdu words are selected using the greedy
search algorithm. Finally the evaluation of these
synthesized words is conducted using naturalness and
intelligibility scores.
Keywords Speech Synthesis, Hidden Markov
Models (HMMs), Urdu Language, Perceptual Testing

1. Introduction
A text-to-speech (TTS) synthesis system for a
particular language is a framework to convert any given
text into its equivalent spoken waveform representation.
Currently the most frequently employed TTS is the Unit
Selection Synthesis [1-3]. However being the best TTS
to date it has some limitations. Like the synthesized
speech resembles the prosody/style of recording with
the training database. If we want to synthesize speech
with various voice characteristics then we need to
increase the training data that cover all that variations.
However recording that much data is not feasible [4].
With the improvements in Hidden Markov Models
(HMM) techniques, the HMM based speech
synthesizers are becoming popular [5]. In these systems
the statistical models are trained based on source filter
model from the training corpus. The main advantage of
parametric approach [6] is that original waveforms are
not required to be stored for synthesis purposes. As an

Speech signal



Training HMMs

Training part



Text analysis


Context-dependent HMMs
& state duration models

Parameter generation
from HMMs


Figure 1. Overview of Parametric speech

synthesis with HMMs ( [7], pp. 227)
This refers to the voice size produced by the HTSEnglish-Demo Scripts.

Footprint refers to the amount of disk space required

by an application.


Proceedings of the Conference on Language & Technology 2014

female speaker and stored at a sampling rate of 8 kHz

mono wav format.

1.1. Development
The development consists of two steps, training
and synthesis. In the training step the recorded data,
along with segmental and prosodic labels, is used to
train the HMM models. The HMMs are trained on
speech features that include MFCCs, F0 and durations.
Whereas in synthesis stage first the text to be
synthesized is converted into a sequence of context
dependent label format. This label structure contains the
segmental and prosodic information that is helpful in
selecting the appropriate models for the synthesis
purpose. Finally the selected speech parameters are
passed to the synthesis filter to produce the waveform.
In this paper we present the development of HMM
based Speech Synthesizer for Urdu language and its
evaluation. In section 2 we describe the requirements for
training the HMM models, that include data collection,
configuring influential features and the generation of the
question file to handle data sparsity issues. Section 3
represents evaluation process and results on the test
data. Section 4 encompasses analysis and discussion of
the results. Finally section 5 discusses the concluding
remarks and our plans for future.

2.1.1. Importance of Segmental and Prosodic labels.

Segmental (Phoneme) boundaries are required in
continuous speech to identify the different phones
present in the training data. The segmental labels are
marked carefully by a highly trained team of Linguistics
at CLE using Praat [19] software and saved in its native
TextGrid format. The other marked layer is the word
layer, which specifies the word boundaries. By having
the word layer marked explicitly, we can apply stress
and syllabification rules to it and can generate
additional two layers.
With the addition of extra layers the advantage is
that now we have more information, and can represent a
single phoneme in a number of different contexts, which
is important because the characteristics of a certain
phoneme are greatly influenced by its context.
For the addition of extra layers (stress/syllable) a utility
was written in Python [20], to mark the layers in
TextGrid [19] format. The functionality of the program
is explained:
Input: Take TextGrid file with segmental and word
1. Extract different layers (currently possible
Segment, Word, Stress, Syllable and
2. Apply Stress/Syllabification rules [21].
3. Align stress and syllable identities with the
segment layer.
Output: Generate a new TextGrid file

2. Requirements for building Speech

Synthesizer with HMM based Speech
Synthesis Toolkit (HTS)
HTS is a toolkit [17] for building statistical based
Speech Synthesizers. It is created by the HTS-working
group as a patch to the HTK [18]. The purpose of this
toolkit is to provide research and development
environment for the progress of speech synthesis using
statistical models.
The requirements for setting up the synthesizer are:
1. Annotated Training data.
2. Define speech features (MFCC, F0 and
duration) for model training.
3. Sorting out unique context-dependent as well
as context-independent phonemes (from the
training data) for model training.
4. Unified question file for spectral, F0 and
duration for context clustering.

The block diagram of the utility is shown in Figure 2.

TextGrid file

Make a New
TextGrid file

Extract Segment
and Word Layer


and Stress rules

2.1. Annotated Training data

For the system development, 30-minutes of speech
data was selected. The recorded utterances consisted of
paragraphs taken from Urdu Qaida of grade 2 and 4
respectively. The recordings were carried out in an
anechoic room ensuring minimal noise and using a high
quality microphone. The data was recorded by a native

Figure 2. Block diagram of Stress/Syllable marking



Proceedings of the Conference on Language & Technology 2014

Next the generated TextGrid file is converted to HTKformat for further processing, because HTS is
implemented as a modified version of HTK. And HTK
modules require labels in its native format.


TextGrid File
Take TextGrid file for one complete utterance.
Usually one complete utterance consists of a single
Extract Layers
Different layers (segment, stress, syllable, word)
are extracted, that will be used to calculate contextual
factors for each phoneme.
The HTK-style labels are generated at this stage.
Full-context layout
A general layout is defined for the HTS-format that
will be used to incorporate contextual factors [23]:
(The details of this layout can be seen in
lab_format.pdf file bundled in the HTS-Demo

2.1.2. Conversion to context-dependent label format.

HTS requires both the basic HTK and its extended
version Context-Dependent to capture the prosodic
variation of the phoneme.
In basic HTK-format each phoneme is represented
by a string identity and time values are represented in
units of 100 ns interval as shown in the example below:
HTK-Label Format


With Context-Dependent format the phoneme identity

exists in different segmental plus supra-segmental
contexts as shown below:



Initialize factors related to stress, syllable, phoneme
and word with default values (0 or x). The layout is
kept general with all possibilities so that at a later stage
when additional layers are added, then little
modification is required.
Process 1st Segment
Selects the first segment and calculates all the
possible contexts available to it.
Make structure
With the calculated contextual factors in the
previous step a structural representation is created using
the defined layout.
Generate Full-Context
A full context label file is generated that consists of
full-contextual representation of every phoneme in the
entire utterance.



2.1.3. Phone Set Used. The CISAMPA phone-set [22]

is employed in our system, which uses an ASCII based
character set to represent different phonemes. It was
chosen because these characters are easily accessible
during the data tagging process.
For the conversion of TextGrid to ContextDependent format a utility was developed. The flow of
this utility is illustrated in Figure 3:
Take TextGrid file
for one complete

Extract Layers:
Segment, Stress,
Syllable, Word

Define the fullcontext label


2.2. Feature Specification

Generate full
context labels

Make full-context
label structure for
each segment

Generate Monolabels

Process the first


In HTS a set of noticeable speech features are

selected which are important to capture speech
variations properly, and through their proper
manipulation high quality speech can be synthesized.
The most common set of features are spectrum, F0 and

Loop for all


Initialize the contextual

variables with default

2.2.1. Spectrum. For spectrum mel-cepstral

coefficients (MFCCs) of order 35 were calculated along
with the and 2 features, so the final length is 105. For

Figure 3. Flow diagram of the HTS-format conversion



Proceedings of the Conference on Language & Technology 2014

the calculation of these features the SPTK toolkit [24]

was used.

clustering process for spectrum, F0 and duration models

[26]. It provides a basis for grouping a number of data
points and hence handles data sparsity issues, which is
common in speech synthesis as the number of unique
models built are enormous.
In clustering first all the data points are placed into
a single cluster and a list of questions is made. Then an
objective function is defined. The cluster is split based
on each question, and the question with which the
objective function is minimized is selected as a
successful candidate and is removed from the list. And
the remaining questions are asked on the resultant
clusters until a stopping criterion is met [27].
Another advantage is that these trees can also be
used in the synthesis stage, as they are built on the
acoustically similar properties of speech. In synthesis
mostly the utterance that is required to be synthesized is
unseen. Meaning that utterance was not available
exactly the same way in the training data. So, by using
the trees that incorporate the acoustic similarities we can
trace them and select the closest alternative.

2.2.2. Fundamental frequency (F0). The range defined

for the voiced regions was 80-400 Hz. Fundamental
frequency was calculated using the auto-correlation
method used by the snack library available in
ActiveTcl [25].
2.2.3. Durations. There are two possible ways to
estimate duration of each phone. One way is to calculate
it offline like MFCCs and F0, the other way is to
estimate during the training process. The offline method
is perfect if we only have a single state HMM. Whereas
in most of cases we have 3 or 5 state HMM model, as a
result we dont know in advance how the state
alignments will be done. So, the durations of each state
are estimated during the state-alignment step of the
Expectation Maximization (EM) Algorithm which is
used to train the HMMs.

2.3. Unique List

2.4.3. Generation of the Question file. The questions
are developed based on the similarities between the
place and manner of articulation for segmental context.
For prosodic context, the number of syllables in a word,
their position and whether stressed or not etc. is taken
into account.
The idea is to group the phones that have a similar
place and manner of articulation. These set of questions
have a dual role. In the training process they are used to
split the cluster nodes of the tree. Whereas during
synthesis process, these generated trees are employed to
trace the phoneme with un-seen context.
Moreover these questions are created on a phoneset specific to language, we cannot employ the
structuring specified for some other language (like
English, Brazilian and Japanese). The questions specific
to Urdu language were created considering the grouping
in Table 1.

A list of unique context-dependent as well as

context-independent phones from the training corpus is
generated. It is required to identify number of possible
models that can exist. Each model is created and trained
with the available number of examples in the training
corpus. First a context-independent (mono-phone)
model is generated, which is simply an average of all the
examples. Then these mono-phones are copied to
context-dependent (full-context) and are re-estimated
using the concerned examples only.

2.4. Question File

Some sort of criteria is required to tackle the
problem of fewer training examples available per
model. The numbers of training examples are few
because if we look at the full-context style label format,
then it reveals that the possible contextual occurrence of
a single phoneme is quite huge. And to have this much
context available in the training data is not possible, on
the other side this much context is also rare in everyday
To address these issues, a methodology known as
clustering is employed. The notion of clustering is to
group the phonemes which are acoustically similar and
share a single model for closely related contexts. In
clustering question file plays an important role, as they
define how the grouping should be done.
The question file consists of a number of binary
questions with YES or NO outcome related to segmental
and prosodic context of the phoneme. It is helpful in the

Question format
For example the question for phoneme before the
previous phoneme is defined as:
Field 1

Field 2

Field 3



{R , R_H}

The field 1 defines the label showing that it is a question.

Field 2 specifies the grouping of various categories and
finally the third column represents the possible
phonemes for the defined categories.


Proceedings of the Conference on Language & Technology 2014

Table 1. Grouping of contextual factors3




All Vowels
All Consonants
P, P_H, B, B_H, T_D,
T_D_H, D_D, D_D_H, T,
T_H, D, D_H, K, K_H, G,
G_H, Q, Y
M, M_H, N, N_H, N_G,
N_G_H, U_U_N, O_O_N,
O_N, A_A_N, I_I_N,
A_E_N, A_Y_N
F, V, S, Z, S_H, Z_Z, X,
G_G, H
B, B_H, D_D, D_D_H, D,
D_H, G, G_H
P, P_H, T_D, T_D_H, T,
T_H, K, K_H
M, M_H, N, N_H, N_G,
P, P_H, T_D, T_D_H, T,
T_H, K, K_H
V, Z, Z_Z, G_G

P_H, B_H, T_D_H,

D_D_H, T_H, D_H, K_H,


T_S_H, D_Z_H



L, L_H
J, J_H
S, Z
T_S, T_S_H, D_Z, D_Z_H
I_I, I, A_E, E, A_Y, I_I_N,
A_E_N, A_Y_N
U_U, I_I, U_U_N, I_I_N
U, I
O, A_A, E, O_N, A_A_N
A_Y, A_Y_N
U_U, U, O_O, O, A_A,
U_U_N, O_O_N, O_N,

2.4.4. Synthesis. In the synthesis stage first the text to

be synthesized is entered in CISAMPA format, then
using the utilities developed in the training stage,
converted to full-context style labels.
The label format used in the synthesis part is
similar to the training, except for the timing information
which is absent it this case. By using these labels three
different set of models are selected (Spectrum,
Fundamental frequency and the Durations). From
Spectrum and Duration models the optimal state
sequence is selected. Finally the optimal state sequence
along with the excitation signal is fed to the synthesis
filter to produce the final waveform, as illustrated in
Figure 4.

F, S, S_H, X




P_H, B_H, T_D_H,

D_D_H, T_H, D_H, K_H,
G_H, M_H, N_H, N_G_H,
R_H, R_R_H, L_H, J_H,
T_S_H, D_Z_H
T, T_H, D, D_H, N, N_H,
R, R_H, S, Z, L, L_H
T_D, T_D_H, D_D,
K, K_H, G, G_H, N_G,
N_G_H, X, G_G


P, P_H, B, B_H, M, M_H



R, R_H

Convert to

Y, H
R_R, R_R_H
N, N_H
M, M_H
N_G, N_G_H
X, G_G
S_H, Z_Z

Get Optimal




F, V


Figure 4. Overview of Synthesis Process

The Urdu CISAMPA Phone set can be accessed at



Proceedings of the Conference on Language & Technology 2014

2.4.5. Tree traversal for model selection. For example

if we want to synthesize the word P A K I S T A N,
then for each phoneme a separate tree will be used to
trace the appropriate model. In this example we may
have two different models for the A phoneme, because
it is occurring twice and have different left and right
contexts. For the case of P A K, first it checks whether
left context is bilabial or not (L=Bilabial?). Then on its
outcome it proceeds to the next node, and finally reaches
the leaf node from where suitable model is selected as
shown in Figure 5.

3.1. Methodology
To perform evaluation of the underlying system a
list of 200 high frequency words of Urdu language were
selected using the greedy search algorithm [31]
developed at Center for Language Engineering, KICS.

3.2. Experiment
For the assessment of speech quality synthesized
by the statistical models (HMMs). The Mean-OpinionScore (MOS) was considered for the naturalness and
intelligibility measure. There were a total of 4 listeners
who carried out the evaluation. Among the participants,
three were linguists (expert listeners) and one was
technical (naive listener).
For our system the naturalness and intelligibility are
interpreted as:
How close it seems to be produced by
a human?

L = Bilabial?



R = Stop?








|P |A |K|
L = Bilabial
R = Stop
L = Voiced
L = Vowel
R = Voiced

How much conveniently the word
was recognized?


The MOS-scale varies from 1 to 5, where 1 represents

the lowest score and 5 the highest. The experimental
results of four listeners are listed in Table 2:


Table 2. Mean score for Intelligibility and Naturalness

Figure 5. Tree traversal for model selection

3. Evaluation and results

The main goal of any Text to Speech System is to
generate a voice which resembles closely to a human
voice. So for the assessment of a speech synthesizer, a
human listener should carry out the testing.
To test the system comprehensively, there are a number
of tests, which include Diagnostic Rhyme Test (DRT)
[28] and Modified Diagnostic Rhyme Test (M-DRT)
[29] that evaluates the system on the phoneme level.
For our system we only focused on the
naturalness and intelligibility of high frequency words.
As Consonant-Vowel-Consonant (CVC) [30] or DRT
was not appropriate because the number of possible
correct words fitting exactly the CVC format was
scarce. Moreover, we did not have the phoneme
coverage balanced for the 30-minutes of the speech data.
The phoneme coverage graph is shown in Figure 6.
Consequently a set of 200 high frequency words were
selected for the testing purpose.

Subject Type



Technical 1
Linguistic 1
Linguistic 2
Linguistic 3



The testing reveals that most of the words were

intelligible but not natural. The reason behind un-natural
voice can be regarded due to the kind of training data.
In training, words were available as a carrier sentence,
and none of the training utterance consisted of a single
word. We know that if a word is spoken explicitly
without any carrier sentence then it is little bit longer
and clearer, whereas in carrier some of the phonemes are
shorter or are completely ignored.

4. Analysis and Discussion

The analysis show that on average 92.5% words
were correctly identified, irrespective they sounded less
natural or intelligible. On the other hand, there were also
a few cases where the listener was unable to identify the
correct word. These are listed in Table 3.


Proceedings of the Conference on Language & Technology 2014




Figure 6. Phoneme Coverage Graph4

4.1. Phoneme coverage of training data

There were total of 66 different phonemes present
in the phone set defined for training HMMs Models,
having total frequency of 17793 (30-minute data). So for
completely phonetically balanced system we should
have at least 270 (1.51 % coverage) examples per
phoneme. Whereas in our case vowels had a very high
frequency (A = 1810, A_A = 1646), and some of the
consonants were completely ignored (J_H, L_H, M_H,
N_G_H, R_H, Y, Z_Z). The phoneme coverage can be
visualized in Figure 6.
A list of words that were not correctly identified are
listed in Table 3.The first column contains the word in
Nastalique style. Second represents the actual
pronunciation that should have been synthesized, in
CISAMPA format. Whereas third column represents the
word interpreted by the listener. The bold letters
highlights the phones which have disagreements, while
gray letters indicates that they were missing in the
synthesized utterance. Finally the last column represents
the coverage of the correct phoneme that was wrongly
produced, in the training corpus.































5. Conclusion and Future Work

A reasonably good quality5 HMM based speech
synthesizer for Urdu language has been developed. The
utilities developed were unique as they converted handlabeled TextGrid files directly to HTS-label format,
without using any of the automatic data tagging
software (like Sphinx [32]). The Question file was
generated for the Urdu phone set, keeping in account the
articulatory features of language. Finally the testing of
the synthesized quality was carried out by using the
Mean-Opinion-Score (MOS) for naturalness and
In future work we are planning to build the system
incrementally with new database which comprises of
approximately 10-hours of speech and is being recorded
by a professional speaker

Table 3. Words with errors



Listened Coverage

This work has been conducted through the project,

Enabling Information Access for Mobile based Urdu
Dialogue Systems and Screen Readers supported
through a research grant from ICTRnD Fund, Pakistan.

[1] A. W. Black and P. Taylor, "CHATR: a generic speech
synthesis system," in proc. of the 15th conference on
Computational linguistics, Stroudsburg, PA, USA, 1994.

Only those phonemes are shown whose occurrence

counts are more than 50

Some synthesized utterances can be accessed at:



Proceedings of the Conference on Language & Technology 2014

system (HTS) version 2.0," in proc. of Sixth ISCA Workshop

on Speech Synthesis, Bonn, Germany, August, 2007.

[2] R. E. Donovan and P. C. Woodland, "Automatic speech

synthesiser parameter estimation using HMMs," in proc. of
ICASSP-95, Detroit, Michigan, May, 1995.

[18] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw,

X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey and others,
"The Hidden Markov Model Toolkit (HTK) version 3.4,"
Cambridge University Engineering Department, December,

[3] A. J. Hunt and A. W. Black, "Unit selection in a

concatenative speech synthesis system using a large speech
database," in proc. of ICASSP-96, IEEE International
Conference, Atlanta, Georgia, May, 1996.
[4] A. W. Black, "Unit selection and emotional speech," in
proc. of INTERSPEECH, Geneva, September, 2003.

[19] P. Boersma and D. Weenink, "Downloading Praat for

Windows," 10 September 2013. [Online]. Available:

[5] Z. Heiga, T. Tomoki, M. Nakamura and K. Tokuda,

"Details of the Nitech HMM-based speech synthesis system
for the Blizzard Challenge 2005," IEICE transactions on
information and systems, vol. 90, no. 1, pp. 325--333, 2007.

[20] G. van Rossum and others, "Python language website,"

World Wide Web: http://www.python.org, 2007.

[6] H. Zen, K. Tokuda and A. W. Black, "Statistical parametric

speech synthesis," Speech Communication, vol. 51, no. 11, pp.
1039-1064, 2009.

[21] S. Hussain, "Phonological Processing for Urdu Text to

Speech System," in Contemporary Issues in Nepalese
Linguistics (eds. Yadava, Bhattarai, Lohani, Prasain and
Parajuli), Linguistics Society of Nepal, Nepal, 2005.

[7] K. Tokuda, H. Zen and A. W. Black, "An HMM-based

speech synthesis system applied to English," in Speech
Synthesis, 2002. Proceedings of 2002 IEEE Workshop, Santa
Monica, California, IEEE, September, 2002, pp. 227-230.

[22] A. Raza, S. Hussain, H. Sarfraz, I. Ullah and Z. Sarfraz,

"An ASR System for Spontaneous Urdu Speech," in proc. of
Oriental COCOSDA, Kathmandu, Nepal, November, 2010.
[23] H. Zen, "An example of context-dependent label format
for HMM-based speech synthesis in English," The HTS
CMUARCTIC demo, July, 2011.

[8] Y. Qian, F. Soong, Y. Chen and M. Chu, "An HMM-based

Mandarin Chinese text-to-speech system," in Chinese Spoken
Language Processing, vol. 4274, Springer Berlin Heidelberg,
Singapore, December, 2006, pp. 223-232.

[24] S. Imai, T. Kobayashi, K. Tokuda, T. Masuko, K.

Koishida, S. Sako and H. Zen, "Speech signal processing
toolkit (SPTK)," 2009.

[9] O. Abdel-Hamid, S. M. Abdou and M. Rashwan,

"Improving Arabic HMM based speech synthesis quality," in
proc. of INTERSPEECH, Pittsburgh, Pennsylvania, USA,
September 17-21, 2006.

[25] ActiveState and ActiveTcl-User-Guide, "Incr Tk",

ActiveTcl, Nov, 2002.

[10] D. Bansal, A. Goel and K. Jindal, "Punjabe speech

synthesis using HTK," International Journal of Information
Sciences & Techniques, vol. 2, no. 4, July, 2012, pp. 57-69.

[26] S. J. Young, J. J. Odell and P. C. Woodland, "Tree-based

state tying for high accuracy acoustic modelling," in proc. of
the workshop on Human Language Technology, Plainsboro,
New Jerey, USA, March, 1994.

[11] I. Ipsic and S. Martincic-Ipsic, "Croatian HMM-based

speech synthesis," CIT. Journal of computing and information
technology, vol. 14, no. 4, December, 2006, pp. 307-313.

[27] K. Shinoda and T. Watanabe, "Acoustic Modeling Based

on the MDL Principle for speech recognition," in proc. of
EuroSpeech-97, Rhodes, Greece, September, 1997.

[12] Z. Ahmed and J. P. Cabral, "HMM BASED SPEECH

International Workshop On Spoken Language Technologies
For Under-resourced Languages, St. Petersburg, Russia, 2014.

[28] W. D. Voiers, "Diagnostic evaluation of speech

intelligibility," Benchmark papers in acoustics, vol. 11,
Stroudsburg, Pennsylvania, 1977, pp. 374-387.

[13] T. Fukada, K. Tokuda, T. Kobayashi and S. Imai, "An

adaptive algorithm for mel-cepstral analysis of speech," in
proc. of ICASSP-92., IEEE International Conference, San
Francisco, California, March, 1992.

[29] A. S. House, C. E. Williams, M. H. Hecker and K. D.

"Articulation-Testing Methods:
Differentiation with a Closed-Response Set," The Journal of
the Acoustical Society of America, vol. 37, no. 1, January
1965, pp. 158-166.

[14] H. Kabir, S. R. Shahid, A. M. Saleem and S. Hussain,

"Natural Language Processing for Urdu TTS System," in
Multi Topic Conference, 2002. Abstracts. INMIC 2002.
International, IEEE, 2002, pp. 58-58.

[30] U. Jekosch, "The cluster-based rhyme test: A segmental

synthesis test for open vocabulary," in Speech Input/Output
Assessment and Speech Databases, Noordwijkerhout, 1989.

[15] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi and

T. Kitamura, "Speech parameter generation algorithms for
HMM-based speech synthesis," in proc. of ICASSP'00, IEEE
International Conference, Istanbul, Turkey, June, 2000.

[31] B. Bozkurt, O. Ozturk and T. Dutoit, "Text design for

TTS speech corpus building using a modified greedy
selection," in INTERSPEECH, 2003.

[16] S. Imai, "Cepstral analysis synthesis on the mel frequency

scale," in proc. of ICASSP'83, IEEE International Conference,
Boston, Massachusetts, USA, April, 1983.

[32] "CMU Sphinx - Speech Recognition Toolkit," Carnegie

[Accessed 3 March 2014].

[17] H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A.

Black and K. Tokuda, "The HMM-based speech synthesis


Proceedings of the Conference on Language & Technology 2014

Alphabet Signs Recognition using Pixels-based Analysis

Mohammad Raees1, Sehat Ullah2
Department of Computer Science and IT, University of Malakand, Pakistan.

view the position of each known finger, the system

recognizes all those isolated signs in which thumb is
visible. The two earlier efforts made for Urdu
alphabets recognition of PSL are that of Aleem et al
[2] and of Kausar et al [3]. The former system is 5DT
DataGlove based which compares a scanned gesture
image with stored database images. The second PSL
recognition system needs the glove of eleven
different colors.
One distinguishing feature of the proposed approach
is that neither colors nor gloves are used for fingers
identification, rather a finger-cap is used to cover the
upper part of thumb. To further reduce computation,
instead of scanning whole of the scene, the system
captures only from the specified Region of Interest
(ROI). Along with displaying the appropriate Urdu
alphabet, the system pronounces the letter on behalf
of the signer as well.

Sign language is the language of gestures and
postures used for non-verbal communication. This
paper presents a novel vision based approach for the
detection of isolated signs of Pakistan Sign Language
(PSL). The signs representing alphabets of Urdu
(national language of Pakistan) are recognized by
distinguishing fingers. The algorithm, following a
model of seven phases, identifies each of the five
fingers from their respective positions. After fingers
recognition, signs are deduced from their states of
being raised or down. For quick recognition, signs
are categorized into three groups based on the thumb
position. Five testers evaluated the system using a
simple low cost USB camera in a semi-controlled
environment. The results obtained are encouraging
as accuracy of the system exceeds a level of 85.4%.

2. Literature review
1. Introduction

Sign language recognition is becoming a fat area of

HCI to make mute persons able for their daily
conversation. The crux of HCI domains is to make
computer able to recognition signs and symbols.
Several methods have been suggested to easily and
accurately detect the signs. Common methodologies
used for signs recognition are Template Matching[4],
Conditional Random Fields (CRF) [5] and Dynamic
Time Warping ( DTW) [6].
The procedure of Transition-Movement Models
(TMM) [7] is based on hand movements recognition
to deduce continuous signs. The system although
recognizes a large vocabulary of 513 signs but as the
costlier sensor-based glove with magnetic trackers
are used, the technique is rarely affordable.
HMM based hand gesture recognition is proposed in
[8] having 82.2% accuracy in recognizing 20 signs
for Arabic alphabets. The system needs pre-training
of hand gestures with which an input image sign is
compared by using eight feature points. Most of the
sign recognition systems are gloves based as with
gloves, fingers positions and orientations are easily
detectedable. System of Al-Jarah [9] uses Cyber
Gloves while that of Kadous et al [10] needs Power

Sign Language (SL) is used to communicate through

gestures instead of sounds. Apart from Sign
Language; the language of mute people, hand
gestures are used in different domains like traffic
controllers, robot instructions, hand-free computing
and remote video gaming. Although there is no
international sign language, every region of the world
has its own sign language. Like American Sign
language (ASL) and British Sign Language (BSL),
Pakistan has its own sign language called Pakistan
Sign Language (PSL) where signs are used to
represent Urdu letters and terms. Mute people always
need interpreter to make them able to communicate
in the society. Let the availability of interpreter alone,
it is impossible for an interpreter to accompany an
individual (mute) in every matter and every time for
the whole of his life. To solve this problem, this
paper is the first of our long-term goal of designing a
Digitized Urdu Sign Language Interpreter (DUSLI).
Unlike other sign language systems most of which
rely on prior training of signs [1], the proposed
algorithm dynamically finds out which finger is
small, which one is middle and so on. Keeping in


Proceedings of the Conference on Language & Technology 2014

Gloves. The approach based on Data Gloves for FSL

(French Sign Language) [11] have 83% accuracy.
Although all signs of Arabic alphabets can be
recognized by the algorithm of Khaled et al [12], due
to light sensitivity accuracy remains below 57%. The
algorithm of Simon et al [13] makes use of Kinect,
which using infrared camera, is independent of
lighting conditions. As some signs needs facial
expression and body motion, Kinect based system
could hardly cope with such dynamic signs. Skin
color segmentation is followed in [14, 15] where
hand section is extracted from the image. Boundary
tracking and fingertip methodology is pursued for
sign detection in [16]. Though the algorithm is robust
as only 5% errors occurred in the detection of all
those signs of ASL in which fingers are open, the
approach is neither applicable for signs in which
fingers are closed nor for the motion based signs. The
approach of Thinning the Segmented Image is
presented in [17]. The technique of Morphological
operation has been exercised in [18] for hand
gestures recognition. The work of Nagarajan et al
[19] identifies hand postures used for signs of
counting with the help of Convexity defects. The
Convexity defects of fingers vary at large with a slight
orientation of hand or of fingers. Furthermore, a skin
color object in the background could be falsely
treated as finger. The Non-manual features extraction
is a sub-domain of sign language used mostly for
continuous signs. Head pose and motion for sign
recognition is suggested in [20] where head shakes
based signs are well recognized by trackers. Facial
expression and Gaze direction are traced in the
approaches discussed in [21] and [22] respectively
for continuous signs. The module of T. Shanableh
extracts features by K-NN and Polynomial Networks
[23]. Algorithm of Tauseef et al [24] employs neural
network for alphabet recognition where learning
schemes requires significant training and data
sampling. The Visual Command system is suggested
in [25] where hand posture is surrounded by ordinary
daily items. The work is interesting and can pave way
for the use of new symbols in sign language domain.

phase. Group of sign is identified in fifth phase using

positions and status of the distinguished fingers.
Signs are recognized in sixth phase while the last
phase is to produce output both in audio and visual
forms. The abstract level diagram of the algorithm is
given in Fig.1
Core and Kernel Extractions
Edge Detection

Template Matching
Distinguishing Fingers
Group Detection
Sign Recognition
Fig.1 Schematic of the proposed System

3.1. Core and Kernel images extractions

Core image, in the proposed algorithm, is the portion
of captured scene most probable to hold hand posture
while Kernel is the image that encloses exactly and
merely the hand gesture. The very first phase is to
segment out the portion containing hand posture from
the rest of the whole input scene. This is achieved in
two steps, first extracting the Core image and then
the Kernel image. The process to obtain meaningful
hand posture (Kernel) from a scanned frame is shown
in Fig. 2
Capturing whole scene

3. The proposed method

Core image extraction

The proposed method is first of its kind to make use

of both Template Matching and low level pixels
analysis for recognition of images captured from live
video. The systems algorithm comprises of seven
phases. In first phase, image tightly bounding the
hand posture is extracted for which edges are
detected in second phase. Third phase is to search out
thumbs position through Template Matching.
Fingers are distinguished from one another in fourth

Binary of Core image

Sliding Scan
Kernel Image


Fig.2 Schematic of Kernel extraction

Proceedings of the Conference on Language & Technology 2014

columns n extracted from rows r and columns c of

Cr, exactly circumscribes the hand posture.

Core image is a 265x245 pixels image extracted in

RGB from the dedicated region of running video, as
shown in Fig.3. The core image contains hand
postures along with the unwanted background, is
assigned to an image variable Cr.




For efficient results edges are detected by Sobel

operator. For each point of Kr the gradient is
obtained as G;

Fig.3 The extracted Core image

If Cr, Kr and HP represents Core image, Kernel
image and the Hand Posture respectively then,

The white-spaces of the Cr containing mostly the
meaningless background are removed to achieve
faster computation. Kernel (Kr) is the crux of the
core image tightly bounding the hand posture. To get
Kr, Cr is first assigned to a temporary image (T img)
which is converted to binary for the purpose of
finding the Tm(Topmost), Lm (Leftmost) and Rm
(Rightmost) pixels enclosing the HP as shown in


The edges detected Kernel is shown in Fig.5(b)



Fig.5 (a)The RGB of Kernel (b) Edge detected Kernel


3.3 Template matching

For easy identification of thumb, a unique Templatesign is designed over thumb jacket (covering) that
can be seen in all the given figures. The thumbs
template (Tt) is searched out in G of Kr using the
most suitable Square Difference Matching method.

Fig.4 Binary of Cr with Lm,Tm and Rm Pixels.

Considering the Rows and Columns of Cr enclosed

inside Tm, Lm and Rm, Kr is obtained from the Core


3.2 Edge detection


Where Tx and Ty represents width and height of the

template image respectively.

Before the detection of edges, slice of the scanned

RGB image confined by Tm, Lm and Rm is assigned to
Kr, as shown in Fig.5(a). Kr with rows m and


Proceedings of the Conference on Language & Technology 2014

3.4 Distinguishing fingers

Excluding thumb, the rest of the fingers are first

detected and then distinguished from each other. The
algorithm follows the sliding technique of scanning
to detect fingers tip. So,
is scanned from top-left
to bottom-right for the first black pixel to encounter.
To make the system signer-independent and to avoid
the distance problem, Fingers Width ( ) is
calculated once for the topmost finger. This is
achieved by going five pixels deep from the fingers
tip and calculating the Euclidean distance ( )
between the left (
) and right (
) edgepixels of the finger as shown in Fig.6(b).

Fig.7 Small(S), Ring(R), Middle (M) and Index (I)

fingers recognized.

3.5 Group detection

Based on the respective positions of fingers, closely
resembled signs are group together. The logic
structure of the groups namely G1, G2, G3 is given


To avoid the slight width differences among fingers,

is enlarged by the addition of five pixels which is
then assumed constant for all the four fingers.

Thumb.x-20 index.x




Thumb.x < index.x

Fig.6 (a) (a
Fingers tip (b) Fingers width
The same method
of sliding is repeated to find out
Topmost pixel
( ) for the detection of the
remaining three fingers. To avoid the scanning of an
already identified finger,
( ) lying inside
any of the previously identified fingers are omitted.

Thumb.x index.x+20


( )
where i representing the four fingers, ranges from 1
to 4.
The next step is to distinguish each finger from the
rest. For this purpose, leftmost
( ) is find out
from the set of known
finger thus
( )
accessed first from left is sure to be SMALL finger,
the next will be RING finger and so on as shown in

3.6 Sign recognition

The distinguished fingers with their respective
positions are fed to the Engine; especially designed
for Isolated signs. The system dynamically assigns x
and y positions of each finger to a Point variable
declared with the name of that finger. The Engine
inferring from the respective positions of
distinguished fingers recognizes the signs.
Standard deviation (SD) about y-axis of all the
fingers, except Thumb, is calculated with the
following formula to find out whether the fingers are
down are not.


Proceedings of the Conference on Language & Technology 2014

G2 Signs recognition. In
signs, thumb
lies at left of index finger as in shown in Fig.9

If all the four fingers are down will be less than 6

while in the signs where some fingers are down and
other are raised the value of will be greater than 6.


3.6.1 G1 Signs recognition. In

signs, thumb
being adjacent to index finger rests within a limit of
20 pixels at right of index, as clear from all the signs
shown in Fig.8






Signs (a) BAA (b) KAF (c) YAA
(d) ZOWAD and (e) SEEN
Logic structure of



Fig.8 Signs of
(a) ALIF (b) JAA (c)


Signs of
are recognized using the following
decision control statements.

signs is presented as,

if ((SD<6)
AND ( Thumb.y <Index.y))
if ((SD>6)
AND ( Smallf.y>Ring.y )
AND (Middle.y<Index.y))
WAWO if( (SD<6)
CHOTE if ((SD>6)
AND (Thumb.y<Index.y)
AND (Small.y<Index.y)
AND (Index.y<=Middle.y))


ZOWAD if (SD>6)
AND ( Index.y<Small.y )
AND (Small.y<Ring.y)
AND (Abs(Ring.y-Middle.y)<5)
SEEN if (SD<6)
AND (Thumb.y>Index.y+10)
KAAF if (Abs(Middle.y-Index.y)<5)
if (SD<6)
AND (Thumb.y>Index.y+40)
AND (Thumb.x<=Index.x)

Proceedings of the Conference on Language & Technology 2014

G3 Signs recognition. The signs in which
thumb is at least 20 pixels away at right from index
are grouped in
, shown in Fig.10

reset the system for a new sign. Clicking over the

Still button halts whole of the system while Exit
button is to quit. To test the system, twenty trials
were performed for each sign by five signers.





Signs (a) LAAM (b) SOWAD

The two Signs of

following rule.



Fig.11 The captured image with specified ROI

are distinguished using the

Details of correct, false and missed detection of all

the ten digits are shown in Table 1. Accuracy was
calculated using the following formula, which yields

if ((Index.y<Middle.y)

Accuracy (in %age) = ( No. of Correct Recognition

No of Failed Recognition )/Total Signs * 100


SOWAD if ((Thumb.x>Index.x+20)
AND (SD<6))

Table 1

3.7 Output

detection detection detection


















4. Experimental result and analysis




We implemented the model in Microsoft Visual

Studio 2010 with the library of OpenCV. A Corei5
Laptop running at 2.5GHz was used for the
development and testing of the system. Each captured
frame will hold four buttons which are highlighted
with mouse move-over, as shown in Fig.11. A new
frame is captured after providing a healthy time of
30ms for the signer to pose. One can directly capture
sign by clicking the Capture button if hand gesture is
posed before the expiry of 30ms. ReStart button is to















Engine of the system after recognizing the exact sign,

invokes the last module dealing with both audio and
visual outputs. The detected sign is pronounced in the
exact standard Urdu accent and is displayed in a
200X200 pixels image form.


Proceedings of the Conference on Language & Technology 2014

A noteworthy attribute of the method is its accurate

recognition of the closely resembled signs like that of
ALIF and SOWAD already shown in Fig.8(a) and
10(b) and signs of WAWO and YAA as shown in
Fig.12(a) and 12(b) respectively. The earlier highly
accurate systems like the Fuzzy Classifier Method fail
to distinguish such signs.





Fig.12 The closely resembled signs of (a)
WAWO and (b) YAA



5. Conclusion and future work

The system is specifically designed for the
recognition of alphabets from PSL but applicability
and accuracy of the algorithm testifies that it is
pursuable for any sign language. Tilt or orientation of
hand exceeding ten degrees in either axis is the main
reason of false detection. We have aimed to enhance
the algorithm in future so that orientation up to large
extent could be tolerated.




6. References





P. S. Rajam, G. Balakrishnan. Real Time Indian

Sign Language Recognition System to aid Deafdumb People, IEEE 13th International Conference
on Communication Technology, China, 2011, pp.
A. K. Alvi, M. Y. Azhar, M. Usman, S. Mumtaz, S.
Rafiq, R. U. Rehman and I. Ahmed, Pakistan Sign
Language Recognition Using Statistical Template
Matching, International Conference On Information
Technology, Istanbul, Turkey, December 2005,
S. Kausar, M. Younus, Javed and S. Sohail.
Recognition of Gestures in Pakistani Sign Language
using Fuzzy Classifier, 8th WSEAS International
Conference on Signal Processing, Computational
Geometry and Artificial Vision, Rhodes, Greece,
2008, pp. 101-105.
T. Starner, J. Weaver and A. Pentland. Real-time
American Sign Language recognition using desk and
wearable computer based video, IEEE Transactions
on Pattern Analysis and Machine Intelligence, 1998.
vol. 20: p. 1371-1375.






S. Akram, J. Beskow and H. Kjellstrm, Visual

Recognition of Isolated Swedish Sign Language
Signs, Ph.D thesis, Cornell University Ithaca, New
York, United States, 2012.
P. Doliotis, C. Mcmurrough, D. Eckhard, and V.
Athitsos, Comparing gesture recognition accuracy
using color and depth information, Conference on
Pervasive Technologies Related to Assistive
Environments, 2011.pp.20-20.
G. Fang, W. Gao, and D. Zhao, Large-Vocabulary
Continuous Sign Language Recognition Based on
Transition-Movement Models, IEEE Transactions
on Systems, Man and Cybernetics; 2007: Vol.34
Issue 3, pp.305-314.
W. Gao, J. Ma, J. Wu and C. Wang, Sign language
recognition based on HMM/ ANN/DP, International
Journal of Pattern Recognition and Artificial
Intelligent, 2000, p. 587-602.
O. Al-Jarrah and A. Halawani, Recognition of
gestures in Arabic sign language using neuro-fuzzy
systems, Artificial Intelligence and Soft Computing,
2006: pp.117-138.
Kadous and M. Waleed, Machine recognition of
Auslan signs using PowerGloves towards largelexicon recognition of sign language, Proceedings of
the Workshop on the Integration of Gestures in
Language and Speech, 1996: pp. 165-174.
H.D. Yang and S.W. Lee, Robust sign language
recognition with hierarchical conditional random
fields, IAPR International Conference on Pattern
Recognition, Istanbol Turkey, August 2010, pp. 2202
- 2205.
K. Assaleh and M. Al-Rousan, Recognition of
Arabic Sign Language Alphabet Using Polynomial
Classifiers, EURASIP Journal on Applied Signal
Processing, 2005, pp. 2136-2146.
S. Lang, D.Marco and B. Berlitz, Sign Language
Recognition with Kinect, MS thesis, Freie
Universitt, Berlin, 2011.
Vezhnevets, V. Sazonov and A. Andreeva, A Survey
on Pixel-based Skin Color Detection Techniques,
Proc. Graphicon, Moscow, Russia 2003, pp. 85-92.
X. Zabulisy, H. Baltzakisy and A. Argyroszy,
Vision-based Hand Gesture Recognition for HumanComputer Interaction, Ph.D. thesis, Institute of
Computer Science Foundation for Research and
Technology, University of Crete Heraklion, Crete,
Grece, 2007.
J. Ravikiran, K. Mahesh, S. Mahishi, R. Dheeraj, S.
Sudheender and N. V. Pujari, Finger Detection for
Sign Language Recognition, Proceedings of the
International MultiConference of Engineers and
Computer Scientists, Hong Kong, 2009, Vol. 01,
R. Rokade and M. Kokare, Hand Gesture
Reccognition by Thinning Method, International
Conferenceon Digital Image Processing, IEEE
Computer Society, Bangkok, 2009, pp. 284 - 287.
J. Kuch and T. S. Huang, Vision-based hand
modeling and tracking for virtual teleconferencing
and telecollaboration, IEEE Int.Conf. Computer
Vision, Cambridge,1995, pp. 666-671.

Proceedings of the Conference on Language & Technology 2014

[19] S. Nagarajan, T. Subashini and V. Ramalingam,

Vision Based Real Time Finger Counter for Hand
Gesture Recognition, International Journal of
Technology CPMR-IJT, December 2012. Vol. 2.
[20] U.M. Erdem and S. Sclaroff, Automatic detection of
relevant head gestures in American Sign Language
communication, IAPR International Conference on
Pattern Recognition, BRAC University, 2002, Vol.1,
[21] V. Agris, M. Knorr, K.F. Kraiss and J. Kim, The
significance of facial features for automatic sign
language recognition, 8th IEEE International
Conference on Automatic Face Gesture Recognition,
Amsterdam, Netherlands, Sept. 17-19 2008, pp. 1-6.
[22] L. Muir and S. Leaper, Gaze tracking and its
application to video coding for sign language,
Picture Coding Symposium, the Robert Gordon
University, Schoolhill, Aberdeen, UK, 2003, pp. 321325.
[23] T. Shanableh and K. Assaleh, Arabic sign language
recognition in user independent mode, IEEE
International Conference on Intelligent and
Advanced Systems, Kuala Lumpur 2007, pp.597-600.
[24] H. Tauseef, M.A. Fahiem, and S. Farhan,
Recognition and Translation of Hand Gestures to
Urdu Alphabets Using a Geometrical Classification,
Second International Conference in Visualization,
page 213-217, Barcelona, Spain, July 2009
[25] Chalechale and F. Safae, Visual-based interface
using hand gesture Recognition and Object
Tracking, Iranian Journal of Science & Technology,
Transaction B, Engineering Shiraz University, 2008
Vol. 32, pp. 279293.


Proceedings of the Conference on Language & Technology 2014

Sense Tagged CLE Urdu Digest Corpus

Saba Urooj, Sana Shams, Sarmad Hussain, Farah Adeeba
Centre for Language Engineering, Al-Khawarizmi Institute of Compute Science,
University of Engineering and Technology, Lahore
To determine the correct meaning of words in the
respective context is called lexical disambiguation or
Word Sense Disambiguation (WSD) in the field of
computational linguistics. WSD is defined as the
process of computationally determining which sense
of a word is triggered by the use of the word in a
particular context [2]. A words meaning is formed by
patterns of use thus a new meaning arises with a new
pattern of use. To analyze this usage pattern, corpora
need to be sense tagged.
A sense-tagged corpus is a significant linguistic
resource because it contains a lot of rich semantic
knowledge that can be used in theoretical linguistics. In
the field of computational linguistics, these resources
are critically used for natural language processing
(NLP) such as traditional and statistical machine
translation, creation and navigation of metadata,
computer-assisted lexicography and ontology-driven
frameworks such as the semantic web [3]. Moreover,
statistics extracted from analysis of sense tagged
corpus can be used in a number of other research
domains such as information retrieval, information
extraction, text summarization and automatic question
answering [4].
There are three distinct approaches for word sense
disambiguation. These approaches include knowledge
based approach, unsupervised approach and
supervised/semi-supervised approach based upon the
main source of knowledge used for sense
differentiation [1]. Knowledge based approach uses
methods that depend upon dictionaries, thesauri, and
lexical knowledge bases, but do not employ any corpus
evidences. Unsupervised approaches include methods
that work directly from raw un-annotated corpora, e.g.
word-aligned corpora to gather cross-linguistic
evidence for sense discrimination. Supervised
approach includes methods that use annotated corpora
with sense IDs to train from, or as seed data in a
bootstrapping process. Therefore presence of a largescale sense tagged corpus is critical for successful
WSD programs.
This paper presents the construction of an Urdu
Sense Tagged Corpus Ver. 1.0. The final corpus
consists of 5,611 sentences with 100K words of which
17,006 words are sense tagged. The paper is organized

This paper presents the construction of an Urdu
Sense Tagged corpus using four main lexical
resources; an Urdu wordlist consisting of 5000 high
frequency content words, a 100K words corpus
annotated with part of speech (POS) tags, an Urdu
WordNet with approximately 5058 senses and Urdu
morphological analyzer. The paper also briefly
presents Urdu word-sense annotation tool, a software
tool developed to provide an easy interface for sense
tagging, ensuring tagging consistency and accelerating
the annotation speed. In this version of the Urdu sense
tagged corpus, 17,006 words have been sense tagged
with 2285 unique senses. The final section discusses
the linguistic and tool specific challenges in the
construction of sense tagged corpus and describes
future work in this context.

1. Introduction
Words possess multiple senses and each sense is
context sensitive where sense can be defined as
semantic value (content) of a word when compared to
other words; i.e. when it is part of a group or set of
related words. The process of assigning senses to a
word is not simple as those senses are clearly different
and in some cases completely unrelated to each other
[1]. Therefore, most sense distinctions are not as
obvious as the distinction between bank as financial
institution and bank as river side. For example 1,
bank as financial institution divides into following
related senses: the company or institution, the building
itself, the counter where money is exchanged, a fund or
reserve of money and a supply of something held in
reserve (blood bank). Similarly in Urdu language,
( pr:\paper) can have following related
senses: ( mth:n ke: sv:l:t
k: k:z/exam paper), ( k:z k:
k:/piece of paper) and /( hft:
v:r xb:r rs:l:/ daily or weekly publication).



Proceedings of the Conference on Language & Technology 2014

as follows. Section 2 reviews various manually sense

tagged corpora constructed using respective languages
WordNet. Section 3 then specifically details the
development of Urdu sense tagged corpus, by first
presenting the lexical resources used and the process
followed in the senses annotation and Section 4
presents the current state of Urdu sense tagged corpus.
Lastly, section 5 and 6 describes the research
challenges in the development of Urdu sense tagged
corpus and future work in this context.

the senses of a targeted set of words that occurred most

frequently in an English text. The experiments carried
out with the DSO corpus prompted the subsequent
evaluation efforts of Senseval [6].
Sense tagged corpora have been developed for other
languages as well. For example, Japanese SemCor
(JSEMCOR) [9] has been constructed using annotation
transfer approach. According to this approach, sense
tagged corpus in one language is translated into the
target language and sense annotations are also
projected to the target language. The sense projection
is carried out using a WordNet in the target language
which is aligned with the WordNet that was used to
sense tag the source language text. The source corpus
used is English SemCor and the source WordNet is
Princeton3 (1.6) WordNet of English. The target
language WordNet is Japanese WordNet [9]. The final
corpus consists of 14,169 sentences with 150,555
content words of which 58,265 are sense tagged. The
license is similar to the Princeton WordNet License, so
the data is freely available.
Attempts have been made via DutchSemCor project
[10] to develop a Dutch corpus that is sense-tagged
with senses from the Cornetto lexical database [11]
using a semi-automatic approach. In DutchSemCor
about 282,503 tokens for 2,870 nouns, verbs and
adjectives (11,982 senses) have been manually tagged
by two annotators, resulting in 25 examples on average
per sense (more than 400,000 have been manually
tagged by at least one annotator and millions have been
automatically tagged). This corpus was built in two
phases. In the first manual phase 25 examples are
collected for each sense. These examples are used to
train a supervised WSD system for the second phase.
The supervised system searches for the remaining 75
examples of the different senses to complete the
corpus. Active learning [12] is used to steer the
supervised system in selecting appropriate examples.
Dutch-SemCor is not available, but excerpts and
statistics are freely downloadable.
Similarly, Bulgarian word sense tagged corpus4 [13]
has been constructed from Bulgarian Brown corpus. It
consists of 811 excerpts each containing 100+words:
the total size of the source corpus is 101,062 tokens.
The words in BulSemCor are assigned meaning
manually from the Bulgarian WordNet [14]. The
sense-annotated corpus consists of 99,480 lexical units
annotated with the most appropriate synset from the
Bulgarian WordNet (BulNet). The corpus excerpts are
offered under MS No Redistribution Non Commercial
license for free, it is also possible to query the corpus

2. Literature Review
In a comprehensive survey conducted by Bond [5],
almost all available WordNet tagged corpora (along
with their availability) for English and for other
languages have been enlisted. Among these sense
tagged corpora for English language, HECTOR,
SemCor and DSO corpus have been explained below
in further details. HECTOR was a collaborative project
by Oxford University Press and Digital Education
project [6] in corpus lexicography. All corpus instances
in a 20 million word corpus (a pilot for the British
National Corpus) were tagged according to the senses
in a dictionary entry that was being developed
alongside the tagging process. The database comprises
200,000 tagged instances and an associated set of
dictionary entries. There are 300 words associated with
over 100 corpus instances [6].
The English SemCor corpus is a sense-tagged
corpus of English created at Princeton University by
the WordNet Project research team [7]. It was created
very early in the WordNet project and was one of the
first sense-tagged corpora produced for any language.
The corpus consists of a subset of the Brown Corpus
(700,000 words, with more than 200,000 senseannotated) and it has been part-of-speech-tagged and
sense-tagged. It is distributed under the Princeton
WordNet License.
The DSO corpus includes 192,800 occurrences of
frequent 121 nouns and 70 verbs of English which
have been manually sense-tagged with senses from
WordNet [6]. About 192,800 word occurrences have
been hand tagged with WordNet 1.5 senses [8]. It is
distributed on the Linguistic Data Consortium
Catalogue2 (LDC) under different licenses for LDC
Members (free for 1997 members) and non-members.
Unlike Semcor, which adopted all-word corpus
approach i.e. assigning sense tags to all words in a
running text and hence resulting in an insufficient
number of training examples per word for a supervised
learning approach, the DSO corpus focuses on tagging


See catalog.ldc.upenn.edu/LDC97T12


See dcl.bas.bg/en/corpora_en.html#SemC

Proceedings of the Conference on Language & Technology 2014

online. The restrictions on use and redistribution mean

that corpus is not considered open source.
An attempt for devising word sense annotated
corpus for Chinese language has been made [15]. The
subsequent work contains three components; a corpus
annotated with word senses, a lexicon containing sense
distinction and description, the linking between the
lexicon and the Chinese Concept Dictionary. In this
corpus, 813 nouns and 132 verbs have been analyzed
and described in the lexicon with the feature-based
formalism and 60,895 word occurrences have been
sense tagged from three-month texts of Peoples Daily,
an official daily newspaper for the government of
China, with a move to extend this corpus for other
kinds of texts.
Undoubtedly high accuracy WSD needs large-scale
word sense tagged corpus as training material [16]. It is
argued that no fundamental progress in WSD can be
made until large-scale lexical resources are built [17].
From the above reviewed literature, it is evident that
the number of sense tagged corpus for English and for
other languages increased in the past years. In the field
of Urdu lexical resource development, some attention
has been paid to corpus construction, POS tagging and
WordNet development by Center for Language
Engineering, by making the CLE Urdu Digest Corpus5
[18] and Urdu WordNet [19] available for research, but
significant research still needs to be focused on sense

3.1.1. Urdu WordNet 1.0 Wordlist

The Urdu WordNet 1.0 Wordlist6 used in sense
tagging comprises 5000 content words of which
approximately 3000 words have been extracted from
the following three sources, 18 million words corpus
crawled from online news websites covering a wide
range of domains including sports, news, finance,
culture, etc., and CLE Urdu Digest Corpus covering
the domains e.g. education, health, politics,
international affairs, sports, business, humor and
literature and Urdu Verblist7 extracted from Online
Urdu Dictionary (OUD8). Additional 2000 words have
been added during the development of Urdu WordNet
based on the initial 3000 words.
3.1.2. CLE Urdu Digest Corpus
The initial data for sense tagging has been taken from
CLE Urdu Digest Corpus 100K [18] containing
102,209 words of Urdu. It is a hundred thousand
words collection of written Urdu language from a wide
range of domains, designed for the purpose of
linguistic research. This corpus covers a range of
subjects including education, health, politics,
international affairs, sports, business, humor and
CLE Urdu Digest Corpus is divided into two major
categories i.e. Informational (80%) and Imaginative
(20%). The Informational part includes texts from
letters, interviews, press, religion, sports, culture,
entertainment, health and science. The Imaginative part
includes texts from short stories and novels, translation
of foreign literature and book reviews. A total of
83,450 words have been collected in the Informational
domain, amounting to 81.6% of corpus. Additionally,
18,759 words are collected in the imaginative domain,
forming 18.4% of the corpus. The data for this corpus
construction has been taken from Urdu Digest9. The
data used in this corpus ranges between years 20032011. The data is distributed in 348, UTF-8 files and
is arranged according to the above mentioned genres.
Each file contains minimum three hundred words.

3. Developing Urdu Sense Tagged Corpus

This paper describes the development of an Urdu
corpus annotated with word senses, in order to build a
comprehensive resource for Urdu lexical semantics. In
the construction of sense tagged corpus, four linguistic
resources i.e. Urdu Wordnet 1.0 Wordlist, CLE Urdu
Digest Corpus, Urdu WordNet and Urdu
Morphological Analyzer have been used. The detail of
these resources along with the explanation of
annotation method and annotation tool is described in
the following sections.

This Corpus has been manually tagged with parts of

speech, using the revised POS tagset [20]. 80% of the
corpus was then used to train the Urdu POS tagger
available on CLE website. The tagger was then tested

3.1. Linguistic Resources and Applications

Fundamentally four major resources have been used
to develop the Urdu Senses tagged corpus; Urdu
Wordnet 1.0 Wordlis, CLE Urdu Digest Corpus, Urdu
WordNet and Urdu Morphological Analyzer.




Proceedings of the Conference on Language & Technology 2014

on 20% of the corpus. The files for POS tagging were

selected randomly. The results showed tagging
accuracy of 96.8%.

in the sequence in which it occurs. The second

approach is the targeted tagging method in which all
occurrences of a single target word are listed with a left
and right context as in a Key Word In Context
(KWIC)11 index and are annotated through comparison
of the contexts. The targeted tagging approach enables
the annotator to consider the different meanings only
once and hence apply a more systematic and consistent
comparison of the different contexts. On the other
hand, sequential tagging approach requires changing
focus to different words all the time, repeatedly
resulting in substantial cognitive load. Considering the
mentioned advantage of using the former approach,
CLE Urdu Sense tagged Corpus Ver. 1.0 has been
annotated using the targeted tagging approach.

3.1.3. Urdu WordNet

Urdu WordNet10 is a semantic dictionary of Urdu [19]
developed by Center for Language Engineering. It
contains 5058 senses approximately. All synsets have
POS definition, unique synset ID, definition, synset
and example. The example of an entry has been given
in Figure 1.

3.2.2. Urdu Word-Sense Annotation tool (rdu

mfhu:m k:r /)
For the purpose of word sense annotation, a word sense
annotation tool has been developed by Center for
Language Engineering, which enables manual
disambiguation of large volume of texts. This
annotation tool uses POS tagged files as input and
generates sense tagged files (where the words are
tagged with synset ID) as output using the Urdu synset
ID developed through the Urdu WordNet. The user
interface of the tool gives three views presented in the
1, 2 and 3 numbered windows in figure 2 below.

Figure 1 Layout of Urdu WordNet

3.1.4. Urdu Morphological Analyzer
An Urdu Morphological Analyzer [21] is used for
showing all the possible morphological forms of
words. This analyzer is then integrated in the sense
tagging tool displaying corpus matches for a certain
word being sense tagged, in order to display all
possible morphological forms of a base word along
with its all base-form occurrences. This functionality
provides additional data to an annotator following the
target annotation approach, through which in addition
to the specific word entry, specific entries of various
morphological forms of the word are also displayed to
the annotator for sense tagging. Selection view

The selection view of the interface displays the list of
high frequency words and enables the annotator to
select a target word. This window makes use of Urdu
wordlist (discussed in section 3.1.1.). The tool then
matches these words with those in the corpus and
WordNet. WordNet view
This view displays the linguistic information available
in WordNet for the selected lexical item. This window
integrates file generated via Urdu WordNet and
enables the reader to select the most appropriate sense
among the available senses and the POS by matching it
with the POS used in the corpus. With the help of this
view, the annotator can also compare the contexts of a
sense by its use in the WordNet example and its
occurrence in the corpus.

3.2. Sense Annotation Method and Tool

The following sub-sections give detail about the
method used for sense tagging and Urdu word sense
annotation tool i.e. rdu mfhu:m k:r / .
3.2.1. Sense Annotation Method
The word forms in the corpus were POS-tagged and
linked to the corresponding word senses in Urdu
WordNet, if available. Conventionally, two annotation
methods are used in corpus sense tagging. Firstly the
sequential tagging method in which the corpus text is
presented in its original order and each word is tagged


A KWIC index is formed by sorting and aligning the

words within an article title to allow each word (except
the stop words) in titles to be searchable alphabetically
in the index.



Proceedings of the Conference on Language & Technology 2014


Figure 2 Layout of the annotation tool showing word sense annotation Corpus view

The corpus view of the interface displays the corpus
with all occurrences of the target word. Facilitated by
the integration of the Morphological Analyzer, this
corpus view not only displays all occurrence of the
specific word in the corpus, but additionally displays
all occurrences of the words complete morphological
forms available in the Corpus. As this window shows
the complete sentence within which the target word
occurs thus, the annotator is facilitated to comprehend
the complete pre and post context of the word
occurrence, in order to mark the most appropriate sense
from the available word senses.

Figure 3 Snapshot of the annotation tool showing

tagging options

3.2.3. Corpus Annotation Tags

The annotator selects a specific word from the
selection view and its complete occurrence and
occurrences of its morphological forms are highlighted
(in a new color) in the Corpus view window. Based on
the available different senses of the selected (target)
word in the WordNet view, the annotator carefully tags
every occurrence. Ideally, if the specific sense of a
word occurrence exactly matches with the sense
entered in the WordNet, the annotator selects that sense
and the respective occurrence gets tagged with that
sense ID. If the sense of a specific target word is not
available in the WordNet view, then the annotator
reports the problem in four possible options, provided
in the annotation utility. These are also displayed in
Figure 3 below. Insufficient Context

This is the case where some contexts were too brief for
a sense to be assigned to the word e.g. in the following
sentence the context of the word u:pr (/over) is too
brief to be sense tagged.


> Insufficient_Context</
s trh u:pr hi: u:pr ki: r:ste: bn ge:
this like over and over many ways made were
Like this, many ways were made. Literary Reference/symbolic Sense
This tag is used to tag idiomatic phrase in corpus or the
case where a corpus sense is not conveying its literal
meaning rather it has been used in symbolic sense e.g.


Proceedings of the Conference on Language & Technology 2014

in the following sentence nbz: (/pulse) cannot

be sense tagged as it has been used as an idiom here.
> Literary_Reference_&_Idioms</

vkt ki: nbz: d i:m: l rhi: th:
slowly passing was
Time was passing slowly. Non-standard Usage of language
In context where people use the semantic aspects of
language imaginatively and creatively in the same way
as they do not always follow the rules of grammar and
syntax, this tag is used. Apart from some recognized
figurative sense extensions in the dictionary, this
aspect of language use is unpredictable.

Table 2: Word count with no. of senses tagged

No. of senses tagged








No. of senses tagged






The work is proceeding to add more examples per

sense to aid the development of an automatic sense

5. Improving WordNet via sense tagging

process Word Sense Not Available

This is the context where the particular sense of the
word being displayed in the corpus view is valid but
not available in the WordNet view. This option was
specifically designed to provide feedback to the Urdu
WordNet development team for inclusion of missing
senses for a word.

During the course of annotation, the annotators

followed certain consistency criteria which helped in
improving WordNet as well. Following identifications
were made during the process of tagging:

5.1. Consistency of sense definition with POS Other
This is the category left for the addition of any other
comment by the annotators and is tagged for discussion
and mutual agreement.

In some cases, it was observed that the

interpretative definition (gloss) associated with the
synset doesnt match with POS of the word e.g.
initially the sense of word fsurd: (/sad) was
more like a verb than an adjective. After feedback to
the WordNet team, it was then modified to convey
adjective sense.

4. Current status of Urdu sense tagged

The corpus used for sense tagging is POS tagged
CLE Urdu Digest Corpus [20]. Sense layer has been
added to this corpus manually over a period of ten
months by a single annotator using sense annotation
tool rdu mfhu:m k:r/ described in
section 3 above. The current status of Urdu sense
tagged corpus has been given in the table 1 and 2
below. Table 1 shows that the final corpus consists of
5611 sentences with 100K words of which 17006 are
sense tagged.

Table 3: Consistency of Definition with POS





5.2. Consistency of sense example with POS

Table 1: Current state of Urdu sense tagged corpus

Sense tagged corpus
Total no. of sentences in the corpus
Total no. of words in the corpus
Tagged total word types
Tagged total sense types
Tagged total word tokens


In some cases, it was observed that the example

associated with the synset doesnt matchwith POS of
the word e.g.


Table 4: Consistency of Example with POS


Table 2 shows that there are 559 words which have

more than 2 senses tagged and 1522 words have one
sense tagged in the corpus.






Proceedings of the Conference on Language & Technology 2014

of Urdu language e.g. sense mapping was not found for

blnd fire xun (
)i.e. high blood pressure

and msi txte ( ) i.e. solar panels.

5.3. Consistency of definition across synset

6.2. Foreign language borrowed words

As the definition associated with the synset encodes

the meaning of all the members of the synset in an
explicit way, it is very important that all the members
have equal relationship with sense meaning.
Substitution tests were applied to identify semantic
equivalents of words found in the corpus and only
those synonyms were pertained which have equal
chance of usage in the example sentence e.g. d e:rn:
( )cannot be the part of synset in this particular

The sense mapping was not found in the

dictionaries for those words which are borrowed from
foreign language and have been lexicalized for Urdu
e.g. test match, basket ball and interview.

6.3. Complex Predicates

A fundamental assumption underlying all syntactic
theories has been that the main verb plays the role of
predication within a clause and all other elements in
the clause are either arguments or modifiers [22]. But
in Urdu there are complex predicates, defined as
containing two or more predicational elements which
jointly predicate within a mono-clausal structure.
Annotators faced difficulty in tagging complex
predicates i.e. the combinations of main verb and light
verb for example in the following sentence the word
p:e: :ne: ( )is a sense tagging challenge as
this sense was not found in the available senses of
word p:n: ( )in the dictionary.

s m: p:e: :ne: v:li: hj:ti:n hm:ri: :nk :
this in found
protein our
ko: sehtmnd rk ti: h:
healthy keeps
The protein found in this keeps our eyes healthy.

Table 5: Consistency of definition across synset

with POS






5.4. Addition of senses available in the corpus

In the development of WordNet, less frequent
senses were not added. During the process of
annotation all the word sense not available tags were
reported back to the WordNet team so that those
particular senses can be added in the WordNet
according to their usage in the corpus.

6.4. Normalization
The annotation tool was unable to match corpus in
some contexts e.g. vow ( )with hamza above case ().
The reason is that these combinations were typed in
different formats in corpus and WordNet and hence
requiring the process of normalization [23] which is
the process of representing texts into consistent

6. Challenges in the Process of sense

Annotators faced two specific challenges during the
process of tagging; a) tool specific limitation i.e. in
some cases, annotator tool couldnt match the corpus
instances b) language specific limitations i.e.
annotators faced difficulty in tagging certain linguistic
contexts such as; non-standardized translations, foreign
language borrowed words and complex predicates. The
detail of these ambiguous contexts is given below:

7. Conclusion and Future Work

This paper describes the construction of a sensetagged Urdu corpus. The goal of this research has been
to create a valuable resource both for word sense
disambiguation and researches on Urdu lexical
semantics. The current corpus consists of 5611
sentences with 100K words of which 17006 words are
sense tagged. This manually annotated corpus can act

6.1. Non-standardized translations

It was ambiguous to tag non-standardized
translations of English words which have become part


Proceedings of the Conference on Language & Technology 2014

as a seed corpus for automated methods to extract

additional senses and their relationships.


8. Acknowledgements
This work has been conducted through the Essential
(www.cle.org.pk\eulr), supported through a research
grant from DAAD, Germany



9. References
[1] A. Eneko and P. Edmonds, Word Sense
Disambiguation: Algorithms and Applications
Springer, 2007.
[2] A. Kilgarriff, Word senses, Word Sense
Disambiguation, Springer Netherlands, 2006.
[3] P. Resnik, WSD in NLP applications, Word
Applications (2006).
[4] S. J. Ker, C. R. Huang, J. F. Hong, S. Y. Liu, H. L.
Jian, I. L. Su & S. K. Hsieh, "Design and Prototype
of a Large-scale and Fully Sense-tagged Corpus."
Large-Scale Knowledge Resources. Construction
and Application. Springer Berlin Heidelberg, 2008.
[5] T. Petrolito and F. Bond, A survey of WordNet
annotated corpora, Proceedings of the 7th Global
WordNet Conference (GWC 2014), 2014.
[6] A. Kilgarriff, SENSEVAL: An Exercise in
Program, Proceedings of First International
Conference on Language Resources and
Evaluation, Granada, 1998.
[7] S. Landes, C. Leacock and R. I. Tengi, Building
semantic concordances, WordNet: An electronic
lexical database, 1998: 199-216.
[8] M. Palmer, H. T. Ng and H. T. Dang, Evaluation
of WSD systems Word Sense Disambiguation:
Algorithms and Applications, 2006.
[9] F. Bond, T. Baldwin, R. Fothergill & K. Uchimoto,
Japanese SemCor: A sense-tagged corpus of
Japanese, Proceedings of the 6th International
Conference of the Global WordNet Association
(GWC), 2012.
[10] P. Vossen, A. Grg, R. Izquierdo, & A. van den
Bosch, DutchSemCor: Targeting the ideal sensetagged corpus, LREC, 2012.
[11] P. Vossen, I. Maks, R. Segers, & H. VanderVliet,
Integrating Lexical Units, Synsets, and Ontology
in the Cornetto Database, Proceedings on the sixth
international Conference on Language Resources
and Evaluation (LREC), 2008.
[12] J. Zhu, H. Wang, T. Yao, & B. K. Tsou,
"Active learning with sampling by uncertainty and
density for word sense disambiguation and text
classification", Proceedings of the 22nd
International Conference on Computational


Computational Linguistics, 2008.
S. Koeva, S. Leseva, E. Tarpomanova, B. Rizov, T.
Dimitrova & H. Kukova, Bulgarian SenseAnnotated CorpusResults and Achievements,
FASSBL7, 2010: 41.
S. Koeva, S. Mihov, & T. Tinchev, "Bulgarian
WordnetStructure and Validation." Romanian
Journal of Information Science and Technology 7,
Y. Wu, P. Jin, Y. Zhang & S. Yu, A Chinese
corpus with word sense annotation Computer
Processing of Oriental Languages. Beyond the
Orient: The Research Challenges Ahead, Springer,
Berlin Heidelberg, 2006.
Ng, H. T., Getting serious about word sense
disambiguation In Proceedings of the ACL
SIGLEX Workshop on Tagging Text with Lexical
Semantics: Why, What, and How, 1997.

[17] Vronis, J., Sense tagging: does it make

sense, In Corpus linguistics conference,

[18] S, Urooj, S, Hussain, F, Adeeba, F. Jabeen and R.
Perveen, CLE Urdu Digest Corpus, in the Proc.
of Conference on Language and Technology 2012
(CLT12), Lahore, Pakistan, 2012.
[19] A. Zafar, A. Mahmood, F. Abdullah, S. Zahid, S.
Hussain, and A. Mustafa, "Developing Urdu
WordNet Using the Merge Approach ", in the
Proceedings of Conference on Language and
Technology 2012 (CLT12), Lahore, Pakistan.
[20] T. Ahmed, , S. Urooj, , S. Hussain, A. Mustafa, R.
Parveen, , F. Adeeba, A. Hautli, & M. Butt, "The
CLE Urdu POS Tagset", poster presentation in
Language Resources and Evaluation Conference
(LERC 14), 2014, Reykjavik, Iceland.
[21] Hussain, S., Finite-State Morphological Analyzer
for Urdu, National University of Computer and
Emerging Sciences, (2004), Lahore, Pakistan.
[22] M. Butt, The light verb jungle: Still Hacking
Away, in Workshop on Multi-Verb Constructions,
[23] S. Hussain, S. Gul and A. Waseem, Developing
lexicographic sorting: An example for Urdu, in
ACM Transactions on Asian Language Information
Processing (TALIP), Volume 6 Issue 3, 2007.

See http://cle.org.pk/eulr/ .


Proceedings of the Conference on Language & Technology 2014

Structural Analysis of Linking Urdu WordNet to PWN 2.1

Ayesha Zafar*, Afia Mahmood**, Sana Shams*, Sarmad Hussain*
* Center for Language Engineering, Al-Khawarizmi Institute of Computer Science, University of
Engineering and Technology, Lahore, ** University of Education, Lahore
firstname.lastname@kics.edu.pk, afiamahmood@ue.edu.pk
Urdu WordNet1 is the first semantic dictionary
of Urdu developed by Center for Language
Engineering. It contains 5058 senses. All synsets have
POS definition, unique synset ID, definition, synset and
example. The example of an entry has been given in
Figure 1.

Multiple cross language WordNets such as Euro
WordNet (EWN), Multi WordNet, Asian WordNet and
Indo WordNet, have been developed that involve
mapping Princeton WordNet (PWN) with the
respective language WordNet [1,2,3,4,5]. Majority of
these projects have employed the transfer-and-merge
method developed during the construction of Euro
WordNet for WordNet linkage. This paper discusses
the process, challenges and results of linking Urdu
WordNet, to the Princeton WordNet Version 2.1 from a
linguistic and lexicographic perspective. Based on the
synset alignment experience, cross language (Urdu
English) linkage issues have been highlighted followed
by a contextual strategy for the resolution. Urdu
language concepts that could not be aligned with the
PWN 2.1 are also highlighted and discussed.

Figure 1: Layout of Urdu WordNet


Increasing number of language specific

WordNets has created interest in the linkage of
WordNets to Princeton WordNet to enhance their
usability. The linkage of synsets of one language to the
other facilitates the development of bilingual
dictionaries which can be used for machine translation
and cross language information retrieval. It also
alleviates the performance of word sense
disambiguation tasks even in the absence of sense
tagged corpora in a target language [3, 5, 9]. This paper
reports the research challenges of aligning Urdu synsets
with English synsets of PWN 2.1.
The paper is organized in the following
sections. Section 2 reviews the current literature
regarding various WordNet linkage projects and their
reported accuracy statistics. Section 3 describes the
approach of linking Urdu WordNet with PWN 2.1.
Sections 4 presents in detail the challenges and
solutions for linking Urdu concepts with English

WordNet is a lexical resource whose design is

based on psycholinguistic theories of human memory
on the one hand and the British school of
structural/lexical semantics on the other [6]. Nouns,
verbs, and adjectives are organized into synonym sets,
each representing one underlying lexical concept [7].
There are semantic and lexical relations between
lexical items which dominate their organization and
exhibit their meaning. Moreover, these relations occur
more often between words belonging to the same part
of speech, thus nominal lexical items are networked
with other nominal lexical items, verbal lexical items
with verbal ones, etc. Furthermore, it is not composed
of entries in the traditional lexicographical sense.
WordNet assumes that synonyms grouped in synsets
stand for concepts, and that most relations stick to
concepts rather than to single lexical items [8].



Proceedings of the Conference on Language & Technology 2014

synsets. Section 5 documents concept categories that

remain un-linked. Alignment results are discussed in
section 6. Finally, Section 7 concludes the paper by
reporting the future work required in this direction.

Thai WordNets have been constructed using the

manual and semi-automated approach [14] [15]. This
WordNet contains 21, 344 senses. The major
difficulties in the alignment of Thai WordNet to PWN
were caused due to the conceptual gaps between Thai
and English language. For example the meaning of
retail store and store is opposite in Thai. Retail store
denotes store and store denotes to retail store.
Similarly, device, implement, tool, equipment etc. are
mapped on only two words of Thai. Furthermore, one
English word doctor cannot be mapped on two
Persian WordNet which is also aligned with PWN
was created using the automatic approach. The
approach used bilingual dictionary as well as Persian
and English corpora to align the Persian and PWN
synsets. Montazery et al [16] elaborate the method that
their approach calculates a score for each candidate
synset of a given Persian word and for each of its
translations, it selects the synset with maximum score
as a link to the Persian word. They report that this
method brought more accuracy than the manual
method. The accuracy of automatic approach has been
reported as 82.6%.
Chinese [17] and Spanish [18] WordNets have
been created using the automatic methods. Thai [15]
and Hindi [11][12] WordNet have been developed
using the semi-automatic approaches. Urdu WordNet
[19] has been developed using the merge approach and
later manual linkage of Urdu synsets to PWN 2.1
synsets. The following section presents the procedure
of aligning Urdu WordNet with PWN 2.1 and
consequently provides in detail the specific alignment
challenges faced in the process.

Literature Review

Recently there have been multiple attempts to

build WordNets for different languages and to link
these WordNets to English WordNet. The process of
linking involves the matching of a particular synset in
one WordNet to a synset in another WordNet and
requires high level of accuracy especially when the two
languages belong to different cultures. In addition,
conceptual gaps and the difference in the coarseness of
the word senses are further challenges faced during
alignment. As reported in [10], three types of
difficulties were faced during the alignment of
Romanian WordNet (RoWN) to PWN.; (i) Difficulties
caused by similar or intersecting synsets and nondifferentiating or insufficiently distinguishing examples
in PWN (ii) Difficulties caused by the structural
differences in wordnet development, e.g. all word
senses in PWN are equal, while Romanian wordnet has
main and derived senses. Some idiomatic expressions
are also missing in the Romanian wordnet (iii)
Difficulties caused by the intrinsic differences between
English and Romanian language i.e. at times English
language meanings are missing in the Romanian
language and vice versa.
Similar challenges were faced in the linkage of
Hindi WordNet to PWN [11]. Hindi WordNet used a
semi-automated system, WNSynsetMatcher tool [12],
for linking the Hindi WordNet with the English
WordNet. They describe that the main challenges faced
were due to cultural difference in the concepts of
kinship relations, musical instruments, grains, kitchen
utensils, different tools and certain species of birds and
animals. The solution proposed for alignment is using
direct and hypernymy linkages.
The construction of Ancient Greek WordNet
(AGWN) was automatic in which Greek-English
digitized lexicons were used to extract Greek-English
word pairs [13]. Later, the Greek word of the extracted
pair was linked to every synset in the PWN. However,
all the synsets of Greek were not available in the PWN.
Thus, the AGWN contains 35,000 distinct lemmas with
coverage of 28% of Greek lexicon, whereas the Greek
lexicon contains 120,000 distinct lemmas. Bizzoni [13]
state that English is polysemic in nature and the high
polysynthetic nature of English and the relatively
isolating character of the Greek contributed to major
difficulties in the development of AGWN.

Urdu WordNet to PWN 2.1 alignment


5000 nouns, verbs, adjectives and adverbs were

used to develop the Urdu WordNet (UWN) [19]. In the
next stage, these 5000 words were reviewed and
aligned to PWN 2.1. The following steps were
followed during this process.
Firstly, the finalized Urdu Synset with its specific
POS, relevant details of concept definition,
example sentence and a unique ID was entered
into the Urdu WordNet application. During Urdu
synset finalization it was verified that all the
senses of a specific synset were distinct (different
from each other) and comprehensive (i.e. embody
precise and adequate detail) for concept
Next, the verified Urdu senses were looked up in


Proceedings of the Conference on Language & Technology 2014



the dictionaries for all the possible translations.

Based on this lookup, at least three candidate
words were to be selected for possible mapping.
Once the English candidate terms are generated,
the complete POS category of its respective sense
is carefully analysed. For example, Urdu senses
depicting a state in the concept definition would
be mapped to noun.state sense of the English
word rather than noun.act or noun. artifact senses
of the same wordfor consistency.
Once an English sense is finalized for mapping,
its PWN sense ID is recorded against the
particular Urdu sense. The following table shows
the process.


Urdu is morphologically richer than English as

it has morphological devices such as inflection, that
change verbs into their causative forms. Causitivization
[20] is a process in which subject takes new arguments
that changes the meaning of the verb. In Urdu, infixes
like ( l-) and ( v-) create verb causatives. Verbs in
Urdu language are categorized into three forms which
(i) Verb/ / la:z m (ii) Transitive Verb/ / mu t
d / and (iii) Di-transitive Verb / / mu t
d l mu t d. In most of the cases, (i) represents
the root verb, while (ii) and (iii)
represents its causatives. It is shown in the following
table 2.

Table 1: Urdu to English sense mapping

Urdu Word
Urdu POS






peace, repose,

Selected Eng
Eng Concept

the absence of
mental stress
or anxiety


state; free



the state
during the
absence of

Eng Sense ID



Morphological issue: Causative

difference between Urdu and English

Table 2: Examples of Urdu root verbs and their






(root verb)


s :n :







As shown in the table above, ( s :n :/ sleep) is a

root verb and its causative is ( sl:n:/ to make
someone sleep. In contrast, morphological causatives
are not found in English. Therefore, during the
WordNet linkage, the causative verbs in Urdu couldnt
be mapped appropriately on English verbs.
UWN Entry: <100795></ s :n : /sleep><N ><
/ n i:nd
::n:/ be asleep><
/b: s :n ::ht : h : / the baby wants to
PWN Entry: {00014762} <verb.body> (be asleep)

Lastly, the selected candidate word is entered in

the Urdu-English alignment utility. The utility
displays all the senses of the selected word, and
there the selected sense is selected to complete
the mapping process.

Thus, maps on sleep. However, no possible word

for could be found from PWN.

Alignment challenges and proposed


Similarly,( pkn:/ squeeze) is a root verb, that

changes to ( pk:n:/ compressed) due to
causitivization, and in the process it also changes its
meaning. Furthermore, it was also observed that at
times, Urdu root verb becomes passive whereas its
causative remains active. In this case, causative maps
directly on English word. For example, the causative
( b:n/ to play) of the base verb ( bn:/

During the alignment of UWN to PWN 2.1

challenges faced were that of equivalence. These issues
can be broadly categorized as syntactic, morphological
and semantic differences. The following section
discusses these alignment challenges and proposes
solutions for alignment.


Proceedings of the Conference on Language & Technology 2014

automatic play) is mapped on Play <01710937>, where

as the base word automatic play (bn:/)
remains unmapped. Similar phenomenon can be
observed in other Urdu verbs like, ( pkn/get
squeezed) and ( btn:/ get distributed)
These issues can be handled through VerbNet.
VerbNet associates the semantics of a verb with its
syntactic frames, and combines traditional lexical
semantic information such as thematic roles and
semantic predicates, with syntactic frames and
selectional restrictions. Therefore, such causative verbs
can be clustered in semantically coherent classes. Verb
lexicon which is based on VerbNet can be linked to


putting an affect

Even though the complex predicates structurally

comprise of two words, syntactically and semantically
they behave like single constituents. Other examples of
this issue are (br:bri) N + ( krn ) V.
The UWN to PWN alignment challenge arises
when a noun in Urdu as always gives meaning of a
verb. Therefore, it becomes confusing to map it with a
English verb or English noun. As a solution such
N/Adj+V constructions can be aligned with WordNet
by adopting either list based approach or Rule based
approach. However, complex predicates are considered
highly productive with respect to their combinatorial
possibilities. This means it is impossible to construct a
static list of N/Adj+V combinations [23] [24]. In this
scenario, it is useful to investigate the actual syntactic
and semantic characteristics behind complex predicate
formation [24]. Thus rule-based approach is
recommended Using the rule based approach, heuristic
are drawn from the semantic and syntactic features of
the N/Adj + V constituents in a complex predicate.
These generalizations are then used to predict the
nature of these complex N/Adj+V constructions on the
basis of the semantic features of the nouns or adjectives

Another alignment challenge is faced due to

complex predicates in Urdu as Urdu language employs
different types of complex predicates to express its full
range of verbal predication. [21] [22] Two types of
complex predicates i.e. noun+verb and adj+verb were
found common in the data which couldnt be mapped.
In N+V and Adj+V complex predicates the
noun and adjective contains the predicational content
where as the verb, usually referred to the light verb
[23]. For example,( f: krn:/ to disclose),
and ( srnd:zho: na:/ to influence) are
complex predicates in which nouns or adjectives
require a verb to denote their complete meaning. They
do not give complete meanings in isolation. In the
examples given above ( ho: na) and ( krn) are
used to convey the complete meaning thus ( f) N
will always be used with /V and ( srnd:z)
Adj will always be used with /V. This is presented in
table 3 below.




Urdu Concept

Urdu Example

k si: i:zko:
z:hr j : j :
krne: k: ml

s ne: pn :
r:zsbprf: krdj

the act of

Semantic Issues

The following sub sections present detail of the

semantic challenges faced in alignment of UWN to

Table 3: The Case of Complex Predicates




predicates in Urdu causing POS mismatch in alignment



hm:re: mlkki:
:bo:h v:
grmh: ldi:
srnd:z ho: t i:
our countrys climate
is hot, it affects


Single Urdu concept for multiple PWN

During alignment, it was observed that some
Urdu words in a particular sense could be mapped to
multiple senses of a certain English word. For example,
UWN Entry: <100281></ bpn><
kmsnh:ne: ki: k:fj t , no: mri:>
is a noun in Urdu which can be accurately mapped on
the following two different senses of the word
childhood from PWN:
{14948030} <noun.time> (the time of person's life
when they are a child)

he has revealed his

secret to all


Proceedings of the Conference on Language & Technology 2014

{14235403} <noun.state> (the state of a child between

infancy and adolescence)
This is because ( bpn) gives a generalized sense
of childhood.
Thus both noun.time sense and
noun.state senses of the word childhood can be
mapped. Similarly, UWN Entry: <100902 ><
/k:nt:/echidna> is a noun in Urdu which can be
accurately mapped on the following two senses of
Echidna in PWN:
1. {01853520} <noun.animal> (a burrowing
monotreme mammal covered with spines and having a
long snout and claws for hunting ants and termites;
native to New Guinea)
2. {01853149} <noun.animal> (a burrowing
monotreme mammal covered with spines and having a
long snout and claws for hunting ants and termites;
native to Australia)
This alignment challenge can be handled through one
to many mapping of concepts. The Urdu sense which
composes multiple concepts of PWN in terms of their
relations and general understanding can be aligned with
all those senses of PWN.

Blood relations
Urdu language carries different terms for blood
relations, e.g. nephew in PWN is used as a son of your
brother or sister whereas in UWN ( b:n :)
means sisters son and ( bt i: :)is used for
brothers son.
Similarly, niece is a daughter of your brother or sister
in English but ( b:ni:)is sisters daughter and
( bt i: i) is brothers daughter in Urdu.
Moreover, a concept for brothers wife, called
(b: bi:) in Urdu and sisters husband called
(bno:i:) in Urdu is inexistent in the PWN 2.1.
These differences represent lexical gaps in structuring
of information in the case of blood relationships.
Relations with In-laws
Urdu lexicalizes the distinction between the blood
relations of husband and wife. However in English
only two senses for these relations exist, {09731744}
<noun.person> a brother by marriage and {10444395}
<noun.person> the sister of your spouse whereas in
Urdu, ( s:l :) is used for wifes brother and two
terms are used for husbands brothers i.e. (e:th)
elder brother of husband and ( de:v r) younger
brother of husband. Also ( s:li:) is used for
wifes sister and ( n n d) is used for husbands
Maternal and paternal relations
There was another challenge for mapping maternal and
paternal relationships. This is because English does
not have specific concepts for relationships. For
example ( ) younger paternal uncle, ( t :j
:) elder paternal uncle, ( m:m :) maternal
uncle, ( x :lu:) husband of mothers sister and
(ph ph : ) husband of fathers sister all relations have
only one corresponding English sense, uncle
{10575646} <noun.person> -- (the brother of your
father or mother; the husband of your aunt) in PWN.
Similarly it is challenging to map other such relations
like aunts, cousins, grandparents and grand-children
where Urdu gives more than one sense for each of them
based on gender, paternal and maternal side relations,
This specific challenge of mapping personal
relationships from Urdu language to English can be
resolved by constructing hypernymy linkage. This
means that in the absence of the equivalent English
concept, the nearest term capturing the sense would be
assumed as the hypernymy of that concept and would
be mapped to it. For example, and would
be mapped to the English synset of uncle.


Multiple Urdu concepts for single PWN

Another alignment challenge faced during
UWN to PWN mapping was that multiple concepts of a
particular Urdu word could be mapped on one word of
English. For example, Urdu verb ( bdkn:/
scared) has two senses in UWN;
<101339> /
:nvrk:drkr j: bkr pi:he htn:/ animals
scared and retreats
and <101340>
se: drkrbdm:n
ho:nl ho: :n:/suddenly a man gets scared to
be skeptical
Here, both the senses can be mapped on scared,
{01762161} <verb.emotion> (cause fear in)
Thus as a solution it is proposed that both the Urdu
concepts are aligned to a single PWN concept to
resolve such semantic issues.
Difference in personal relationship
Urdu language organizes kinship terminologies in
classificatory terms whereas English language uses
descriptive terms for relationship. Family relation
hierarchies are different in Urdu and English. This
difference causes alignment challenge because the
kinship terminologies in Urdu have a wider array of
relationships that do not have corresponding senses in
PWN. These are explained in the following three types
of relationships:


Proceedings of the Conference on Language & Technology 2014

unmapped due to the unavailability of proper concept

in PWN. Similarly, there are certain utensils which
only exit in Urdu, e.g. ( bh :l : ) large drum
of clay which is used to store grains, ( d : I) a
medium size wooden spoon used for cooking.

Differences in representation of utensils

It was observed during mapping that certain
kitchen utensils depicts a category of words which is
related to food, cooking and eating habits of the
indigenous culture. For example, ( b r t n) mean
kitchen ware, utensils made of clay, metal or glass;
equipment for cooking and eating. In this case, ( b
r t n) represent a composite sense of various utensils
where as a sense capturing this concept in PWN could
not be found. This is similar to the concept cutlery (a
composite of spoons forks, etc.) for which we do not
have a corresponding equal concept in Urdu.
Similarly ( d : ) is a culture specific sense that
implies (i) ( lki: k: b :
m : ) large wooden spoon, (ii)
( brt n s m : o:rb: v
:r: dst rx:npr nt e: h:) a bowl for curry,
( k si: be: brt n se: p:ni: nk:lne:
k: dndi: d :r pj : le: ki: klk: o:t : zrf)
a pot, which is used to extract water from any vessel.
However, PWN only gives a general concept of utensil
i.e. {04462854} <noun.artifact> an implement for
practical use (especially in a household). Such issues
can also be handled through direct linkage or
hypernymy linkage. For example, the assumed
hypernymy of would be tableware (articles for use
at the table (dishes and silverware and glassware)).



Differences in representation of fruits

There are many fruit names which are culture
specific and are discretely lexicalized in Urdu. For
example, ( k:ri: ) unripe mango fruit is commonly
used in Urdu. This issue can be handled by direct
linkage. For example, can be linked to English
synset mango.



Literal Concepts

There are many words in Urdu language based

on stereotypes and culturally-inherited associations.
Such metaphors do not hold true in all situations as are
used as phrasal words. These also remained unmapped
as no parallel senses exist in English. Table 4 illustrates
few examples of these senses.

Un-mapped lexical and cultural senses

The different categories of alignment challenges

discussed above can be resolved by adopting the
proposed solution, however, some Urdu senses still
remain unmapped. This is because of the inevitable
linguistics, cultural, semantic differences of Urdu and
English language. Few categories of these senses that
remain unmapped are discussed below.



Urdu has borrowed many words from English

language. While mapping, it was revealed that the
semantics of such English words when used in Urdu
has changed and it does not give the same area of
meaning as that of the originally borrowed foreign
word. For example ( po:st/ any office or rank), is a
borrowed word from English, but it could not be
aligned to any of the PWN senses of the word Post.
Another example of different semantic
orientation of borrowed words is ( fsr/ an officer)
who has right to order. The Urdu concept of this word
is not available in any of the PWN senses of the
<noun.person>, someone who is appointed or elected
to an office and who holds a position of trust.
This semantic change refers to semantic shift or
progression and involves changes in the usage of words
where its literal sense radically differs from its original
meaning. Moreover, such words couldnt be mapped to
a sense of a different English word.


borrowed words

Table 4: Example of missing literal concepts




bndu:qse: o:lj :
qt lkrde:n :

Cultural specific vegetables and

utensil names


There are a few vegetables which cannot be

mapped on any of the PWN senses as they only exist in
Urdu, e.g. ( s : ), ( bi: t H u: ) , remained


:r m :mu:li:
hrk t j: mb

f : ne: k hi: hml:

kj : :r dmnko:
bhu:nkrrkdj :

sb:h se: hi: ski:
:kh phkrhi: t hi:

Proceedings of the Conference on Language & Technology 2014

Table 6: WordNet data

Un-categorized conceptual gaps

in Urdu and English


Total number of reviewed senses

There are many concepts in Urdu which remain

unmapped due to unavailability of corresponding
concepts in PWN. These concepts are of varied nature
thus, un-categorized and tabulated below.


prnd:ki: :n


:mj:n j:
su:fj:n: , mbt zl x
:s ki: nzr m :
t hzi:bse: r :


rpe: k : so:lhv:
hiss: o: qi:mt m :
e:k rpe:ke:
so:lhv: hisse: ke:
br:br ho:t : h:


Total number of unmapped senses


Within the total 1829 senses aligned, the following

table provided the total count of nouns, adjectives and

Table 5: Un-categorized conceptual gaps


Total number of UWN senses

Total number of senses aligned to PWN 2.0


Table 7: Count of Aligned Senses as per Parts of


mhe: vo: dni:

t rh: j:d h bhm:re:
kbu:t r ne: prv:zke: lj
e: prkh o:le:

Total count of Nouns, Adjective and Verbs from UWN

b:z:rift u: se:

Total number of Nouns


Total number of Adjectives


Total number of Verbs


1403 Urdu sense remained unmapped due to cultural,

religious, semantic and linguistic differences. The
percentage of unmapped senses is 39.79 % which is
higher in number. The issues of unmapped senses have
already been discussed. On the basis of proposed
suggestion, these unmapped senses will be further
reviewed and attempted to be aligned to PWN 2.1
through continued research. The work accomplished to
data is available at CLEs2 website.

hmre: d: d: ke:
zm :ne: m do: :ne: ki:
roti: huv: krt i: t hi:


In the table above, ( prv:z/ flight) is an Urdu

concept depicting ( prnd:ki: :n /
flight of birds), which could potentially be mapped to
flight. However, it was observed that flight gives a
generic concept of flying, whereas Urdu WordNet
provides a specific concept for flight of birds which is
not available in PWN. Similar patterns are observed in
other Urdu words as well.

This paper reports the UWN to PWN mapping

methodology, issues and challenges while aligning
Urdu WordNet to PWN. It was observed that
morphological, syntactical, semantic and cultural issues
were a hindrance in accomplishing Urdu to English
mapping. However, possible solutions are suggested to
resolve these issues. Further research needs to be
conducted in hypernym relationship development and
Urdu VerbNet development in order to resolve the
alignment challenges for effective alignment.

Alignment Results


This work has been conducted through the

project, Essential Linguistic Resources project3,
supported through a research grant from DAAD,

The current status of English- Urdu aligned senses

have been given in table 6 below. During the
alignment process total 3526 Urdu senses from UWN
have been reviewed out of which 1829 Urdu senses
were aligned to PWN 2.0. This is shown in table




Proceedings of the Conference on Language & Technology 2014

WordNet, In Proceedings of Language Resources and

Evaluation Conference, Iceland, 2014.


[1] P. Vossen, EuroWordNet: a Multilingual Database for

Information Retrieval, workshop on Cross-language
Information, Zurich.

[14] D. Leenoi, T. Supnithi, W. Aroonmanakun, Building a

Gold Standard for Thai WordNet, In Proceeding of The
International Conference on Asian Language Processing,
Thailand, 2008. Available at:

[2] E. Pianta, L. Bentivogli, C. Girardi, MultiWordNet:

In Proceedings of the First International Conference on
Global WordNet, Mysore, India, 2002.

[15] S. Thoongsup, K. Robkop, C. Mokarat, T. Sinthurahat,

T. Charoenporn, V. Sornlertlamvanich, H. Isahara, Thai
WordNet Construction, In Proceedings of the 7th Workshop
on Asian Language Resources, Association for
Computational Linguistics, 2009.

[3] V. Sornlertlamvanich, Review on Development of Asian

WordNet Japio year book, 2009.

[16] M. Mortaza, H. Faili, "Automatic Persian WordNet

Construction." In Proceedings of the 23rd International
Conference on Computational Linguistics: Posters.
Association for Computational Linguistics, Beijing, China,

[4] J. Ramanand, A. Ukey, B. Singh, P. Bhattacharyya,

Mapping and Structural Analysis of Multi-lingual
Wordnets, In Proceedings of IEEE, Bombay, 2007.
[5] G. Miller, Beckwidth, C. Fellbaum, D. Gross,
Introduction to WordNet: An On-line Lexical Database,
International Journal of Lexicography, Vol 3, No.4 (1990),
pages 235-244.

[17] R. Xu, Z. Gao, Y. Pan, Y. Qu, Z. Huang, An Integrated

Approach for Automatic Construction of Bilingual ChineseEnglish WordNet, In proceedings of The Semantic Web,
Heidelberg, 2008.

[6] C. Fellbaum , WordNet: An Electronic Lexical

Database. MIT Press, 1998.

[18] J. Atserias, L. Villarejo, G. Rigau, Spanish WordNet

1.6: Porting the Spanish Wordnet Across Princeton
Versions, In Proceedings of Language Resources and
Evaluation Conference, Portugal, 2004.

[7] C. Fellbaum, M. Palmer, L. Delfs, S. Wolf, Manual and

Automatic Semantic Annotation with WordNet, In
Proceedings of NAACL, Pittsburgh, 2001.

[19] A. Zafar, A. Mahmood, F. Abdullah, S. Zahid, S.

Hussain, and A. Mustafa, "Developing Urdu WordNet Using
the Merge Approach ", in the Proceedings of Conference on
Language and Technology 2012 (CLT12), Lahore, Pakistan.

[8] P. Vossen , "EuroWordNet: A Multilingual Database

with Lexical Semantic Networks." Kluwer Academic
Publishers, Dordrecht, 1998.
[9] J. Daude, L. Padro, G. Rigau, Mapping Wordnets Using
Structural Information. 38th Annual Meeting of the
Association for Computational Linguistics, 2000.

[20] S. M. J. Rizvi, Development of Algorithms and

Computational Grammar for Urdu (Ch 4: Urdu Verbs
Characteristics and Morphology), PHD thesis, Department
of Computer & Information Sciences, Pakistan Institute of
Engineering and Applied Sciences, Islamabad. 2007.

[10] D. Cristea, C. Mihiala, C. Forascu, D. Trandabat, M.

Husarciuc, G, Haja, O, Postolache, Mapping Princeton
WordNet Synsets onto Romanian Wordnet Synsets,
Romanian Journal of Information Science and Technology.
Volume 7, 2004, (June, 27, 2014), pages 125-145

[21] T. Mohanan, Argument Structure in Hindi, CSLI

Publications, 1994.
[22] M. Butt, The Structure of Complex Predicates, PHD
thesis, Department of Linguistics, Stanford University, 1995.

[11] M. Shamsfard, A. Hesabi, H. Fadaei, N. Mansoor, A.

Famian, S. Bagherbeigi,
S. Assi, Semi Automatic
Development of Farsnet; the Persian WordNet, In
Proceedings of 5th Global WordNet Conference, Mumbai,
India, 2010.

[23] M. Butt, T. Ahmed, Discovering Semantic Classes for

Urdu N-V Complex Predicates, in proc. International
Conference on Computational Semantics, UK, 2011.

[12] J. Saraswat, S. Ripple, P. Goyal, P. Bhattacharyya,

Hindi to English Wordnet Linkage: Challenges and

[24] M. Butt, T. Bogel, A. Haulti, S. Sulger, T. Ahmed,

Identifying Urdu Complex Predication via Bigram
Extraction, in proc. COLING 2012, India, 2012.

[13] Y. Bizzoni, F. Boschetti, R. Del Gratta, H. Diakoff, M.

Monachini, G. Crane, The Making of Ancient Greek


Proceedings of the Conference on Language & Technology 2014

CLE Urdu Books N-grams

Farah Adeeba, Qurat-ul-Ain Akram, Hina Khalid, Sarmad Hussain
Centre for Language Engineering, Al-Khawarizmi Institute of Compute Science,
University of Engineering and Technology, Lahore
firstname.lastname@kics.edu.pk, ainie.akram@kics.edu.pk
corpus cleaning and generation of N-grams are
discussed in this paper.

The paper presents the development of first
publically available Urdu N-grams extracted from
different books. For the best representation of Ngrams, large amount of Urdu corpus is collected from
books covering different domains. The automatic
cleaning of 37 million words corpus is discussed. The
domain-wise N-grams are extracted which can be used
in different Natural Language Processing and
Information Retrieval applications.

2. Literature Review
A lot of effort has been carried out for the
development of structured publically available text
corpora in different European and Asian languages.
Among them, a majority of the work focused on the
development of corpus for English language [8,3]. The
researchers use these publically available corpora to
develop the language model for different applications.
Effort is now focused on the development of N-grams
and annotated N-grams, so that language model can be
made available to users. Google Web IT 5-grams is
publically available N-grams corpus collected from
Google web books. The Version 2 of this corpus
contains more than 8 million books [6]. This corpus
includes N-grams (up to 5-gram) annotated with partof-speech tags.
A large amount of text corpus for Arabic language
is reported by Leipzig Corpora Collection [7]. This
corpus contains Arabic text of different countries,
including Algeria, Bahrain, Egypt, Iraq, Jordan,
Kuwait, Lebanon, Mauritania, Morocco, Oman,
Palestine, Qatar, Sudan, Syria, Tunisia, United Arab
Emirates and Yemen. This corpus is crawled from
online news websites of different countries. The
country wise distribution of this corpus is also
Persian language belongs to Arabic script and
shares same complexities of Urdu including writing
style and word segmentation issues. AleAhmad et al.
[5] report a standard text corpus called Hamshahri for
Persian language. This text corpus is crawled from
online website of Hamshahri newspaper. The corpus is
categorized into 82 different domains. This corpus
contains a lexicon list of 417,335 words. Darrudi et al.
[4] report character N-grams (up to 5 grams )
computed on Hamshahri text corpus.
The effort has been carried out for the development
of the Urdu text corpus. Ijaz and Hussain[2] report
Urdu text corpus of 18 million words crawled from

1. Introduction
Reliable, well balanced and sizeable corpus is
important for the development of mature Natural
Language Processing (NLP) and Information
Retrieval(IR) applications. These applications rely on
language model which represents the characteristics of
any language. N-gram is one of the most explored and
used probabilistic language model to develop such
applications. Normally, data sparsity issue appears if
N-grams are computed from the corpus, which covers
limited contextual information of words. Hence, large
amount of words corpus is required which has rich
contextual information of words, having a reasonable
large number of N-grams with minimum data sparsity.
In addition, a balanced corpus is required, which
covers reasonable domains for language coverage. In
literature two widely used Urdu corpora [1,2] are
reported. These corpora are extracted from Urdu
magazine and news. CLE Urdu Digest [1] is publically
available corpora varying from 100K1 to 1M2, that can
be used for language modeling and N-gram extraction.
These available Urdu corpora cover limited domains.
In addition, the size of these corpora is also not too
large. Therefore, new text corpus, having a reasonable
domain and size is collected and presented in this
paper. The corpus distribution into domains, automatic

2 http://cle.org.pk/clestore/urdudigestcorpus1M.htm


Proceedings of the Conference on Language & Technology 2014

two news websites. This corpus is collected from

different news domains, including finance, culture,
science, etc. This corpus contains 104,341 unique
words. This is not publically available due to licensing
constraint. Urooj et al. [1] also report the 100K words
Urdu corpus called of CLE Urdu Digest Corpus
extracted from Urdu Digest magazine. The extracted
data is categorized into 13 different domains. Later,
the same approach is used to develop 1 million words
Urdu Digest corpus. The reported 100K Urdu tagged
corpus is also annotated with Part of Speech (POS)
tags. All these Digest corpora are publically available.
These corpora are not sufficient to develop the well
representing language model for Urdu. Hence there is a
need to collect the large Urdu text corpus covering a
diversity of domains so that representative language
model can be developed.

3.2. Domain Wise Corpus Distribution

The books belong to prose are further analyzed to
classify them into different domains. Therefore a
complete manual pass has been carried out and prose
books are categorized into 18 different domains,
including articles, biography, character representation,
culture, foreign literature, health, history, interviews,
letters, magazines, novels, plays, religion, reviews,
science, short stories, travel and Urdu literature. The
domain wise books information is given in the Results

3.3. Corpus Cleaning

Although, the corpus is available in Unicode file
format, but still there exists some web based content in
books. Therefore books are further processed to
remove such erroneous text such as HTML tags, URL,
and Non-Unicode text. This raw corpus is processed to
extract the words based on space tokenization. The
analysis of this list shows that words are not properly
space delimited in this corpus and a sub word or more
than one words are resulted as single Urdu word after
tokenization. Some of the erroneous words examples
extracted from the raw corpus are listed in Table 1.

3. N-grams Development
For the development of N-grams, the first step is
the acquisition of Urdu text corpus, which should
cover a diversity of different domains. After
acquisition, corpus is segregated into two main genres
i.e. poetry and prose. Genre classification is done
manually. Genre specific corpus is cleaned based on
the Urdu characteristics and manual analysis of books
content. After cleaning, the N-grams of the cleaned
corpus are computed. The details of each process are
given in sub-sequent sections.

Table 1: Examples of erroneous words



3.1. Corpus Acquisition

To address the need for coverage of various
domains of Urdu text, Urdu books are crawled from
the web. A total of 1,399 books are collected from an
online Urdu library [9]. The licensing information of
these books is unspecified therefore N-grams of this
corpus are reported and released publically under
institutional license. These books are available in
Unicode format. In first pass, each book is manually
analyzed and categorized as poetry book or prose book.
This classification is done by reading the content of
books. The percentage of content is used for the
categorization of book in specific domain. After this
manual analysis and categorization, a total of 861
books having 37,680,293 words belong to prose
category and 507 books having 309,486 words belong
to poetry domain. During books distribution into
different domains, there are 31 books which contain
non-Urdu content therefore these books are not
considered for domain classification.


This analysis shows that the extracted N-grams on

this corpus will not give desired information and will
contain erroneous contextual information of words.
Therefore, to address this issue, the corpus
cleaning process needs to be done for proper space
insertion between words before extraction of N-grams.
To aid the cleaning process for Urdu text, a cleaning
tool3 is also available to assist the manual cleaning.
The manual cleaning of this 37 million words is not
feasible, therefore extracted word list is analyzed and
semi-automatic corpus cleaning process is devised.
After analysis of word list and corpus, following
cleaning issues are extracted. The automatic way to
address the issue is also discussed subsequent sections.



Proceedings of the Conference on Language & Technology 2014

3.3.1. Normalization. The Urdu words which can

be written using different sequence of Unicode
characters also exist in the corpus e.g. can be written
as single character Unicode ( U+06C2) or as a
combination of two characters and (U+06C1 and
U+0654). In the same way, can be written in two
different ways, i.e. using two characters' Unicode
with (U+0654 and U+0648) or with single character
which is ( U+0624). The extracted word list treats
such variations as separate word based on the
different Unicode values. Hence such issues need to
be resolved using normalization of Urdu text. Few
examples of normalization issues are shown in Table

such as , the ligature which ends with joiner will

be attached with next ligature if space is removed and
ZWNJ is not inserted between ligatures. The typists
normally add space between ligatures which caused
segmentation of a compound word into two words.
This is due to the unfamiliarity of ZWNJ(U+200C)
and unavailability of this character on keyboard.
Therefore, to address this issue, a separate list is
prepared which contains the compound words having
space between ligatures. This space is automatically
replaced with ZWNJ during cleaning process.
The following categories of Urdu compound words
are also identified which are resolved automatically.
The extracted word list is analyzed and different
category of compound word indicators are extracted
and against each category,
the solution is
The compound words which are joined with are
handled separately. All the words are extracted from
the corpus, which start with prefix. This extracted
list is manually analyzed and words are finalized,
which will be joined with previous word in the corpus
e.g. . There are some words in Urdu
which start with but these are independently valid
words, e.g. , hence all such cases are removed
from the extracted list. The finalized word list is
resolved in such a way that any word exists in this list
is attached with previous word by deleting the space
between them and if the previous word ends with
joiner then ZWNJ is inserted e.g. is replaced
with .
The compound words which contain
(Zeer-e-Azafat) i.e. between sub-words are also
handled automatically. The word which contains
( Zeer-e-Azafat) at end is automatically attached
with next word in the corpus e.g.
. The
attachment is done in such a way that if the word
containing ( Zeer-e-Azafat) ends with joiner
then ZWNJ is added between indicated word and next

Table 2: Examples of normalization issues found

and its replacement made


U+06C1 U+0654)



(U+062C U+0630 U+0628


(U+062C U+0631 U+0627 U+0654


(U+062C U+0631 U+0623


(U+0644 U+0643 U+0645)

(U+0644 U+06A9 U+0645)

(U+0645 U+0634 U+06A9 U+0648


(U+0645 U+0634 U+06A9

U+0648 U+0670 U+06C3)

3.3.2. Aerabs. The extracted word list also contains

words which are treated as separate words because of
aerab attachment. In Urdu, aerabs are optionally used
to give pronunciation guidance of same word but
written/used in different contexts. Urdu has a variety of
words which are written using same character
sequence, but have different meaning based on the
context in which they appeared. Such words have
different phonetic behavior which is indicated using
aerab. In Urdu writing styles, usually aerabs are not
used and such words are separated using the context in
which they are appeared, e.g. can be used in two
different contexts i.e.\( jg\) and \( jg\), but
such words are normally written without aerab i.e. .
Based on this analysis, aerabs are removed from

Table 3: Examples of seen Zeer-e-Azafat in corpus


Space Insertion and Omission.
Space insertion and space deletion issues also exist in
this corpus which are handled separately. Normally, in
Urdu, space is not properly inserted between the
words. The space deletion issues deal with the cases
where space will be deleted inside compound words
and if required Zero Width Non-Joiner (ZWNJ) will be
used to preserve the valid word shape. In Urdu, words


The word which contains ( Yay-e-Azafat)

at end can be sub-word of compound word, e.g.
is sub word of . Therefore a complete word
list is extracted from the corpus, which contains
( Yay-e-Azafat) at end. This list is manually
analyzed and finalized. There are Urdu words which


Proceedings of the Conference on Language & Technology 2014

end with ( Yay-e-Azafat) but these are not

part of compound word e.g. . Therefore, all
such words are removed from the extracted list. This
( Yay-e-Azafat) word list contains all those
sub-words which will be joined with the next word in
the corpus by deleting the space. Some examples of
( Yay-e-Azafat) words are given in Table 4.

separately on complete corpus and then the next step is

Table 6: Automatic cleaning process


Table 4: Examples of Yay-e-Azafat found in the





Space insertion issues also exist in the corpus which

are discussed below. The Urdu words which end with
( HAMZA) must be separated with a space. As
(HAMZA) is a non-joiner therefore usually space is
not inserted after ( HAMZA) to type next word. The
(HAMZA) is a clear indicator of word boundary, e.g.
and therefore space is inserted after
(HAMZA) to separate the next word.
The special symbols and punctuation marks are also
handled automatically in such a way that space is
inserted before and after special and punctuation
symbols so that these cannot be attached with any
Urdu word.
Normally space is not added between Urdu word
and digit (Latin or Urdu digit) e.g. 8. To resolve
this issue, the corpus is processed and space is inserted
between digits and Urdu character/letter. In the same
way, the Latin word and Urdu words are not separated
using space e.g. txt. Therefore the space is inserted
between Latin and Urdu letters. Some example of such
space insertion issues are given in Table 5.


Remove space between " "words list

and previous word in the corpus and
add ZWNJ where required.
Apply Normalization
Automatically insert ZWNJ between
the words by using the cleaned list of
ZWNJ insertion between words
Add space between special symbols
and punctuation marks but not within
Latin words
Join word with next word which exists
in ( Yay-e-Azafat) word list
Join word with next word which exists
in ( Zeer-e-Azafat) word list
Remove All Aerab
Separate Latin digits from Urdu
Separate Urdu Unicode from Latin
Separate Urdu digits(-)from Urdu
characters using space
Add space after ( HAMZA)

3.4. Poetry Corpus Cleaning

Same as done for prose corpus, the poetry corpus is
processed to remove HTML tags, URLs, and NonUnicode letters. The manual analysis of the corpus
shows that the poetry corpus cannot be cleaned using
automatic cleaning application. Therefore, poetry
books are cleaned manually using following cleaning

Table 5: Examples of space omission in corpus as in

case of numerals and Urdu text







After careful analysis, it has been observed that the

suggested solution to address these cleaning issues
must be applied in an order so that proper words can be
extracted. Therefore the order of solutions which are
applied in automatic cleaning application is listed in
Table 6. The automatic cleaning application is
developed in such a way that each step is performed


Introductory section is removed from poetry

books to ensure only poetry text in the book.
The prose portion having dedication of the
respective poem to someone is also removed.
Extra symbols such as ***** are removed.
Carriage return is inserted after each Verse
( )and Couplet ( )so that these can
be separated automatically.
Footnote is also removed.

After this manual cleaning, the poetry books

contain only poetry text so that these can be further
processed to extract the poetry N-grams.


Proceedings of the Conference on Language & Technology 2014

Table 8: Domain wise corpus distribution of prose

3.5. N-grams Extraction

The N-grams give useful information of corpus
which can be used in different NLP application. In this
paper, the N-grams are extracted from prose and poetry
corpora separately. N-grams are extracted at unigram,
bigram and trigram levels for words and ligatures.

4. Results
The crawled corpus is categorized into poetry and
prose genres. After manual corpus cleaning, the poetry
corpus information such as number of books, poets,
verses, words and unique words is given in Table 7. A
total of 309,486 words are collected from poetry.
Table 7: Poetry corpus size

Number of Books
Number of Poets
Number of Verse
Total Words
Unique Words


The prose is manually classified into 18 subdomains including

articles, biography, character
representation, culture, foreign literature, health,
history, interviews, letters, magazines, novels, plays,
religion, reviews, science, short stories, travel and
Urdu literature The corpus is automatically cleaned
using process discuss above. The number of books,
words and unique words of each domain are given
Table 8. A total of 37,680,293 words are collected
from prose domain.
The N-grams are extracted from cleaned poetry
corpus. The number of each computed N-grams are
given in Table 9.
The automatically cleaned prose corpus is further
processed to compute N-grams. Two different types of
N-grams are computed from prose; (1) N-grams
computed from prose and (2) N-grams computed from
each classified sub domain of prose. The ligature Ngrams and word N-grams are computed for each
category of N-grams. The information about word Ngrams and ligature N-grams computed from complete
prose are given in Table 10 and Table 11 respectively.
The domain wise information about word n-grams is
given in Table 12.



Foreign Literature
Short Stories
Urdu Literature







Table 9: Poetry corpus N-grams




Table 10: Words N-grams




Table 11: Ligatures N-grams



Table 12: Domain wise N-grams









Proceedings of the Conference on Language & Technology 2014

Short Stories






7. Acknowledgement
This work has been conducted through the project,
Urdu Nastalique OCR supported through a research
grant from ICTRnD Fund, Pakistan.

8. References
[1] S Urooj, S Hussain, F Adeeba, F. Jabeen, R. Parveen,
"CLE Urdu Digest Corpus", in Proc. Conference on
Language and Technology (CLT12), Lahore, Pakistan. 2012
[ 2] M Ijaz, S Hussain. Corpus Based Urdu Lexicon
Development, in Proc. of Conference on Language
Technology (CLT07), University of Peshawar, Pakistan,

5. Discussion and Future Work

To generate the representative N-grams of Urdu
corpus, 37 million words Urdu corpus is collected and
categorized into two main domains. The automatic
cleaning of prose resolves much of the word
segmentation errors in less time. The extracted Ngrams from prose give reasonable words and ligatures
contextual information, but still there are some low
frequent errors which require a manual cleaning pass.
The higher order N-grams such as 4-grams and 5grams are also useful for advance NLP applications
such as machine translation system. These reported as
future work.
The word level cleaning of the poetry is not
performed. Hence, to have more accurate N-grams,
poetry corpus needs to be cleaned so that proper word
boundary can be defined. In future, the analysis of the
poetry corpus will be carried out and some semiautomatic way of cleaning poetry corpus will be
For future work, collected text corpus would be
extended to annotate it with POS tags so that POS
tagged N-grams will be extracted and reported.

[3] G. Kennedy. (1998). "An introduction to corpus

linguistics". Addison Wesley Longman Ltd. 1998
[4] E Darrudi, MR Hejazi, F Oroumchian. "Assessment of a
modern farsi corpus." in proc. of the 2nd Workshop on
Information Technology & its Disciplines (WITID). 2004.
[5] Ale. Abolfazl, A. Hadi, D. Ehsan, R. Masoud, O.
Farhad, "Hamshahri: A standard Persian text collection."
Knowledge-Based Systems 22.5 (2009): 382-387.
[6] Y Lin, JB Michel, EL Aiden, J Orwant, "Syntactic
annotations for the google books ngram corpus." In Proc.
ACL 2012 System Demonstrations. Association for
Computational Linguistics, 2012.
[7] T. Eckart, F. Alshargi, U. Quasthoff, D. Goldhahn.
"Large Arabic Web Corpora of High Quality: The
Dimensions Time and Origin." In Workshop on Free/OpenSource Arabic Corpora and Corpora Processing Tools
Workshop Program. 2014.

6. Conclusion

[8] F. Mayer, Corpus Linguistics: Introduction. (1st

edition), Cambridge University Press, 2002, Retrieved (06,
25, 2012).

In this paper, 37 million words corpus is processed

and cleaned, a semi-automatic cleaning process is
devised for such a large corpus. The categorization of
the corpus into prose and poetry is also discussed. To
ensure the diversity of the text corpus and to extract
domain-specific N-grams, the prose is further
categorized into 18 different domains. In future, the
POS tagged layer will be added to generate the POS
tagged N-grams. These N-grams can be used in any
Urdu Natural Language Processing and Information
Retrieval application. The presented Urdu books N-



Proceedings of the Conference on Language & Technology 2014

Accent Classification among Punjabi, Urdu, Pashto, Saraiki and Sindhi

Accents of Urdu Language
UET, Lahore

Saad Irtza
UET, Lahore

Mahwish Farooq
UET, Lahore

Sarmad Hussain
UET, Lahore

ASR system which gives best accuracy for all accents

of a language.
pronunciation dictionaries, which are based on
pronunciation of words followed by a single accent
group. Different accents give rise to multiple
pronunciations of a word due to the use of different
phonemes. But the addition of multiple pronunciations
in ASR dictionary against a single word generates
additional confusions, resulting in higher recognition
error rate [1].
The most widely used approach to address accent
variation problem in speech recognition systems is to
build multiple accent dependent ASR systems. In order
to send accented speech to the accent-specific speech
recognizer, the above approach requires a preprocessing step of accent classification.
In different geographical regions of Pakistan, 59
languages are spoken. Based on the accent and native
language, there are six prominent accents used in
Pakistan, namely Urdu, Punjabi, Pashto, Saraiki,
Sindhi and Balochi [2]. In the past, major research has
been done on Urdu literature or grammar but no major
work is done on the acoustic analysis of the different
accents of Urdu. Although less than 8% of Pakistanis
speak Urdu as their first language, but it is spoken and
understood as a second language by almost all
Pakistanis. In learning a second language, vowel
articulation is the key factor to take care of [4]. In
Phonetics, vowel is a sound least obstructed by any of
the articulators during its articulation and are generally
called the directors of the sounds [5]. Therefore, in our
experiments we use vowels to classify five different
accents of Urdu language.
The rest of the paper is structured as follows:
Section 2 describes the past techniques used to classify
accents of different languages, Section 3 details the
two methodologies followed in this paper, to classify
five different accents of Urdu language, Section 4

Automatic Speech Recognition (ASR) is a key
component in Human Computer Interaction (HCI)
applications. Stability of ASR systems largely depends
on accent, gender, age of speakers, background noise
and channel variations. In this paper, a study has been
conducted to classify five different accents of Urdu
language spoken in Pakistan i.e. Punjabi, Urdu,
Pashto, Saraiki and Sindhi. Speech data has been
collected from native speakers of these accents. The
five accents have been classified using mel frequency
cepstral coefficient (MFCCs) and feature formants.

1. Introduction
The objective of an automatic speech recognition
system is to convert speech into text. The performance
of ASR systems depends on the training data on which
acoustic models have been trained. Speaker dependent
(SD) ASR systems perform better than speaker
independent (SI) ASR systems, as the trained speech
samples in SD ASR systems cover most of the possible
acoustic variations of the targeted speakers. The
acoustic variations in speakers are mainly due to age,
gender and regional accents. Among these factors,
accent is the most leading factor that contributes to a
higher error rate in ASR systems.
Accent refers to the articulation pattern that a
speaker follows to produce a particular sound and it is
also related to speaker's first language, which affects
speaker's perception and production of speech. Dialect
and accent do not cause much trouble when
communicating among humans but gives poor
recognition results with ASR systems. ASR trained on
a particular accent gives poor accuracy on cross accent
speech data. It is a worth challenging task to build an


Proceedings of the Conference on Language & Technology 2014

experiments and different regions are clustered using

Bhattacharya distance.
The above discussion shows that a lot of work has
been done to classify Chinese and English accents but
no research has been carried out before, to classify
different accents spoken in Pakistan. In this paper we
proposed a method to classify five widely spoken
accents of Urdu language in Pakistan i.e. Punjabi,
Urdu, Pashto, Saraiki and Sindhi.

reports the results, Section 5 discusses the results and

Section 6 concludes the paper.

2. Literature Review
The performance of speech recognition systems
improve significantly over the last few years as this
field has received much attention by researchers. In the
past, many experiments have been performed in order
to classify Chinese, Indian, Korean and American
English accents.
A recent research shows that the native accent
pronunciation dictionary, used in ASR systems, can be
transformed to the accented speech dictionary by
simply using knowledge of native language of the
foreign speakers. The use of accent adapted dictionary
to classify different Chinese and English accents
reduces speech recognition error rate by 13.5% [7, 11].
Acoustic level adaptation techniques such as MLLR
and an integration of both PDA and MLLR are also
performed to trace the detailed acoustic changes in the
speech of speakers due to change in speaking speed
and style [11]. Results in the paper showed that both
techniques are complementary.
Another approach used to identify Beijing,
Shanghai, Guangdong and Taiwan accents from multiaccent Mandarin corpus, consisting of male and female
speakers of each accent, is by using a set of Gaussian
mixture models (GMMs) [8, 11]. The GMMs are used
to estimate the probability that the input speech comes
from a particular gender and accent. This approach
shows that performance of accent dependent systems is
generally better than that of accent independent
Multivariate statistical analysis is also used to
classify different accents [9]. The corpus used for the
experiments consists of 4925 telephone utterances of
American English being spoken by native speakers of
23 different languages. The study shows that
classification results of support vector machine are
better than other classifiers like ZeroR, Naive Bayes,
Logistic Regression, SMO and K- Neatest Neighbors
using MFCC, FBank and LPC features.
In order to classify Chinese, Indian, Korean and
Croatian accents of English speakers, Q factor is
introduced [10]. The paper reports that clustering of
different accents based on formants does not assist to
classify accents with high accuracy.
Different geographic regions of Sweden that share
common accents are grouped using MFCC feature
vectors [6]. The Swedish Speech data FD5000
telephone speech database was used for these

3. Methodology
In this paper, we have classified Punjabi, Urdu,
Pashto, Saraiki and Sindhi accents of Urdu language
by using vowels. Based on two different acoustic
features, two experiments have been conducted to
classify above mentioned five accents. The vocabulary
of the experiments consists of 139 district names of
Pakistan. The proper nouns are chosen, as these are
language independent words and are familiar to the
speakers of all five languages. Urdu is the national and
official language of Pakistan, therefore, district names
can also be considered to be words/phrases of Urdu
language. The training and testing corpus consists of
760 utterances from native speakers of above
mentioned five languages. The speaker distribution is
listed in Table 1.
Table 1: Speaker distribution
Accents Male Speaker Female Speaker
The speech is recorded over telephone channel at
the sampling rate of 8 kHz with random background
In Urdu, there exist 13 vowels: three short, three
medial and seven long [3]. In our experiments, we used
eight vowels, three short and five long, having
sufficient data for analysis. The list of eight vowels
along with Urdu letters, CISAMPA and IPA symbols
is given in Table 2.


Proceedings of the Conference on Language & Technology 2014

Table 2: Urdu vowel list with

IPA and CISAMPA symbols









common in speaker recognition, which is the task of

recognizing people from their voices. The middle
frame of 100 samples of each vowel is used to
calculate MFCCs. These cepstral features are
calculated using HTK toolkit. The Hidden Markov
Model Toolkit (HTK) is a portable toolkit for building
and manipulating hidden Markov models. Only the
first twelve MFCCs have been used here to classify
accents of Urdu language. The MFCC data is first
normalized using Weka (Weka is a collection of
machine learning algorithms for data mining tasks.)
and then mean MFCC vector for each accent of a
vowel has been computed.
Using 12 dimensional mean MFCC vector, the
distance between two accents has been computed by
Euclidian distance formula. For example, distance
between Punjabi and Urdu accents (of a vowel of Urdu
language) is computed using equation 1.

The data of each of eight vowels listed above is

given in table 3.
Table 3: Vowel data statistics
Punjabi Urdu Pashto Saraiki Sindhi
109 295






































(mfcc1PUN mfcc1URD ) 2 ...

d ( Pun,Urd ) (mfcc 7PUN mfcc 7URD ) 2 ...


(mfcc12PUN mfcc12

The distance between Punjabi and Urdu accents of a

vowel of Urdu language using formants has been
calculated using two dimensional Euclidian distance
formula given in equation 2.

d ( Pun,Urd ) ( F1PUN F1URD ) 2 ( F2PUN F2URD ) 2


In Experiment 1, we used feature formants, which

are proved to be sensitive features to classify accents of
Chinese language [10], to classify accents of Urdu
language. Preliminary analysis of first and second
formant frequencies (F1 and F2) shows differences in
characteristics of vowels. Formants are calculated from
the midpoint of each vowel to get a standard value
using Praat, Praat is a package for analysis of speech in
phonetics. The mean and standard deviation of F1 and
F2 for five accents of each vowel have been calculated.
Each accent of a vowel is compared with the remaining
four accents of the same vowel and all vowels of Urdu
accent, to find out how much these accents and vowels
are different from each other.
In Experiment 2, we have used mel frequency
cepstral coefficients (MFCCs) and formants (F1 and
F2) as acoustic features to classify five different
accents of Urdu language by calculating distance
between them. MFCCs are commonly used
as features in speech recognition. They are also

The distances calculated using these two equations

are then compared and analyzed in following sections.

4. Classification Results
The classification results of experiment 1 are shown
below in graphs. Each vowel with the five accents is
plotted in a separate graph along with remaining
vowels of Urdu accent and each plot shows standard
deviation of formants (F1 and F2) of a particular vowel
from their mean.
The figure 1 shows the standard deviation of F1 and
F2 of short vowel "A" (of above mentioned five
accents) from their mean along with other vowels of
Urdu language.


Proceedings of the Conference on Language & Technology 2014

The figure 4 below shows the standard deviation of

formant frequencies of short vowel "I" (of above
mentioned five accents) from their mean along with
remaining vowels of Urdu language.

Figure 1: Comparison of A vowel of 5 accents

and all vowels of Urdu
The figure 2 shows the standard deviation of
formant frequencies of long vowel "AA" (of above
mentioned five accents) from their mean along with the
remaining vowels of Urdu language.

Figure 4: Comparison of I vowel of 5 accents

and all vowels of Urdu
The figure 5 below shows standard deviation of
F1 and F2 of long vowel "II" (of Punjabi, Urdu,
Pashto, Saraiki and Sindhi accents) from their mean
along with remaining vowels of Urdu language.

Figure 2: Comparison of AA vowel of 5

accents and all vowels of Urdu
The figure 3 shows standard deviation of formant
frequencies of long vowel "AE" of five accents from
their mean along with other vowels of Urdu language.

Figure 5: Comparison of II vowel of five

accents and all vowels of Urdu
Figure 6 below shows standard deviation of formant
frequencies of "OO" vowel of five accents from their

Figure 3: Comparison of AE vowel of 5

accents and all vowels of Urdu


Proceedings of the Conference on Language & Technology 2014

In Experiment 2, distances calculated by Euclidian

formulas have been used to classify accents. Table 4
shows the distance between different accents,
computed by summing respective distances of these
accents over all vowels.
Table 4: Distance between different accents of
Urdu language

Figure 6: Comparison of OO vowel of 5

accents and all vowels of Urdu
Figure 7 shows the standard deviation of formants
of "U" vowel (of five accents) from their mean along
with the remaining vowels of Urdu language.

See Appendix for distance between different

accents of each vowel.

5. Discussion
The figures 1 to 8 of experiment 1 show that using
F1 and F2, all vowels of the Urdu accent can be
classified, with an exception of short vowel I which is
being confused with long vowels II and AE of Urdu
language (shown in all figures 1 to 8).
But on the other hand it is very clear from the above
graphs that using F1 and F2, it is difficult to classify
Punjabi, Urdu, Pashto, Saraiki and Sindhi accents of A,
AA, AE, I, II, OO, U and UU vowels.
Table 4 of experiment 2 shows that distance
calculated between two accents using MFCCs is
always greater than the distance calculated using
formant frequencies. Therefore, it can be concluded
that on the basis of distance calculated using MFCCs,
the probability of an accent to get confused with other
accents is minimal. As given in table 4, the calculated
distance between "Urdu and Saraiki" accents based on
formant frequencies values is 0.402592 while based on
MFCC values is 1.514395 which is almost three times
greater than that calculated using formant frequencies.
Therefore, MFCCs can be used to classify Punjabi,
Urdu, Pashto, Saraiki and Sindhi accents of Urdu

Figure 7: Comparison of U vowel of 5

accents and all vowels of Urdu
Figure 8 shows the standard deviation of formant
frequencies of "UU" vowel of five accents from their

Figure 8: Comparison of UU vowel of 5

accents and all vowels of Urdu


Proceedings of the Conference on Language & Technology 2014

[9] P. Chen, J. Lee and J. Neidert, "Foreign Accent

Classification", CS 229, Fall 2011.

6. Conclusions
Above mentioned results of the two experiments
show that two dimensional formant features F1 and F2
are not sufficient to classify Punjabi, Urdu, Pashto,
Saraiki and Sindhi accents of Urdu language spoken in
different geographical regions of Pakistan. Therefore,
there is a need to explore more dimensions of speech
data. This need has been accomplished
by using
twelve dimensional Mel-frequency cepstral features,
that retain accent related information of the speaker, to
classify above mentioned five accents. Our results
show that MFCC vectors can be used to classify
Punjabi, Urdu, Pashto, Saraiki and Sindhi accents of
Urdu language.

[10] D. Stantic and H. Jo, "Accent Identification by

Clustering and Scoring Formants", World Academy of
Science, Engineering and Technology, Vol: 63, Mar. 25,
[11] C. Huang, T. Chen and E. Chang, "Accent Issues in
Large Vocabulary Continuous Speech Recognition",
International Journal of Speech Technology 7, Pg: 141-153,

9. Appendix
The distance between "A" vowel of Punjabi, Urdu,
Pashto, Saraiki and Sindhi accents of Urdu language
calculated using formants and MFCCs is given in table
Table 5: Distance between "A" vowel of
different accents
MFCC Formants
0.25810 0.062430506
0.29642 0.112110215
Table 6 contains the distance between "AA" vowel
of Punjabi, Urdu, Pashto, Saraiki and Sindhi accents of
Urdu language.

7. Acknowledgement
This work has been conducted through the project,
Enabling Information Access for Mobile based Urdu
Dialogue Systems and Screen Readers supported
through a research grant from ICTRnD Fund, Pakistan.

8. References
[1] M. Lincoln, S. Cox and S. Ringland, "A comparison of
two unsupervised approaches to accent identification",
ICSLP, Dec , 1998.
[2] T. Rahman, "Language, ideology and power : language
learning among the Muslims of Pakistan and North India",
Oxford University Press, Karachi, 2002.
[3] S. M. Saleem, S. Anjum and J. Jalibi, "Oxford Urdu
English Dictionary", Oxford University Press, Lahore, 2013.

Table 6: Distance between "AA" vowel of

different accents
MFCC Formants
0.068566 0.0151051
Punjabi-Pashto 0.091284 0.020927132
Punjabi-Saraiki 0.11807 0.049725722
Punjabi-Sindhi 0.258101 0.062430506
0.112472 0.035993314
0.155985 0.037388876
0.232043 0.075529297
Pashto-Saraiki 0.170231 0.067898458
0.27159 0.047876065
0.296429 0.112110215

[4] A. Ali, "Pakistani Pronunciation of RP vowels: An

Exploratory Study ", ELT, NUML, Lahore.
[5] IPA, "Handbook of the International Phonetic
Association: A guide to the use of the International Phonetic
Alphabet", Cambridge University Press, Cambridge, July 8,
[6] G. Salvi, "Accent Clustering in Swedish Using the
Bhattacharyya Distance", in proc. International Congress of
Phonetic Sciences (ICPhS), Barcelona, 2003.
[7] LIU Wai Kat and Pascale FUNG, "Fast Accent
Identification and Accented Speech Recognition", IEEE,

Table 7 contains the distance between "AE" vowel

of Punjabi, Urdu, Pashto, Saraiki and Sindhi accents of
Urdu language.

[8] T. Chen, C. Huang, E. Chang and J. Wang, "Automatic

Accent Identification using Gaussian Mixture Models", in
Proc. ASRU, 2001.


Proceedings of the Conference on Language & Technology 2014

Table 7: Distance between "AE" vowel of

different accents
MFCC Formants
Punjabi-Pashto 0.09128
Punjabi-Saraiki 0.11807
Punjabi-Sindhi 0.25810 0.062430506
Pashto-Saraiki 0.17023
0.29642 0.112110215
The distance between "I" vowel of Punjabi, Urdu,
Pashto, Saraiki and Sindhi accents of Urdu language is
given in table 8.

The distance between different
vowel "U" is given in table 11.

accents of short

Table 11: Distance between "U" vowel of

different accents
MFCC Formants
0.08632 0.013677
0.06215 0.05581
0.10059 0.031453
The distance between different accents of long
vowel "UU" is given in table 12.

Table 8: Distance between "I" vowel of

different accents
MFCC Formants
Punjabi-Saraiki 0.22327
The distance between "II" vowel of Punjabi, Urdu,
Pashto, Saraiki and Sindhi accents of Urdu language is
given in table 9.

Table 12: Distance between "UU" vowel of

different accents
MFCC Formants
0.08632 0.013677
0.06215 0.05581
0.10059 0.031453

Table 9: Distance between "II" vowel of

different accents
0.086322 0.013677
0.087848 0.044865
0.223272 0.055401
0.209288 0.058085
0.125695 0.054643
0.100599 0.031453
0.274814 0.066827
The distance between different accents of long
vowel "OO" is given in table 10.
Table 10: Distance between "OO" vowel of
different accents
MFCC Formants
0.08632 0.013677


Proceedings of the Conference on Language & Technology 2014

Text Processing for Urdu TTS System

Rida Hijab Basit
Center for Language Engineering,

Sarmad Hussain
Center for Language Engineering,

in some Romanian TTS has been divided into three

parts: text pre-processing, text normalization and
phonological processing [2]. It is capable of extracting
sentences, paragraphs, abbreviations, numerals, phone
numbers, time and punctuations. It then syllabifies and
marks stress positions on the words. NLP for many
English TTS also apply Part Of Speech (POS) tags to
the tokens [3, 4].
NLP for some Tamil TTS systems only handles
abbreviations, acronyms and numbers [5]. Numbers
include ordinary numbers, phone numbers, dates, time
and currency figures. It also handles some foreign
language words that the input text may encounter.
Text to Speech systems for Urdu also use NLP
before generating the digital speech signal. NLP can be
divided into three parts: Text Processing, Text
Annotation and Phonological Annotation. This paper
focuses on Text Processing and is an extension of work
done in [6].
The rest of the paper is structured as follows:
Section 2 describes the NLP architecture, Section 3
details the Text processing in NLP from the Urdu
language perspective, Section 4 reports the results and
Section 5 discusses the results whereas Section 6
concludes the paper.

Natural Language Processing plays an important
role in any Text to Speech (TTS) system. The raw text
given as input to TTS may consist of numbers, dates,
time, acronyms or symbols. NLP processes the raw text
and converts it in the form that can be used by TTS to
generate its corresponding speech. NLP consists of
three parts; "Text Processing", "Text Annotation" and
"Phonological Annotation". This paper enhances
earlier work and details the text processing in NLP
from the perspective of Urdu and also reports the
results given by NLP.

1. Introduction
Text to Speech (TTS) system for any language
takes a sequence of words as input and converts it into
speech. The accuracy of TTS system lies in the
intelligibility and naturalness of the speech it produces.
Text to Speech system can be divided into three parts:
Natural Language Processing, Text Parameterization
and Speech Generation [1].
Natural Language Processing normalizes the raw
input text and converts it to its corresponding phonetic
transcription. Text Parameterization then uses this
phonetic transcription to generate certain numeric
parameters. Based on these parameters, Speech
Generation module synthesizes corresponding speech.
The raw text given as input to TTS system can be
of any form. It may consist of numbers, time, dates,
symbols and any miscellaneous characters. Therefore,
before converting it into speech it must be converted to
some form that can be spoken by the TTS system. For
this purpose, raw text first undergoes the process of
Natural Language Processing.
In the past, many researchers have proposed
different NLP frameworks for several languages. NLP

2. NLP Architecture
Natural Language Processing can be divided into
three categories - Text Processing, Text Annotation
and Phonological Annotation, as mentioned in the
previous section. Text Processing converts the raw text
that may consist of numbers, dates, time or symbols
into a simple text string [1]. Text annotation adds
morphological, syntactic and semantic information to
the input text, for example, assigns a grammatical tag
to each word in a text string representing its Part of
Speech (POS).


Proceedings of the Conference on Language & Technology 2014

Finally, this annotated string undergoes

phonological processing, which annotates further
information including phonetic transcription (either
through letter to sound rules [7] or looking-up
pronunciation lexicon), syllable, stress and intonation.
The high level architecture diagram for NLP is shown
in Figure 1.

Figure 2: Text processor module

3.1. Sentence Segmentation

The input string undergoes the process of sentence
segmentation. This module breaks the input string into
sentences - if it encounters a full stop, question mark,
line feed or carriage return character. In addition, if a
sentence is very long, it is segmented at a hard limit of
400 characters (slightly shorter strings are truncated to
adjust word boundaries). The whole input text is first
broken into sentences and then it is sent to other
modules for further processing.

Figure 1: High level architecture diagram

The shaded portion in Figure 1 has been discussed
in detail in this paper.

3. Text Processing Module

Text Processing module takes raw input from the
user and converts it into normalized text. Raw input
may contain any form of the text. The input text
undergoes sentence segmentation as shown in Figure 1.
These segmented sentences are then converted to 8bit Urdu Zabta Takhti (UZT) [8] format for internal
Text processing mainly consists of tokenization,
semantic tagging and text generation. Tokenization
module tokenizes the incoming sentences into words or
equivalent units and sends them to the semantic tagger.
Semantic tagger analyzes multiple tokens and where
necessary labels them as number, time, date, text, or
some other category. Each tagged token is then passed
to text generation module which converts the token into
its corresponding text string based on its label. The
flow diagram for Text Processor has been shown in
Figure 2.
Text Processor module in NLP must be able to
convert numbers, symbols, time, date and
miscellaneous strings to text, which can then be
converted to speech by any TTS. This complexity of
Text Processor module has been handled and is
discussed in detail in the following sections.

3.2. Conversion to Urdu Zabta Takhti (UZT)

The segmented sentence, initially in Unicode
format, is then converted to UZT format. A conversion
map containing, for each Unicode symbol, a
corresponding UZT symbol is then used to convert this
sentence into the desired format.
This conversion is needed as UZT takes less
number of bytes as compared to Unicode characters.
This makes processing faster and easier. Conversion is
limited to the characters which are available in UZT.
Other characters are ignored at this time, except ASCII
digits, which are mapped onto Urdu digits. This
converted sentence is then passed on to the
tokenization module.

3.3. Tokenization
The tokenization module separates words in the
input string according to space and the punctuation,
including ( ) ' " ! : / - etc. It also contains a few rules
for specific cases, for example, to separate a number
and text joined together like the string 12 into


Proceedings of the Conference on Language & Technology 2014

separate tokens; to identify a decimal number between

digits as separately from an end of sentence marker.
However, space is not a reliable cue for
tokenization. According to Naseem et al., 75% of the
errors in Urdu corpus are due to spaces [9]. Therefore,
Akram et al. [10] proposed a statistical technique to
address this issue, based on Urdu ligature and word ngrams. The current work will be extended to include
this algorithm in the future.

3.4.2. Number Processor. Number processor handles

both English and Urdu digits in the input string. It
checks the next tokens of the numbers and based on
this, tags them as whole numbers, fractional numbers,
time, date or miscellaneous strings.
Whole numbers are individual digits that appear in
the input string and do not have any context (relevant
next tokens), for example, 12, etc.
Fractional numbers contain a '/', for example,
12/13, /. Such sequences are tagged as fractional
numbers. Similarly, sequences with additional
constraints of having a ':' instead of a '/' and digits
within hour and minute ranges, for example, 12:03,
: are tagged as time.
For dates, it checks the context and obtains a
single semantic unit. Some of the examples include:
-- (12-12-2001) or (12/12/2001)
or )12 June 2001).
The numbers which do not lie in any of the above
mentioned categories are tagged as miscellaneous

3.4. Semantic Tagger

It is necessary to determine how the tokens are to
be converted to text and read out. For example,
22/10/2010 should be converted to "
( "
, "Twenty
second october year two thousand ten") and not as
"( "
, "Twenty two slash ten slash
two thousand and ten"). This is done by first marking
"22/10/2010" strings as a single semantic unit and
tagging it as a date, but "22/10" will be tagged as
fractional number. Semantic tagger analyzes the tokens
- tagging them as date, time, numbers (whole numbers,
fractional numbers and decimal numbers), text, special
symbols and miscellaneous strings [6] as shown in
Figure 2. This is done by considering each token
(separated in the tokenization phase) in its context to
see if they can be grouped into larger semantic forms.
Semantic tagger module, initially designed in [6],
has been extended to meet additional requirements for
Urdu TTS. The modules of the semantic tagger are
discussed in more detail below.

3.4.3. Decimal Number Processor. As mentioned

previously, tokenization module creates one token for
the decimal number in the input string. Decimal
number processor takes that decimal number as input
and tags it. Examples include . or . etc.
3.4.4. Special Symbol Processor. It handles symbols
and a date format that starts with a symbol. Symbols
include @, #, $, %, , , , , , , , along-with
the Urdu characters like .cte
Date format tagged by this processor starts with an
Arabic sign Sanah (). Examples include:
2001 etc.

3.4.1. Text Processor. Text processor checks the

incoming token and its next tokens (context) to
determine if it's a date or text and then tags it
accordingly. Some of the date formats covered by this
( June
module are ( June 2001) or
year 2001) or ( June 2001 A.D.) or
( June year 2001 A.D.) or ( June 23,

( June 23, year 2001). It covers

2001) or
both Urdu and English digits and the months of Islamic
and Gregorian calendar. It also recognizes the Arabic
sign Sanah ( )before the year and Hijri ( )or Eeswin
( )symbols after the year in a date.
The text which belongs to categories other than
date and time is considered miscellaneous by this text
processor module.

3.4.5. Miscellaneous String Processor. All the tokens

that contain colon (:), slash (/) or dash (-), in some
combination with numbers or texts, are considered
miscellaneous. These strings do not lie in any of the
above mentioned categories but can appear in the input
text. This processor tags such strings.
3.4.6. Punctuations. Semantic tagger also tags
punctuations in the input string. The punctuations that
are displayed in the output string include ! ' . This
is because these punctuations affect the speech and
must be there to ensure accurate speech synthesis.


Proceedings of the Conference on Language & Technology 2014

3.5. Text Generation

Tagged tokens are then passed to text generation
module which converts them into their corresponding
UZT text equivalents [6]. String generation has been
divided into several different sub-modules depending
upon the categories as shown in Figure 2. These are
Number to Text Converter
Date to Text Converter
Time to Text Converter
Special symbol to Text Converter
Miscellaneous to Text Converter
These modules have been discussed in detail in the
following subsections.



3.5.1. Number to Text Converter. It deals with whole

numbers, decimal numbers and fractional numbers. The
numbers 0 to 99 have their UZT text equivalents
stored, which are referred during conversion. The basic
pronunciations - ( s , "hundred"), ( z :r,

j , "decimal

point"), ( nsf, "half"), ( :i:, "one third"),

, "thousand"),







Table 2: Decimal numbers in NLP

Transcription Translation
of Output

point two

point one
j e:k

Table 1: Whole numbers in NLP

Transcription Translation
of Output
One lac




:i:s, "one hundred and twenty three"). After their

conversion, decimal point gets its equivalent
j , "decimal point")
pronunciation (
whereas rest of the numbers (before the decimal
number) are dealt as whole numbers. Table 2 shows the
examples for decimal number inputs.

, "million"), (
, "billion")
"Lac"), (
and pushes the required pronunciation into the stack as
well. At the end, the stack will have the equivalent
UZT text for the whole number. The example for whole
numbers is shown in Table 1.


:n, "one two three") instead of (

(h , "one fourth") etc. are also stored for

future reference.
It reads whole numbers from right to left and
pushes the text equivalent of each number into the
stack. Counter checks the addition of ( ,
"hundred"), (

thirty three
Two lac
Two crore
thirteen lac
forty five

Decimal numbers are also read from right to left

until a decimal point is encountered. The numbers after
the decimal point are considered unique, for example,
123 in 12.123 will be converted to ( e:k

b, "billion"), ( , "by - used in

fractional numbers"), (


"thousand"), ( : , "Lac"), ( rb, "million"),

One lac

Fractional numbers are converted in a same way as

whole numbers except that they have slash (/) between
them which is given its text equivalent ( , "by used in fractions). This converter also recognizes
special fractions with text equivalents different than the

twenty four


Proceedings of the Conference on Language & Technology 2014

normal fractions. Special fractions include 1/2, 1/3,

1/4, 2/3, 2/4 and 3/4. 1/2 is converted to ( sf,

f v

, "one by two")
"half") instead of ( e:k
in Urdu. Table 3 shows examples for fractional


Table 3:Fractions with special pronunciations

Transcription Translation
of Output
Six by ten

One third

e:k :i:

f v

f v


The numbers can have both Urdu and English

digits. Moreover, whole numbers can be represented in
the form "100,000". Due to this diversity, it covers 4
formats for whole numbers, 2 for decimal numbers and
2 for fractional numbers.

3.5.2. Date to Text Converter. Date to Text Converter

handles three different types of dates, which are:
D(D)-M(M)-Y(Y) & D(D)/M(M)/Y(Y)
D(D) Month-Text Y(Y) & Month-Text D(D),
Here, D(D) is the date, M(M) is the month in
numbers and Y(Y), a year. Month-Text is the month
already in Urdu string.
The D(D) should be in the range 1 to 31 whereas
M(M) in the range of 1 to 12. Years before 2000 have
different corresponding Urdu string as compared to the
years like 2000 or greater than this. For example, 1992
has (
V , "nineteen ninety
two") as its corresponding text whereas 2002 has
h z r
, "two thousand and two"). Date to
Text converter takes care of these two different formats
of years and gives the output accordingly.
Some formats of dates are shown in Table 2.

V :


year two
year two
year two
ninety two
Year two
one A.D.

During text conversion, it also keeps track of

symbols like Arabic Sanah (), Hijri ( )and Eeswin
( )which can appear with the year in a date. Type 2
covers both Islamic and Gregorian calendar months.
Each type for date can be represented with both Urdu
and English digits. They appear with different date
symbols constituting different formats for each date
Type 1 has a total of 24 formats, Type 2 has 108
whereas type 3 consists of 10 different formats.
3.5.3. Time to Text Converter. It converts hours and
minutes in time just as whole numbers whereas ':' in a
time is replaced by (
, "Used to display
time in Urdu"). It also checks the minutes in the time; if
it shows zero minutes then it only gives the information
about the hours. For example, in case of 10:30, it gives

:s m , "ten thirty")
whereas for 10:00 it just gives ( , "ten"). This
module covers 6 different formats for time. Table 5
shows the examples for time.

Table 4: Date to text converter output

Transcription Translation
of Output


f v
year two


Table 5: Time examples in NLP

Transcription Translation
of Output




Four Forty

Proceedings of the Conference on Language & Technology 2014

Table 7: Miscellaneous strings in NLP

Transcription Translation
of Output
June two


one dash
d bi:s







dash june
June dash



one twenty

3.5.4. Special Symbol to Text Converter. Special

symbol to text converter handles 56 different symbols
which are mapped onto its equivalent text by this
converter. The symbols covered by this converter have
already been discussed in Section 3.4.4. Some
examples of special symbols and their conversion are
shown in Table 6.
Table 6: Special symbols in NLP
of Output













4. Results
Text Processor module has been tested with some
real data. Data used for the testing purpose is taken
from Urdu newspapers, Urdu typed books, Urdu news
websites and Urdu digest. Sentences are selected from
each of the corpora and given to NLP. Each word is
checked if it has given desired output or not, and the
results are integrated.
The coverage of formats for dates, time and
numbers depends on their frequency in the real data.
Time and decimal numbers have fewer occurrences in
the given corpora as compared to dates, whole numbers
and symbols.
In the testing process, 13297 words are covered,
which along-with text strings; contain numbers,
symbols, and miscellaneous strings. Decimal numbers
and time, found in the testing data have 100%
accuracy. Accuracies for whole numbers, dates and
miscellaneous strings are 99%, 91% and 93%
respectively. Accuracy for symbols is reduced to 50%.
s s ec use or l t o o s
ols l e
is not correct and it does not give its full equivalent
string, thus reducing the accuracy. Here, accuracy


3.5.5. Miscellaneous String to Text Converter. (:),

(/) and (-) in miscellaneous strings are converted to
their equivalent texts - ( colon), ( slash) and
( dash) respectively. For any text or number that
may be present in the miscellaneous strings undergo
simple conversion of text or number as described in
previous sections. This module covers more than 142
different formats for miscellaneous strings. Some
examples are shown in Table 7.


Proceedings of the Conference on Language & Technology 2014

symbol") only instead of (

determines the correctness of conversion of input string

to its Urdu format. Table 8 tabulates the results.

Decimal No.

Table 8: Results
Tokens Identified

, Ar c s
Moreover, there were some errors related to
tokenization and sentence segmentation. Tokenization
errors occurred due to typing errors and space issues in
the input file. Sentence segmentation errors resulted
because some of the sentences consisted of more than
400 characters and as described in section 3.1, NLP
segments the sentence if characters in a sentence are
more than 400. This resulted in inappropriate sentence








6. Conclusion
This paper has discussed various steps in detail,
needed for converting raw text string into normalized
Urdu string. Raw input may consist of numbers, date,
time or symbols that must be normalized using text
processing module before sending it to TTS system for
speech generation. The overall accuracy for text
processing module is 90.5%, which is a quite
acceptable number. However, this is a work in progress
and some future goals are yet to be achieved. Future
goals include refining NLP output and handling more
formats for each sub-module depending upon the

Majority modules show accuracy above 90%, as it

can be seen in Table 8. Issues encountered during
testing phase have been discussed in Section 5.

5. Discussion
The results given in Table 8 show sufficient
accuracy of NLP text processing module but there are
some issues that need to be resolved.
Space issues in the test data and invalid Unicode
code points resulted in inappropriate conversion of
whole numbers. For example, shows no spaces
between them, thus resulting in wrong interpretation of
the whole number present in the given example. Some
of the numbers in the test data contained invalid
Unicode code points (not belonging to Urdu) which
were not recognized by NLP thereby, giving incorrect
Similarly, some dates were not recognized because of
space issues. In , there is no space
between ' ' and " " due to which it was unable to
recognize the proper date format. At another point, the
date of format was not identified
because English comma ',' was used instead of Urdu
comma ''. Some of the other dates were used with
miscellaneous characters due to which they were
tagged as miscellaneous strings. For example, 1999 2000 .
Symbols showing 50% accuracy are a major
co cer o
. es
ols ,
etc. were not
recognized properly during the testing phase. NLP
could not generate the whole text equivalent for these
symbols which reduced the overall accuracy of this
module. For example, for it gave ( , "Arabic

7. Acknowledgement
This work has been conducted through the project,
Enabling Information Access for Mobile based Urdu
Dialogue Systems and Screen Readers supported
through a research grant from ICTRnD Fund, Pakistan.

8. References
[1] Hussain S., "Phonological Processing for Urdu Text to
Speech System."Yadava, Y, Bhattarai, G, Lohani, RR,
Prasain, B and Parajuli, K (eds.) Contemporary issues in
Nepalese linguistics, 2005.
[2] Ungurean C., and Burileanu D., "An advanced NLP
framework for high-quality Text-to-Speech synthesis."
In Speech Technology and Human-Computer Dialogue
(SpeD). IEEE, 2011.
[3] Trilla A,. "Natural Language Processing techniques in
Recognition.", 2009.


Proceedings of the Conference on Language & Technology 2014

[4] Bhatt S., "Natural Language Processing with Text-toSpeech on Android", May 2011.
[5] Ramakrishnan A. G., Kaushil L. N., and Narayana. L.,
tur l
gu ge rocess g or
S. in Proc. of
the 3rd Language and Technology Conference, Poznan,
Poland, 2007.
[6] Hussain S., "Urdu localization project: Lexicon, MT and
TTS (ULP)." in Proc. of the Workshop on Computational
Approaches to Arabic Script-based Languages, Association
for Computational Linguistics, 2004.
[7] Hussain S., "Letter-to-sound conversion for Urdu text-tospeech system." in Proc. of the Workshop on Computational
Approaches to Arabic Script-based Languages, Association
for Computational Linguistics, 2004.

[8] Hussain S., and Afzal M., "Urdu computing standards:

Urdu zabta takhti (uzt) 1.01." In Multi Topic Conference,
INMIC 2001. Technology for the 21st Century. Proceedings.
IEEE International, IEEE, 2001.
[9] Naseem, T., and Hussain, S., "Spelling Error Trends in
Urdu." in Proc. of Conference on Language Technology
(CLT07), University of Peshawar, Pakistan, 2007

[10] Akram M., and Hussain S., "Word segmentation for

urdu OCR system." in Proc. of the 8th Workshop on Asian
Language Resources, Beijing, China, 2010.


Proceedings of the Conference on Language & Technology 2014

Urdu Keyword Spotting System using HMM

Saad Irtza, Khawer Rehman and Sarmad Hussain
Centre for Language Engineering,
Al-Khwarizmi Institute of Computer Science,
UET, Lahore, Pakistan
saad.irtaza@kics.edu.pk, sarmad.hussain@kics.edu.pk
vectors (512 point FFT) are extracted from speech and
acoustic vectors have been prepared for each training sample.
In the decoding process, sliding window is used to find the
distance between acoustic vectors of input speech file and
acoustic vectors of keyword. A 20,000 vocabulary size has
been used in this system. Testing data set consists of 100
utterances from 14 male and female speakers. The word error
rate has been found to be 10.6%. In recent years mostly
HMM based keyword spotting techniques are used
[1][2][6][7][11]. In [1][6][7] keyword spotting using filler
model is implemented. In filler model technique, nonkeywords are modeled as fillers while keywords are modeled
[7]. Filler model can be modeled on word level or phoneme
level [1][2]. Different filler models results in different hit and
false alarm rate [1]. Bengali KWS has been developed on 12
keywords using filler modeling approach. Training data set
consists of 350 utterances of keywords and subset of TIMIT
English speech corpus has been used to develop filler model.
Test data set consists of 240 speech utterances. The overall
accuracy of system has been found to be 95.83%. The
performance of KWS has been improved by using phoneme
recognition in the first stage and in second stage, search for
keywords using phone lattice [4][5], edit distance algorithm
[3] or string searching algorithm [8]. These methods report
good accuracy with low false alarm rate but required large
amount of training data which should cover all vocabulary
and are also computational very expensive [4]. Spanish KWS
on 80 keywords has been developed using Albazyin
database. Confidence Measure method for keywords is
implemented to decrease false alarms. The hit and false
alarm rate of the system has been found to be 84.33% and
41.44% respectively.

This paper reports the development of Urdu keyword
spotting system (KWS). The approach in the development of
KWS is based on filler models to account for non-keywords
speech intervals. An impact of using different training
datasets to develop filler models has been explored. In
addition, a phoneme recognizer (PR) based on all phone
model automatic speech recognition system (ASR) has been
developed on keywords. Training and decoding parameters
of KWS system have been tweaked to get the optimum
performance. In the end, KWS and PR systems are integrated
and string matching algorithm has been used to improve the
performance of Urdu keyword spotter system. The overall
system accuracy is 94.59% on the data set used.
Keywords: Automatic speech recognition (ASR), Keyword
spotting system (KWS), Out of vocabulary words (OOV).

1. Introduction
Automatic Speech Recognition is a key component in
applications e.g. speech document retrieval (SDR) and
human-computer interaction via voice commands. Keyword
Spotting (KWS) is a technique which is used to decode only
particular words from a continuous speech (Tejedor, 2006).
It is extensively used in large vocabulary ASR systems
which are subjected to out of vocabulary (OOV) words.
Generally in dialogue systems, users speak some extra words
other than exact query [11] therefore these ASRs often
encounter out of vocabulary OOV words. KWS is used to
spot the desired words in continuous speech. For instance in
weather mobile service, the user is instructed to speak the
desired district name to acquire its weather report, but in
some cases the users speak complete sentences e.g.
( I need to find the weather of
Lahore). In such cases KWS must spot Lahore in the
input string. A good keyword spotter should identify all the
keywords and minimize the false alarms i.e. not decode nonkeyword parts of speech as keywords. This paper reviews
some relevant work done on KWS, followed by the
experimental details and the results of the current work on as
system being developed for Urdu.

The performance of dialogue system degrades because of
OOV words [11]. The objective is to develop a KWS to
address the OOV words and to spot keywords in
unconstrained Urdu speech with high hit rate and minimal
false alarms.
Filler modeling technique has been
implemented to detect eight locations names of district
Lahore. Figure 1 shows the system architecture. It consists
of Keyword Spotter (KWS) and Phoneme Recognizer (PR).
KWS is implemented by using filler modelling. All phone
model is implemented in PR. Speech input is processed by
KWS and PR processes. The output of KWS is stream of
OOV words and keywords while PR outputs string of
phones. Keyword Detector (KWD) measures the confidence
score of keywords spotted by KWS in phonemes string
decoded by PR.

2. Literature Review
Different techniques have been deployed for keyword
spotting. KWS has been developed on five keywords of Urdu
using word boundary detection [12]. Training and testing
data set consists of isolated words of 7500 and 3200
utterances respectively. The accuracy of system has been
found to be 98.1%. Sliding Model Method [8] has been used
to develop KWS which isimplemented withDistance time
Warping (DTW) and HMM [10].In this method,feature


Proceedings of the Conference on Language & Technology 2014

training dataset has been used in this experiment to model filler
Training Datasets of KWS

Number of
Sampling rate

Figure 1: Architectural diagram

Datasets of PR
Location names


















All phone

All phone


Table 1: Training datasets

Table 2 describes the data used to test the system trained on the
different training data sets.
Keywords vocabulary size

Number of Speakers


Total Utterances
Sampling rate
Duration (minutes)
Sentence templates
Language weight
Word insertion penalty


Table 2: Testing datasets

Figure 2: HMM model of KWS

Keyword Spotter is based on Hidden Markov Model

(HMM). Acoustic model has been developed using HTK toolkit
and Julius is used to test the performance of acoustic model. In
KWS, all the non-keywords are modeled as fillers and
transcribed at phoneme level. Keyword models are for isolated
words and transcribed at word level. The HMM model of KWS
is shown in Figure 2.
Phone Recognizer (PR) is implemented by using CMU
Sphinx toolkit. The tri-phone basd acoustic model has been
developed. Training data used in PR is same as that has been
used for training of keywords. Keyword Detector compares the
outputs of KWS and PR. It validates the presences of keyword
by measuring its confidence in output of PR. For confidence
measuring Bitap algorithm [2] is used.
In the first experiment, three different training data
sets has been used to model the filler words. The datasets used
are: 1) 49 location names of Lahore district, 2) 19 district names
of Pakistan, 3) continuous spontaneous speech with general
Urdu vocabulary coverage. Table-1 describes the detail of
training datasets. In experiment 2, different training and
decoding parameters have been tweaked. The tweaking
includes: 1) number of states of HMM of keyword, 2) language
weight, 3) word insertion penalty. The best performance

Table 3 and Figure 3 give the recognition results of KWS
systems developed. The overall accuracy is higher when the
training set is from the same domain, but highest when general
Urdu corpus is used with larger amount of training data, giving
94.59% accuracy.


Location names

Training datasets
District names Spontaneous speech

Table 3: Recognition results of KWS

Table 4 describes the recognition results of PR.

Test utterances


Accuracy (%)

Table 4: Recognition results of PR


Proceedings of the Conference on Language & Technology 2014



Number of keywords

Locations Data
Districts Data
Dialogues Data



false alarm








Word insertion penalty

Figure 3: Performance chart of KWS

Figure 6: Effect of tweaking word insertion penalty on hit rate and

false alarm






false alarm


Keywords decoded

Number of keywords

Figure 4 shows the effect of varying number of states of

keywords on hit rate and false alarm.





Number of keywords

false alarm









90 100

Table 3 describes the hit rate, miss rate and false alarmon three
different datasets. False alarm in each dataset is same. Miss rate
is maximum i.e. 10 (27%) on location names dataset and
minimum i.e. 2 (5.4%) on spontaneous speech out of 37
utterances of keywords in 82 sentences. Best hit rate of 35
(94.59%) has been achieved on spontaneous speech. Figure 4
describes the effect of changing the number of states of
keywords on hit rate and false alarm. In all phone model, 5
number of states have been used for all phonemes. The
keywords consist of five to seven phonemes. It has been
explored how many states are required to model each keyword.
Figure 4 shows that 15 number of states are sufficient to model
a keyword that consist of five to seven phonemes.


Language weight


Figure 7: Effect of tweaking threshold value on hit rate



Threshold value

Figure 5 shows the effect of varying language weight on hit rate

and false alarm.



Figure 4: Effect of tweaking HMM states on hit rate and false alarm




Number of HMM states



Figure 5: Effect of tweaking language weight on hit rate and false


Figure 6 shows the effect of varying word insertion penalty on

hit rate and false alarm.


In the second experiment, the acoustic model has been

developed on optimum value of number of states of keywords.
Figures 5 and 6 shows the effect of tweaking decoding
parameters i.e. language weight and word insertion penalty.
Figures 5 and 6 show that hit rate and false alarms will increase
with the increase in language weight and word insertion
penalty. Language weight and word insertion penalty has been

Proceedings of the Conference on Language & Technology 2014

selected such that hit rate is maximized. The false alarms have
[9] Kim, Joo-Gon, Ho-Youl Jung, and Hyun-Yeol Chung. "A
been reduced by tweaking the KWD module. The keyword will
keyword spotting approach based on pseudo N-gram language
be considered correct if the output of bitap algorithm is equal to
model." 9th Conference Speech and Computer. 2004.
minimum threshold value. The threshold value is tweaked to
[10]Nitta, Tsuneo, et al. "Key-word spotting using phonetic
minimize the false alarm without effecting hit rate. Figure 7
distinctive features extracted from output of an LVCSR
shows the effect on tweaking the threshold value on hit rate.
engine." ISCA & IEEE Workshop on Spontaneous Speech
The optimum value of threshold comes out to be 60% for all
Processing and Recognition. 2003.
keywords. False alarm has been reduced from 16.2% to 5.4%.
[11]Wilpon, Jay G., et al. "Automatic recognition of keywords
in unconstrained speech using hidden Markov models."
Acoustics, Speech and Signal Processing, IEEE Transactions
on 38.11 (1990): 1870-1878.

It is concluded from this experiment that to increase the
performance of ASR system states of words or phonemes
should be tweaked in training process. Decoding parameters
have significant effect on performance of keyword spotter
system. The performance of string matching algorithm also
effect the accuracy of keyword spotter.

[12]Juang, B. H. "Recent developments in speech recognition

under adverse conditions." First International Conference on
Spoken Language Processing. 1990.

[1] Tejedor, Javier, and Jos Cols. "Spanish keyword spotting
system based on filler models, pseudo N-gram language model
and a confidence measure." Proceedings of IV Jornadas de
TecnologadelHabla (2006): 255-260.
[2] S.Das and P.C Ching, "Speaker Dependent Bengali
Keyword spotting in unconstrained English Speech", A Project
report, Indian Institute of Technology Guwahati, India, 2005
[3] K. Audhkhasi and A. Verma, Keyword search using
modified minimum edit distance measure, Proc. ICASSP2007, vol. 4, pp. 929-932, 2007.
[4]Lin, Hui, Alex Stupakov, and Jeff A. Bilmes. "Spoken
keyword spotting via multi-lattice alignment." INTERSPEECH.
[5] Lin, Hui, Alex Stupakov, and Jeff Bilmes. "Improving
multi-lattice alignment based spoken keyword spotting."
Acoustics, Speech and Signal Processing, 2009. ICASSP 2009.
IEEE International Conference on. IEEE, 2009.
[6]Li, Weifeng, AudeBillard, and HervBourlard. "Keyword
Detection for Spontaneous Speech." Image and Signal
Processing, 2009. CISP'09. 2nd International Congress on.
IEEE, 2009.
[7]Szke, Igor, et al. "Phoneme based acoustics keyword
spotting in informal continuous speech." Text, Speech and
Dialogue. Springer Berlin Heidelberg, 2005.
[8]Silaghi, Marius-Calin. "Spotting Subsequences Matching an
HMM Using the Average Observation Probability Criteria with
Application to Keyword Spotting." Proceedings of the National
Conference on Artificial Intelligence. Vol. 20. No. 3. Menlo
Park, CA; Cambridge, MA; London; AAAI Press; MIT Press;
1999, 2005.


Proceedings of the Conference on Language & Technology 2014

HPSG Analysis of Arabic Verb Form Derivation

Md. Sadiqul Islam,
Bangladesh University of Engineering & Technology,
worked on Sign Based Construction Grammar (SBCG)
[2] version of HPSG.

Semitic languages exhibit rich nonconcatenative
morphological operations, which can generate a
myriad of derivational lexemes. Especially, the feature
rich, root-driven morphology in Arabic language
demonstrates the construction of different trilateral
verb stems which are called as verb forms. Although
HPSG is a successful syntactic theory, it lacks the
morphology. In this paper, I propose a novel HPSG
representation of Arabic verb forms. I also present the
lexical type hierarchy and derivational rules for
generating these verb forms using HPSG framework.

In this paper, I propose and analyze the HPSG

constructs required for capturing the syntactic and
semantic effects of rich morphology of Arabic verbs.
Our contributions in this paper are as follows:

1. Introduction

I propose the type hierarchy of Arabic Verb

lexeme. The proposed type hierarchy shows
position of different verb forms.
I propose lexical construction rules for
formation of trilateral verb forms from root
I show an example of how from type
hierarchy and construction rules the verb
forms will be derived.

The rest of the paper is organized as follows. In

Section 2, I discuss about Arabic Verbal System and
work related to this research. Section 3 presents an
overview of HPSG. I have presented my contribution
in section 4 and Section 5 concludes this paper.

Semitic languages like Arabic, Amharic and Hebrew

exhibit rich morphological operations for construction
of lexicons. We can have a large coverage of
vocabulary in these languages by computational
linguistic modeling of their morphology. In this paper I
focus on Arabic verbs for morphological analysis.

2. Background
Arabic verb system exhibits both concatenative and
nonconcatenative morphology, capable of lexically
expressing diverse syntactic and semantic phenomena.
Formalisms of existing morphological analyzers for
Arabic are not powerful enough to capture this higher
layer diversity. Here, I limit our discussion of verb
formation to different type of trilateral verb stems from
trilateral sound roots. These stems are termed as verb
forms by western scholars.

A. Arabic Verbal System

Arabic language exhibits an extremely rich
morphology [3]-[4]. Both concatenative and
nonconcatenative operations take place in the
formation of an Arabic word. Inflection is made by
concatenative operations whereas derivation is made
by nonconcatenative operations. In this paper, I mainly
focus on nonconcatenative operation and give a
mathematical formalism to capture their rich diversity.

For modeling Arabic morphology, I have chosen Headdriven Phrase Structure Grammar (HPSG) [1] which is
an attractive tool for capturing complex linguistic
constructs. HPSG is very suitable for Natural
Language Processing (NLP) as it integrates all the
essential linguistic layers (Phonology, Morphology,
Syntax, Semantics, Context etc.) of NLP. I have

Arabic verb is an excellent example of

nonconcatenative root-pattern based morphology. A
combination of root letters are plugged in a variety of
morphological pattern with priory xed letters and
particular vowel melody that generates verb of a


Proceedings of the Conference on Language & Technology 2014

particular type which has some syntactic and semantic

information [5]. Figure 1 shows how different set of
root letters are plugged into a vowel pattern generates
different verbs with some common semantic and
syntactic meanings.


Form II (Causative): kattaba ( ) He

caused to write.
3. Form III (Ditransitive): ktaba ( ) He
4. Form IV (Factitive): aktaba ( ) He
5. Form V (Reexive): takattaba ( ) It was
written on its own.
6. Form VI (Reciprocity): taktaba ( )
They wrote to each other.
7. Form VII (Submissive): inkataba ( ) He
was subscribed.
8. Form VIII (Reciprocity): iktataba ( )
They wrote to each other.
9. Form IX (Color or bodily defect): iktabba ()
No meaningful word is formed from root
(k,t,b) for this form.
10. Form X (Control): istaktaba ( ) He
asked to write.

Figure 1 Root-pattern morphology: 3rd person singular

masculine sound perfect from same pattern

All these verb forms can be derived from Form I root

verb by syntactically and semantically. It is worthy to
mention that Form I has eight subtypes depending on
the vowel on middle letter in perfect and imperfect
form. Some types of verbal noun formation depend on
these subtypes. Any combination of root letters for
Form I verb will follow any one of these eight patterns.
I refer these patterns as Form IA, IB, IC . . . IH.

B. Related Works
Figure 2 Different Vowel patters applied to same root

HPSG analysis for nonconcatenative morphology in

general and for Semitic (Arabic, Hebrew and others)
languages in particular is relatively new [5], [7], [8],
[9], [10], [11], [12], [13].

Similarly, Figure 2 shows how same set of root letters

plugged into different vowel patters that generate two
lexemes with completely different syntactic
information by share related semantic meaning.

However, the intricate nature of Arabic morphology

motivated several research projects addressing the
issues [3], [14]. HPSG representation of Arabic verbs
and morphologically complex predicates are discussed
in [5], [9] and [13]. An in-depth analysis of declensions
in Arabic nouns has been presented in [12] and [17]. In
[12] and [15], an HPSG formalization of Arabic verbal
nouns and their constructions from root verbs have
been presented.

From any particular root letters, up to fteen different

verb stems may be formed, each with its own template.
These stems have different semantic information.
Western scholars usually refer to these forms are
referred as Form I, II, . . . , XV. Form X to Form XV
are rare in Classical Arabic and are even more rare in
Modern Standard Arabic. These forms are discussed in
detail in [6]. It should be mentioned that all these
fifteen forms may not exist for every trilateral root.

In [13], an overview of HPSG modeling of Arabic

verbs and construction of Arabic passive has been
discussed. A complete AVM and declension of Arabic
verb have been proposed in [16]. The construction
rules for verbal declension are also discussed here.
Though [5] has given an overview of morphology of

Here I give examples of each of the renowned ten

verb forms

Form I (Transitive): kataba ( ) He wrote


Proceedings of the Conference on Language & Technology 2014

Second derivation is grammatically wrong. We do not

capture the accurate agreement information with the
grammar G. Basic problem lies with CFG is that its
terminals are completely non-informative. This leads to
Head-driven Phrase Structure Grammar (HPSG), a
constraint-based lexicalized formalism of natural

Arabic verbal system, it did not show the derivation of

verb forms from verb root. In this paper, want to
capture derivation of verb forms and model the
syntactic and semantic effects from this derivation
using HPSG.

3. An HPSG Primer

Two assumptions underlie the theory defining head

driven phrase structure grammars. Firstly, the
languages are systems of sorts of linguistic objects at a
variety of levels of abstraction not just collections of
sentences. Secondly, grammars are best represented as
process neutral systems of declarative constraints, as
opposed to constraints defined in terms of operations
on objects as in transformational grammars. Thus,
HPSG is seen consisting of inheritance hierarchy of
sorts with constraints of various kinds on the sort of
linguistic object in the hierarchy [18]. HPSG includes
grammar rules and lexical entities. Normally, the latter
are not considered to belong to a grammar. The
formalism is described around lexicons. This means
that the lexicon is more than just a list of entries; it is
in itself richly structured. Individual entries are marked
with types. Types form a hierarchy.

Head-driven Phrase Structure Grammar (HPSG) is a

highly lexicalized, non-derivational generative
grammar theory. To understand the motivation of
HPSG, we may start from its predecessor Context Free
Grammar (CFG). A CFG Grammar G is a 4-tuple G =
{, V, S, P} where

is a finite, non-empty set of terminals, the


V is a finite, non-empty set of non-terminals;

S V is the start symbol;

P is a finite set of production rules, each of the

form A, where AV and (V )*;

For example, let we have the following CFG for very

In HPSG terminology, the basic grammatical construct

(or type) is the sign, which is a formal representation of
words, phrases and sentences. All human utterances are
captured by signs. A rule that licenses a sign is
captured by another object called construct. Signs and
constructs are formalized as typed feature structure
which is a set of attribute-value pairs. Attributes are
called linguistic objects. The value of an attribute may
be either atomic or complex i.e. function. Functions are
those feature structures which are described using an
attribute value matrix (AVM).

small fragment of English,

= {eat, eats, , rice, Sadique, , I, you, he, };
V = {S, VP, NP, V, N, P}
S = Start symbol

NP N | P

The generic construct of a sign is presented in Figure 3

An HPSG Sign. The AVM basically maps features to
feature structures. A feature in an AVM can be of two
types: (a) category name, i.e., sort description and (b)
agreement (or constraints), which is a list of attributes
and their values.

V eat | eats |
N rice | Sadique |
P I | you | he | }
Using the above CFG, the sentence I eat rice can be
analyzed by the following derivation:

A construct is represented using a feature structure

with MOTHER (MTR) feature and DAUGHTERS
(DTRS) feature. The value of MTR feature is a sign
and the value of DTRS is a nonempty list of signs. A
typical description of a construct is shown in Figure 4
An HPSG Construction. The licensing of signs
follows the Sign Principle which states that Every
sign must be lexically or constructionally licensed. A
sign is lexically licensed only if it satisfies some lexical
entry, and constructionally licensed only if it is the
mother of some construct [19].

I eat NP I eat rice
However, there are problems with CFG. The above
definition also generates the sentence I eats rice.
I eats NP I eats rice


Proceedings of the Conference on Language & Technology 2014



MORPH morphobj


formatives and afxes. VDEC indicates verbal

declension type.


The SYN feature contains three function features

namely CAT, VAL and MRKG. The significant CAT
information is VFORM, VOICE and MOOD. Value of
VFORM can be perf or imperf to denote perfect and
imperfect verb respectively. Value of VOICE can be
act or pass to denote active or passive. Value of
MOOD can be any of the three moods in Arabic,
namely, subjunctive, indicative or jussive. For perf
VFORM, MOOD is always subjunctive. VAL contains
valence of the sign. SEM feature contains semantic
information. First, I consider the INDEX feature,
which is a reference to a discourse entity. Then list of
FRAMES under SEM feature contains modifierframes.


Figure 3 An HPSG Sign



DTRS list(sign)



List of Daughters

In this paper, I have proposed a lexical type hierarchy

for different type of verb Forms.
Figure 6 HPSG Type Hierarchy of Arabic Verb
shows the proposed type hierarchy. In this figure I
mainly cover the type of perfect verb forms. Perfect
verbs have only one type of declension as they are
always indicative. I have not shown the imperfect verb
forms here. But the types of imperfect verb forms are
well understood from this figure. The number of types
for imperfect verbs is three times bigger than that of
perfect verbs. This is because imperfect verbs have all
three moods indicative mood, subjunctive mood and
jussive mood. From this diagram, it is clear that all
these verb forms are derived from trilateral roots. In
other words, these verb forms are variation of stems
derived from tri-literal roots.

An HPSG Construction

Figure 4 An HPSG Construction

4. HPSG Formalism for Verb Form

In this section, I model the verb form categories and
their derivation from trilateral root. I adopt SBCG [2]
version of HPSG for the analysis. This AVM for
Arabic verb is discussed in subsection A. Then I
discuss different HPSG types of root verbs and nally
propose a multiple inheritance hierarchical model for
Arabic verb forms which is the first contribution in this
paper. After this hierarchical model, I propose a lexical
construction rule which recognizes formation of
trilateral Form IV verb form from trilateral Form I root
verb. Then, we show an example how this construction
rule helps to recognize Form IV verb lexeme. All these
contributions are described in subsection B and C.


Type Hierarchy of Arabic Verb Forms


Construction Rules of Arabic Verb Forms

Complete construction rules for derivation of these

entire verb forms cannot be describe considering the
page limitation as it may require a full book to cover
all these derivation. To keep our description precise I
have limited our discussion to derivation of strong
Form IV perfect active indicative verb.

HPSG AVM for Arabic Verb

I use the verb AVM proposed in [16]. The AVM

mentioned for verb is given in Figure 5 SBCG AVM
for Arabic.

Here, I propose a construction rule for derivation of

strong Form IV verb from strong Form I verb. Surely
this derivation can guide derivation of others forms
Figure 7 Lexical Rule for Form IV construction
shows this construction rule. As common with other
SBCG construction rules, it contains two parts MTR
and DTRS. MTR contains the AVM of Form IV verb

Here MORPH is to denote morphological features. It

contains three features, namely, ROOT, SKELETON
and VDEC. ROOT is a sequence of root letters in verb
and its meaning is same as in [15] SKELETON is a
sequence of morphological objects which are
phonologically realized. It will include both lexical


Proceedings of the Conference on Language & Technology 2014

and DTRS contains only one AVM which is for Form I

The construction rule contains four place holders. The
first three place holders are for three root letters. Thus
from this construction rule, Form IV verb from letter
k, t, b or from n, s, r can be recognized. From
the SYN feature, it can be seen that Form IV verb is
always transitive.
SKELETON feature of DTR AVM contains a vowel
place holder. This means for all types of Form I verb,
the SKELETON of Form IV verb is same. In other
words, Form IV verb stem is same for Form IA
Form IH verb stem.

Figure 6 HPSG Type Hierarchy of Arabic Verb

? 5B

? 5B


G1I, G2I, G3I

A 1
SKELETON @, G1I, G2I, @, G3I, @ 1


+ C+


QQ 1
CASE @EE ' 1












? 2



SIT 1 1





? 5

G1I, G2I, G3I

A1 1
SKELETON G1I, @, G2I, VOWEL, G3I, @ 1 1




SIT 2 '




An interesting semantic effect is visible from the

construction rule. If the even frame of MTR and DTR
AVM are probed, it will be found that the under-goer
of Form IV verb is the actor of Form I verb and actor
index of Form I goes to the valence sign index in Form
Figure 8 Example of Derivation from cxt rule shows
an example which shows how the mentioned
construction rule will be used to derive the stem aktaba
( )from root verb kataba ().

5. Conclusion
In this paper, I have captured the morphological
derivation of Arabic verb forms and capture their
syntactic and semantic effects by proposing the
construction rule. I have shown how form IV can be
generated from form I root verb. But the result of this
paper can be used for deduction of all Arabic verbs.
This immediate improvement of this paper can be to
show construction rules and their syntactic and
semantic effect for all other verb forms.

Figure 7 Lexical Rule for Form IV construction



ST *+











*+ 1



:;<=8> ?@


6. References
[1] C. J. Pollard and I. A. Sag, Head-Driven Phrase
Structure Grammar, Chicago: University of
Chicago Press, 1994.
[2] I. A. Sag, Sign-Based Construction Grammar: An
informal synopsis, Stanford University, 2007.
[3] K. R. Beesley, Finite-State Morphological
Analysis and Generation of Arabic at Xerox
Research: Status and Plans in 2001, Workshop on
Arabic Language Processing: Status and

Figure 5 SBCG AVM for Arabic verb


Proceedings of the Conference on Language & Technology 2014


International Conference on Head-Driven Phrase

Structure Grammar, Paris, France, 2010.

[4] O. Smr, Functional Arabic Morphology. Formal

System and Implementation, PhD Dissertation,
Charles University in Prague, 2007.

[16] M. H. Masum, M. S. Islam, M. S. Rahman and R.

Ahmed, Type-Based HPSG Analysis of Arabic
Verbal Declension, the 7th International
Conference on Electrical and Computer
Engineering, 2012.

Linguistics, 2001.


[5] M. S. I. Bhuyan, and R. Ahmed, An HPSG

Analysis of Arabic Verb, The International Arab
Conference on Information Technology, 2008.

[17] M. H. Masum, M. S. Islam, M. S. Rahman and R.

Ahmed, HPSG Analysis of Type-Based Arabic
Nominal Declension, Proceedings of the 12th
International Arab Conference on Information
Technology, 2012.

[6] http://en.wiktionary.org/wiki/Appendix:Arabic
verb forms
[7] A. M. Mutawa, S. Alnajem, and F. Alzhouri, An
HPSG Approach to Arabic Nominal Sentences,
Journal of the American Society For Information
Science and Technology, vol. 59(3), pp. 422-434,

[18] G. M. Green, Elementary Principles of HPSG,

FIPS Publication, 1999.
[19] I. A. Sag, and T. Wasow, Syntactic Theory: A
Formal Introduction, Stanford: CSLI Publications,

[8] A. Kihm, Nonsegmental concatenation: a study of

Classical Arabic broken plurals and verbal nouns,
Morphology 16, pp. 69-105, 2006.

? 5B

? 5B

2 1
G1I3, G2I, G3I

1 1
MORPH SKELETON @, G1I3, G2I, @, G3I, @' 1

+ +

1 1

QQ 1 1
CASE @EE ' 1 1


1 1

1 1

VFORM !? 2
1 1
@E 1

1 1


1 1


1 1


1 1

? 2

1 1


1 1


1 1



0 1

? 5


G1I3, G2I, G3I


G3I ,

+ +









[9] M. S. I. Bhuyan, and R. Ahmed, Nonconcatenative

Morphology: An HPSG Analysis, the 5th
International Conference on Electrical and
Computer Engineering, 2008.
[10] S. Z. Riehemann, Type-Based Derivational
Morphology, Journal of Comparative Germanic
Linguistics, Vol-2, 1998.
[11] S. Bird, and E. Klein, Phonological Analysis in
Linguistics, Vol-20, 1994.
[12] M. S. Islam, M. H. Masum, M. S. I. Bhuyan, and
R. Ahmed, An HPSG Analysis of Declension in
Arabic Grammar, Proceedings of the 9th
International Arab Conference on Information
Technology, 2009.
[13] M. S. I. Bhuyan, and R. Ahmed, An HPSG
Analysis of Arabic Passive, Proceedings of the
11th International Conference on Computer and
Information Technology, 2008.
[14] T. Buckwalter, Buckwalter Arabic Morphological
Analyzer Version 2.0, LDC catalog LDC2004L02,

Figure 8 Example of Derivation from cxt rule

[15] M. S. Islam, M. H. Masum, M. S. I. Bhuyan, and

R. Ahmed, Arabic nominals in HPSG: A verbal
noun perspective. In Proceedings of the 17th


Proceedings of the Conference on Language & Technology 2014

Spoken Dialog System: Direction Guide for Lahore City

Aneef Izhar ul Haq1, Aneek Anwar1, Aitzaz Ahmad1, Tania Habib1,
Sarmad Hussain1, Shafiq-ur-Rahman2

Centre for Language Engineering,

Al-Khwarizmi Institute of Computer Science,
UET, Lahore, Pakistan

FAST NUCES, Lahore, Pakistan


different Spoken Dialog end-to-end systems and their

comparisons, section 3 discusses the architecture of the
prototype systems design while section 4 discusses the
RavenClaw Dialog Manager used in the current system,
respectively. The details about the telephony framework
has been reviewed in the Section 5 and an example
dialog has been presented and explained in the Section
6 of this paper. At the end of the paper, the challenges,
limitations and future scope of this system has been
discussed in section 7 and 8.

A Direction Guide Spoken Dialog System is a system
that asks an Urdu language user about his current and
the destination location over a telephone and then gives
direction guidance to the user from the users present
location to the destination location. A prototype end-toend system has been developed that is distributed and
Hub & Spoke message based system using the opensource GALAXY Communicator. The end-to-end system
uses a Telephony framework and a software based
infrastructure that involves the modules of Automatic
Speech Recognizer, Text-to-Speech Synthesizer,
RavenClaw Dialog Manager, a Backend database and
an Interaction Manager. Currently the end-to-end
system works for a single session and future work
includes the multiple session handling as well.

2. Literature Review
GALAXY is an open-source architecture for
developing new spoken dialog systems. It has a
centralized architecture where a Hub communicates
with individual modules or servers. The Hub and Spoke
GALAXY architecture -- HUB at the center while
spokes are the connections made to each of the servers

Keywords GALAXY; RavenClaw; Spoken

Dialog System; Urdu Path Finder for Lahore City

1. Introduction
A Spoken Dialog System is a system that aims to
provide services such as Weather and Direction
guidance in which the interaction between the system
and the user is made using dialogs. This research paper
focuses on a Spoken Dialog System framework that was
made to be used as a path finder for the city of Lahore.
The system includes 49 places of Lahore and provides
paths from users present location to the destination
location. Generally, a dialog system includes a
telephony framework, a software based infrastructure
and a dialog manager.

Outline-- This research paper comprises of 8 sections

in which section 2 discusses a brief literature review of

Figure 1: Galaxy Communicator [2]


Proceedings of the Conference on Language & Technology 2014

from the HUB -- has been shown in Figure 1. GALAXY

II is an evolutionary form of MIT GALAXY System [1].
It retains the distributed client-server architecture of
GALAXY and became the reference architecture for the
DARPA Communicator Program [2] [3]. All the
subsequent systems developed under the program use
the GALAXY II architecture to build their respective
dialog systems.

is guaranteed. The complete system was not tested on

UNIX but each individual module is known to work on
UNIX so the system can be ported to UNIX. Its
distribution is available free under the BSD license
(permission to modify and distribute). Later dialog
systems by CMU like RoomLine, Olympus and
Conquest were built using the same architecture and
only certain modules have been replaced or modified.

2.1. Jupiter


Jupiter is a conversational system made using

GALAXY architecture, which was developed by MIT
SLS (Spoken Language Systems) group, became online
in 1997. It was a weather information retrieval system
using a telephone. Later it was redesigned using Web
GALAXY architecture [4] to provide support for
graphical interface and was restructured again in 1999
using GALAXY II programmable Hub architecture [2].
The system has a vocabulary of over 2000 words and
can provide weather information for nearly 500 cities
around the world [4]. Jupiter can respond to the queries
about general weather forecast as well as specific
queries like temperature, humidity, wind speed, sunrise
and sunset time etc. It can also inform the user about the
cities in a particular region whose weather information
it can provide [5].

GALATEA is a human-like visual agent which can

communicate via spoken language. It has been
developed by a team of Japanese researchers from
various universities and currently maintained by people
at University of Tokyo. Unlike other spoken dialog
systems, it has an additional module for face-image
synthesis which mimics the human face while speaking
like synchronizing the lips movement with the speech,
nodding the head while answering in affirmation etc. It
is an open-source system developed on Ubuntu, Linux
and is available for use and redistribution under a BSD
license [8] [9]. Despite of being compact and
customizable, the system has been developed as a
Desktop application strictly and does not support an
interface for a telephone connection. Also, most of the
related literature is available in Japanese instead of

2.2. Mercury


Mercury is a telephone-based conversational flightreservation system designed to allow the end user to
plan their air-travel between 226 cities around the world
[6]. Based on GALAXY II, it uses the same architecture
as Jupiter except for the dialog manager whose
management and design are given in [6]. The accuracy
of the system, in the context of recognition by ASR, is
discussed in [7]. This system is now offline and in its
place a web-based version is available by the name of
Flight Browser.

The Olympus architecture is an improved version

of the CMU Communicator project. The motivation
behind the development of the Olympus was to come up
with a framework which is open, transparent, flexible,
modular and scalable. The system provides detailed
accounts of the internal state of each module to aid the
analysis of the running system. The dialog management
module in the Olympus uses the RavenClaw dialog
management framework [10] which comes as a two-tier
architecture that separates the domain-specific tasks
from the domain independent dialog engine.
A key feature shared by all Olympus components is
the decoupling of the domain-specific resources and the
domain-independent programs which promotes
reusability and also allows the users to modify a
particular module as per their needs with lesser effort
and keeping the rest of the system intact. The Olympus
framework comes with a BSD license that allows the use
of the source code for research and commercial use [10].

2.3. CMU Communicator

The CMU Communicator (CMUnicator) is a travelplanning system developed by Carnegie Mellon Sphinx
Group under DARPA Communicator program [7]. It
uses a central architecture by using a GALAXY Hub for
inter-module communication. It comprises of 10
different modules. Moreover, it provides both the
telephony and the desktop interface. A Process Monitor,
which starts the modules and the Hub, monitors the
whole end-to-end process as well as the individual
modules and restarts them in case they crash.
The complete end-to-end system was developed on
Windows 2000 originally and no forward compatibility

2.6. Comparison of Different Spoken Dialog

The comparison between different spoken dialog
systems is shown in Table 1.


Proceedings of the Conference on Language & Technology 2014

Table 1: Comparison of different Spoken Dialog Systems

















































server machine. Each server machines IP and port

numbers are listed in the Hubs program file. To start up
a communication, first all the servers start up on their
respective ports and then Hub starts by loading the
routing rules, the Hub Programs and the list of servers
along with their corresponding addresses and ports and
consequently connects with the servers. Once all the
servers and Hub are up, a new dialog session can
initiate. A new session is created whenever a user calls
on the given number using his telephone or mobile
Figure 2 shows the high level architecture of the
prototypes framework design. The GALAXY
framework constitutes of the dialog manager
(RavenClaw), and the components on the Fedora
machine; the Telephony framework constitutes the
Asterisk server on the CentOS machine and the physical
Hardware. Each block connected with the GALAXY
Hub is a server process, called a GALAXY server. The
Asterisk server on the other hand, is not a GALAXY
server, unlike the rest of the servers, and thus cannot
communicate with the GALAXY Hub directly.
Therefore, it communicates with the servers on the
Fedora machine using socket connections for passing
the call initiation information, the user inputs and for
receiving the systems response and signaling
information for synchronization. The speech recognizer
used in this prototype system is an Urdu Automatic
Speech Recognizer (ASR) that has been trained on the
Sphinx Recognizer. The ASR vocabulary currently uses
49 places inside Lahore and these places were chosen on
the basis that they had large training data which resulted
in good accuracy. The list of these places is provided in
Appendix I. The speech synthesizer uses the Festival
Text-To-Speech Synthesizer (TTS). The TTS outputs
the system response in English as the Urdu TTS is
currently in the developing phase. A Backend server
was used as an Application Database, which contains

Jupiter, Mercury, CMUnicator, Olympus, Conquest

all are based on GALAXY architecture. Jupiter,
Mercury and Galatea are based on Linux platform while
CMUnicator, Olympus and Conquest are Windows
based. Olympus Spoken Dialog System and the
GALAXY communicator Hub & Spoke framework that
contained only dummy servers/modules were chosen in
building this current prototype system design.
GALAXY Communicator Software Infrastructure was
used for building the current prototype design along
with the Dialog Manager module from the Olympus
Spoken Dialog System. Olympus was not chosen as an
end-to-end system in our current prototype design as it
had a lot of additional functionalities that were not
required in this prototype system, hence instead of
scaling down the Olympus to remove the additional
functional modules, the RavenClaw Dialog Manager
from Olympus was used as a stand-alone application in
the prototype system along with the GALAXY

3. System Architecture
The Spoken Dialog System has been made using
two open source entities; the GALAXY framework,
based on a networking solution, and RavenClaw [11].
Each module in the GALAXY Communicator
communicates with the hub only [12]. In the present
system, the communication protocols that are used are
the same as those in the GALAXY architecture but
every module has been developed and implemented
from scratch. The hub uses a program file, written using
a high-level scripting language called the Hub Program
File Syntax, which has a set of rules defined for routing
of the information received from various modules and
to control the flow of communication. Each module runs
as a separate server which proves to be beneficial in
scenarios where certain modules need to run on separate


Proceedings of the Conference on Language & Technology 2014

Figure 2: High level System architectural diagram

not required in the current prototype system. So these
dependencies had to be accounted for in RavenClaw.
In OLYMPUS, the decoded inputs from the
Recognizer enter as an input to the DM via the Parser
Module, the Confidence Annotator (that assigns a
confidence score to the parsed decoded results) and the
Interaction Manager [14]. The output DM frame had to
pass through an NLG Module that performed a lookup
in generating a complete information message that was
then passed to the Synthesizer. In the current
framework, the dependency from the Confidence
Annotator module of the OLYMPUS system has been
catered by assigning a confidence score of either 0 or 1
to the decoded output result of the Recognizer. A
correctly decoded result gets the confidence score of 1
while wrong decoding result means a confidence score
of 0. The dependency of the NLG module of
OLYMPUS on the RavenClaw DM was managed by
writing a small routine that was capable of performing a
lookup based on the DMs output frames prompt_key
key. The prompt_key key can be a welcome type or a
going_to type or a going_from type, etc. The
prompt_key tells which text message is to be
synthesized and the prompt_type tells the nature of the
system response (a question which shall lead to an input,
a simple informative message, or the requested route to
be retrieved from the database).

the direction information of the different locations used

in this Location based prototype system. The path
descriptions for various locations inside Lahore are
retrieved from The Google Directions API [13]. An
Interaction Manager has been used that acts as a
mediator between the Dialog Manager and the
GALAXY Framework. The pre-processing required
after determining the nature of the Dialog Managers
output is done by the interaction manager prior to
routing the frame to the appropriate destination
(Backend or Synthesizer modules).

4. Dialog Manager -- RavenClaw

RavenClaw is the dialog manager used in the
Olympus spoken dialog system. A product of the CMU
Speech Group, Olympus has been designed for
Microsoft Windows based dialog applications built on
top of the foundations laid by GALAXY framework.
RavenClaw is implemented with the provision of being
reused; the dialog management logic and
responsibilities are defined separately, along with
generic error handling and error recovery techniques.
The dialog itself is defined separately in a file as a dialog
tree. For each new dialog system, the system developer
needs to write a new dialog tree. The dialog manager
RavenClaw had to be extracted from the Olympus
Spoken Dialog System so as to enable it to use it as a
standalone application/module with the GALAXY
framework. But the usage of the dialog manager as an
independent entity was not possible as it had
dependencies on the Parser, Confidence Annotator and
Natural Language Processing (NLG) modules that were
present in the Olympus Spoken Dialog System but were

5. Telephony Framework
A telephony framework/server is the most important
part of the Mobile Based Spoken Dialog System as it is
the telephony framework that enables the user to input
his speech to the Spoken Dialog System or to listen to


Proceedings of the Conference on Language & Technology 2014

Gateway device to the dedicated Asterisk Server that

runs on the Operating System of CentOS and the
communication software that is used between the
Gateway device and the dedicated Asterisk Server is
TrixBox. The asterisk server is made to communicate
with the Spoken Dialog System GALAXY
Communicator using a wired or wireless connection.
Currently, the system uses a single telephone line but
this system can be well-expanded and scaled up if
multiple analog trunk lines are used or a single E1 PRI
(Primary Rate Interface) line is used that supports a data
rate of up to 2.048 Mbps and up to 32 simultaneous

the system generated responses and prompts. The

telephony framework in the prototype system consists
of the following core components:

Telephone line (Hardware)

Linksys-spa-400 VOIP box (Hardware)
x86-based server (Hardware)
CentOS (Software)
TrixBox IP PBX (Software)

6. Process Flow
Figure 4 shows the process flow diagram of the
GALAXY framework. The Asterisk server on the
Telephony framework side receives the phone call and
connects the caller with the GALAXY framework.
Other than that, the Asterisk server plays all the system
messages to the user (received from the Synthesizer)
and records the user input which it then sends to the
Recognizer. The communication protocol of the
GALAXY framework is a custom standard. The
messages exchanged between various servers are in the
form of GALAXY frames. The Hub receives the output
frame from each server and routes it to the intended
server. This routing information comes from the Hub
program file which has the list of rules defined for
routing the output frames from each server. The

Figure 3: Telephony framework

Figure 3 shows the current design of the telephony
framework. The current design of the telephony
framework includes a Telephone line that is provided by
PTCL(the service provider) and terminated at an FXO
port present in the LinkSys SPA 400 VOIP Gateway
device. An Ethernet cable is used to connect the

Figure 4: Process flow diagram


Proceedings of the Conference on Language & Technology 2014

System: Welcome to the CLE Direction Guide.

System: What is your current location?

recognizer receives the recorded speech file from

Asterisk and performs recognition operation before
sending the decoded output to the Dialog Manager. The
Synthesizer receives the message text of the systems
response, performs the text-to-speech synthesis, and
sends the synthesized speech file to Asterisk.
Furthermore, it also signals to proceed with the dialog
after the system response has been played to the user on
the Asterisk end. The Dialog Manager controls and
executes the flow of the dialog. The Interaction Manager
receives the output of the Dialog Manager, performs
some pre-processing operations on the basis of the type
of the system response to be generated, and sends the
processed information to either the Backend (in case of
a Backend query) or to the Synthesizer otherwise.
The process flow diagram is explained below.

System: Where are you headed to?


System: Just a minute let me check that for you.

System: Total distance is 2.3 km. Head southwest
towards Katchery Rd. After 77m take the 1st left towards
Katchery Rd. After 0.2 km at the roundabout, take the
1st exit onto Mayo Hospital Rd. After 0.5 km, turn left
onto Napier Rd. After 0.2 km, turn left onto Hospital Rd.
After 66 m, continue onto Mayo Hospital Rd. After 0.4
km, turn right onto Railway Rd. After 0.7 km, turn right
at Barafkhana chowk onto Flemming Rd. After 0.1 km,
Gawal Mandi, Lahore.

6.1. Flow diagram

System: Thank you for using the CLE Direction Guide.

For each system response, the Dialog Manager

sends a prompt key and prompt type to the Interaction
Manager as an output. Depending on the prompt type,
the Interaction Manager chooses to route the key to the
Backend (for the retrieving the requested route) or uses
the key to construct the text message to be sent to the
Synthesizer. The Synthesizer gets the system response
synthesized, and sends the speech file to Asterisk along
with the prompt type. The Asterisk server plays the
system response to the user and then sends the playback
event completion signal to the Synthesizer. On receiving
this signal, the Synthesizer sends the system_utt_end
event to the Dialog Manager so that it may proceed with
the execution of the dialog.
If the Asterisk server receives a request type prompt,
after sending the playback event completion signal to
the Synthesizer it plays a beep and records the callers
locations that follow the beep. The recorded speech
input is sent to the Recognizer which decodes the speech
input and sends the decoded result to the Dialog
Manager along with the user_utt_end event. On
receiving this event, the Dialog Manager processes the
inputs and performs the necessary validation operations
before commencing the dialog. If there is an error in
decoding the input from the user, the error handling is
implemented in such a way that the user is then asked to
input the location again (up to a maximum of 3
iterations). If still the system is unable to decode
correctly, the system informs the user with an error
message that his inputs were not decoded successfully
and the call is then terminated.

7. Challenges
The currently made spoken dialog system faces a
few challenges due to the Automatic Speech
Recognizer. These challenges can be categorized as
misrecognition and non-recognition. In the first
scenario, the recognizer can misrecognize a location
with some other location in the vocabulary. In such a
situation, no error would be generated as the word exists
in the vocabulary and the corresponding system
response would be played-back to the user. An example
in the case of misrecognition is shown below:
System: What is your current location?
(Recognizer decoded output: )
System: Where are you headed to?

(Recognizer decoded output: )

System: Total Distance is 3.0 km, Head southwest
toward Katchery Rd, After 77 m, Take the 1st left onto
Katchery Rd, After 0.2 km, Take the 2nd left onto New
Anarkali Road, After 0.6 km, Slight left at Jalandhar
Moti Choor House onto Ganpat Rd, After 0.2 km, At
the roundabout, take the 2nd exit onto Circular Rd ,
After 0.4 km,
Keep left to stay on Circular Rd ,
After 0.2 km, Slight left onto Shahalami Rd, After 81 m,
Turn left onto Shah Alami Road, After 0.6 km, At the
roundabout, take the 2nd exit onto Dabbi Bazar, After
0.3 km, Turn left onto Gali Kabli Mal, After 0.3 km,
Turn right onto Jamadaran Gali, After 24 m, Turn left

6.2. Example Dialog

An example dialog conversation has been shown


Proceedings of the Conference on Language & Technology 2014

toward Choona Mandi Bazar, After 31 m, Turn left onto

Choona Mandi Bazar, After 0.1 km, Choona Mandi,
In this example, the destination location Gawal
Mandi was misrecognized as Choona Mandi and system
returned the path from Anarkali to Choona Mandi.
The other type scenario could be when the
Automatic Speech Recognizer does not recognize
anything. The un-recognition is mainly due to limited
training data and vocabulary or it could also be due to
the background noise present in the users environment.
In the case when no output is given by the recognizer,
then the system asks the user again and again for a
maximum count of 3 unless it finds some correct
location in the vocabulary. An example of nonrecognition is given below:

8. Future Work

System: What is your current location?


Currently the prototype has been made to work in a

single session, but this can be up-scaled by giving it the
ability to establish multiple sessions at the telephony
framework end using an E1 line. At the software end,
the Dialog Manager module used in the current
framework design does not allow concurrent sessions.
On the other hand, GALAXY Communicator
framework is capable of handling multiple and
concurrent sessions. Hence for a workaround, an
intermediate entity or module (basically a Conversation
Manager) must be made that spawns a new Dialog
Manager when a new parallel call is received from a
new user. The Conversation Manager will act as an
interface between the Dialog Manager(s) and the rest of
the framework. The rest of the framework shall be
enhanced accordingly. Work is currently being done in
developing an Urdu TTS, hence the future scope of this
paper also includes the integration of the prototype Urdu
TTS in the current system. We also aim to expand the
ASR vocabulary to all the important places in Lahore
city (currently 49 locations are in the ASRs

(Recognizer decoded output: )


System: What is your current location?

(Recognizer decoded output: Null )

System: Where are you headed to?


This work has been conducted through the project,

Enabling Information Access for Mobile based Urdu
Dialogue Systems and Screen Readers supported
through a research grant from ICT RnD Fund, Pakistan.

(Recognizer decoded output: Null )

System: Where are you headed to?


(Recognizer decoded output: Null )

[1] P. R. Cohen, A. Cheyer, M. Wang and S. C. Baeg, "An

Open Agent Architecture," in Proc. AAAI Spring
Symposium, Stanford, California, March 1994.

System: Where are you headed to?


[2] S. Seneff, E. Hurley, R. Lau, C. Pao, P. Schmid and V.

Zue, "GALAXY-II: A Reference Architecture for
Conversational System Development," in Proc.
ICSLP'98, Sydney, Australia, 30th November - 4th
December, 1998, pp. 931-934.

(Recognizer decoded output: )

System: Just a minute let me check that for you.
System: Total distance is 2.3 km. Head southwest
towards Katchery Rd

[3] J. Polifroni and S.Seneff, "GALAXY-II as an

Architecture for Spoken Dialogue Evaluation," in Proc.
Second International Conference on Language
Resources and Evaluation (LREC), Athens, Greece, May
31- June 2,2000.

In the above given example, if there is an instance

that no useful output is obtained from a recognizer for 3
consecutive iterations, the user is informed with an error
message saying An error occurred during recognition.
Sorry for the inconvenience, and the call is terminated.
Hence these two challenges i.e. misrecognition and nonrecognition, need to be accounted for in the future.

[4] R. Lau, G. Flammia, C. Pao and V. Zue, "WebGalaxy

Integrating spoken language and hypertext navigation,"
in Proc. of Eurospeech, Rhodes, Greece, September,


Proceedings of the Conference on Language & Technology 2014

[5] V. Zue, S. Seneff, J. R. Glass, J. Polifroni, C. Pao, T. J.

Hazen and L. Hetherington, "Jupiter: A TelephoneBased Conversational Interface
for Weather
Information," IEEE Trans. Speech and Audio
Processing, vol. 8, no. 1, January, 2000.

[9] [Online]. Available: http://hil.t.u-tokyo.ac.jp/~galatea/.

[Accessed 30 October 2013].

[10] January 2007. [Online]. Available: http://wiki.speech.cs.

[Accessed 7 January 2014].

[6] J. Polifroni and S. Seneff, "Dialogue Management in the

Mercury Flight Reservation System," in Satellite
Dialogue Workshop of the ANLP-NAACL Meeting,
Seattle, Washington, April 2000.
[7] [Online]. Available: http://www.speech.cs.cmu.edu/Com
municator/Communicator/. [Accessed 18 November
[8] [Online].
[Accessed 30 Novermber 2013].

[11] D. Bohus and A. I. Rudnicky, "The RavenClaw dialog

management framework: Architecture and systems,"
Computer Speech and Language, vol. 23, January, 2009.
[12] "GALAXY Communicator Documentation," [Online].
[Accessed 13 January 2014].
[13] Google. [Online]. Available: https://developers.google
[Accessed 13 November 2013].
[14] D. Bohus, A. Raux, T. K. Harris, M. Eskenazi and A. I.
Rudnicky, "Olympus: an open-source framework for
conversational spoken language interface research," in
Proc. HLT-NAACL 2007, Rochester, New York, USA,
April, 2007.

Appendix I


Proceedings of the Conference on Language & Technology 2014

Computer-aided Error Analysis of L2 Spoken English:

A Data Mining Approach
Yuichiro Kobayashi
Japan Society for the Promotion of Science

However, several limitations of the approach were

pointed out in the 1970s. Major limitations can be
summarized as follows [3]:

Understanding learners errors is significant for

language teachers, researchers, and learners.
Computer learner corpora enable us to carry out
computer-aided error analysis, and as compared to
traditional error analysis, it has an advantage in the
storing and processing of enormous amounts of
information about various aspects of learner language.
The present study aims to explore the error patterns
across proficiency levels in second language spoken
English with data mining techniques. It also attempts to
identify error types that can be used to discriminate
between English learners at different proficiency levels.
Spoken data for the present study were sourced from
the NICT JLE Corpus, a computerized learner corpus
annotated with 46 different error tags. The results of
the present study indicate that there is a substantial
difference in the frequencies of five types of errors,
namely (a) article errors, (b) lexical verb errors, (c)
normal lexical preposition errors, (d) noun number
errors, and (e) tense errors, between lower- and upperlevel learners. The findings will be useful for L2
learner profiling research and for the development of
automated speech scoring systems.

(a) Limitation 1: EA is based on heterogeneous

learner data.
(b) Limitation 2: EA categories are fuzzy.
(c) Limitation 3: EA cannot cater for phenomena
such as avoidance.
(d) Limitation 4: EA is restricted to what the learner
cannot do.
(e) Limitation 5: EA gives a static picture of L2
Nonetheless, the emergence of computer learner
corpora in the early 1990s enabled us to carry out
computer-aided error analysis (CEA). It was developed
to overcome most of the drawbacks of traditional EA,
and it has an advantage in the storing and processing of
enormous amounts of information about various
aspects of learner language [4].

2. Purposes
The present study aims to explore the error patterns
across proficiency levels in second language (L2)
spoken English with data mining techniques. It also
attempts to identify error types that can be used to
discriminate between English learners at different
proficiency levels.

1. Introduction
Understanding learners errors is significant for
language teachers, researchers, and learners [1]. Error
analysis (EA) was a major topic in the field of second
language acquisition in the 1960s and early 1970s, and
it consists of the following five steps [2]:

3. Materials
3.1. Spoken data
Spoken data for the present study were sourced from
the NICT JLE (National Institute of Information and
Communications Technology Japanese Learner
English) Corpus, a corpus of more than 1,200 Japanese
English as a foreign language (EFL) learners oral
interview transcripts [5]. It comprises 325 hours of
interviews conducted with 1,281 test takers. The 15minute oral proficiency test is called the Standard
Speaking Test (SST) and its evaluation criteria

(a) Collection of a sample of learner language

(b) Identification of errors
(c) Description of errors
(d) Explanation of errors
(e) Evaluation of errors


Proceedings of the Conference on Language & Technology 2014

conform to the American Council on the Teaching of

Foreign Language Oral Proficiency Interview (ACTFL
OPI). The present study used the error-tagged subset of
this corpus, including oral performance data from 209
test takers whose proficiency levels were assessed,
varyingly, as level 3 (novice high), level 4
(intermediate low), level 5 (intermediate low-plus),
level 6 (intermediate mid), level 7 (intermediate midplus), level 8 (intermediate high), and level 9
(advanced). Since the numbers of test takers in levels 1
and 2 were low, they were not employed for the present
study. Table 1 summarizes the number of test takers
and total words from each level.

learner groups at each proficiency level and error types

were classified. Following these classifications, cooccurrence patterns of errors of lower- and upper-level
learners were compared using association rule mining.
Finally, error types that can be used to discriminate
between learners at different proficiency levels were
identified with a decision tree model.

5. Results and discussion

5.1. Frequency distribution of errors
The present study began by examining which types
of errors are frequently observed in the oral
performance of Japanese EFL learners. Figure 2
presents the frequencies of all error types found in the
corpus. As can be seen in this figure, the most common
errors are article errors (at), followed by tense errors
(v_tns), normal lexical preposition errors (prp_lxc1),
unknown types of errors (o_uk), noun number errors
(n_num), and lexical verb errors (v_lxc).

Table 1: Corpus size of each level

Test takers
Total words
Level 3
Level 4
Level 5
Level 6
Level 7
Level 8
Level 9

5.2. The relationship of proficiency levels and

error types
The next step was to investigate the complex
interrelationships among seven proficiency levels,
those among 42 error types, and association patterns
between proficiency levels and error types. In Figure 3,
they are visualized as a heatmap with hierarchical
clustering based on the complete linkage method on
Euclidean distances. It consists of the result of the
clustering of proficiency levels, the result of error types,
and the heatmap generated from the frequency matrix
of error types. In the heatmap, errors in cells dark in
color represent more frequent errors at that proficiency
level and errors in cells pale in color represent less
frequent errors, as compared to at other levels. The
diagram shows that most error types are the most
frequent at level 3 and that their frequencies decrease
as learners proficiency levels rise. Results of
categorizations where small clusters of highly similar
items are included within much larger clusters of less
similar items. As is clear from the result of the
clustering of proficiency levels, there is a substantial
difference in the frequencies of errors between lowerlevel (levels 3, 4, and 5) and upper-level (levels 6, 7, 8,
and 9) learners.

3.2. Error tags

Limiting the number of error types targeted in an
EA can yield limited results. Therefore, we need to
analyze various types of errors, and to compare their
frequencies with each other [6]. The error tagset for the
NICT JLE Corpus was designed to deal with as many
morphological, lexical, and grammatical errors as
possible, and includes 46 different error tags [7] (see
Appendix). The tags are based on XML syntax, as can
be seen in Figure 1.
Yes. Uh. Usually, <at odr="1" crr="the"></at>
crr="opens">open</v_agr> on <n_num odr="3"
crr="Saturdays">Saturday</n_num> and <n_num
odr="4" crr="Sundays">Sunday</n_num>.
Figure 1: A sample of error-tagged spoken

4. Procedure
In the present study, the frequencies of each error
tag inserted in the spoken data were automatically
calculated and employed for the data mining of
learners error patterns. First, the distribution of
frequencies of all types of errors was visualized in
order to overview the errors found in the data. Then,
using heatmap analysis with hierarchical clustering, the


Proceedings of the Conference on Language & Technology 2014

Figure 2: Distribution of frequencies of 42 error types

Figure 3: Relationship between proficiency levels and error types

Proceedings of the Conference on Language & Technology 2014

respectively and sorted in descending order of their lift

scores. Using lift scores instead of support scores can
mitigate the effect of high frequency items on the
ranking of patterns.

5.3. The co-occurrence of errors

It is crucial to consider the difference in frequencies
of error types between lower- and upper-level learners
in more detail. As Figure 3 indicates, complex patterns
of frequencies of errors underlie the differences of
proficiency levels. Therefore, the co-occurrence
patterns of errors deserve careful attention. Tables 2
and 3 show the top 10 co-occurrence patterns of errors
made by lower- and upper-level learners respectively.
The Apriori algorithm for association rule mining [8]
was used for extracting co-occurrence patterns, and
these patterns were ranked in descending order of their
support scores. Since a support score is based on the
co-occurring frequency of the left-hand-side and righthand-side items, a pattern including high frequency
items tends to be placed in a higher rank.

Table 4: Co-occurrence patterns of errors

occurred at lower-levels with lift scores (top
1 {pn_agr}
=> {v_cmp}
2 {v_asp}
=> {v_cmp}
3 {v_qst}
=> {prp_lxc1}
4 {av_pst}
=> {v_cmp}
5 {o_je}
=> {n_num}
6 {pn_agr}
=> {prp_lxc1}
7 {pn_agr}
=> {n_num}
8 {v_ng}
=> {n_num}
9 {v_ng}
=> {v_tns}
10 {v_fml}
=> {v_lxc}

Table 2: Co-occurrence patterns of errors

occurred at lower-levels with support scores
(top 10)
1 {v_lxc}
=> {at}
2 {at}
=> {v_lxc}
3 {v_tns}
=> {at}
4 {at}
=> {v_tns}
5 {n_num}
=> {at}
6 {at}
=> {n_num}
7 {prp_lxc1}
=> {at}
8 {n_num}
=> {v_lxc}
9 {v_lxc}
=> {n_num}
10 {v_lxc}
=> {v_tns}

Table 5: Co-occurrence patterns of errors

occurred at upper-levels with lift scores (top
1 {v_ng}
=> {n_lxc}
2 {v_ng}
=> {v_cmp}
3 {o_uit}
=> {o_uk}
4 {v_ng}
=> {o_uk}
5 {o_uit}
=> {o_lxc}
6 {o_uit}
=> {v_lxc}
7 {v_fin}
=> {v_lxc}
8 {v_fin}
=> {n_num}
7 {v_vo}
=> {v_lxc}
10 {v_ng}
=> {o_lxc}

Table 3: Co-occurrence patterns of errors

occurred at upper-levels with support scores
(top 10)
1 {prp_lxc1}
=> {at}
2 {at}
=> {prp_lxc1}
3 {v_tns}
=> {at}
4 {at}
=> {v_tns}
5 {o_lxc}
=> {at}
6 {at}
=> {o_lxc}
7 {n_num}
=> {at}
8 {v_tns}
=> {prp_lxc1}
9 {prp_lxc1}
=> {v_tns}
10 {n_num}
=> {prp_lxc1}

As Table 4 shows, lower-level learners make many

verb-related errors (e.g., v_lxc, v_cmp) as well as noun
number errors (n_num), pronoun number/sex
disagreement errors (pn_agr), and normal lexical
preposition errors (prp_lxc1). Furthermore, they
produce Japanese English (o_je) in their oral
performance. On the other hand, as compared to lowerlevel learners, upper-level learners make fewer simple
grammatical errors, but produce some unknown error
types (o_uk) and unintelligible utterances (o_uit).
Additionally, in order to overview a number of
patterns extracted by the Apriori algorithm, cooccurrence patterns were visualized as networks using
the Fruchterman-Reingold algorithm [9] in Figures 4
and 5. As is clear in these figures, many more error
patterns can be found at lower levels, as compared to
upper levels. On the other hand, collocation errors
(o_lxc) co-occur with many types of errors at upper

The top 10 co-occurrence patterns ranked in Table 2

and those in Table 3 are similar, but, interestingly
enough, collocation errors (o_lxc) can be found in the
Tables 4 and 5 present the top 10 co-occurrence
patterns extracted from lower- and upper-level learners


Proceedings of the Conference on Language & Technology 2014

Figure 4: Co-occurrence patterns of errors at lower levels (116 patterns)

Figure 5: Co-occurrence patterns of errors at upper levels (97 patterns)


Proceedings of the Conference on Language & Technology 2014

model [10] of error patterns distinguishing lower- and

upper-level learners. It is a flowchart-like diagram
showing a sequence of branching operations based on
comparisons of the frequencies of prominent error

5.4. Error types that characterize lower- and

upper-level learners
Finally, the present study identified which error types
are the most crucial for the difference between
proficiency levels. Figure 6 presents the decision tree

Figure 6: Decision tree of error patterns distinguishing lower- and upper-level learners

Figure 7: Error types that can be used to distinguish between lower- and upper-level learners


Proceedings of the Conference on Language & Technology 2014

published version of the NICT JLE Corpus and the data

of the remaining 50 were kindly offered for research
purposes by Dr. Emi Izumi, one of the compilers of the
corpus. This work was supported by Grants-in-Aid for
Scientific Research Grant Numbers 12J02669 and

In Figure 6, the first significant distinction is based

on the frequency of article errors (at). If the relative
frequency per 100 words in a learners performance is
more than 0.8068, it is classified into the group at the
lower left in the diagram. On the other hand, if the
frequency is less than 0.8068, then the next distinction
is regarding the frequency of lexical verb errors (v_lxc).
After this kind of sequence of branching, all learners
data were classified into six groups. The proportion of
lower-level learners and that of upper-level learners to
each group were provided as bar plots in the bottom of
the diagram. Lower-level learners are the majority in
the left four groups, while upper-level learners are the
majority in the right two groups. As these results
indicate, the frequencies of five types of errors, namely
(a) article errors (at), (b) lexical verb errors (v_lxc), (c)
normal lexical preposition errors (prp_lxc1), (d) noun
number errors (n_num), and (e) tense errors (v_tns),
can be used to discriminate between lower- and upperlevel learners. The frequency distributions of these
error types can also be seen in the form of box plots in
Figure 7, where the horizontal lines in the middle of the
boxes represent the median frequencies in each of the
seven proficiency levels. As the box plots show, these
error types characterize the oral performance of level 3,
4, and 5 learners.

[1] S. P. Corder, The Significance of Learners Errors,
International Review of Applied Linguistics, volume 5, 1967,
pp. 161-169.
[2] S. P. Corder, Error Analysis in J. P. B. Allen and S. P.
Corder (Eds.), Techniques in applied linguistics, Oxford
University Press, London, 1974, pp. 122-154.
[3] E. Dagneaux, S. Danness, and S. Granger, Computeraided Error Analysis, System, volume 26, 1998, pp. 163-174.
[4] M. Abe, A Corpus-Based Investigation of Errors across
Proficiency Levels in L2 Spoken Production, JACET
Journal, volume 44, 2007, pp. 1-14.
[5] E. Izumi, K. Uchimoto, and H. Isahara, A Speaking
Corpus of 1,200 Japanese Learners of English, ALC Press,
Tokyo, 2004.
[6] J. Schachter and M. Celce-Murcia, Some Reservations
Concerning Error Analysis, TESOL Quarterly, volume 11,
1977, pp. 441-451.

6. Conclusion
The purpose of the present study was to explore the
error patterns across proficiency levels in L2 spoken
English with data mining techniques. The results of the
present study indicate that there is a substantial
difference in the frequencies of five types of errors,
namely (a) article errors, (b) lexical verb errors, (c)
normal lexical preposition errors, (d) noun number
errors, and (e) tense errors, between lower- and upperlevel learners. It also shows that lower-level learners
make many more types of verb-related errors, noun
number errors, pronoun number/sex disagreement
errors, and normal lexical preposition errors compared
to upper-level learners. The findings will be useful for
L2 learner profiling research [11] and the development
of automated speech scoring systems [12]. It is
necessary to refine the error tagset for the NICT JLE
Corpus for more detailed future analyses. It is also
desirable to consider the context in which an error
occurred and to check the influence of topics and tasks
of the interview test on the frequency of the error.

[7] E. Izumi, K. Uchimoto, and H. Isahara, Error Annotation

for Corpus of Japanese Learner English, in Proceedings of
the Sixth International Workshop on Linguistically
Interpreted Corpora, 2005, pp. 71-80.
[8] R. Agrawal and R. Srikant, Fast Algorithms for Mining
Association Rules, in Proceedings of 20th International
Conference on Very Large Data Bases, 1994, pp. 487-499.
[9] T. M. J. Fruchterman and E. M. Reingold, Graph
Drawing by Force-Directed Placement, Software-Practice
and Experience, volume 21(11), 1991, pp. 1129-1164.
[10] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen,
Classification and Regression Trees, Chapman and Hall,
New York, 1984.
[11] J. A. Hawkins and L. Filipovi, Criterial Features in L2
English: Specifying the Reference Levels of the Common
European Framework, Cambridge University Press,
Cambridge, 2012.
[12] K. Zechner, D. Higgins, X. Xi, and D. Williamson,
Automatic Scoring of Non-Native Spontaneous Speech in
Tests of Spoken English, Speech Communication, volume
51(10), 2009, pp. 883-895.

Among the data of the 209 learners analyzed in the
present study, those of 159 are included in the


Proceedings of the Conference on Language & Technology 2014

Appendix: Error tagset for the NICT JLE






Error type
Noun inflection
Noun number
Noun case
Countability of noun
Complement of noun
Verb inflection
Subject-verb disagreement
Verb form
Verb tense
Verb aspect
Verb voice
Usage of finite/infinite verb
Verb negation
Complement of verb
Adjective inflection
Usage of positive / comparative /
superlative of adjectives
Adjective number
Number disagreement of adjective
Quantitative adjective
Complement of adjective
Adverb inflection
Usage of positive / comparative /
superlative of adverb
Adverb position
Complement of preposition
Normal preposition
Dependent preposition
Pronoun inflection
Number/sex disagreement of pronoun
Pronoun case



Case of relative pronoun
Japanese English
Misordering of words
Unknown type errors
Unintelligible utterance