Vous êtes sur la page 1sur 8

Application of BIS POS Tagset for Sanskrit: Case of Verbs and Particles

Abstract: In this paper we address the issue of POS tagging of Sanskrit verbs and particles
using the BIS POS tagset. Traditionally, the number of grammatical categories for Sanskrit
varies from one to five. The language has been exhaustively described in the tradition. And this
description is still prevalent in todays grammar teaching. In such a situation, the application of
this tagset, which is a new paradigm with respect to Sanskrit, is a challenge. The tagset has
certain subcategories for verbs and particles to be tagged. We will explore how actually these
sub-tags will be applied in tagging the corpus of the language.
Keywords: POS tagging, tagset, morphology, Sanskrit, corpus, Pinian grammar.

1. Introduction
The BIS POS Tagset is a national standard tagset for Indian languages that has been recently
designed by the POS standards committee at IIIT, Hyderabad. This tagset has 11 categories at
the top level. The categories at the top level have further subtype level 1 and subtype level 2. The
standard which has been followed in this tagset takes care of the linguistic richness of Indian
languages. This is a hierarchical tagset and allows annotation of major categories along with
their types and subtypes. Most of the categories of this tagset seem to have been adapted either
from the MSRI or the ILMT tagset. For morphological analysis it will take help from
Morphological Analyzer, so morpho-syntactic features are not included in the tagset. The BIS
scheme is comprehensive and extensible and can spawn tagsets for Indian languages based on
individual applications. It captures appropriate linguistic information, and also ensures the
sharing, interchangeability and reusability of linguistic resources. The Sanskrit specific tagsets
available so far (barring IL-POSTS) are not compatible with other Indian languages and with the
exception of the IL-POSTS, all other tagsets are flat and brittle and do not capture the various
linguistic information. The IL-POSTS, an appreciable framework, captures various linguistic
information in one go and this, according to the designers of the BIS tagset, makes the annotation
Madhav Gopal
Centre for Linguistics
Jawaharlal Nehru University
New Delhi
mgopalt@gmail.com
Anil Pratap Giri
Department of Sanskrit
School of Humanities, Pondicherry
University, Puducherry
apgiri.san@pondiuni.edu.in
Girish Nath Jha
Special Centre for Sanskrit Studies
Jawaharlal Nehru University
New Delhi
girishjha@gmail.com

task complex. And from machine learning perspective also it is not so good. So, the BIS tagset,
as a middle path, is suitable for tagging all Indian languages (Gopal and Jha, 2011).

This paper explores how the BIS POS tagset could be used for tagging Sanskrit, especially
focusing on verbs and particles, as these items carry a lot of problems while tagging Sanskrit
corpus using this tagset. In tradition, the number of grammatical categories for Sanskrit varies
from one to five (Gopal et al. 2010) and various items have been described variously. In such a
situation, the application of this tagset, which is a new paradigm with respect to Sanskrit, is a
challenge.

2. Sanskrit Verbs
Sanskrit verbs are generally classified in three categories: parasmaipada, tmanepada and
ubhayapada. The difference between parasmaipada and tmanepada is for the greater part only
a formal one. Many verbs are used in the parasmaipada, but not in the tmanepada , and
inversely (Speijer, 1886). A verb having both kind of forms said to be ubhayapadi. In such a
kind of verb the parasmaipada form denotes that the fruit of the action goes to someone different
other than the agent whereas the tmanepada form denotes the fruit of the action goes to the
agent herself. There is a further classification into sakarmaka (transitive) and akarmaka
(intransitive) categories. Their usage can also be categorized into three categories: kartvcya,
karmavcya and bhavavcya. They can again be classified into primary and derivative verbs
depending on the type of verbal root. However, these classifications are of no use in the BIS
paradigm. One has to understand things according to the framework. One has to apply the tag
available in the tagset. The category of verb is somewhat complicated in this framework. It has
main and auxiliary divisions under subtype level 1 and finite, non-finite, infinitive and gerund
divisions under subtype level 2, as one can see from the table below:

4 Verb V V
4.1 Main VM V__VM
04/01/01 Finite VF V__VM__VF
4/01/02 Non-finite VNF V__VM__VNF



Table 1. Verb tags in BIS scheme

2.1 Main verb (VM)

At level 1verb main does not seem to be an appropriate tag in the case of Sanskrit language.
However, if anybody insists to use it, it can be utilized in tagging the verbs of present tense when
followed by a sma and also the kta and ktavat pratyayntas when followed by an auxiliary, and
in doing so the auxiliary verbs and sma have to retain their Auxiliary tags. However, we have not
used this tag in tagging the Sanskrit corpus.

2.2 Finite (VF)

All the conjugations of the dhtus are finite verbs (VF). However, when some of these forms
will be used to express the aspectual meaning of the preceding kdanta will be tagged as
auxiliary, as is stated above. In addition, kta and ktavat pratyayntas will also be tagged as VF
when they are not followed by an auxiliary. As we do not have a separate tag for gerundives (like
kryam, karayam, kartavyam), VF tag could be applied for them as well.

/NNP /NNP /VF /PUNC /PRP /PRP H/NN 1/VF /PUNC
/NNP c /NNP /VF /PUNC

2.3 Non-finite (VNF)

kta and ktavat pratyayntas (these are generally described as participles in literature) will be
tagged as verb non-finite (VNF) when followed by an auxiliary and other kidantas like at,
nac and knac will also get the same tag.

~/NNP 9 /NNP /VNF /PRP /PSP /NN */VF //PUNC
/RB /PRP /NNP /VNF 1/VAUX /PUNC

2.4 Infinite (VINF)

Sanskrit infinitives are different from other Indian languages and English. They correspond to
the infinitive of purpose in English. They are formed by adding tumun suffix in the verb root.
Only tumun pratyayntas will be tagged as VINF.

/PRP /NNP /RPD /VINF /VF /PUNC

2.5 Gerund (VNG)
04/01/03 Infinitive VINF V__VM__VINF
04/01/04 Gerund VNG V__VM__VNG
4.2 Auxiliary VAUX V__VAUX

In the literature ktvnta and lyabanta forms are described as gerund. So, these kinds of
constructions will be labeled with the gerund (VNG) tag.

~/NNP /NNP /VNG 9 /NNP */VF /PUNC /RB /CCD
1| /PRF /NN /VNG q1 /NN */VF /PUNC

2.6 Auxiliary (VAUX)

In the language some tiantas (like verbal inflections of as, s, sth, k, and bh only) that
follow a kdanta to express its (kdanta's) aspectual meaning, will be tagged with Auxiliary label
and the indeclinable sma will also get the same tag when follows a verb in present tense and
modifies the meaning of the associated verb.

/NST /CCD q/NNP /NNP /PSP *| /NN
/VNF /VAUX /PUNC 1 /DMD /NN /NNP /JJ /NN
9/VF /VAUX /PUNC /PRP /RB /NNP /VNF /VAUX
/PUNC

3. Sanskrit Particles and Conjunctions
Particle is a very important category for Sanskrit, as they play many kinds of role and are of
many kinds and used for a number of purposes. Some of the indeclinables described as avyayas
in the tradition fall in this category. The Sanskrit conjunctions are also described as avyyas in
tradition, so I put these two categories together here to understand them clearly. In the tagset,
there are default, classifier, interjection, intensifier and negation subtypes of the Particle category
whereas conjunction has co-ordinator and subordinator subtypes level 1 and quotative subtype
level 2.
8 Conjunction CC CC
8.1 Co-ordinator CCD CC__CCD
8.2 Subordinator CCS CC__CCS
08/02/01 Quotative UT CC__CCS__UT
9 Particles RP RP





Table
2.Conjunction and particle tags in BIS scheme

3.1 Default Particle (RPD)

In the current system this would be applied for all avyayas which dont have specific tag in this
framework. This will include the avyaya types T, , and 9 .

/RPD /PRQ /VF ?/PUNC /JJ /RPD /PRP ?/PUNC /RPD
/VF /PRP ?/PUNC /INJ ,/PUNC /PRP /RPD 7 /VINF 4/VF
/PUNC

3.2 Classifier Particle (CL)

The classifier tag is not applicable for Sanskrit. It can be removed.

3.3 Interjection (INJ)

Words that express emotion are interjections, and also the particles which we use for getting the
attention of people, e.g., , , , , 1, , etc.

/INJ /NN !/PUNC /PRP /PRQ /VF ?/PUNC

3.4 Intensifier (INTF)

Adverbial elements with an intensifying role are intensifiers. They could be both, either positive
or negative. , , 7, 77 etc. will fall in this category.

/PRP +/VNG /VF /PRP /INTF /PUNC

3.5 Negation (NEG)

The indeclinables which are used for negative meaning are treated under this category.
9.1 Default RPD RP__RPD
9.2 Classifier CL RP__CL
9.3 Interjection INJ RP__INJ
9.4 Intensifier INTF RP__INTF
9.5 Negation NEG RP__NEG

7/NN /NEG /VF /PUNC /PRP 7 /PRP /NEG */VF /PUNC

3.6. Conjunction (CC)

Conjunction is a major category in the tagset and has co-ordinator, subordinator and quotative as
subtypes. We have to first enlist the conjunctions in these subcategories and then tag
accordingly.

3.6.1 Co-ordinator (CCD)

The conjunctions that join two or more items of equal syntactic importance will be assigned
CCD label. The list mainly includes , , , and .

/NN /NN /CCD F /UNK /VF /PUNC

3.6.2 Subordinator (CCS)
The conjunctions that introduce a dependent clause are subordinators.The conjunctions , ,
etc. will be labelled as CCS.

/NNP /VF /CCS /PRP /NN */VF /PUNC

3.6.3 Quotative (UT)

The subordinators have a further sub type of 'quotatives'. Quotatives occur in many languages
and have the role of conjoining a subordinate clause to the main clause. Therefore, it has been
included at the third level of hierarchy within Conjuncts, however, it is left optional to the
languages to go to this level of granularity or remain at the higher level keeping only two level
hierarchy for Conjuncts.

"/PUNC /PRP 7 /VF /NN "/PUNC /UT /PRQ 4 /VF ?/PUNC

4. Conclusion

Thus, above is the description of how the BIS tagset can be used for tagging Sanskrit Verbs and
particles. The problems which the human annotators, especially those trained in the traditional
grammatical system, are most likely to face, are of the identification of exact category for
Sanskrit words while tagging a corpus. Most often it is difficult for them to select a specific
category for a particular word. They are prone to use a generic tag where a specific tag should be
applied. They need to be trained rigorously in order to be comfortable with the newly introduced
terms for describing their language. From the above account it appears that this initiative will
enrich Indian NLP and will eliminate the language barriers between different linguistic
communities not only in India but across the world. The uniformity in tagging all Indian
languages will help in identifying linguistic differences and similarities among Indian languages,
and thus facilitate other NLP/linguistic researches.

Moreover, the corpus annotated with this tagset would be more useful as it is tagged by a
standard tagset or paradigm. This will ensure the maximal use and sharing of the tagged data.
The initiative for tagging Indian languages with the present standard tagset is a promising effort
in this direction with the hope that all Indian language corpora annotation programmes will
follow these linguistic standards for enriching their linguistic resources.

References
Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharyya, P., Choudhury, M., Jha, Girish Nath,
Rajendran, S., Saravanan, K., Sobha L., Subbarao, K.V.: A Common Parts-of-Speech
Tagset Framework for Indian Languages. LREC, Marrakesh, Morocco (2008).
Gopal, Madhav and Jha, Girish N.: Tagging Sanskrit Corpus Using BIS POS Tagset. In: Singh,
C., Lehal, G.S., Sengupta, J., Sharma, D.V., and Goyal, V. (eds.) Proceedings of the
International Conference, ICISIL 2011, Patiala, India, March 9-11, 2011, CCIS 139 pp.
191-194, Heidelberg: Springer.
Chandrashekar, R.: Parts-of-Speech Tagging For Sanskrit. Ph.D. thesis submitted to JNU, New
Delhi (2007)
Gopal, Madhav, Mishra, Diwakar and Singh, Priyanka Devi.: Evaluating Tagsets for Sanskrit. In:
Jha, Girish Nath (ed.) Proceedings of the Fourth International Sanskrit Computational
Linguistics Symposium, Dec.10-12, 2010, Heidelberg: Springer.
IIIT-Tagset. A Parts-of-Speech tagset for Indian Languages.
http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines.pdf
Jha, Girish Nath, Gopal, Madhav, Mishra, Diwakar.: Annotating Sanskrit Corpus: adapting IL-
POSTS. In: Z. Vetulani (ed.) Proceedings of the 4
th
Language and Technology
Conference: Human Language Technologies as a challenge for Computer Science and
Linguistics, pp. 467-471 (2009)
Ramkrishnamacharyulu, K.V.: Annotating Sanskrit Texts Based on Sabdabodha Systems. In:
Kulkarni, A. and Huet G. (eds) Proceedings of the Third International Symposium on
Sanskrit Computational Linguistics, pp. 26-39 (2009), Heidelberg: Springer.
Kale, M.R.: A Higher Sanskrit Grammar. MLBD Publishers, New Delhi (1995)
8. Speijer, J.S.: Sanskrit Syntax. Motilal Banarsidass Pvt. Ltd., New Delhi (1886, repr.
2006)

Vous aimerez peut-être aussi