Language Resources For Computer Assisted Translation From It

LANGUAGE RESOURCES FOR COMPUTER ASSISTED TRANSLATION FROM ITALIAN TO ITALIAN SIGN LANGUAGE OF DEAF PEOPLE
Davide Barberis, Nicola Garazzino, Paolo Prinetto, Gabriele Tiotto, Alessandro Savino, Umar Shoaib, Nadeem Ahmad
Politecnico di Torino, Italy
ABSTRACT This paper discusses the use of language resources for Computer Assisted Translation (CAT) from Italian Language to Italian Sign Language (LIS) of Deaf People. It gives an overview of a CAT translation system. The pipeline process allows a user to obtain the translation of an Italian Language text as an animation of a virtual avatar signing in LIS. The paper describes the architecture of the translation system and how the language resources are integrated in the platform. It analyzes the features that characterize a dictionary, which links two languages among different features and different communication channels. KEYWORDS Deaf, Sign Language, Assisted Translation, Virtual Avatars, Sign Language Synthesis. INTRODUCTION Sign Languages (SLs) are visual languages used by deaf people to convey meaning. SLs rely on signs as lexical units instead of words used in common languages. Italian deaf people resort to Italian Sign Language (LIS) within their communities and it can be considered as the main way of communication of 60.000 Italian deaf individuals [1]. Inclusion of deaf people in the society aims at providing them access to services in their own language and spreading signed language knowledge among the communities. In this context, the Machine Translation (MT) provides a way to translate the written language into the visual language exactly as it happens from one written language to another. The research on Sign Language Machine translation deals with several issues, generally related to both the translation and the generation of signs. Most of the translation approaches (e.g., television closed captioning or teletype telephone services) assume that the viewer has strong literacy skills [2]. Then, reading the captioning results to be difficult even if they are fluent in sign language. In these cases, if a live interpreter is not feasible, the translation is avoided to deaf people. A new approach, which takes into account the building a system that is able to produce both the translation and the visualization of signs at the same time, is thus very important. This paper describes a system for the computer-assisted translation from Italian Language to LIS that provides the output of the translation resorting to a virtual avatar. It focuses on
the necessary linguistic knowledge and resources required for the development. Section 2 provides the state of the art of the sign language translation. Section 3 describes the approach to the design of a system for a visual language translation. Section 4 goes deeply into the architecture of the ATLAS project [3] and deals with and the interaction between the tool and the language resources. Section 5 gives a set of conclusions and future perspectives. STATE OF THE ART The application of MT to SL is rather recent. Several projects that targeted SL capturing and synthesis were developed and important results have been achieved to improve accessibility of the deaf to the hearing world. The VISICAST translator [2] and the eSIGN [3] project defined the architecture for sign language processing and the gesture visualization with a virtual character by using motion capture and synthetic generation to create signs. BlueSign [4], Zardoz [5] and SignSmith Studio [6] are additional examples of working systems. BlueSign targets LIS translation without taking into account LIS grammar. Some examples are reported of translation to other sign languages, most of them targeting translation of single words or fingerspelling. Other examples take effort of statistical or example based techniques to perform translation [9], [10]. A MT to LIS system to perform the translation resorting to LIS grammar has never been mentioned in literature. In [11] authors analyze different statistical translation techniques applied to sign language and conclude that some of them give reasonably good results. Typical procedures, i.e., scaling factor optimization and selecting the best alignment, work reasonably well but can be improved. More sophisticated algorithms, i.e., syntactic enhancements, fail to improve over the baseline. The general unavailability of language resources for sign language translation is the main issue reported. The available corpora and lexicons are not comparable with the resources used to perform written language translation. The creation and annotation of these corpora (parallel corpora, lexicons, etc) represents an important resource for testing language hypothesis purposes. The way the corpora are generated has to be considered because it influences both the quality of annotations and the related information. Recent researches on corpora creation underlined that corpora needs to be representative, well documented and machinereadable [12]. Today several examples of corpora that satisfy those requisites are listed in literature [19]. The DGS Corpus project [15] aimed at creating of a general-purpose corpus of German Sign Language to study lexical variation, semantics, morphology and phonology. The DGS corpus collects data (340-400h scheduled) from dialogues, free conversation, elicited narratives and elicited lexical items. Fluent DGS signers perform the data. Other examples on German Sign Language are provided by the Berlin corpus that collects formal structures of DGS focusing on the interdependency of spoken and signed language based on the collection of natural DGS data. Dialogues and conversation in natural settings are the data. The Deutschschweizerische Gebardensprache (DSGS) corpus [16] collects 18h of group discussions data in Swiss-German Sign Language. An important work on Swiss-German Sing Language is the Multimedia Bilingual Databank that collects 60h of informal groups discussions on given arguments (medicine, sport,
schooling, etc.). The NGT Corpus [17] for Dutch Sign Language collects 2375 videos (stated at September 2010) for a total of about 72h of data on dialogues and narratives signed by 92 native signers. The corpus is fully transcribed at Gloss level. The ECHO Project [18] constitutes an example of multi lingual sign language corpus: it collects data in Dutch, British and Swedish Sign Languages. It resorts to 30 min of videos per sign language transcribed at gloss level. A rich Sign Language lexical corpus is the SignLex Corpus that includes 5616 videos of elicited lexical items signed by native signers. The domain is focused on technical terms for education purposes. Statistical approaches are used for Computer Assisted Translation (CAT) [19] as well. In these systems, statistical approaches help the user while performing manual translation. Since a significant intervention by the user is required, these systems are generally more robust then automatic ones. To our knowledge they all miss the final step of translation: the visual adaptation, e.g., by virtual avatars. THE ATLAS PLATFORM The ATLAS system (Figure 1) is designed to get a written text as input and to perform the translation, resorting to two different translators: a statistical one and a rule based one. The statistical translator is based on MOSES [8], an open source statistical translator that automatically trains the translation models for any language pair. The Rule Based translator is based on a traditional rule-based approach. The input sentences are interpreted in terms of an ontology-based logical representation, which acts as input to a linguistic generator that produces the corresponding LIS sentence. The sentence defined through a formalism called ATLAS Written LIS (AWLIS). A LIS sentence is basically a sequence of annotated glosses carrying a set of additional syntactic information. The semantic-syntactic interpreter resorts to an ontology modeling the weather forecasting domain. The interpreter performs the analysis of the syntactic tree and builds a logical formula by looking at the semantic role of the verbal dependents [22].
Figure 1 Architecture of the ATLAS Systems
This information is collected within the AWLIS formalism, which is sent to the animation module. This module is composed by: 1. A Planner: it gets the signs and set them into the signing space, according to relocations strategies. 2. An Animation Interpreter (AI): it takes the AWLIS and the sign planning to render the signs by smoothing them together. The animation engine retrieves a basic form of the sign for each lexical entry specified in the AWLIS from a database. Once the basic shape is done, it applies a set of modifiers to the shape, e.g., facial expression, automatic relocation, body movements. The Signary, henceforth referred to as lexicon, is the ATLAS lexical database that collects all the basic forms of the signs, performed by the avatar, in a video format. Translation Reliability Sign Language translation is error prone. By increasing the amount of language resources, such as lexicon and training corpora, we expect to achieve a good translation level. Currently, weve got a long way to go before we can provide a reliable translation because with can provide only 40 translated weather forecasts in our corpus. To improve translation quality the user has to be able to correct translation mistakes. The ATLAS Editor for Assisted Translation (ALEAT) is the CAT translation tool developed within the ATLAS project. It translates an Italian text into ATLAS Extended Written LIS (AEWLIS), resorting to a set of manual translation steps. The AEWLIS is an improved version of AWLIS. ALEAT provides access to the parallel corpus in order to retrieve a set of feasible translations. The user selects the most feasible translation and modifies all mistakes by user interface. ALEAT gives also access to the lexicon in order to connect each lemma within the AEWLIS with its correspondent entry in the lexicon. This allows the AI directly to play the basic forms of the signs by taking as input the AEWLIS. All the modifiers are specified by means of a tagging window. ALEAT provides a userfriendly interface to perform all operations of the CAT process. THE ATLAS CORPUS A complete description of the ATLAS corpus is out of the scope of this paper. An introduction has been provided in [20], while results on the extraction of sign relocation data from the ATLAS corpus are provided in [22]. The collected corpus consists in the translation from Italian into LIS of 40 weather forecasting news recorded from the national broadcasting network. The corpus has been annotated using the ATLAS Editor for Annotation (ALEA), a web application developed and used within the ATLAS project for the annotation of video content [25],[26]. The research teams involved in the ATLAS project are performing studies on the ATLAS corpus in order to extract useful data related to LIS linguistic phenomena such as relocation, co-articulation and iconicity. Actually the relocation process has been studied [22] and the results help designing the Planner module. This corpus has been used to train the ATLAS statistical translator that, along with the
Rule Based translator, is involved in the core translation of the ATLAS pipeline. The ATLAS corpus includes a lexicon of 3063 signs as high quality videos. Each video is linked to his translation in Italian language. Meaningful !elds include de!nitions and the semantic domains. It contains two sets of signs: general-purpose and weather forecast related. The former includes general-purpose domain signs. Although it contains signs within the Rome area along with some variations used in other regions of Italy, our informants suggested that this set of sign is currently used as a standard lexicon in Italy. The second set of signs includes: 113 New Signs mostly related to the weather forecasts domain 73 Variations of signs from the LIS Dictionary 214 New Standard LIS signs not present in the LIS Dictionary WordNet Integration The dictionary implementation is the main issue of a translation system. Since each natural language word can be related to a set of synonyms, the relation among words of different languages is not unique. Synonyms can be very useful when the corpus is not fully annotated. Wordnet [24] is a dictionary that handles the existence of relation among words defining the synset: it models the sense of a single headword. Thus, words sharing the same synset can be defined as synonyms. Including the Wordnet platform as part of the translation process require to store the sysnet data to the lexicon too. The architecture performs a set of operations to translate an Italian word into a LIS sign (Figure 2). The search result includes 3 different scenarios: 1. The input word is not found. 2. The input word has only one correspondence. 3. The input word has multiple correspondences.
Figure 2 The Wordnet Integration Flow
In the first scenario, the synset information associated to the word became the entry for a Wordnet search. If some of the word synonyms is already in the lexicon cases 2 and 3 can rise. The second case gives an immediate translation but can be also supervised to check if more than one LIS sign is linked to the original word. In this case, the user is able to manually select the preferred LIS sing. The third case happens when some disambiguation issue is related to the original word, since some words of the Italian language have multiple meanings (e.g., the Italian word pesca may refer to both the peach and the to fish verb). In this case, using the synset data associated to the retrieved results, the user is able to select the right LIS. CONCLUSION AND FUTURE WORK In this paper we proposed the use of lexical resources and corpora in the CAT of Sign Languages. We described the workflow of our CAT tool, ALEAT. Implementing the WordNet infrastructure to support both the limitation of annotated words (using the synonymous) and the disambiguation of word meaning is the main improvement of introduced by the paper. The tool is currently being integrated within the ATLAS platform. Future work aims at testing the intelligibility of translation produced with ALEAT and at comparing this with the pure automatic translation. ACKNOWLEDGEMENT(S) The work presented in the present paper has been developed within the ATLAS (Automatic Translation into sign LAnguageS) Project, co-funded by Regione Piemonte within the Converging Technologies - CIPE 2007 framework (Research Sector : Cognitive Science and ICT).
REFERENCES [1] Eud homepage: http://www.eud.eu/italy-i-187.html. [2] A survey and critique of American Sign Language natural language generation and machine translation systems. M Huenerfauth - 2003 - Technical Report MS-CIS-0332. [3] ATLAS project WebSite: http://www.atlas.polito.it/. [4] Visicast project website: http://www.visicast.co.uk/. [5] e-Sign project website: http://www.sign-lang.uni-hamburg.de/esign/. [6] Blue sign partners: Blue sign translator. Available at http://bluesign.dii.unisi.it/. [7] T. Veale and A. Conway. Cross Modal Comprehension in ZARDOZ.
[8] S. Morrissey, A. Way, D. Stein, J. Bungeroth, and H. Ney (2007), Towards a hybrid data-driven mt system for sign languages, in Machine Translation Summit, Copenhagen, Denmark, Sept. 2007, pp. 329335. [9] G. Masso and T. Badia, Dealing with sign language morphemes for statistical machine translation (2010), in 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, Valletta, Malta, May 2010, pp. 154157. [10] Stein, D., Bungeroth, J., Ney, H. (2006): Morpho-Syntax Based Statistical Methods for Sign Language Translation. In: 11th Annual conference of the European Association for Machine Translation, Oslo, Norway, June 2006, pp. 169177 [11] G. Tiotto (2011). ATLAS: An Italian to Italian Sign Language Translation System Through Virtual Characters. PhD thesis, Politecnico di Torino, April 2011 [12] Dsg corpus webpage: http://www.sign-lang.uni-hamburg.de/dgs-korpus/ index.php/welcome.html. 52 [13] Dsgs corpus webpage: http://www.hfh.ch/projekte_ detail-n70-r76-i574-sE.html. 52 [14] [15] Ngt corpus webpage: http://www.ru.nl/corpusngtuk/. 52 Echo pro ject webpage: http://echo.mpiwg-berlin.mpg.de/home. 52
[16] S. Barrachina, O. Bender, F. Casacuberta, J. Civera, E. Cubel, S. Khadivi, A. Lagarda, H. Ney, J. Tomas, E. Vidal (2009), et al. Statistical approaches to computerassisted translation. Computational Linguistics, 35(1): 328, 2009 [17] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, Evan Herbst (2007), Moses: Open Source Toolkit for Statistical Machine Translation, Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007. [18] Alice Ruggeri, Cristina Battaglino, Gabriele Tiotto, Carlo Geraci, Daniele Radicioni, Alessandro Mazzei, Rossana Damiano, Leonardo Lesmo (2011), Where should I put my hands? Planning hand location in sign languages, in Workshop on Computational Models for Spatial Language Interpretation and Generation, Boston, USA, 2011 (to appear) [19] Nicola Bertoldi, Gabriele Tiotto, Paolo Prinetto, Elio Piccolo, Fabrizio Nunnari, Vincenzo Lombardo, Alessandro Mazzei, Rossana Damiano, Leonardo Lesmo, Andrea Del Principe (2010), On the creation and the annotation of a large-scale Italian-LIS parallel corpus, n. 4, 2010, (Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies (CLSTLREC 2010), La Valletta, Malta) 22/05/2010 - 23/05/2010 [20] Barberis D., Garazzino N., Prinetto P., Tiotto G. (2010), Avatar Based Computer Assisted Translation from Italian to Italian Sign Language of Deaf People, Proceedings of 3rd International Conference on Software Development for Enhancing Accessibility and Fighting Info-exclusion (Oxford (UK)) November 2526, 2010 pp.7 (pp.53-59)
[21] Barberis D., Garazzino N., Piccolo E., Prinetto P., Tiotto G. (2010), A Web Based Platform for Sign Language Corpus Creation, Lecture Notes In Computer Science Vol.2 pp.7 (pp.193-199) [22] Wordnet web site: http://multiwordnet.fbk.eu/english/home.php

Language Resources For Computer Assisted Translation From It

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Language Resources For Computer Assisted Translation From It

Transféré par

Droits d'auteur :

Formats disponibles

LANGUAGE RESOURCES FOR COMPUTER ASSISTED TRANSLATION FROM ITALIAN TO ITALIAN SIGN LANGUAGE OF DEAF PEOPLE

Figure 1 Architecture of the ATLAS Systems

Figure 2 The Wordnet Integration Flow

Vous aimerez peut-être aussi