Vous êtes sur la page 1sur 38

Loquendo TTS 7 User's

Guide
Introduction
Loquendo TTS 7 User's Guide
Base notions
Overview of Loquendo TTS
Loquendo TTS User Contro Introduction
Appendix A - XML Support
Loquendo TTS is a multilingual/multivoice text-to-speech
Appendix B - MRCP Suppor
(TTS) synthesizer, able to convert any written text into truly natural-
sounding speech, in a large variety of languages and voices.

Loquendo TTS has a modular architecture, based on a core engine


and a number of language and voice modules, each shipped in a package
including relevant libraries and data. Loquendo TTS is available for
telephony, multimedia and embedded applications, all with the same
engine and wide range of languages and voices. The system is
implemented via software only, without requiring specialized hardware.
The synthesized speech can be output to a multimedia audio board, a
telephone card or a file.

All Loquendo TTS features are accessed by a set of legacy APIs, that
allow the control of every aspect of the TTS process. The standard
Microsoft Speech APIs (SAPI 4 and SAPI 5) are also supported.
Instructions on the use of the APIs for developing applications with
Loquendo TTS (as well as on software and hardware requirements) are
given in the Loquendo TTS Programmer's Guide.

An SDK package, including a number of sample applications and GUIs, is


provided to help the user develop applications with Loquendo TTS. The
SDK also includes the application 'Loquendo TTS Director' which provides
the user with tools for designing prompts to be synthesized. All the
Loquendo TTS documentation (including this guide) is also included in the
SDK.

This guide is intended for the Loquendo TTS user who wishes to prepare
texts to be synthesized into speech, and who wants to tune the system
parameters in order to obtain the best possible synthesis performance for
the application domain.

The user will approach Loquendo TTS via one of the above mentioned
Loquendo GUI's, or via a third party application. Most likely, he/she will be
responsible for writing out the texts that will be subsequently input into the
Loquendo TTS system. In which case, once the correct language is
selected, the main rule to follow is:

"Write texts according to the standard orthographic and grammatical rules


of the language"

Loquendo TTS is designed to be able to read out any text and to deal with
any combination of characters, achieving a high level of pronunciation
accuracy and naturalness; an even higher standard of pronunciation will
be achieved, however, if the text is written correctly, is well punctuated,
and contains no invalid characters. Even so, the reading process is full of
pitfalls and difficulties of which the human reader (and writer) is usually
unaware, and a written text may sometimes need to be interpreted and
even modified before it can be read both correctly and expressively.
Consequently, the reading by the TTS system may occasionally sound
inappropriate or unsatisfactory. This guide is intended to offer suggestions
on improving the TTS rendering of a text by means of careful sentence
construction and a sensible use of the Loquendo TTS advanced features
and controls.

The rest of the document is organized as follows:

Chapter 1 - Base Notions gives a few basic concepts about TTS


conversion, suggesting what to expect from a TTS system, how to
overcome its drawbacks and how to optimize texts to achieve a high
quality reading.
Chapter 2 - Overview of Loquendo TTS Features illustrates the
architecture of the system and some crucial aspects of its reading
performance.
Chapter 3 - Loquendo TTS User Controls describes in detail how to
modify and tune the reading process by controlling its parameters.
Appendix A - XML Support describes the support Loquendo TTS
offers to the XML-based languages SSML and VoiceXML.
Appendix B - MRCP Support describes the support Loquendo TTS
offers to MRCP.

© 2001-2009 Loquendo. All Rights Reserved..


Loquendo TTS 7 User's Guide

Introduction
Loquendo TTS is a multilingual/multivoice text-to-speech (TTS) synthesizer, able to convert any written
text into truly natural-sounding speech, in a large variety of languages and voices.

Loquendo TTS has a modular architecture, based on a core engine and a number of language and voice
modules, each shipped in a package including relevant libraries and data. Loquendo TTS is available for telephony,
multimedia and embedded applications, all with the same engine and wide range of languages and voices. The
system is implemented via software only, without requiring specialized hardware. The synthesized speech can
be output to a multimedia audio board, a telephone card or a file.

All Loquendo TTS features are accessed by a set of legacy APIs, that allow the control of every aspect of the TTS
process. The standard Microsoft Speech APIs (SAPI 4 and SAPI 5) are also supported. Instructions on the use of
the APIs for developing applications with Loquendo TTS (as well as on software and hardware requirements) are
given in the Loquendo TTS Programmer's Guide.

An SDK package, including a number of sample applications and GUIs, is provided to help the user develop
applications with Loquendo TTS. The SDK also includes the application 'Loquendo TTS Director' which provides the
user with tools for designing prompts to be synthesized. All the Loquendo TTS documentation (including this guide)
is also included in the SDK.

This guide is intended for the Loquendo TTS user who wishes to prepare texts to be synthesized into speech, and
who wants to tune the system parameters in order to obtain the best possible synthesis performance for the
application domain.

The user will approach Loquendo TTS via one of the above mentioned Loquendo GUI's, or via a third party
application. Most likely, he/she will be responsible for writing out the texts that will be subsequently input into the
Loquendo TTS system. In which case, once the correct language is selected, the main rule to follow is:

"Write texts according to the standard orthographic and grammatical rules of the language"

Loquendo TTS is designed to be able to read out any text and to deal with any combination of characters,
achieving a high level of pronunciation accuracy and naturalness; an even higher standard of pronunciation will be
achieved, however, if the text is written correctly, is well punctuated, and contains no invalid characters. Even so,
the reading process is full of pitfalls and difficulties of which the human reader (and writer) is usually unaware, and
a written text may sometimes need to be interpreted and even modified before it can be read both correctly and
expressively. Consequently, the reading by the TTS system may occasionally sound inappropriate or unsatisfactory.
This guide is intended to offer suggestions on improving the TTS rendering of a text by means of careful sentence
construction and a sensible use of the Loquendo TTS advanced features and controls.

The rest of the document is organized as follows:

Chapter 1 - Base Notions gives a few basic concepts about TTS conversion, suggesting what to expect from
a TTS system, how to overcome its drawbacks and how to optimize texts to achieve a high quality reading.
Chapter 2 - Overview of Loquendo TTS Features illustrates the architecture of the system and some crucial
aspects of its reading performance.
Chapter 3 - Loquendo TTS User Controls describes in detail how to modify and tune the reading process by
controlling its parameters.
Appendix A - XML Support describes the support Loquendo TTS offers to the XML-based languages SSML
and VoiceXML.
Appendix B - MRCP Support describes the support Loquendo TTS offers to MRCP.
© 2001-2009 Loquendo. All Rights Reserved..
Loquendo TTS 7 User's Guide

Base Notions

What is text-to-speech?

Text-to-speech (TTS) conversion simulates human reading behavior, where sequences of written symbols
(graphemes) are first interpreted as the words and sentences of a given language and then uttered as a sequence
of sounds typical of that language (phonemes), with pauses to separate phrases and suitable intonation to mark
intention and meaning.

Such simulation is achieved by a series of transformations of the input character sequence, according to linguistic
knowledge and the recorded speech databases.

The written text is first transcribed into a sequence of phonetic symbols, according to the grammar, the lexicon and
the phonetic rules of the language.

Then, the phoneme sequences are mapped onto the vocal database of the voice, which consists of a large
collection of utterances recorded by a professional speaker and carefully labeled with phonetic and prosodic tags.
The closest matching speech fragments are extracted from the vocal database and concatenated to form the
desired speech output. The intonation and context of these fragments are taken into account during their selection.
The same phoneme sequence can be rendered by different speech fragments depending on its context (e.g. the
word “all” would be rendered by different speech fragments in its two occurrences in the phrase “all the best to all
of you”).

In more detail, the conversion from text into speech is achieved in the following steps:

Split the character sequence into words and sentences (based on blank spaces and punctuation).
Interpret and expand numbers, abbreviations, symbols and acronyms, by means of tables and rules.
Assign grammatical tags to words and resolve ambiguities on the basis of context.
Assign pauses and intonation, based on punctuation and on a minimal syntactic parse.
Assign lexical stress and a phonetic transcription to each word, via rules or lexicons.
Transcribe the entire phrase into phonemes and assign prosodic labels.
Search the vocal database for the longest phoneme sequences matching the phrase transcription, taking
account of context.
Concatenate the corresponding speech signal fragments.
Apply signal processing for concatenation smoothing and minor prosodic variations.

These processes should be kept in mind when judging pronunciation accuracy and the quality of the speech output.
When it comes to interpreting text, the system is only as intelligent as the information expressed in its rules and
lexicons, and can only produce sounds and intonations which have been recorded in the vocal database.
Furthermore, one should consider the widespread ambiguity intrinsic to any natural language, where sounds and
meaning depend on context and where the written representation is always underspecified with respect to the
corresponding speech.

Suggestions for an optimal use of TTS

The TTS process exploits only a subset of the complex knowledge base on which a human reader implicitly relies.
While it can access grammatical and phonetic knowledge, the artificial system does not come to a true
comprehension of the text, lacking the necessary semantic and pragmatic skills.

This is why the system cannot deal with ambiguous or misspelled text, nor give different emotional colors to its
voices according to text semantics. The system tries to pronounce exactly what is written, applying the standard
orthographic conventions for interpreting characters, symbols, numbers, word sentence delimiters and punctuation.
The cues to a proper intonation are mainly punctuation marks and syntactic relationships between words.

This means that the best synthesis results will be obtained with well-formed sentences, correct and standard
orthography, unambiguous contexts and rich and appropriate punctuation.

If you are able to prepare or select in advance the texts that will be fed into the TTS system, then the main rule to
follow, once the correct language has been selected, is:

"Write texts according to the standard orthographic and grammatical rules of the language"

In more detail, we suggest that you keep to the following simple guidelines:

Spell words carefully (using the correct character set for the language).
Use capital letters when grammatically appropriate and apply standard conventions for representing numbers
and abbreviations.
Separate words according to the standard orthographic conventions (insert blanks between words and after
punctuation marks, when appropriate).
Avoid ambiguities.
Write short sentences with correct syntactic structure.
Insert punctuation marks frequently and carefully.
In the event of pronunciation errors, directly insert the correct phonetic transcription.
In the event of acoustic defects (or persisting pronunciation errors) just rephrase the text, by changing word
order or punctuation marks (remember that the same word can be rendered by different speech fragments,
depending on its position in the sentence). You can also tune the selection of speech fragments.
To fine-tune TTS parameters, apply the Loquendo TTS user controls.

© 2001-2009 Loquendo. All Rights Reserved.


Loquendo TTS 7 User's Guide

Overview of Loquendo TTS Features

Languages and voices

Loquendo TTS has a modular architecture, based on a core engine and a number of separate language and voice
modules, enabling the implementation of linguistic rules and acoustic data that differ greatly from language to
language and from voice to voice. The same text will sound completely different according to the language and
voice employed, so the first choice for the user is clearly which language and which voice to use.

As language and voice modules are packaged and installed separately, which languages and voices are actually
available to the user depends on which have been installed.

For the best reading of a text, the TTS should use the language in which the text is written and a corresponding
mother-tongue voice.

However, Loquendo TTS also offers the innovative phonetic mapping feature, enabling voices to also speak
foreign languages. This means, for example, that the user can select the language of the text, but have it read by
any voice, switching language as required. The text will be interpreted and converted into its phonetic
representation according to the rules of the chosen language. The resulting phonemes will then be mapped onto
the phonemes applicable for the current voice. If the voice's mother tongue is very distant from the chosen
language (e.g. if the voice is French and the language is English), the resulting pronunciation will be approximate,
with a foreign accent, but still plausible and intelligible, similar to that of a foreign speaker who keeps his own
mother tongue accent.

This feature is especially useful for mixed language texts, where the same voice can also read foreign language
parts of the text with the correct pronunciation, albeit with a strong foreign accent. The best results are obtained
when foreign words and short phrases are embedded in a text written mainly in a single language, that should be
the mother tongue of the chosen voice.

The language and voice switch can be made automatically by configuring the Loquendo TTS language guesser
capability. In this case, the TTS guesses the language of each text portion (word, phrase or sentence, according to
the user settings) and switches accordingly to the correct language and voice. The user, for example, can choose
whether to always switch to a mother tongue voice, or whether to keep the same voice when the language change
occurs in the middle of a sentence (see text interpretation).

Interpretation of characters

The text comes to the TTS system as a sequence of characters, which can be coded according to different
standards, such as ANSI, ISO and UNICODE (UTF-8 and UTF-16). The user can choose the preferred coding (see
text interpretation). Note that some standards may interpret the same code differently according to the language.

The text can be either plain text, possibly marked with Loquendo TTS user controls, or SSML text, written
according to the SSML standard syntax.

As a first step in TTS synthesis, Loquendo TTS performs a superficial normalization of the text, trying to
standardize its form, delimiting sentences and words and converting numbers, abbreviations, acronyms and
conventional expressions (internet addresses, mathematical symbols, dates, currencies, etc.) into their fully
expanded graphemic form. Most of these transformations are governed by language-dependent rules.

Prosody and expressivity

The intonation and rhythm of synthesized speech depend on the prominence assigned to each word and on the
prosodic breaks delimiting phrases.

Loquendo TTS assigns word prominence (stress) and prosodic breaks based on punctuation and syntactic
structure. In general, content words are stressed, while function words are not. The appropriate user control can
override the standard behavior (see say-as).

Prosodic breaks are realized as short pauses and intonation variations. Three kinds of intonation contours are
realized: the continuation contour (generally corresponding to a comma in the text or to a strong syntactic
boundary), the conclusive contour (corresponding to a full stop, colon, semicolon) and the interrogative contour
(corresponding to a question mark).

The overall intonation of the synthesized speech is natural and neutral.

Gilded TTS and vocal add-ons

To add emotion and expressivity to its synthesized speech, Loquendo TTS provides the Gilded TTS feature, a way
of inserting expressive cues into the discourse. The use of expressive cues makes vocal messages lifelike,
suggesting expressive intention and varying the intonation. The repertoire of expressive cues consists of a set of
speech acts, pre-recorded formulas comprising conventional figures of speech such as greetings (e.g. "Hello!",
"Bye!", etc.) and exclamations (e.g. "Oh no!", "I'm sorry!", etc.), and a set of paralinguistic events (e.g. cough,
cry, whistle, etc.).

Speech acts can be selected from the TTS Director menu, or simply written directly into the text, with one or more
exclamation or question marks at the end.

Paralinguistic events have to be explicitly inserted into the text by a specific user control.

The content of Gilded TTS is intended for general-purpose speech synthesis. Additional sets of phrases commonly
used in specific domains (e.g. navigation messages) can be added to the system by installing, for a given voice,
optional packages called vocal add-ons.

From the Effects menu in Loquendo TTS Director, speech acts, paralinguistic events and phrases selected from the
vocal add-ons installed, can be inserted into a text using the submenus Speech Acts (presenting items grouped
into intuitive linguistic categories), Extras and Vocal Add-ons (presenting phrases grouped by vocal add-on). A list
of all the phrases and effects available for a given voice can be generated and saved using the menu Effects and
the submenus Save Effect List As.

Control of TTS pronunciation and lexical add-ons

Loquendo TTS allows the user to fine tune the TTS pronunciation of words and sentences. Although the system
attains a high level of accuracy in text-to-phoneme conversion, its rules are intended for general-purpose speech
synthesis. Some lexical domains (e.g. names and place names) or special kinds of text (e.g. SMS) can not be
covered by general-purpose rules, while each application has its own particular requirements for the interpretation
of acronyms or for the preferred pronunciation of homographs. For this reason, the user is given some control over
TTS pronunciation, by means of two different approaches:

specifying the preferred phonetic transcription directly in the input text, e.g. \sampa(%le#"gRa~Z) (see
phonetic input).
listing the preferred transliterations and phonetic transcriptions within a pronunciation lexicon, that
Loquendo TTS will access run-time, so overriding its default transcription rules.

Additional lexicons commonly used in specific domains for a given language can also be added to the system,
either by installing optional packages called lexical add-ons, or by including lexicon files created directly by the
user.

Any additional lexicon files can be loaded by using the appropriate API, or directly in the text by using the
appropriate control tags (see lexicon).

Mixer

Loquendo TTS enables user prompts to be enriched with a number of audio effects, such as reverb, stereo
balance and audio mixing (see audio mixer capabilities).

Mime-types

Loquendo TTS has defined to use the following mime-type for their resources:

Audio files
audio/x-alaw-basic: for a-law audio files associated with .al extension
audio/x-basic: for mu-law audio files associated with .ul extension
audio/x-wav: for wave audio files associated with .wav extension

Lexicon files
application/pls+xml: for Pronunciation Lexicon Specification files (W3C PLS Version 1.0) associated with .pls
extension
application/x-loquendo-lex: for Loquendo Lexicon files associated with .lex extension

© 2001-2009 Loquendo. All Rights Reserved.


Loquendo TTS 7 User's Guide

Loquendo TTS User Controls


Loquendo TTS allows the user to control certain aspects of the TTS reading, such as the language in which the
text will be read, the voice, speaking rate, loudness, the interpretation of numbers, the stress prominence of a word
and its pronunciation.

Such controls can be applied by means of three different approaches:

software APIs
setting parameters in the system configuration files
inserting commands directly into the input text

The first two methods are reserved for programmers, and enable the development of applications and GUIs - such
as those provided in the Loquendo TTS SDK - helping the user when preparing and tuning prompts.

The third method is intended for anyone editing texts to be fed into the TTS system, and who wishes to fine-tune
the speech synthesis output.

Below we will describe this third kind of control, available to users of applications based on the Loquendo TTS
proprietary interface. If the Speech API 4.0 or 5.1 interfaces are used, the commands must be given as described
in the Microsoft SAPI documentation. Alternatively, Loquendo TTS behavior can be controlled by preparing the
input text according to the SSML standard (Speech Synthesis Markup Language), described in Appendix A.

Note that the application or GUI through which the user accesses Loquendo TTS may interfere with the nominal
functioning of the User Controls, so please refer to the application manual for specific instructions. In particular, see
the specific User Guides for the Loquendo TTS tools provided in the SDK, which are intended to help the user in
applying text commands and to enhance control capabilities by means of API-based graphical interfaces.

Loquendo TTS User Controls are a way of marking-up the text with commands that will be executed
synchronously with the TTS conversion. They should be inserted in the input text, delimited by an initial backslash
(\) and a final space. More than one command can be given in a single Control Sequence: in this case, only the
final space is required and spaces inbetween commands may be omitted. The space acts as a delimiter of the
Control Sequence, unless it occurs within a string that is between parentheses.

The general syntax of the Loquendo TTS User Controls is the following:

<Marked-up text> ::= [<text-portion 1>] <Control Sequence> [<text-portion 2>]

<Control Sequence> ::= <Control> [<Control Sequence>] <white space>

<Control> ::= \<control tag> [<parameters>]

<white space> ::= <SPACE>|<TAB>|<RETURN>|<NEWLINE>|<FORMFEED>

<control tag> ::= <string of characters>> ["="]

<parameters> ::= ({[<control tag>]<value>}) | <value> [{,<value>}]

<value> ::= <string of characters> | <number> | <filename>

<filename> ::= [<path>] <name>

<path> ::= [<disk>:/]{<name>/}

<name> ::= <string of characters>


Example:
This is \speed=70\pitch=30 a marked up text for \spell=yes LTTS.

Here \audio(play=c:/sounds/my first beep.wav) you will hear a beep.

The effect of a User Control can be local, e.g. the playing of a sound file or the spelling of the following word, or
global, e.g. a language change or a raise in the voice pitch. In the second case the effect of the control remains
until a new control cancels it. For a subset of the global parameters, their action scope is the single prompt fed to
Loquendo TTS (which in turn can be a phrase, a paragraph or an entire text), at the end of which the parameters
are reset to their default value.

In the following paragraphs, the Loquendo TTS User Controls are described, grouping them according to their
global or local effect.

Global controls

The family of Global Controls includes commands that change the value of some of the Reading Parameters of
Loquendo TTS, affecting the quality of the output speech:

Voice, Language and User Lexicons


Prosodic aspects of the voice (speaking rate, volume, voice pitch and timbre)
Sound effects (reverb, balance)
Text interpretation (SSML vs. plain text, spelling vs. word pronunciation, blanks read as pauses or not,
number interpretation, automatic language guessing)

The change will take place at the moment when the word following the Control is being synthesized. The new value
of the parameter will remain set until a new command changes it again (or, for the language, voice and prosody
parameters, until the end of the prompt is reached).

Voice, language, lexicon, style and say-as


Voice Control: forces a voice switch between voices. The mnemonic must be the name of an installed voice (see
Languages and Voices). At the end of the prompt the parameter is reset to its application default value.

\voice=<mnemonic>

Example:

\voice=Susan hello. \voice=Dave hi.

("hello" is read by the voice "Susan", then "hi" is read by the voice "Dave").

Language Control: forces a language switch between languages. The mnemonic must be the name of an installed
language (see Languages and Voices). Aliases are accepted for the language name, from the generic name (e.g.
English), to the standard ISO codes for languages and variants (e.g. en-US, en-GB). At the end of the prompt the
parameter is reset to its application default value.

\language=<mnemonic>

Example:
\language=English Paris \language=French Paris.

(the first occurrence of the word Paris will be pronounced: p"}rIs , and the second:
paR"i).

Lexicon Control: loads and activates one or more User Lexicons, named <lexicon filename>. A User Lexicon can
be loaded in the <num> position in the Lexicon Array (see User Control of TTS Pronunciation). Any position
(number) can be chosen (although it is suggested that consecutive positions are used, starting from 0). If the
specified position is already taken, the old lexicon assigned to that position will be substituted with the new one.
Lexicons within a Lexicon Array are given an order of priority to allow for any conflicting commands which they may
contain: lexicons assigned the highest number overide lexicons with a lower number. A loaded lexicon can be
unloaded by referring to its <num> position in the Lexicon Array. The scope of the lexicon tag is the single prompt.
Loquendo TTS supports two formats of lexicon files: the Loquendo Format, and the W3C Pronunciation Lexicon
Specification (PLS) version 1.0 (see http://www.w3.org/TR/pronunciation-lexicon/). Use the Loquendo TTS Lexicon
Manager tool to write and modify files in either format.

\lexicon(load=<num>,<lexicon filename>)
\lexicon(unload=<num>)

Example:

Assume you have two User Lexicons (created by means of Loquendo TTS Lexicon Manager):

C:/tmp/first.lex
containing the following entries:
"ez" = "exit zone"
"o" = "open"

C:/tmp/second.pls
containing the following entries:
<lexeme>
<grapheme>ez</grapheme>
<alias>easy</alias>
</lexeme>
<lexeme>
<grapheme>j4f</grapheme>
<alias>just for fun</alias>
</lexeme>

The following text:

\lexicon(load=0,c:/tmp/first.lex)\lexicon(load=1,c:/tmp/second.pls)o ez j4f,

\lexicon(unload=1) o ez j4f.

will be read as:

"open easy just for fun, open exit zone j 4 f."

Style Control: activates one of the alternative reading styles available. Reading styles modify TTS behaviour by
setting reading parameters (see Text Interpretation) and by activating User Lexicons (several lexicons can be
activated at a time, including regular expression lexicons .rex). Reading styles take the form of text files that can be
created by the user (with the extension .ycf, in the Data directory). For some languages, Loquendo TTS provides a
number of pre-defined styles, e.g. for SMS reading, address reading. Language-specific styles can be activated by
direct reference to their filename, with the \style command (e.g. \style=SMSEnglishUs). Alternatively (see below),
they can be set via their language-independent mnemonic by means of the \sayas command (e.g. \sayas=sms). At
the end of the prompt, the previously set configuration, or the configuration defined by the application, is restored.
\style=<name> activates the reading style corresponding to the file <name>.ycf
\style=reader restores the previously set, or application-defined, reading style
\style=default restores the language-specific default reading style
\style=general restores the language-independent default reading style

Example:

Assume that your Data directory contains the first User Lexicon in the above example,
and the following style:

C:/tmp/sample.ycf
"lexicon" = "first.lex"
"SpellPunctuation" = "YES"
"DefaultNumberType" = "ordinalM"

The following text:

\style=sample 3 ez, o 7 door. \style=general 3 ez, o 7 door.

will be read as:

"third exit zone comma, open seventh door dot. three ez, o seven door."

Say-as Control: activates a reading style specified by a mnemonic (e.g. address, sms), if the corresponding pre-
defined style is available for the current language (see above). At the end of the prompt the application-defined
reading style is restored.

\sayas=<mnemonic> activates the reading style corresponding to the file <mnemonic><current language
name>.ycf
\sayas=reader restores the application-defined reading style
\sayas=default restores the language-specific default reading style
\sayas=general restores the language-independent default reading style

Example:

\sayas=SMS brb c u l8r, \sayas=default brb c u l8r.

(if the current language is English, the text will be read as "be right back see you
later, b r b c u l eight r").

Prosodic controls
The following commands allow the quality of the output voice to be controlled by modifying its rhythm, intonation,
volume and timbre. The output speech is modified from the word following the command, up until the end of the
prompt. In the case of a voice change, the new prosodic values are also applied to the new voice. A new prompt,
however, resets all the prosodic parameters to their default values.

The prosodic parameters can be set to a given value, or increased or decreased with respect to their current value.
The desired value can either be expressed in physical units (words per minute for speaking rate, Hertz or
semitones for pitch and timbre, decibels for volume), or as a number on an abstract scale from 0-100, where 50
corresponds to the default value of the parameter, and 0 and 100 to its minimum and maximum values respectively.
This 0-100 scale can be modified by the application, so please refer to the application manual for specific
instructions.

Note that the relative changes are with respect to the current value and are cumulative: for example, two
consecutive commands \pitch+20Hz within the same prompt will raise the pitch by 40Hz.
N.B. Even if there are, in theory, no formal limitations to the assignable values (expressed in physical units),
changes are limited by the physical range possible for the voice.

The general syntax for Prosodic Controls is the following:

\<tag>=<num> sets to <num> (abstract scale)


\<tag>±<num> increases or decreases by <num> (abstract scale)

\<tag>=<num><physical unit> sets to <num> (in physical unit of measurement)


\<tag>±<num><physical unit> increases or decreases by <num> (in physical unit of measurement)
\<tag>±<perc>% increases or decreases by <perc> percent of the current value

\<tag> resets to the standard default value

In detail, the specific syntax for the four prosodic parameters is the following:

Speed Control: allows the speaking rate to be modified, expressed in an abstract scale 0-100 or in words per
minute (wm).

\speed=<num>

Example:

\speed=60 (Scale 0-100)

\speed±<num>

Example:

\speed+20 (Scale 0-100 variation)

\speed=<num>wm

Example:

\speed=110wm (words per minute)

\speed±<num>wm

Example:

\speed-10wm (words per minute variation)

\speed±<perc>%

Example:

\speed+20% (percentage variation)

\speed

Example:

\speed (reset)
Pitch Control:: allows the fundamental frequency (tone or pitch) to be modified, expressed in an abstract scale 0-
100 or in Hertz (hz) or semitones (st).

\pitch=<num>

Example:

\pitch=60 (Scale 0-100)

\pitch±<num>

Example:

\pitch+20 (Scale 0-100 variation)

\pitch=<num>hz

Example:

\pitch=200hz (Hertz)

\pitch±<num>hz

Example:

\pitch-40hz (Hertz variation)

\pitch±<nsemit>st

Example:

\pitch+2st (semitones variation)

\pitch±<perc>%

Example:

\pitch+30% (percentage variation)

\pitch

Example:

\pitch (reset)

Volume Control: allows the volume (loudness) to be modified, expressed in an abstract scale 0-100 or in decibels
(dB).

\volume=<num>

Example:

\volume=60 (Scale 0-100)

\volume±<num>
Example:

\volume+20 (Scale 0-100 variation)

\volume=<num>db

Example:

\volume=-3db (decibels)

\volume±<num>db

Example:

\volume+1db (decibel variation)

\volume±<perc>%

Example:

\volume+20% (percentage variation)

\volume

Example:

\volume (reset)

Timbre Control: allows the voice timbre to be modified by a shift in frequency, expressed in an abstract scale 0-
100 or in semitones (st).

\timbre =<num>

Example:

\timbre=60 (Scale 0-100)

\timbre±<num>

Example:

\timbre+20 (Scale 0-100 variation)

\timbre

Example:

\timbre (reset)

Sound effects
The following commands create certain sound effects by acting on acoustic parameters of the speech output signal:
reverb gives the impression of a large hall or church, while delay (or echo) repeats the audio signal at ever
diminishing volume. If the Audio Destination device is stereo, the balance between the two loudspeakers can be
adjusted, locating the voice source in a given position in space. Additional effects include several robotic-voice
styles and the whispering effect, both of which can be applied to any voice in any language.

These effects remain active until a contrary command cancels or modifies them, i.e. they are NOT reset at
the end of the prompt.

The syntax of the three controls is the following:

Reverb Effect: creates reverb with an intensity of <gain> and a delay of <delay> milliseconds

\reverb=<gain>,<delay>

Example:

\reverb=80,500 (0<gain<100, 0<delay<2000)

\reverb=0,0 (removes the reverb effect)

Balance Control: locates the voice source at <num> degrees with respect to the listener (<num> between -45 and
45), or on the right/front/left

\balance=<num>

Example:

\balance=-30 (locates the voice source at -30 degrees)

\balance=right (locates the voice source on the right)

\balance=center (locates the voice source in a central position)

\balance=left (locates the voice source on the left)

Robot Control: applies the robotization effect to the voice currently active in the system. There are 9 robots
available: Robby, Gort, Twiki, Torg, Tobor, Ash, Hector, Max and Lynjx.

\robot=<robotName>

Example:

\robot=Max

\robot (removes the robotization effect)

Whisper effect: applies the whisper effect to the voice currently active in the system. The possible values are: on,
off.

Example:

\whisper=on

(the effect is active)

\whisper=off

(the effect is active)

Text interpretation
The following commands control certain general aspects of text interpretation:

the decoding of the input character stream (OEM, ANSI, UNICODE, etc.) and the general text format (SSML
vs plain text)
where to insert prosodic pauses in speech, corresponding to typographical, punctuation or syntactic features
how to interpret digit sequences, special characters and acronyms
how to choose the language appropriate to the text
which Phonetic Alphabet to use for the Phonetic Input and Output

Such reading parameters can also be set in Configuration files or by the correct API's (see the Loquendo TTS
Programmer's Guide). Here we describe how to adjust them synchronously with the text by means of a User
Control embedded in the text.

The general syntax for these User Controls is the following:

\@<key>=<value>

where <key> is the name of the Reading Parameter to be changed and <value> is its chosen value.

Example:

\@DefaultNumberType=telephone (all numbers will be read as telephone numbers)

Such User Controls remain active until a contrary command cancels or modifies them, i.e. a parameter that
has been changed with the \@ tag is NOT reset at the end of the prompt.

Here is the complete list of the modifiable reading parameters:

Text coding and format


TextEncoding - Use this parameter to change the input text coding. Possible values:
"default"
"AutoDetect" or "auto" or "AutoDetectCoding"
"unicode" or "utf-16" or "utf16"
"utf-8" or "utf8"
"iso"
"ansi" or "windows" or "windows-ansi"
"oem" or "ms-dos"
"iso-8859-1" or "iso-latin-1" or "us-ascii"
"iso-8859-2" or "iso-latin-2" "iso-central-european" [polish language only]
"iso-8859-7" or "iso-greek" [greek language only]
"iso-8859-9" or "iso-turkish" [turkish language only]
"windows-1250" or "windows-latin-2" or "windows-central-european" [polish language only]
"windows-1252" or "windows-latin-1" or "windows-western"
"windows-1253" or "windows-greek" [greek language only]
"windows-1254" or "windows-turkish" [turkish language only]
"autodetect" (default in case of input file), "ansi" or "windows" or "windows ansi" (default), "unicode" o
TextFormat - Use this parameter to change the input text format. Possible values are: "ssml", "plain"
(default) or "autodetect".

Prosodic Pause Insertion


MultiSpacePause - By default, multiple spaces or tabs in a text generate a pause. If you set this
parameter to "FALSE", no pause is generated.
MaxParPause - By default, lines shorter than 5 words (like titles or signatures) are automatically
terminated by a pause. You can change this value from 5 to a different value; use "0" (zero) if you want to
disable this feature
ProsodicPauses - Three possible values: "automatic" (pauses are automatically inserted at phrase
boundaries, e.g. at end of sentence, at commas, between subject and verb), "punctuation" (pauses are
inserted only at punctuation marks), "word" (pauses are inserted at each word boundary, "word by word"
reading)
ShortPauseLength - Sets the duration (in milliseconds) of automatically inserted short pauses. The value
"0" resets the pause duration to its default value (typically, 50 ms)
MediumPauseLength - Sets the duration (in milliseconds) of pauses inserted at commas. The value "0"
resets the pause duration to its default value (typically, 120 ms)
LongPauseLength - Sets the duration (in milliseconds) of end-of-sentence pauses. The value "0" resets
the pause duration to its default value (typically, 500 ms)

Digit and symbol pronunciation


SpellingLevel - Three possible values: "normal", "spelling" (all words are spelled out), "pronounce" (no
words are spelled out), "screenreader" (isolated characters are spelled out)
SpellPunctuation - If you set this parameter to "TRUE", punctuation marks are read out, e.g. :-] "colon
hyphen square bracket"
TaggedText - If you set this parameter to "FALSE", control tags are not processed but pronounced
DefaultNumberType - Defines the number type: “automatic” (by default, number type is detected
automatically), “telephone”, “cardinal”, “code”, “time”, “ordinalM”, “ordinalF”
DateReadingMode - Sets the reading mode for dates: "dmy", "mdy", "ymd", "dm", "my", "d", "m", "y"
TimeReadingMode - Sets reading mode for time values as either "time" (default) or "duration"
CurrencyReadingMode - Sets the reading mode for currencies: "decimal" (e.g. "two point forty five
dollars"), "short" (e.g. "two dollars and forty five"), "extended" (e.g. "two dollars and forty five cents")

Language guessing
AutoGuess - Activates and configures the AutoGuess mode, i.e. the ability to automatically switch to the
appropriate language according to the text. The syntax of this configuration parameter is:
AutoGuess=[Type]:[Language list].
Possible values for "type":
1. "no" - no AutoGuess mode
2. "VoiceParagraph" - Detects language and changes voice accordingly, paragraph by paragraph
3. "VoiceSentence" - Detects language and changes voice accordingly, sentence by sentence
4. "VoicePhrase" - Detects language and changes voice accordingly, phrase by phrase
5. "LanguageParagraph" - Detects and changes language, paragraph by paragraph, without changing
the active voice
6. "LanguageSentence" - Detects and changes language, sentence by sentence, without changing the
active voice
7. "LanguagePhrase" - Detects and changes language, phrase by phrase, without changing the active
voice
8. "LanguageWord" - Detects and changes language, word by word, without changing the active voice
9. "BothParagraphSentence" - Combines the effects of "VoiceParagraph" and "LanguageSentence"
10. "BothParagraphPhrase" - Combines the effects of "VoiceParagraph" and "LanguagePhrase"
11. "BothParagraphWord" - Combines the effects of "VoiceParagraph" and "LanguageWord"
12. "BothSentencePhrase" - Combines the effects of "VoiceSentence" and "LanguagePhrase"
13. "BothSentenceWord" - Combines the effects of "VoiceSentence" and "LanguageWord"
14. "BothPhraseWord" - Combines the effects of "VoicePhrase" and "LanguageWord"

The AutoGuess keyword requires a list of languages separated by commas (e.g. English, French, Spanish,
German). The intended languages can also be specified by voice names (e.g. Simon, Juliette, Carmen). For
AutoGuess types 9-14, the character '-' (minus) after the language (e.g. "Swedish-") means that voice
changes are permitted, but not "language only" changes (see the second example below). The character '-'
(minus) before the language means that only language changes are permitted (not voice changes).
Some examples:

\@AutoGuess=VoiceSentence:Italian,English

(sentence by sentence changes between Italian and English voices)

\@AutoGuess=BothSentenceWord:French-,Spanish-,English

(sentence by sentence detects the appropriate language, and changes voice


accordingly.
In addition, when using non-English voices, English words are detected
and pronounced with the English phonetic rule set).

LanguageSetForGuesser - Defines the list of languages, separated by commas, from which the
Language Guesser should guess (e.g. "English, French, Spanish, Italian")

Phonetic input
SampaSecondAccent - If you set this parameter to “YES”, the SAMPA secondary stress (the symbol % )
is converted to the SAMPA primary stress (the symbol " ). By default, the SAMPA secondary stress is simply
ignored. (For more information on the SAMPA phonetic alphabet, see below).

Audio Mixer
AudioTagScope - If you set this parameter to “SinglePrompt”, the audio file you are mixing with speech
(see Audio mixer capabilities) will be played till the end of the current prompt; otherwise, if it is set to
“MultiplePrompt”, the audio file will be played over multiple prompts till the audio comes to an end, or, if the
loop command is set, until you stop the audio track

The audio file format is detected automatically based on the file extension; the recognised extensions are:

.alaw / .alw / .al: for 8 kHz, a-law, mono audio files


.ulaw / .ulw / .ul: for 8 kHz, u-law, mono audio files
.wav: for any frequency, a-law, u-law or linear encoding, and either mono or stereo audio files: the wave
header will be analysed in order to extract the correct parameters.

For raw files of unknown extension it is possible to set the correct audio parameter by using the following user
controls:

MixerAudioFileDefaultFrequency - sets the frequency of the audio file, which must be a value
expressed in Hz (e.g. 44100)
MixerAudioFileDefaultEncoding - sets the audio encoding as "linear", "alaw" or "ulaw"
MixerAudioFileDefaultChannels - can be "mono" or "stereo"

Local controls

Local Controls force a particular reading mode at the exact point in the text where they are inserted, overriding the
current values of any Reading Parameters which have been set. The effect is local, changing the pronunciation of
the following word or text portion (say-as, stress, duration), or inserting a given event (a word or phrase
represented in phonemes, a paralinguistic event, a sound mixed with the voice, a bookmark).

The Local Controls can be grouped as follows:

Interpretation - modifying the pronunciation of the following word (numbers, spelling, stress)
Pauses - inserting a pause of a given duration or imposing a given duration on the following pause
Phrase duration - imposing a given duration on the following delimited text portion
Special event - (insertion of a bookmark, playing of a demo sentence or of a sound file)
Phonetic Input - specifying the synthesis of words by their phonetic transcription
Audio Mixer - mixing the synthesized voice with sound files

Interpret
The interpret commands control the pronunciation of the following word (the string of characters up to the next
blank or phrase terminator), governing number interpretation, spelling and stress. The functioning of these
commands may differ from language to language.

Number interpretation

\number=ordinalM (masculine) ordinal number, e.g. \number=ordinalM (English: fourth, Italian quarto)
\number=ordinalF feminine ordinal number, e.g. \number=ordinalM 4 (Italian quarta)
\number=cardinal cardinal number, e.g. \number=cardinal 254667 (two hundred fifty four thousand six
hundred sixty seven)
\number=code numerical code, e.g. \number=code 254 (read: two five four)
\number=time time e.g. \number=time 10:24:45 (read: ten twenty-four and forty five seconds)
\number=telephone telephone number, e.g. \number=telephone 254667 (two five four six six seven)

Spelling

\spell=yes the following word is spelled out, e.g. \spell=yes but (bee you tee)
\spell=no the following word is not spelled out, e.g. \spell=no IBM (ibm)
\spell=auto default behavior, e.g. \spell=auto but \spell=auto IBM (but i bee em)

Stress

\stress=yes the following word is stressed, e.g. \stress=yes da casa (Italian: dà càsa)
\stress=no the following word is de-accented, e.g. da \stress=no casa (Italian: da casa)
\stress=auto default behavior, e.g. \stress=auto da \stress=auto casa (Italian: da càsa)

Pause
The pause commands can be used to modify the duration of pauses corresponding to punctuation marks , or to
insert pauses when punctuation is lacking. Note that the presence of a pause always modifies the intonation of the
preceding phrase.
The resulting intonation depends on the punctuation mark:

,()- produce a continuation or suspensive intonation


? produces a an interrogative intonation
.;! produce a conclusive intonation

Pause insertion
Inserts a pause (silence) in absence of punctuation marks. No effect if punctuation is present.

\pause inserts a medium-length pause (120 ms), preceded by a ‘comma intonation’


\pause, inserts a medium-length pause (120 ms), preceded by a ‘comma intonation’
\pause. inserts a long pause (500ms), preceded by a ‘conclusive intonation’
\pause? inserts a long pause (500ms), preceded by a ‘question intonation’

Example:

Here \pause is a comma pause. (inserts a 120ms ‘comma intonation’ pause between "Here"
and "is")

Here \pause, is a comma pause. (leaves unaltered the ‘comma pause’ between "Here" and
"is")

Here \pause. is a conclusive pause. (inserts a 500ms ‘conclusion’ pause between "Here"
and "is")

Here \pause? is a question pause. (inserts a 500ms ‘question’ pause between "Here"
and "is")

Pause duration
When followed by a punctuation mark, forces the duration of the corresponding pause to <num> milliseconds. In
the absence of punctuation, inserts a ‘comma intonation’ pause of <num> milliseconds.

\pause=<num> sets to <num> milliseconds the duration of the following pause

Example:

This \pause=10 , is a comma pause. (reduces to 10ms the following ‘comma intonation’
pause)

This \pause=10, is a comma pause. (reduces to 10ms the following ‘comma intonation’
pause)

This \pause=10 is a comma pause. (inserts a 10ms ‘comma intonation’ pause)

No final pause \pause=0. (reduces the final silence to a minimum duration, while
keeping the conclusive intonation)

Phrase duration
Forces the subsequent text to be synthesized in <msec> milliseconds. The text should be preceded by the tag
\duration(start,value=) and followed by a final tag \duration(end). The delimited text should not include
punctuation marks. The forced duration should be realistic (at least 30% of the normal speaking time), otherwise
there will be no effect.

\duration(start,value=<msec>) forces the duration of the phrase to <msec> milliseconds, up to the tag
\duration(end)
\duration(end) delimits the phrase for which the duration is forced

Example:

Standard reading.
\duration(start,value=500) Fast reading \duration(end) .
\duration(start,value=1000) Slow reading \duration(end) .

Special events
These commands trigger particular events (the playing of a sound or demo sentence, the signaling to the
application that a bookmark has been reached) at the moment when the synthesis output reaches the exact point in
the text where they have been inserted.

Play sound
Plays one of the paralinguistic sounds recorded for the voice in use. The names of the available sounds can be
found in the SDK prompt editing tool or in the Voice Package. For most voices, the following sounds at least are
available: Cough, Cry, Eh, Kiss, Laugh, Mmm, Oh, Sniff, Swallow, Throat, Whistle, Yawn.

\item=<sound name>

Example:
\item=Laugh

Play demo sentence


Plays a predefined sentence in the mother tongue of the current Voice. Allows any voice to be tested without
having to insert an input text. The active language should be the mother tongue of the voice. This control cannot be
mixed with other text in the same prompt.

\demosentence synthesizes the standard sentence predefined for the voice in use

Example:

\demosentence . (if the voice is Simon, the following sentence will be played:

"My name is Simon. I am one of the Loquendo male voices and my mother tongue is
British English")

Bookmark
When the TTS engine encounters the bookmark, it notifies the application by calling the user callback and signaling
that the bookmark has been reached.
N.B. Not all Audio Destinations/devices allow bookmarks.

\book=<string>

Phonetic input
These commands allow the user to write a word or phrase phonetically rather than graphemically. In case the usual
written form of a word or phrase does not produce the desired pronunciation, the user can insert it phonetically,
selecting the appropriate phonemes to achieve the desired pronunciation.

The phonemes can be represented using the standard SAMPA alphabet, in its X-SAMPA extension (see
http://www.phon.ucl.ac.uk/home/sampa/), or by using the International Phonetic Alphabet (see
http://www.langsci.ucl.ac.uk/ipa/).

The interpretation of phonetic transcriptions by Loquendo TTS is, in principle, language-independent, thanks to the
Phonetic Mapping feature (see Languages and Voices). Their acoustic rendering, however, depends on the
phonemes available for that voice.

The syntax of the Phonetic Input command is the following:

SAMPA
Enables the insertion of a phonetic transcription expressed in the SAMPA (X-SAMPA) alphabet.

The phonetic transcription <phonemes> should be a string of X-SAMPA phonemes, representing a single word
or a short phrase. The character "#" can be used as a word separator, while "$" can be used as a syllable
separator.

\sampa(<phonemes>)

Example:

\sampa(%le#"gRa~Z) (Les Granges in French)

The older variants of the SAMPA alphabet, which are language-dependent, are also accepted if the <variant>
parameter is specified, where <variant> can be "ucl" or "teleatlas" or "navteq" (or "x-sampa" for the default variant).
A <language> parameter can also be specified. For example, the value <en-us> forces the interpretation of the
symbols "e" and "o" as diphthongs, while the value "spanish" allows the interpretation of the symbol "Y".

\sampa=<variant>;(<phonemes>)
\sampa=<variant>;<language>;(<phonemes>)

Example:

\sampa=x-sampa;("p_hleIs)(place in English) equivalent to: \sampa("p_hleIs)

\sampa=ucl;("pleIs) (place in English)

\sampa=teleatlas;("pleIs) (place in English)

\sampa=navteq;en-us;("ples) (place in English)

IPA
Allows the insertion of a phonetic transcription expressed in the IPA alphabet.

The phonetic transcription <phonemes> should be a string of IPA phonemes that can be written in unicode or, if
the document or the application do not support the unicode encoding, by using the character entities.

\ipa(<phonemes>)

Example:

\ipa(le gʀˈɑ̃ʒ) ( Les Granges in French)

or, with the character entities:

\ipa(le g&#640;&#712;&#593;&#771;&#658;) (Les Granges in French)

Dynamic range control


The Dynamic Range Control applies compression and compensation gain to the waveform, thus enhancing the
synthetic speech output. The feature is useful, for example, in improving comprehension in noisy environments.
Seven presets with different degrees of compression can be selected through this user control, starting from built-
in(1) that provides the least compression, to built-in(7) with maximum compression. The first five presets tend to
keep the signal within the available range, while the last two presets may produce saturation and, for this reason,
should be used only in particular contexts where, for instance, background noise is very high.

The syntax of the Dynamic Range Control is the following:


\dynamic=<built-in curve>

Example:

\dynamic=built-in(1)

In this way the DRC static curve one is selected, activating the first preset of the dynamic range control.

\dynamic

This command is used to reset and disable the dynamic range controller.

Audio mixer features


The audio mixer allows synthetic speech to be mixed with sound files. It's possible to mix one or more sound files
simultaneously. Every sound file (audio source) is treated as an independent audio track, with independent
volume, timeline and sampling rate.
The sampling rate frequency of the audio sources is automatically converted according to the voice frequency
used. The audio mixer supports 16 bit sound files, both mono and stereo, with arbitrary sampling rate frequency.
". wav" files are supported and played.
".mp3" files can't be played but can be generated by using a tool (like Loquendo TTS Director); it is only necessary
to specify the correct file extension (.mp3). ".wma", ".asf", ".ogg", ".avi", ".mpg" are not supported and are not
played.
". raw" , ".pcm" and any other extension files are played as raw files.

The syntax of the Audio Mixer control is the following:


\audio(command[;command...])

Example:

\audio(play=beep.wav)

One or more commands can be given in a single control. The available commands are described below:

Command Syntax:
play \audio(play=<filename>)

Description:
This command allows the playing of an audio file at the specified position in the text.
The filename can contain a forward slash ( / ) in order to specify the full path, but
backslashes are not permitted, i.e. the syntax must be UNIX like, regardless of the
operating system.
The <filename> can also be a URL.

Example 1:
This is \audio(play=music.wav) a test.

Result:
"This is" will be read, then the music.wav will be played, then "a test" will be read.

Example 2:
This is \audio(play=music.wav;volume=50) a test.

Result:
"This is" will be read, then music.wav will be played at volume 50% (see volume
command below), then "a test" will be read.

Example 3:
This is \audio(play=music1.wav;play=music2.wav) a test.
Can also be written:
This is \audio(play=music1.wav) \audio(play=music2.wav) a test.

Result:
"This is" will be read, then music1.wav will be played, then music2.wav will be played,
finally "a test" will be read.

Command Syntax:
mix \audio(mix=<filename>) or
\audio(mix=<filename>,loop) or
\audio(mix=<filename>,<count>)

Description:
This command allows an audio file to be played at the specified position in the text.
The filename can contain a forward slash ( / ) in order to specify the full path, but
backslashes are not permitted, i.e. the syntax must be UNIX like, regardless of the
operating system.

Example 1:
This is \audio(mix=music.wav) a test.

Result:
Speech and music.wav will be mixed and heard together. The current track is
music.wav (see the track command below for details).

Example 2:
This is \audio(mix=music.wav,loop) a long test.

Result:
Speech and music.wav will be mixed and heard together. If the end of the audio file is
reached, it will restart from the beginning. The current track is music.wav (see the
track command below for details).

Example 3:
This is \audio(mix=music.wav,3) a long test.

Result:
Speech and music.wav will be mixed and together. If the end of the audio file is
reached, it will restart from the beginning 3 times. The current track is music.wav (see
the track command below for details).

Note:
\audio(mix=music.wav) and \audio(mix=music.wav,1) are equivalent.

Command Syntax:
name \audio(name=<track name>)

Description:
This command allows a mnemonic name to be assigned to the current track. For
convenience, this mnemonic name can be used in the track command in place of
the file name.

Command Syntax:
volume \audio(volume=<range(0-200)>)

Description:
This command allows the volume of the current audio track to be set. To specify
the current track use the track command.
Default volume is 100%. The range values are percentages of the default volume.

Example 1:
This is \audio(mix=music.wav) \audio(volume=50) a test.

Result:
The volume of the audio file is set to 50% (from the start).

Example 2:
This is \audio(mix=music.wav) a test. Now I set the volume \audio(volume=50) to
50%.

Result:
The audio file volume starts at 100%, then is set to 50% after a brief period.

Command Syntax:
pause \audio(pause[=filename])

Description:
This command allows the current audio track to be paused. To specify the current
track use the track command.

Example 1:
\audio(mix=music.wav) Music mixing \audio(pause) is now in pause.

Result:
The audio file is paused before the words "is now in pause".

Example 2:
\audio(mix=music1.wav;mix=music2.wav) Music mixing \audio(pause=music1.wav)
is now in pause.
The current track is now music1.wav.

Command Syntax:
resume \audio(resume[=filename])

Description:
This command allows the current audio track to be resumed. To specify the current
track use the track command.
If the track is not in pause it has no effect.

Example 1:
\audio(mix=music.wav) Audio mixing \audio(pause) is now in pause. \audio(resume)
Audio mixing has been restarted.

Result:
Audio mixing is suspended before the words "is now in pause", then restarts.

Example 2:
\audio(mix=music1.wav;mix=music2.wav;mix=music3.wav) Audio mixing
\audio(pause=music1.wav;pause=music2.wav) is now in pause.
\audio(resume=music2.wav) Audio mixing restarts.
The current track is now music2.wav.

Command Syntax:
pauseall \audio(pauseall)

Description:
This command allows all the audio tracks to be paused. It is possible to resume audio
tracks which have been paused by using the resume command, or the resumeall
command.

Example:
\audio(mix=music1.wav) \audio(mix=music2.wav) This is a test using \audio(pauseall)
the mixing feature.

(equivalent)

\audio(mix=music1.wav;mix=music2.wav) This is a test using \audio(pauseall) the


mixing feature.

Result:
The command will stop both the audio files.

Command Syntax:
resumeall \audio(resumeall)

Description:
This command allows all the paused audio tracks to be resumed.

Example:
\audio(mix=music1.wav)\audio(mix=music2.wav) Audio mixing \audio(pauseall) is
now in pause. \audio(resumeall) Audio mixing has been restarted.

Result:
The mixing is suspended before the words "is now in pause", then resumes.

Command Syntax:
stop \audio(stop[=filename])

Description:
This command allows the last audio track to be stopped. To specify the current
track use the track command.
It is not possible to resume an audio track using the resume command, after a stop
command.

Example 1:
\audio(mix=music.wav) The music mixer \audio(stop) has now been stopped.

Example 2:
\audio(mix=music1.wav;mix=music2.wav) This is a test. \audio(stop=music1.wav)
music1 has now stopped.

Command Syntax:
stopall \audio(stopall)

Description:
This command allows all the audio tracks to be stopped. It is not possible to resume
an audio track using the resume command, after a stopall command.

Example:
\audio(mix=music1.wav) \audio(mix=music2.wav) This is a test using \audio(stopall)
the mixing feature.

(equivalent)

\audio(mix=music1.wav;mix=music2.wav) This is a test using \audio(stopall) the


mixing feature.

Result:
The command will stop both the audio files.

Command Syntax:
path \audio(path=<path>)

Description:
This command allows a common path to be specified where the audio files are stored.

Example:
\audio(path=c:/signals) \audio(mix=music1.wav) This is a test. \audio(mix=music2.wav)
Hello world. \audio(path=c:/oldsignals) \audio(play=music3.wav) .

(equivalent)

\audio(path=c:/signals;mix=music1.wav) This is a test. \audio(mix=music2.wav) Hello


world. \audio(path=c:/oldsignals;play=music3.wav) .

Result:
The file music1.wav and music2.wav will be searched in the local folder c:\signals.
The file music3.wav will be searched in the local folder c:\oldsignals.

Command Syntax:
track \audio(track=<filename.wav>)

Description:
This command specifies which track is treated as the current track.

Example:
\audio(mix=music1.wav) The current track is music1.wav.
\audio(mix=music2.wav) Now the current track is music2.wav.
\audio(track=music1.wav;pause) The "pause" command refers to the music1.wav
track.
Now the current track is music1.wav.
\audio(track=music2.wav;volume=50) The volume of music2.wav is set to 50%. Now
the current track is music2.wav

Note:
If the current track ends or is stopped, a new current track would be selected from
the active ones, using the track command.

Command Syntax:
mix2play \audio(mix2play[=filename])

Description:
This command switches the current track from mix mode to play mode. It is useful for
concluding the playing of a file of unknown duration.

Example 1:
\audio(mix=music.wav) The audio file is mixed with this sentence. \audio(mix2play)
This sentence will be read after the end of music.wav

Example 2:
\audio(mix=music.wav,loop) The audio file is mixed with this sentence.
\audio(mix2play) This sentence will be read after the end of music.wav. The 'loop'
directive in the mixing command is ignored by mix2play.
Command Syntax:
fadein \audio(fadein=<msec>)

Description:
This command enables a 'fade in' effect to be set for the current track. To specify
the current track, use the track command.

Example:
\audio(mix=music.wav) \audio(fadein=500) The audio file is mixed with this sentence
and then faded.

Command Syntax:
fadeout \audio(fadeout=<msec>)

Description:
This command allows a 'fade out' effect to be set for the current track. To specify
the current track, use the track command.

Example:
\audio(mix=music.wav) The audio file is mixed with \audio(fadeout=500) this
sentence and then faded.

Command Syntax:
recstart/recstop \audio(recstart=<track name>)
\audio(recstop)

Description:
These commands allow speech to be recorded and then used in another
part of the text.

Example:
\audio(recstart=MyTrack1) Try this example using the recording capability.
\audio(recstop;resume) 1234567890.

Result:
The phrase and the numbers will be pronounced together.

Command Syntax:
close \audio(close)

Description:
This command closes the mixer. All tracks are stopped and the memory freed.
Further \audio or \audio(...) tags will reinitialize the audio mixer.

Example:
\audio(mix=music.wav) The audio file is mixed with this sentence. \audio(close) Mixer
flushed.
\audio Now the audio mixer is initialized.

© 2001 -2009 Loquendo. All Rights Reserved.


Loquendo TTS 7 User's Guide

Appendix A - XML Support


Loquendo™ TTS supports Voice XML 1.0 and SSML 1.0, assuming that its text format has been set up as
TTSSSML, by using the appropriate API (ttsSetTextFormat) described in the Loquendo™ TTS Programmer's Guide.

The Voice XML 1.0 variant will be recognized by means of the first-level tag <PROMPT>, SSML 1.0 with the first-
level tag <SPEAK>.

The <pros> and <prosody> attributes can be specified as follows:


Mode Meaning

n Specifies the attribute value (e.g. rate=“110”, 110 words per minute)

+n Increase by n the attribute value (e.g. pitch=+15, increase pitch by 15 Hz)

-n Decrease by n the attribute value (e.g. pitch=-15, decrease pitch by 15 Hz)

+n% Increase the attribute value by n percent (e.g. vol=“+30%”)

-n% Decrease the attribute value by n percent (e.g. vol=“-30%”)

reset Resets the attribute value (to default)

VoiceXML 1.0: supported tags and formats

TAGS SUPPORT FORMATS EXAMPLES

msecs supported Standard This <break msecs=“5000”/> is a 5


Break seconds pause.

size (none, small,


supported Standard This <break size=“large”/>is a long pause.
medium, large)

<div type=“sentence”> my sentence


sentence supported Standard
</div>
Div type
<div type=“paragraph”> my paragraph
Paragraph supported Standard
</div>

level (strong, moderate, Today is a <emp level=”strong“> very


Emp supported Standard
none, reduced) </emp> important day.

<pros rate=“-20%”> Reduced speaking


rate [1] supported Standard
rate </pros>

Vol supported Standard <pros vol=“+20”> Loud sentence </pros>


Pros
<pros pitch=“+10%”> High pitch sentence
Pitch supported Standard
</pros>

Range Not supported


TAGS SUPPORT FORMATS EXAMPLES

phon Not supported

sub supported standard <sayas sub=“hi”> hello </sayas>

<sayas class=“phone”> 349 4640690


phone supported standard
</sayas>

Standard: <sayas class=“date”>


date supported standard
12/12/2000 </sayas>
Sayas
digits supported standard <sayas class=“digits”> 12345 </sayas>
class
literal supported standard <sayas class=“literal”> 12345 </sayas>

currency Not supported

number supported standard <sayas class=“number”> 12345 </sayas>

time supported standard <sayas class=“time”> 23:12:23 </sayas>

SSML 1.0 (W3C WD 02 December 2002): supported elements and


formats

The table below lists the SSML elements and attributes supported by Loquendo TTS. For more information on this
W3C specification, go to:
http://www.w3.org/TR/2004/REC-speech-synthesis-20040907

Note: it's not advisable to use Loquendo control tags within SSML formatted text, especially if the equivalent SSML
element exists.

ELEMENTS AND ATTRIBUTES SUPPORT NOTE EXAMPLES

speak <speak version=“1.0” xml:lang=“en-


GB”>
supported required
123.
</speak>

version (speak attribute) supported required

xml:lang (attribute) supported required

xml:base (speak attribute) supported

xmlns (speak attribute) not


supported

xmlns:xsi (speak attribute) not


supported

xsi:schemaLocation (speak attribute) not


supported

Absolute
path +
filename
URI format: <speak version=“1.0” xml:lang=“en”>
file://..... <lexicon
lexicon supported May occur uri=“file://mypcname/lexicon.lex”/>
as Hello.
immediate </speak>
children of
the speak
element

meta supported not used

name (meta attribute) cross


supported control with
“http-equiv”

http-equiv (meta attribute) cross


supported control with
“name”

content (meta attribute) supported required

matadata supported not used

p <speak version=“1.0” xml:lang=“en”>


supported <p> my paragraph </p>
</speak>

xml:lang (attribute) <speak version=“1.0” xml:lang=“it”>


123
supported
<p xml:lang=“en”> my paragraph </p>
</speak>

s <speak version=“1.0” xml:lang=“it”>


supported <s> my sentence </s>
</speak>

xml:lang (attribute) <speak version=“1.0” xml:lang=“it”>


123
supported
<s xml:lang=“en”> my paragraph </s>
</speak>

interpret-as format detail

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-as=“letters”> USA
letters supported
</say-as>
</speak>

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-as=“words”> USA
words supported
</say-as>
</speak>

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-as=“number”>
number supported
234512 </say-as>
</speak>

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-as=“number”
number cardinal supported
format=“cardinal“> 234512 </say-as>
</speak>

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-as=“number”
number ordinal supported
format=“ordinal“> VIII </say-as>
</speak>

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-as=“number”
number telephone supported format=“telephone“> 347 2324769
</say-as>
</speak>

<speak version=“1.0” xml:lang=“en“>


<say-as interpret-as=“number”
number digits supported
format=“digits”> 234512 </say-as>
</speak>

mdy, <speak version=“1.0” xml:lang=“en”>


ymd, <say-as interpret-as=“date”
dmy, ym, format=“ymd”> 2002/12/02 </say-as>
date supported
my, md, </speak>
dm, y, m,
d

<speak version=“1.0” xml:lang=“en”>


hh:mm:ss <say-as interpret-as=“time”> 23:05:16
time supported
hh:mm </say-as>
</speak>

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-
currency supported
as=“currency“>13,23$</say-as>
</speak>

not
measure
supported

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-as=“telephone”> 347
say- telephone supported
2324769 </say-as>
as
</speak>

not
name
supported

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-as=“net”
format=“email”>
email supported
name.surname@loquendo.com </say-
as>
net </speak>

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-as=“net”
uri supported format=“uri“> http://www.loquendo.com
</say-as>
</speak>

not
vxml:boolean
supported

<speak version=“1.0” xml:lang=“en“>


<say-as interpret-
vxml:date supported
as=“vxml:date”>19630510</say-as>
</speak>

<speak version=“1.0” xml:lang=“en“>


<say-as interpret-
vxml:digits supported
as=“vxml:digits”>123456</say-as>
</speak>

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-
vxml:currency[2] supported as=“vxml:currency”>eur10.32</say-
as>
</speak>

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-
vxml:number supported
as=“vxml:number”>123454</say-as>
</speak>

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-
vxml:phone supported as=“vxml:phone”>+39 333
866592</say-as>
</speak>

<speak version=“1.0” xml:lang=“en”>


“say-as interpret-
vxml:time supported
as=“vxml:time”>0921p</say-as>
</speak>

address supported

<speak version=“1.0” xml:lang=“en”>


<say-as interpret-as=“”
dictate supported detail=“dictate”> It's simple, isn't it?
</say-as>
</speak>

ELEMENTS AND SUPPORT NOTE EXAMPLES


ATTRIBUTES

<speak version=“1.0” xml:lang=“en”>


<phoneme
phoneme supported ph=“&#x2A7;&#xe6;&#x254;&#x2C8;&#x2D0;”>hello</phoneme>

</speak>

ph (phoneme
supported required
attribute)

optional

<speak version=“1.0” xml:lang=“en”>


alphabet (phoneme <phoneme alphabet=“ipa”
supported IPA
attribute) ph=“&#x2A7;&#xe6;&#x254;&#x2C8;&#x2D0;”>hello</phoneme>
phonemes[3]
</speak>

<speak version=“1.0” xml:lang=“en”>


sub supported <sub alias=“World Wide Web Consortium”>W3C</sub>
</speak>

xml:lang supported

<speak version=“1.0” xml:lang=“en”>


gender supported <voice gender=“female”>This is a female voice.</voice>
</speak>

age supported
voice <speak version=“1.0” xml:lang=“en”>
<voice gender=“female” variant=“2”>This is another female
variant supported
voice.</voice>[4]
</speak>

<speak version=“1.0” xml:lang=“en”>


name[5] supported <voice name=“Dave”>This sentence is read by Dave.</voice>
</speak>

<speak version=“1.0” xml:lang=“en”>


Today is a <emphasis level=“strong”>very</emphasis>
emphasis level supported
important day.
</speak>

<speak version=“1.0” xml:lang=“en”>


strength supported Break test <break strength=“strong”/>Goodbye.
</speak>
break
<speak version=“1.0“ xml:lang=“en“>
time supported This <break time=“4s“/> is a very long pause.
</speak>

<speak version=“1.0” xml:lang=“en”>


<prosody pitch=“high”> High pitch sentence </prosody>
standard + </speak>
absolute
<speak version=“1.0” xml:lang=“en”>
variation
pitch supported <prosody pitch=“+20”> High pitch sentence </prosody>
(Hz) +
</speak>
percentage
variation <speak version=“1.0” xml:lang=“en”>
<prosody pitch=“+60%”> High pitch sentence </prosody>
</speak>

<speak version=“1.0” xml:lang=“en”>


<prosody contour=“(0%,+20Hz)(10%,+30%)(40%,+10Hz)”>good
contour supported
morning</prosody>
</speak>

<speak version=“1.0” xml:lang=“en”>


range supported <prosody range=“x-high”>good morning</prosody>
</speak>

<speak version=“1.0” xml:lang=“en”>


<prosody rate =“fast”> Fast rate sentence </prosody>
</speak>
prosody
standard + <speak version=“1.0” xml:lang=“en”>
rate supported percentage <prosody rate=“230”> Fast rate sentence </prosody>
variation <speak>

<speak version=“1.0” xml:lang=“en”>


<prosody rate=“-80.5%”> Slow rate sentence </prosody>
</speak>

<speak version=“1.0” xml:lang=“en”>


duration supported <prosody duration=“3s”> good morning </prosody>
</speak>

<speak version=“1.0” xml:lang=“en”>


<prosody volume=“loud”> High volume sentence </prosody>
<prosody volume=“60.0“> Medium volume sentence </prosody>
standard </speak>
absolute
<speak version=“1.0” xml:lang=“en”>
volume supported variation
<prosody volume=“+10“> High volume sentence </prosody>
percentage
</speak>
variation
<speak version=“1.0” xml:lang=“en”>
<prosody volume=“-40.4%“> High volume sentence </prosody>
</speak>

Absolute
path + <speak version=“1.0” xml:lang=“en”>
audio[6] supported filename <audio src=“file://localhost/welcome.wav”>Hello</audio>
URI format: </speak>
file://.....

<speak version=“1.0” xml:lang=“en”>


Go from <mark name=“here”/> here, to <mark name=“there”/>
mark supported
there!
</speak>

Loquendo
TTS does
desc supported not use text-
only output
mode

[1] The permitted formats are summarised in the previous table.

[2] Currency Indicators


Italian EUR, USD, GBP, JPY
French EUR, USD, GBP, JPY
German EUR, USD, GBP, JPY
Spanish (and sublanguage e.g. Mexican) EUR, USD, GBP, JPY,ESP
English (and sublanguage e.g. American) EUR, USD, GBP, JPY
Currency indicators are admitted in the above languages only.

[3] Use a space to separate phonetic transcriptions of different words.

[4] The variant is the sequence number of the preloaded voices. E.g. if the sequence of the preloaded voices is:
Sonia, Mario, Valentina, Silvana, Roberto, then female variant 2 is Valentina.

[5] IMPORTANT:Do not mix prosody tags and voice switch tags, as the results may be unreliable. The XML parser
causes errors when the voice has not been loaded.

[6] The audio supports 16 bit sound files, mono and stereo, with arbitrary sampling rate frequency.
“. wav” files are supported and played.
“.mp3”, “.wma”, “.asf”, “.ogg”, “.avi”, “.mpg” are not supported and are not played.
“. raw”, “.pcm” and any other extension files are played as raw files.

© 2001-2009 Loquendo. All Rights Reserved.


Loquendo TTS 7 User's Guide

Appendix B - MRCP Support


The Media Resource Control Protocol (MRCP) specifies a common interface for Media Processing Resources that
provide text-to-speech (TTS) capabilities.
Any MRCP compliant Speech Server supporting Loquendo TTS as TTS technology (such as the Loquendo Speech
Suite) offers many suitable standard way to manage text.
Therefore, according to the MRCPv1 and MRCPv2 protocols, a technology vendor can increase the standard
protocol capabilities, introducing new "vendor specific" headers, usable from the MRCP client to invoke the specific
server functionalities.
If you are using a MRCP compliant Speech Server supporting Loquendo TTS as TTS engine, you can tune
Loquendo TTS by setting its vendor-specific parameters on the MRCP client. The vendor-specific parameters
offered by Loquendo TTS coincide with the modifiable reading parameters described in section Loquendo TTS User
Control.
For this reason, in each vendor-specific parameter description, you will be redirected to the LTTS parameter with
the same name (notice that the string "com.loquendo.tts" has been prefixed according to the MRCP
specifications).
The way you get/set vendor-specific parameters depends on the speech server implementation: see the MRCP
compliant server documentation for details.

Vendor-specific parameters

Here is the list of the vendor-specific parameters offered by Loquendo TTS:

com.loquendo.tts.MultiCRPause
See MultiCRPause parameter
com.loquendo.tts.MultiSpacePause
See MultiSpacePause parameter
com.loquendo.tts.MultiParPause
See MultiParPause parameter
com.loquendo.tts.SpellingLevel
See SpellingLevel parameter
com.loquendo.tts.TaggedText
See TaggedText parameter
com.loquendo.tts.DefaultNumberType
See DefaultNumberType parameter
com.loquendo.tts.DefaultNumberType
See DefaultNumberType parameter
com.loquendo.tts.AutoGuess
See AutoGuess parameter
com.loquendo.tts.LanguageSetForGuesser
See LanguageSetForGuesser parameter
com.loquendo.tts.PhoneticCoding
See PhoneticCoding parameter
com.loquendo.tts.SampaSecondAccent
See SampaSecondAccent parameter

© 2001-2009 Loquendo. All Rights Reserved.

Vous aimerez peut-être aussi