Vous êtes sur la page 1sur 71

Viewing Multimedia Messages with 3GPP SMIL in a

Size Driven Mobile Terminal


Master of Science Thesis

Supervisors: Professor Kaisa Sere (ÅA) and M.Sc. Reijo Siira (NMP)
Software Engineering Laboratory
Faculty of Chemical Engineering
Åbo Akademi University
August 2004

Mobile messaging has become a huge business all around the world. Currently, the
richness of multimedia messages is limited by the language used for describing
media timing and interactions, for which Synchronized Multimedia Integration
Language (SMIL) is a de facto standard. A proposed, richer, new SMIL version for
multimedia messages is the Third Generation Partnership Program (3GPP) SMIL.
This thesis studies the use of 3GPP SMIL for scene description in multimedia
messages. Nokia Series 40 mobile platform is used as the target.
Multimedia Messaging System (MMS), SMIL, and the Series 40 platform are
presented in sufficient detail. Then the problems that may arise are discussed and
solutions presented.
The conclusion is that 3GPP SMIL is a feasible standard, and can be utilized on a
Series 40 mobile phone. The 3GPP SMIL profile is much more complex than the
SMIL profile currently used in multimedia messages, and provides a richer user
Interoperability will be an issue, especially for messages sent from a PC. Small
screen size of the terminal and the smaller processor and memory capacity will cause
problems. User interface design for a 3GPP SMIL viewer is non-trivial, because
there are so many actions the user should be able to take. A user interface that
supports all of the features is presented.

Keywords: Synchronized Multimedia Integration Language (SMIL), Multimedia

Messaging Service (MMS), mobile phones

Mobila meddelande har blivit en stor business runt hela världen. Numera är
mångfalden hos multimediameddelanden begränsat av språket som används för att
beskriva växelverkan mellan och synkronisering av medierna. Synchronized
Multimedia Integration Language (SMIL) har blivit industristandarden för detta
ändamål och Third Generation Partnership Program (3GPP) SMIL är ett förslag för
kommande versionen av SMIL för multimediameddelande.
Detta diplomarbete undersöker användningen av 3GPP SMIL för att beskriva
meddelanden. Som målplattform används Nokia Series 40.
Multimediameddelandeservice, SMIL och Series 40 plattformen beskrivs i
tillräcklig detalj. Sedan diskuteras problem som kan uppstå vid implementering av en
3GPP SMIL -presenterare, samt lösningar till dessa.
Arbetets slutsats är att 3GPP SMIL är en utförbar standard, och kan utnyttjas i en
Series 40 mobiltelefon. 3GPP SMIL-profilen är betydligt mer omfattande än den som
nuförtiden används i multimediameddelanden, och tillför en allt rikare
Samverkan mellan olika slags terminaler kommer att orsaka problem, speciellt
med meddelanden som skickas från en persondator, för att då har den mottagande
terminalen mycket mindre skärm samt mindre processor- och minneskapacitet.
Användargränssnittsdesign för en 3GPP SMIL -presenterare är utmanande för att
det finns så många operationer användaren måste kunna utföra. Ett
användargränssnitt som stöder alla funktioner presenteras.

Sökord: Synchronized Multimedia Integration Language (SMIL), mobiltelefon,



ABSTRACT........................................................................................................... II

REFERAT ............................................................................................................III

CONTENTS ......................................................................................................... IV

PREFACE............................................................................................................. VI


1 INTRODUCTION ..........................................................................................1

2 MULTIMEDIA MESSAGING SERVICE...................................................2

2.1 FROM SMS TO MMS – VIA EMS? ............................................................2
2.2 THE FEATURES OF MMS............................................................................4
2.4 STRUCTURE OF A MULTIMEDIA MESSAGE .................................................7
2.5 USE CASES FOR MMS ................................................................................8
3 SMIL ..............................................................................................................11
3.1 HISTORY ..................................................................................................11
3.2 SYNTAX AND STRUCTURE ........................................................................11
3.3 FUNCTIONAL AREAS AND MODULES .........................................................12
3.4 SPATIAL DESCRIPTION .............................................................................13
3.5 TEMPORAL DESCRIPTION .........................................................................15
3.6 OTHER SMIL FEATURES ..........................................................................23
3.7 ACTUAL SMIL PROFILES .........................................................................27
4 MMS VERSIONS .........................................................................................30
4.1 MMS WITHOUT SCENE DESCRIPTION .......................................................30
4.2 OMA MMS.............................................................................................31
4.3 3GPP MMS.............................................................................................33
5 SERIES 40 PLATFORM .............................................................................36
5.1 USER INTERFACE .....................................................................................36
5.2 MEMORY AND PROCESSOR .......................................................................39
5.3 SOFTWARE ...............................................................................................40
6 DEALING WITH TERMINAL CONSTRAINTS ....................................42
6.1 DISPLAY SIZE ...........................................................................................42
6.2 PROCESSING POWER .................................................................................45
6.3 MEMORY .................................................................................................48
7 USER INTERFACE DESIGN.....................................................................49
7.1 AN ADVANCED USER INTERFACE .............................................................49
7.2 SIMPLIFYING THE USER INTERFACE ..........................................................52
7.3 AN ALTERNATIVE TO MEDIA SCROLLING..................................................53

8 DISCUSSION................................................................................................55

9 SUMMARY ...................................................................................................57

10 REFERENCES .............................................................................................58

A SVENSK SAMMANFATTNING................................................................60
A.1 INTRODUKTION ........................................................................................60
A.2 MULTIMEDIAMEDDELANDESERVICE ........................................................60
A.3 SMIL.......................................................................................................61
A.4 3GPP MULTIMEDIAMEDDELANDE............................................................63
A.5 SERIES 40 PLATTFORM .............................................................................63
A.6 3GPP SMIL PÅ EN SERIES 40 TERMINAL ................................................63
A.7 ANVÄNDARGRÄNSSNITTSDESIGN.............................................................64
A.8 DISKUSSION .............................................................................................64

This work is dedicated to my Päivi; for making me whole, for always being by my
side, and for bringing light to the occasional hours of darkness.

This thesis was written during 2003-2004 for Nokia Mobile Phones. It has been
preliminary study for a software project, and thus I have been able to work full-time
on the thesis. I would like to give my warmest thanks for this opportunity, especially
for my bosses Teija Pääkkönen and Reijo Siira. The latter acted as the supervisor of
the thesis and helped a lot with finding a proper subject, forming the structure of this
work, and with proofreading. Thank you!
My supervisor from the Åbo Akademi University’s side has been Kaisa Sere. I
thank her for advice and comments. Jaakko Arvilommi was of great help with
proofreading, and Thomas “Toma” Andersson helped with guiding me through the
mysteries of the Swedish language.
I would still like to send my general greetings to (in alphabetical order): Anna,
Hese, Honkkari, Härtsi, isä, Jenki, Jesse, Keso, Leski, Marcus, Olavi, Osku, Paksa,
Pauski, Pete, Tanja, Turku Terror, the good people of DaTe and äiti.

Salo, August 13th (a Friday), 2004

Asmo Soinio

3GPP Third Generation Partnership Program
CDMA Code Division Multiple Access
CSD Circuit Switched Data
DSP Digital Signal Processor
EMS Enhanced Messaging Service
GPRS General Packet Radio Service
GSM Global System for Mobile Communications
HTML Hypertext Markup Language
MCU Microprocessor Control Unit
MIME Multipurpose Internet Mail Extension
MM Multimedia Message
MMS Multimedia Messaging Service
MMSE MMS Environment
MSISDN Mobile Station Integrated Service Digital Network Number
OMA Open Mobile Alliance
RAM Random Access Memory
SMIL Synchronized Multimedia Integration Language
SMS Short Message Service
SVG Scalable Vector Graphics
UCS Universal Character Set
URI Uniform Resource Identifier
URL Uniform Resource Locator
VAS Value Added Service
W3C World Wide Web Consortium
WAP Wireless Application Protocol
XML Extensible Markup Language

Mobile messaging has become a part of the daily lives of many people, and a huge
business all around the world. Short Messaging Service (SMS) has been an
unforeseen success, and the more advanced Multimedia Messaging Service (MMS)
is catching on as well. To make multimedia messaging even richer, Synchronized
Multimedia Integration Language (SMIL) was introduced to describe relationships
between the medias in a message. The first version of SMIL for MMS was greatly
simplified in order for it to be easy to implement on different platforms, ensuring
conformance. A proposed next SMIL version for MMS is 3GPP SMIL, defined by
Third Generation Partnership Program (3GPP).
The purpose of this thesis is to study whether 3GPP SMIL is a valid standard for
MMS scene description. This is done by studying the problems that arise when it is
implemented on a size driven mobile terminal, and by presenting solutions to those
The Nokia Series 40 mobile platform is the target for this study. It was a natural
choice due to the needs of the mandator of this thesis, but also because it is a widely
spread platform and known for its good user interface. It is a clearly size driven
platform, whereas so-called smart phones, mostly built on Symbian OS, are more
feature driven, and have better capabilities. It should be noted that every mass
product also fundamentally has to be price driven, thus making it infeasible to create
a product that would be both small and among the most powerful.
This thesis starts with presenting MMS, its architecture, its relation to SMS, and
some use cases. Chapter 3 presents the SMIL language. It introduces practically all
features of SMIL, focusing on those included in the 3GPP SMIL profile. Chapter 3
can also be used, and has actually been used in our department, independently as an
introduction to the SMIL language. Chapter 4 lists the current MMS versions, their
media types and formats, and their relation to SMIL. Chapter 5 introduces the Series
40 platform, with focus on the features related to viewing SMIL presentations.
Chapter 6 lists terminal constraints that may cause problems when implementing a
3GPP SMIL viewer, and proposes solutions to these problems. User interface design
is among these problems, and Chapter 7 is devoted to discussing a user interface that
enables all of the 3GPP SMIL features on Series 40 platform.
Chapter 8 discusses the problems and solutions presented, and Chapter 9 shortly
summarizes the whole thesis.
The ideas in Chapters 6 and 7 are my own, except for the alternative scrolling
mechanism presented in sub-section 7.3, which was a result of the discussions during
the design process of the actual SMIL viewer software specification at Nokia.


This chapter will give an overview of the Multimedia Messaging Service (MMS),
focusing on the overall architecture and the end-user experience. Differences
between the versions of MMS conformance documents and specifications are
discussed in the later chapters. The Short Messaging Service (SMS) and Enhanced
Message Service (EMS) are also briefly introduced, as they are the predecessors of
MMS and their features were a base for defining it.
Technical details about transmission of the multimedia message (MM) data
between two mobile devices or a server and a mobile device are mostly outside the
scope of this thesis. For such information, please refer to the 3GPP and OMA
documents listed in the references.

2.1 From SMS to MMS – via EMS?

2.1.1 Short Message Service

SMS was introduced to Global System for Mobile (GSM) networks, and
commercially launched in 1992. The service allows transfer of textual messages with
maximally 140 octets of data. The number of characters with different encoding
standards is shown in Table 2.1. Transfer of the messages is based on store-and-
forward principle, i.e. if the receiving device is not available, the message will be
stored in the service center until it can be delivered or the validity period of the
message expires. Most new phones also support concatenation of several messages
(with a few octets less payload), but with most operators, the user has to pay for each
of them separately.
SMS has been an enormous success and nowadays hundreds of billions of short
messages are sent yearly. SMS has also been implemented on other network
technologies like GPRS and CDMA, and messages can be sent across national
borders and to networks using different network technologies than the sender’s
Despite the limited amount of data, there is a wide range of additional services that
use SMS, for example news, email and weather services. Short messages are also
widely used for public voting and feedback in TV and radio shows.
Many manufacturers have expanded the SMS standard on application-level with
possibilities for richer media. Nokia’s Smart Messaging is one such expansion and it
enables sending of black and white images and monophonic ringing tones. It is an
open specification, but has not been adopted by other manufacturers.

Table 2.1 The amount of data in a short message

Encoding Amount of data in one short message

GSM alphabet, 7 bits 160 characters
8-bit data 140 octets
UCS2, 16 bits (supports Asian 70 characters

2.1.2 Enhanced Messaging Service

EMS is an application-level extension to SMS messaging introduced by Ericsson and

standardized by 3GPP [3GPP23.040] in 1999. It adds richer media to SMS, but
requires no infrastructure update from the network operators. The standardization of
EMS has been evolutionary but [Le Bodic] divides it to two distinguishable steps:
Basic EMS introduced in [3GPP23.040] release 99 and extended EMS introduced in
[3GPP23.040] release 5 (June 2002).
Basic EMS messages can contain simple text formatting, monophonic melodies,
black and white pictures, and black and white animations (up to four pictures, 16 x
16 pixels each). Such messages can be concatenated, but an enhanced element
(melody, image or animation) cannot be spread over several segments. Therefore, the
maximum size of an element in Basic EMS is about 140 bytes. This limits for
example the melodies to a few seconds. The standard also specifies some predefined
animations and melodies that can be used without sending the full representation of
Extended EMS breaks many limitations of the Basic EMS and has the following
additional features:
• The elements can be distributed over many message segments, and the
maximum size for an element is limited to 255 messages (about 34 kilobytes).
In practice, elements bigger that 8 messages (about one kilobyte), should be
• There is support for compression of objects, to cope with these bigger
• New media types:
o Up to 64-colour bitmap images
o Up to 64-colour bitmap animations
o vCards (phonebook entries, business cards)
o vCalendar data (calendar entries)
o Polyphonic MIDI melodies
o Vector graphics
• Text background and foreground color formatting
• Hyperlinks

The first EMS-capable devices were introduced in 2001, and there are many
manufacturers supporting the basic version of EMS. However, according to [Le
Bodic] in 2002 there were no commercial products supporting extended EMS. In
addition, the original author of EMS, Sony Ericsson, lists in their EMS guide [SNE-
EMS-Dg] (published in August 2003) only the features of Basic EMS. Therefore, it
can be assumed that there is no wide support, if any, for extended EMS at the time of
writing this thesis.
The fact that EMS uses standard SMS as the transfer method makes the service
available in almost all of today’s mobile networks, but SMS was never designed for
multimedia content transfer and its bandwidth is very limited. In addition, operators
do not currently have different billing schemes for EMS and SMS, which can make
the costs of sending an EMS message quite high. This limits, for example, the usage
of big color images that would take up tens of message segments.
Multimedia Messaging Service (MMS) on the other hand can take advantage of
the high-bandwidth of GPRS and UMTS network technologies, and there have been

commercial products supporting MMS since second quarter of 2002. Therefore, it is

likely that commercial use of EMS will be limited to Basic EMS, at least if the
introduction of MMS continues rapidly and successfully.
The rest of this chapter will focus on MMS.

2.2 The features of MMS

As the name implies, Multimedia Messaging Service (MMS) allows the transfer of
truly multimedia content. To the end-user, it might not seem that different from SMS
or especially EMS, but the technology behind these services is very different. MMS
was designed to enable transfer of any kind of content and is not tied to a single
transport technology: It can make use of the advantages of third generation mobile
networks (3G), but can also be used in standard GSM networks (2G) with Circuit
Switched Data (CSD). It provides interoperability with Internet electronic mail
(email) and adopts many of the transport protocols and message formats that are
already in use on the Internet. Such features as group sending, delivery and read-
reply reports, message priorities, and message classes have been adopted from
Internet messaging systems.
The following sub-section will give an overview of the quite complicated
standardization work done to realize MMS and the rest of this chapter will focus on
the details of MMS.

2.2.1 MMS specifications

The definition of MMS has required a great amount of work from different
standardization organizations. Main responsibles have been the Third Generation
Partnership Program (3GPP) and the Open Mobile Alliance (OMA). Earlier versions
of the OMA documents were done by WAP Forum, which was merged into OMA in
June 2002. In addition, standards by many other organizations play a role in the
realization of the service, especially definitions of media formats and Internet
messaging standards.
[Le Bodic] clarifies the division of work: 3GPP focuses on the high-level service
requirements, architectural aspects of MMS and content formats, and OMA focuses
on technical realization of MMS on the basis of WAP and Internet transport
protocols. The documents defining the fundamental service are listed in Table 2.2.

Table 2.2 Documents that define MMS (not an exhaustive list)

Author Documents
3GPP • Multimedia Messaging Service (MMS); Stage 1 (TS 22.140)
• Multimedia Messaging Service (MMS); Functional description; Stage 2
(TS 23.140) [3GPP23.140]
OMA • Multimedia Messaging Service: Architecture Overview [OMA-
• Multimedia Messaging Service: Client Transactions [OMA-MMSCTr]
• Multimedia Messaging Service: Encapsulation Protocol [OMA-

2.3 MMS architecture and Multimedia Message delivery

This chapter introduces the high-level architectural elements of the MMS and the
basic delivery technique. These are mostly specified in [3GPP22.140], [3GPP23.140]
and [OMA-MMSArc].
The Multimedia Messaging Service Environment (MMSE), illustrated by an
ellipse in Figure 2.1, includes all the MMS specific network elements under the
control of a single MMS provider (often a mobile network operator). Outside it are
the MMS User Agents – the mobile devices capable of viewing, composing and
handling multimedia messages (MM) – and a Wired Email Client representing the
interoperability with Internet Email.

MMS User Message

Agent store
User Databases
e.g. profiles,
MMS subscription, HLR
2G Mobile
Network A MMS

External Server

Internet /
IP Network


3G Mobile
Network A

Mobile Wired EMail

Network B Client
MMS User

Roaming MMS
User Agent

Figure 2.1 MMS Architectural Elements [3GPP23.140]

The heart of the MMSE is the MMS Relay/Server, often referred to as the MMS
Center (MMSC). It is in charge of storing and managing messages, reports, and
notifications. The MMS Server provides storage services and operational support for
the system and the MMS Relay transfers messages to and from other MMSCs and
other messaging systems (SMS centers and email servers). The MMSC might also do
content adaptation according to a MMS user agent’s capabilities (for example
support for colors or screen size) or when sending a message to a legacy messaging
system (i.e. SMS).
The MMS VAS Applications element is a server that acts much like a user agent,
i.e. it can receive and send messages. It provides machine-to-person services to the
end-users and may additionally be able to create Charging Data Records (CDR) for
service specific charging. See section 2.5.2 for examples.
The delivery of a multimedia message through these elements is clarified by the
following use case from [OMA-MMSArc], adapted to suite the elements in Figure
2.1. This use case example concerns person-to-person messaging between two
mobile terminals.

1. User activates MMS User Agent.

2. User selects or enters MM target address.
3. User composes/edits MM to be sent.
4. User requests that MM is sent.
5. MMS Client submits the message to its associated MMS Relay.
6. MMS Relay resolves the MM target address.
7. MMS Relay routes forward the MM to the target MMS Relay (included in
External Servers).
8. The MM is stored by the MMS Server associated with the target MMS
9. Target MMS Relay sends a notification to target MMS User Agent.
10. Target MMS User Agent retrieves the MM from the MMS Server.
11. Target MMS User Agent notifies target user of new MM available.
12. Target user requests rendering of received MM.
13. Target MMS User Agent renders MM on target user’s terminal.

It might be that the user agent is configured so that steps 10 and 11 would be in
reverse order, meaning that the MM is retrieved only after the user has approved the
retrieval. This might happen especially when the user is roaming (using another
network than the home network) because of billing reasons.
Some or even all of the actual contents might also be transferred using streaming
protocols. In streaming, data chunks are directly rendered on the recipient’s device
without waiting for the whole message to be retrieved. These data chunks can then be
discarded. This enables the user to start viewing a message before it has been fully
transferred, and saves memory of the receiving device.
The data transfer between the user agent and the MMSC can be implemented on
top of the Wireless Application Protocol (WAP). This configuration can use the
whole range of wireless networks from 2G to 3G. In this architectural configuration,
an additional network element – a WAP Gateway – is introduced as illustrated in
Figure 2.2. The communication between the gateway and the MMSC is done using
Hypertext Transfer Protocol (HTTP) over an Internet Protocol (IP) network whereas
the user agent and the gateway communicate utilizing WAP protocol stack and
Wireless Sessions Protocol.

Wireless Internet
Network /IP-network
MMS User Gateway MMS
Agent Relay/Server
Payload Payload

Figure 2.2 MMS data transfer over WAP [3GPP23.140]


2.4 Structure of a Multimedia Message

A multimedia message is divided into headers and a message body. This is done
according to [RFC-2822], which specifies a message as an envelope and contents.
The envelope contains so-called header fields with information like recipient address,
address of the user who sent the message, date and time when the message was sent
and the message’s subject. The addressing scheme of MMS combines email [RFC-
2822] and MSISDN addresses [ITU-E.164]. Examples of possible address fields in
multimedia messages are shown in Figure 2.3.

To: 0401234567/TYPE=PLMN
To: +358501234567/TYPE=PLMN
To: Joe User <joe@user.org>

Figure 2.3 Examples of MMS addresses [OMA-MMSEnc]

The RFC 2822 representation restricts the contents of a message to a single part of
US-ASCII text. This format is therefore extended with the Multipurpose Internet
Mail Extension (MIME) [RFC-2045, RFC-2046, RFC-2047, RFC-2387]. It allows
the representation of multiple non-textual parts in one message. The types of these
parts are described with a content type, also known as MIME-type, which is
composed of a media type, a media subtype and optional parameters. Examples of
content types are listed in Table 2.3.

Table 2.3 Examples of MIME content types

MIME content type Description

image/jpeg An image with the format JPEG
text/plain; charset="us-ascii" Text with the character set US-ASCII
application/octet-stream A sequence of octets with an unknown
multipart/mixed The basic multipart subtype, a message
composed of one or more parts

Even though MIME can represent binary data, it is a textual format with human-
readable field names and values. In order to decrease the size of the data sent over
the mobile network, the WAP Forum has defined a straightforward translation from a
MIME message to a binary format [WAP-230WSP: Section 8.5]. In this format the
most often occurring character strings are replaced with a short binary value.
A multimedia message is meant to be a presentation, not just a pile of unrelated
files. To achieve this, a multimedia message may contain a file that describes the
graphical layout and temporal synchronization of the message’s elements, called
scene description. The binary MIME-transformation includes a Start header field
that defines which of the parts is that description. The model of a multimedia
message is shown in Figure 2.4. The de facto standard for the format and language of
a scene presentation in MMS is Synchronized Multimedia Integration Language
(SMIL), but also XHTML and WML are mentioned in the specifications. SMIL can
describe slideshows with timed and interactive events, and can include links to parts

of the slideshow and to external resources. Chapter 3 will focus on the features and
different versions of SMIL.

Figure 2.4 Model of a multimedia message [OMA-MMSEnc]

2.5 Use cases for MMS

Even though advertisement, at least nowadays in Europe, views MMS as a picture

messaging service only, there is much more to it. The messages can contain many
kinds of media files and can even be interactive in the sense that the user can browse
through them, select links and so on. As the type of data sent via MMS is not limited,
it could be possible to send arbitrary data and install a third party plug-in that handles
Due to the fact that MMS is such a wide and, in many senses, unlimited standard,
it is extremely hard to predict the killer application, or to give even a nearly
exhaustive list of the actual use cases. The ones listed here represent according to [Le
Bodic] the basis that standardization organizations used when determining high-level
requirements for MMS.

2.5.1 Person-to-person

The main use case for SMS, covering about 80% of the revenues by operators, is
person-to-person messages. Therefore it is likely that it will also be the main
application for MMS, even though creating the media used in MMS is not that trivial
for an end-user.
Picture or video messaging is the use case that has been mostly advertised so far.
Many modern mobile phones have a camera either built-in or available as an
accessory. The first phones with cameras were only able to capture still photos, but

video recording is possible in most of the newer models. These images and video
clips can be instantly sent in multimedia messages to a phone supporting MMS or to
an email address. A typical scenario is a subscriber on a trip taking a photo with a
mobile phone, adding some text and sending it to friends back home like an instant
postcard. Figure 2.5 shows an example of this case.

Figure 2.5 An example multimedia message, viewed with RealPlayer 8.

Voicemail can also be sent via MMS, as the standard supports AMR voice clips. In
GSM systems voicemails can usually be stored on the operator’s servers if the device
you are trying to reach is out of reach and the receiver will be notified by an SMS.
With MMS there are at least two ways for voicemail: The network operator could
upgrade the usual voicemail system by sending the voice mails directly to the
receivers phone as MMS, possibly with some extra data about when the message was
sent, who sent it and so on. The user could also record a voice memo, attach it to an
MMS and send it to the receiver directly, without disturbing the receiver by making a
An operator could provide storage services for a user’s own content. When the
user has created some content with his phone that he would like to store or share, he
sends it to the operator’s server where it is stored. A web interface could be provided
to the contents so that the user can easily share these with his friends by sending the
Uniform Resource Locator (URL) of his contents.

2.5.2 Machine-to-person

Another category of MMS usage is the messages sent by servers to a user or a list of
users. These are also referred to as value added services (VAS). In this category, a
service provider provides the contents. The user often needs to subscribe to a service
in order to receive the content.

An example service could be a weather service; the user subscribes to a weather

report for a given region by sending a multimedia message (or short message) with
the region’s name to the server and it responses with a graphical weather report for
the region. The server could also send these reports daily until the user cancels the
service. These kinds of services will most likely be the ones that most benefit from
the advanced features of 3GPP SMIL, as professional designers create the content on
a workstation.
Another already implemented and moderately used service that uses MMS is the
purchasing of additional applications for mobile phones. Application archives
containing Java Midlets or Symbian software can be ordered with short messages
and are then delivered to the user in multimedia messages. Most phones that support
such additional software can also handle the installation from a multimedia message.
Figure 2.6 shows an example of an advanced advertisement presentation. When
the user focuses on the buttons, the picture on the left changes and shows the phone
from different viewpoints. There is also a sub-menu, which is shown when the fourth
item is focused. (Used by courtesy of Grassel Guido / Nokia)

Figure 2.6 An advertisement multimedia message, viewed with InterObject SMIL


This chapter will introduce Synchronized Multimedia Integration Language (SMIL)
[W3C-SMIL2], how and where it is used, its profiling mechanism, and differences
between some of the current profiles. Every major feature of the language will be
presented but the parts that are relevant for MMS will be emphasized.
SMIL, pronounced ‘smile’, is an XML [W3C-XML] based language for
describing multimedia presentations. It was created for describing multimedia
contents on a PC, but was also adopted into MMS for scene description.

3.1 History

Multimedia presentations have been around for a long time, but a common, open
format for describing such presentations has not existed. Temporal elements such as
audio and video have also increasingly been taken into use on the Web. The World
Wide Web Consortium (W3C), an organization that develops common Web
protocols, recognized the need for a declarative format for expressing media
synchronization. A working group focusing on the design of such a language was
established in 1997, and it gave rise to SMIL 1.0 [W3C-SMIL1] in 1998. Another
working group was founded to continue on the subject, and SMIL 2.0 [W3C-SMIL2]
became a W3C recommendation in September 2001.
Usage of SMIL on the Internet is still quite limited, even though major media
players like RealPlayer and QuickTime have been supporting it almost since its
advent. Multimedia components on the Web are still mostly done using proprietary
players like Flash by Macromedia. SMIL has been selected as one of the scene
description languages for MMS, and it seems to have gained de facto status in that
application. Therefore it is likely that if phone to email messaging with MMS
becomes successful then SMIL players will spread to the PCs of average users. This
might also push forward the usage of SMIL on the Internet.

3.2 Syntax and structure

The idea of SMIL is to enable description of multimedia presentation where audio,

video, text, and graphics are combined in a timed fashion. It is a language for
describing how and when the contents are shown with the actual contents in separate
files; kind of the glue that holds it all together. One of its merits is that the language
can be easily authored with a simple text editor.
The two major versions of SMIL, 1.0 and 2.0, are syntactically mostly similar but
the newer version adds a lot of functionality. Particularly the addition of profiling,
the ability to easily define sub-sets of the full SMIL 2.0 language, is valuable in the
context of mobile devices. Therefore, this work will focus solely on SMIL 2.0, and
the term SMIL is used to refer to SMIL 2.0. In the few references to SMIL 1.0 the
full version is stated.
To one who is familiar with Hypertext Markup Language (HTML) the syntax of
SMIL will open up immediately. Like any XML based language, SMIL consists of
elements that may contain attributes and other elements, forming a tree structure. The
root element in SMIL is, unsurprisingly, smil and it is always present in a well-
formed SMIL document. The root element may contain two sub elements, head and
body, like in HTML. The head element contains header information concerning the

whole document, and the body describes the actual contents. A simple SMIL
document is shown in Figure 3.1. It describes the example presentation in Figure 2.5.
Notice that the two lines containing the region elements are divided to two rows
because of layout reasons and there is no new line character in those places in the
actual SMIL document.

<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
<root-layout width="170" height="208" />
<region id="Text" width="100%" height="25%" left="0%"
top="75%" fit="scroll" />
<region id="Image" width="100%" height="75%" left="0%"
top="0%" fit="slice" />
<par dur="8000ms">
<img src="lapland.jpg" region="Image" />
<text src="lapland.txt" region="Text" />

Figure 3.1 A simple SMIL document

Table 3.1 The functional areas of SMIL

Functional area Modules Description of modules

Timing 19 Temporal descriptions and events
Time 1 Controls the rate or speed of time for media
Manipulations elements; fast forward, rewind etc.
Animation 2 Manipulates media element properties with time
Content Control 4 Selects media contents depending on systems
Layout 4 Positioning of elements
Linking 3 Hyperlinking and navigation
Media Objects 7 Description of media objects and their parameters
Metainformation 1 Description of data in a SMIL document
Structure 1 The basic structure of SMIL documents
Transitions 3 Transitions between media objects; fades, wipes

3.3 Functional areas and modules

The specification of SMIL is divided into ten major functional areas, listed in Table
3.1. Each of the areas is composed of several modules, adding up to 45 different
modules. These modules consist of a number of semantically related SMIL elements,
attributes, and attribute value definitions.

This organization enables other parties to easily define a SMIL profile with
selecting which of the modules to implement. There are dependencies between the
modules, meaning that a module builds on the functionality produced by some other
module and cannot be used without it. SMIL modules are also used to integrate
SMIL features into other XML based languages, like in the XHTML + SMIL profile.
The next three sections will discuss SMIL features in general, and section 3.6.6 will
define which features are included in which SMIL profiles.

3.4 Spatial description

SMIL allows the definition of complex, layered layouts by defining regions that hold
the visual contents of the presentation. These spatial definitions are always in the
head-section of a SMIL document, under the layout element, and apply to the
whole presentation. The regions are defined without any content, and actual visual
layouts during the presentation are determined by how and when some visual
components are visible in these regions.

3.4.1 Region

The base for layouts in SMIL is the region element. A region is a rectangular area
that can contain a number of visual components. Regions can be placed arbitrarily,
also on top of each other or partly or totally outside the visible area. The previous
example in Figure 3.1 defines two regions with dimensions and position relative to
the element containing the region (discussed in next section), an identifier and a fit
parameter. The layout of these regions is shown in Figure 3.2.

Figure 3.2 The regions defined by the SMIL document in Figure 3.1

The fit attribute specifies how a graphical element is displayed in this region.
The possible values are: fill, hidden, meet, scroll and slice. The effect of
these values is demonstrated in Figure 3.3 (page 14). Note that the difference
between fill and slice is that with slice the aspect ratio of the original image
is preserved and parts of the image are not shown, whereas with fill the image is just
scaled to fit the region. If the fit attribute is not defined, it defaults to hidden.
The identifiers, specified with the id attribute, are unique character strings used to
refer to these regions (or to any other elements in a SMIL document). The attribute

regionName is actually the primary reference for regions, and many regions may
share the same name, resulting in that a media placed in this destination will be
shown in all of the regions simultaneously (if possible). The id attribute is used only
if no region with the given regionName is found.
Dimensions and position for a region may also be given as exact pixel values, like
"100 px", where the unit qualifier px (pixels) can also be omitted. If the
dimensions of a region are omitted, partly or fully, the region will be placed
according to the limits of the parent region. Dimensions can also be defined using the
right and bottom attributes of the region instead of the width and height, or any
combination of these.
If two regions overlap, their order may be defined with the z-index attribute that
is defined as an integer. A region with a bigger z-index value is stacked on top of a
region with a smaller z-index value. The color of the parts of the region that are not
filled by the media can be defined with the backgroundColor1 attribute.

Fit Result with Result with Fit attribute Result with Result with
attribute the smaller the bigger the smaller the bigger
image image image image
fill scroll

hidden slice


The region The smaller image The bigger image

Figure 3.3 Effect of the fit attribute

The name for this attribute in SMIL 1.0 is background-color. A SMIL 2.0 player may also
support that name, but this use is deprecated. All attributes in SMIL 2.0 have ‘lower camel case’
names, but in SMIL 1.0 hyphenated names are used.

3.4.2 Other layout elements

In the example earlier, the first child of the layout element was a root-layout
element. This defines the size of the main visual element of the presentation, and in a
PC viewer usually the size of the application windows used to show the presentation.
There may be only one root-layout element in a SMIL document, and it has no
children. Note that this is actually not completely logical: In the example the
region elements that are siblings of the root-layout are conceptually clearly
‘inside’ the root-layout and should therefore be its children. The pixel sizes of the
regions are defined regarding to the size of the root-layout.
A SMIL presentation can also be defined to use many separate windows using any
number of topLayout element. Regions that are children of a topLayout element
will be placed in that separated window. The HierarchicalLayout Module, among
some other things, extends the basic layout model with support for hierarchical
region layouts: regions nested inside other regions. An example of these features is
shown in Figure 3.4.

The layout element

<topLayout width="640px" height="480px" />
<region id="left" top="0%" left="0%" width="50%" height="100%" />
<region id="right" top="0%" left="50%" width="50%" height="100%">
<region id="inset" top="25%" left="25%"
width="50%" height="50%" />

The resulting layout

Figure 3.4 Use of topLayout and nested regions [W3C-SMIL2: Chapter 5.9]

3.5 Temporal description

The layout section of a SMIL document specifies regions that describe where
graphical objects can be shown. This is defined inside the head element. In the body
of the document, the rendering of media objects is specified using these regions and a
number of synchronization elements and attributes. The two basic timing elements

are seq and par. Both of these function as containers for media objects or other
timing elements. SMIL temporal description is very complicated (about a hundred
pages in the specification), and this section only covers the basic cases.

3.5.1 Sequential and parallel containers

The objects inside a seq element are played sequentially in the order that they are
specified. Each of the objects is rendered in turn and the rendering of an object starts
when the rendering of the previous has ended. An example of the timeline of a seq
element is shown in Figure 3.5. It also shows how rendering of a media object can be
limited using the dur (duration) attribute and delayed with the begin attribute.
These will be discussed in more detail later in this section. Also note that the audio
elements have no region attribute like the image and text in the example in Figure
3.1, because audio objects do not have any graphical elements.

<audio id="sound1" src="FourSecSample.amr" />
<audio id="sound2" dur="3s" src="ReallyLongSample.amr" />
<audio id="sound3" begin="1s" src="ThreeSecSample.amr" />

Figure 3.5 An example seq element and its timing

The par element was already introduced in the first SMIL example, where its
function was to allow the image and the text to be rendered simultaneously. All the
elements inside a par element will start rendering at the same time, if not delayed by
some special attributes. These two basic containers can be freely nested to create
timing schemes that are more advanced. Figure 3.6 illustrates this behavior: Sounds
one and the sequential container start at the same time, sound five after one second,
and sounds three and four start sequentially after sound two. The body-tags have
been omitted from the rest of the examples.

<audio id="sound1" src="FourSecSample.amr" />
<audio id="sound2" dur="3s" src="ReallyLongSample.amr" />
<audio id="sound3" src="ThreeSecSample.amr" />
<audio id="sound4" src="FourSecSample.amr" />
<audio id="sound5" begin="1s" src="FiveSecSample.amr" />

Figure 3.6 Nested seq and par elements and their timing

3.5.2 Duration attribute

The dur attribute was introduced in the previous examples. It is used to specify how
long a media object is active. If not specified the media’s implicit duration will be
used. This is zero for images, so if an image is as a single element of a seq
container, it will not be shown if no duration is specified. Duration attribute can also
be defined for containers, and the children of a container always end when the
duration of their parent ends. In this case, an image will be shown for the whole
duration of the parent container.
If the duration of a video object is set to be longer than its implicit duration, its
audio will end normally and the last frame will be shown for the rest of the specified
duration. If the duration of an audio object is extended, it will be silent after its
implicit end, but active in the sense of element timing.
An example of these features is shown in Figure 3.7. The duration of the parallel
container is set to five seconds, which limits the duration of all of the media objects.
Image two is ignored because its implicit duration is zero. The active duration for
sound one is six seconds, even though the sample is only three seconds long,
resulting in that sound two is never played. The video element has an implicit
duration of three seconds, but this is extended to six seconds by its specified duration
and finally limited to five by the container’s duration.

<par dur="5s">
<img id="image1" src="RedSquare.jpg" region="LeftImage"/>
<img id="image2" src="lapland.jpg" region="RightImage"/>
<audio id="sound1" dur="6s" src="TheeSecSample.amr" />
<audio id="sound2" src="AnotherSample.amr" />
<video id="video1" dur="6s" src="ThreeSec.3gp" region="Video"/>

Figure 3.7 Duration attribute example

Durations are defined as SMIL clock values, including a number and a qualifier (h,
min, s or ms). The number may also include fractions, or define hours, minutes,
seconds, and milliseconds. If no qualifier is given, it defaults to seconds. Some
examples are given in Table 3.2. The value "indefinite" is used when duration
should be based on the elements inside a container or the elements parallel to the

Table 3.2 SMIL clock values

Clock value Value

"12s" 12 seconds
" 5.3 " 5.3 seconds (leading and trailing spaced ignored)
"0.001h" 3.6 seconds (1/1000th of an hour)
"02:30:03.36" 2 hours, 30 minutes, 3 seconds and 360 milliseconds
"01:20" 1 minute and 20 seconds
" 6 s " Invalid, spaces are not allowed inside the definition
"02:100:10” Invalid, only two characters allowed for minutes

3.5.3 Repeating elements

The playback of an object can be repeated by defining either the repeatCount or

repeatDur attributes. These can be applied to both media objects and containers.
The former of the attributes defines how many times an object is played, with a
numeric value or "indefinite" meaning that the object is repeated until parent
time container ends. The repeatDur attribute is used to determine how long an
object should be repeated, using a clock value or "indefinite". The object will
repeat as many times as is needed to fill this duration. Figure 3.8 illustrates these

behaviors. The sequence will be repeated until stopped by the user. Note also the
dur attributes for the two first elements.

<seq repeatCount="indefinite">
<audio dur="2.5s" src="TwoSecSample.amr" repeatCount="2" ... />
<video dur="2s" src="ThreeSec.3gp" repeatDur="3s" ... />
<audio src="TwoSecSample.amr" repeatCount="2.5" ... />

Figure 3.8 Repeating playback example

3.5.4 Begin and end attributes

As shown in the previous examples, the begin attribute can be used to delay the
rendering of an object related to the start time of its parent (parallel container) or the
ending time of the previous object (sequence). This can be extended by using
identifiers to relate the time to the beginning or end of an arbitrary object. For
example begin="video1.end+1" will make an object start one second after
video1 has ended. It is also possible to define a list of begin values, separated by
semicolon, for example begin="1s; video1.end+2min; 5min". In this case,
the element will start, or restart if already rendering, at each of these times.
Chapter 3.5.2 introduced the dur attribute, used for defining durations for objects.
The same thing can be achieved with the end attribute, which defines the ending
time for an object related to the same moment as the beginning time. In addition, the
relative values can be used. Figure 3.9 shows two codes that define similar timing

<par dur="10s">
<img id="img1" begin="2s" dur="5s" ... />
<video begin="img1.begin+3s" dur="4s" ... />

<par end="10s">
<img begin="2s" end="7s" ... />
<video begin="img1.begin+3s" end="img1.end+2s" ... />

Figure 3.9 Use of end and dur tags – these two codes result in similar timing

3.5.5 Event based timing

Nothing introduced so far has really justified the existence of both end and dur
attributes, because everything could have been defined using only one of them. This
is because the timing components so far have been static. To achieve dynamic
timings, events can be used in begin and end attributes.
SMIL Events always occur on a SMIL object, i.e. a media object, container or top-
level window. The format of a reference is familiar from many programming
languages: begin="image1.activateEvent" will make the object begin when
an activateEvent occurs on object image1. If no object name is given, the event
is assumed to happen on the object itself.
Table 3.3 lists the SMIL events specified for SMIL 2.0 Language Profile (more
information about it follows in section 3.7.4). The general SMIL documentation also
mentions events click, load and repeat (same as repeatEvent). Note that the
events endEvent and beginEvent have in practice2 the same results as the
synchronizing attribute values end and begin introduced in the previous sub-
To delay the beginning or ending of an object, a clock time can be added after the
event, for example "img.inBoundsEvent+1s". The repeatEvent can be
given an integer argument so that the media object responds to a specific repeat time,
not every repeat.

The only differences occur when using negative begin or end times, which are not covered in this

Table 3.3 Events in SMIL 2.0 Language Profile

Event name Description

activateEvent The media object is activated by the user, e.g. by
clicking on the object
focusInEvent The media object gets the keyboard focus, e.g. is
selected by the user by some means
focusOutEvent The media object loses the keyboard focus
beginEvent The object (could be container) begins playback
endEvent The object ends playback
repeatEvent The objects playback repeats (due to repeat attribute,
not due to multiple begin times)
inBoundsEvent The mouse pointer, or some other implementation
specific “cursor”, enters this media element’s area
outOfBoundEvent Opposite of the inBoundsEvent
topLayoutCloseEvent A top-level window (topLayout) is closed
topLayoutOpenEvent A top-level window (topLayout) is opened

The example in Figure 3.10 shows how easily play and stop-buttons can be
implemented for a video using the activateEvent. The video starts playing when
playImg is clicked, and stops when either the video itself or stopImg is clicked.

<par dur="indefinite">
<img id="playImg" src="play.png" ... />
<img id="stopImg" src="stop.png" ... />
<video begin="playImg.activateEvent"
end="activateEvent; stopImg.activateEvent" ... />

Figure 3.10 Play and stop buttons for a video using events

The use of the argument for the repeatEvent is demonstrated by the example in
Figure 3.11. The image is shown at each repeat of the video, but the audio is played
only one second after the video begins for the third time.

<video id="vid" src="ThreeSec.3gp" repeatCount="4" .../>
<img dur="2s" src="repeated.png" begin="vid.repeatEvent" ... />
<audio src="LastTime.amr" begin="vid.repeatEvent(3)+1s" .../>

Figure 3.11 Using repeatEvent with an argument

3.5.6 Other timing related elements and attributes

If better control of element timing is needed when using events, the attributes min
and max can be used. They set the lower and upper bound for an element’s duration.
These override the value of the end and dur attributes. In the example in Figure
3.12, the video is viewed at least once, and at most three times. If the user clicks on
the stop image before ten seconds, the video will be played once and then stopped.

<img id="stop" ... />
<video src="TenSeconds.3gp" end="stop.activateEvent" min="10s"
max="30s" repeatCount="indefinite" ... />

Figure 3.12 Example of min and max attributes

SMIL also includes a third time container that was not introduced with seq and
par, because it is only useful with event timing. This container is excl, and it is
otherwise like a parallel container, but allows only one of its children to play at any
given time. This is useful when there are many media objects from which the user
should select one that is played. The example in Figure 3.13 defines a video for
which the user can select the audio of the preferred language. When the user selects a
language, the new audio will replace any previous selection. The audio definitions
are defined inside parallel containers in order to preserve the synchronization with
the video – the audios all begin in sync with the video, even though they are not
active at that time.

<video id="vid1" .../>
<par begin="englishBtn.activateEvent" >
<audio begin="vid1.begin" src="english.au" />
<par begin="frenchBtn.activateEvent" >
<audio begin="vid1.begin" src="french.au" />
<par begin="swahiliBtn.activateEvent" >
<audio begin="vid1.begin" src="swahili.au" />

Figure 3.13 Example of an excl container [W3C-SMIL2: Chapter 10.3.2]

The clipBegin and clipEnd attributes can be used to limit the playable part of
a continuous media object. For example clipBegin="2s" clipEnd="5s"
would make an audio to be played only between offsets two and five seconds.

3.6 Other SMIL features

The two previous sections introduced the spatial and temporal descriptions of SMIL.
This section will show how actual media objects and links are defined, as well as
some special tricks and goodies.

3.6.1 Media object definitions

The inclusion of media objects into a SMIL presentation has already been introduced
in some of the previous examples. The syntax follows closely the syntax of an img
tag in HTML. The seven media object elements in SMIL are listed in Table 3.4.
They all have similar syntax, and any type of media can be defined with any of these.
Different elements have been introduced only to improve readability. The generic
reference is used when the group of a media object is unclear.

Table 3.4 SMIL media object elements [W3C-SMIL2: Chapter 7.3.1]

Media object Description

ref Generic media reference
animation Animated vector graphic or other animation format
audio Audio clip
img Still image
text Text reference
textstream Streaming text
video Video clip

The most important attribute of these elements is source, src. It defines the
object’s Uniform Resource Identifier (URI), which is used by the SMIL viewer to
fetch the content. The URI could be an HTTP address (like

“http://www.abo.fi/image.jpg”) or in the case of a multimedia message the random

content id given to a file (like "cid:rBWJlxq1YW"). Some SMIL viewers also
support the data URL scheme [RFC-2397]. It allows the insertion of small data
directly to the URL, and can be handy for embedding small texts or very small
images inside a SMIL document, for example "data:,A%20brief%20note".
Note the coding of the two spaces (%20).
As mentioned earlier, the region attribute is used for graphical object to define
the region in which the object is rendered, and the id attribute is used to give any
element in SMIL a specific identifier. The alt attribute should be used to define an
alternative text for an object, and a URI to a longer description should be defined
with longdesc.
There is also an additional brush ‘media object’, which can be used to paint solid
color in a region. This color is defined by color attribute. A brush does not have the
src attribute, and it is not interchangeable with the other media object elements.

3.6.2 Linking

A SMIL document may contain links to other SMIL presentations, a specified time
in a SMIL presentation or external files like HTML documents. As with many SMIL
elements, the syntax is closely related to the syntax of similar features in HTML. The
basic linking element is a, which contains the elements that can be used to open the
link. The element a always has an attribute href, which defines the destination of
the link. The hash separator (#) is used to define a specific element in a SMIL
presentation; the destination presentation will be played from the beginning of the
element with the given id. The link may be opened to a region of the current
presentation, an HTML frame or a new SMIL or browser window. This is specified
with the target attribute. In the example in Figure 3.14 the first link would restart
the presentation from the beginning of video1. The second link could be handled by
opening a web browser with the given address. The third link would open the
presentation another.smil to the region region_A. The last link would open
third.smil and start playing it from the beginning of the element video2.

<video id="video1" ...>
<a href="#video1">
<img src="jump_to_first.png" ...>
<a href="http://www.abo.fi/help.html" target="new"> ...
<a href="another.smil" target="region_A"> ...
<a href="third.smil#video2"> ...

Figure 3.14 Linking example

The element area can be used to associate a link to only a part of a visual media
object. This element is a child of the media object element, and it can contain the

same attributes as an a element, with some additional details. These include

coords, which defines the coordinates of the area, begin and end that specify the
time when the link is active and shape to define some other shape than rectangular.

3.6.3 Content control

One valuable SMIL feature, especially for mobile devices, is the content control. It
can be used to select over a number of layouts or media objects based on system
properties. It is based on the switch element, which allows only one of its child
elements to be chosen, the first one which is acceptable. It can be used anywhere in a
SMIL document. There are twelve defined test attributes, for example
systemBitrate, systemLanguage and systemScreenSize. For a numeral
attribute, any number bigger than the limit will be accepted. Figure 3.15 shows two
examples of switch: The first one selects an appropriate audio file based on preferred
language and the second one selects the layout based on screen size. Note that this is
the only case when multiple layout elements are allowed in a SMIL document.
Another content control related element is prefetch, which suggests the player
that the file specified by its src attribute should be fetched, usually from a server,
even though not yet displayed. When the file is rendered, the data will be directly
available. This element gives the author the ability to control content download so
that the presentation will be shown smoothly.

<video src="video.mpg" .../>
<audio src="finnish.au" systemLanguage="fi"/>
<audio src="dutch.au" systemLanguage"nl"/>
<!—- English as the default, no tests -->
<audio src="english.au" />

<layout systemScreenSize="1024X1280">
... define a big, complicated layout ...
<layout systemScreenSize="480X640">
... define a smaller layout ...
... define a small and simple layout ...

Figure 3.15 Selecting content and layout with the switch element

3.6.4 Transitions

To easily allow enhanced slide shows, SMIL supports transition effects at the
beginning and end of a media objects. Transitions are defined in the head section of
a SMIL document, and then applied to media objects using the transIn and
transOut attributes. There are over a hundred different transitions supported by
SMIL, categorized by types and subtypes, but only four of the transitions are
mandatory. These are the default subtypes of barWipe, irisWipe, clockWipe
and snakeWipe. Figure 3.16 illustrates the use and timing of irisWipe. Note the
fill-attribute, which makes the first image stay visible until the transition has
finished. Without it, the first image would disappear when the second becomes

<transition id="iris1s" type="irisWipe" subtype="rectangle"
dur="1s" />
<img id="redSquare" transIn="iris1s" dur="4s" fill="transition" ...>
<img id="lapland" transIn="iris1s" dur="4s" ...>

Figure 3.16 irisWipe transition with subtype rectangle

3.6.5 Metadata

There are two SMIL elements for defining data about the document or its parts, also
known as metadata. These elements are meta and metadata and they are always in
the head part of a SMIL document. The meta element is a simple element that
defines data about the whole document using the attributes name and content. This
could be for example that “Publisher” (name) is “W3C” (content). The metadata
element acts as the root element for a Resource Description Framework tree, which
may include data about any of the elements in the SMIL document.

3.6.6 Animation

The SMIL animation modules define a complex set of functionality and cover about
fifty pages of the documentation. The animation elements allow the manipulation of
virtually any attributes of a SMIL element as a function of time, and the same syntax
is used to define animations in Scalable Vector Graphics (SVG).

3.6.7 Time manipulation

The time manipulations allow control of speed or rate of time for a SMIL element.
They can be applied to both time containers and single media objects. They could be
used to make a part of a presentation be played with double speed, or to be played
first forwards and after that backwards. The speed of time can also be defined to
accelerate and decelerate. It should be noted that some media formats might not
support playing backwards or different playback rates.

3.7 Actual SMIL profiles

The last two sections presented the fundamental features of the full SMIL language,
which is far from suitable for mobile or other resource limited devices. The profiling
mechanism introduced in Section 3.3 supports the definition of sub-sets of the full
language. When designing SMIL, W3C also defined two SMIL profiles: SMIL 2.0
Language Profile and SMIL 2.0 Basic Profile. Phone and network manufacturers
have defined a greatly limited set to be used in MMS, so-called MMS SMIL, and
3GPP has after that defined the much richer 3GPP PSS SMIL Profile for the same
purpose. These four profiles will be introduced in this section. Their relations are
shown in Figure 3.17.

Figure 3.17 Relations between the SMIL profiles presented in this section

3.7.1 MMS SMIL

MMS SMIL was introduced in the MMS Conformance Document, created by Nokia
and Ericsson in 2001. The specification has since then been renewed by the original

authors accompanied by six other mobile industry companies, and was transferred to
Open Mobile Alliance (OMA) in 2002 [OMA-MMSCon].
Even though the MMS Conformance Document has been updated several times,
the SMIL definition in it has not changed significantly. MMS SMIL defines a
presentation as a collection of slides, which all have the same layout, as in Figure
3.18. Each of these slides is presented by a par element in the SMIL, containing at
most one image, one text, and one audio element. The size of the layout is set using
the root-layout element. Each slide defines its duration in the dur attribute of
the par element
This limited functionality allows easy implementation of an MMS viewer for a
resource-constrained device, as there is no interactivity (except the possibility for the
user to go to next or previous slide) and the timing is simple. One important feature
is also that adapting the original layout to fit the display of the receiving device is
simple because of the simple layout. These presentations are still viewable on a
viewer supporting a richer set of SMIL. MMS SMIL is not SMIL Host Language
Conforming, as defined in [W3C-SMIL2: Chapter 2.4.1].

Figure 3.18 An MMS SMIL presentation [OMA-MMSCon]

3.7.2 SMIL 2.0 Basic

The smallest set of modules that are required for SMIL 2.0 Host Language
Conformance is the set defined by SMIL 2.0 Basic Profile. It was designed by W3C
as a language for resource constrained devices, but was too rich for the first MMS
implementations, which led to the creation of MMS SMIL. On the other hand, 3GPP
– with future MMS implementations in mind – considered the features of SMIL 2.0
Basic too limited, and defined a richer profile that is presented in the next section.
This leaves SMIL 2.0 Basic outside the scope of this work. Many PDA and PC SMIL
players support it, some as the intermediate stage before full SMIL 2.0 Language
Profile support.

3.7.3 3GPP PSS SMIL Profile

Introduced as a part of 3GPP’s Packet-switched Streaming Service (PSS)

specification [3GPP26.234], the 3GPP PSS SMIL Profile - or just 3GPP SMIL - was
first designed for scene description in streaming services. It was later defined as the
mandatory scene description language in the MMS specifications by 3GPP
[3GPP26.140]. This section will focus on 3GPP SMIL release 5; the one defined in
release 4 does not include the BasicTransitions module.

The 17 modules included in 3GPP SMIL are listed in Table 3.5. Even though these
numerically cover only about one third of the full set of modules, all the basic
features are included. From the features presented in the previous sections only the
following are not included in 3GPP SMIL: Multi-window layouts, nested regions,
exclusive (excl) container, animations (on SMIL attributes, animations as media
objects are included), brush media object and time manipulations.
This means that supporting 3GPP SMIL in a SMIL viewer is a lot more complex
than supporting MMS SMIL. Whereas MMS SMIL is quite static in its nature,
having only a number of slides that should advance in a timed fashion, 3GPP is
dynamic and much more complex because of the nested timing elements, events and
links to external resources and beginning of specific elements in the presentation.
This will be discussed in detail in Chapter 6.

Table 3.5 SMIL modules in 3GPP PSS SMIL (release 5)

SMIL Module Contents

BasicContentControl switch element
SkipContentControl Allows compatibility between different SMIL versions
PrefetchControl prefetch element
BasicLayout Basic regions (no hierarchical regions)
BasicLinking a and area elements
LinkingAttributes A lot of attributes for the above elements, like target
BasicMedia Basic media elements
MediaClipping clipBegin and clipEnd attributes
MediaAccessibility alt, longDesc and other accessibility attributes for
MediaDescription author, title and other descriptive attributes for
Metainformation Metadata about a SMIL presentation
Structure smil, head and body elements
BasicInlineTiming begin, end and dur attributes for media
MinMaxTiming min and max attributes for media
BasicTimeContainers par and seq elements
RepeatTiming repeatDur and repeatCount attributes for media
EventTiming Support for events in begin and end attributes
BasicTransitions Transitions between two visual media objects

3.7.4 SMIL 2.0 Language Profile

Only six of all of the SMIL modules are not included in the SMIL 2.0 Language
Profile, among those the Time Manipulations module. The profile was created by
W3C for Web clients that support SMIL. Currently, there are no players with support
for all of the features, but many PC players come close. This profile is inappropriate
for small, resource constraint mobile devices, and will probably be used only in the
PC world. The older SMIL 1.0 is a subset of the SMIL 2.0 Language Profile,
meaning that SMIL 2.0 players will also be able to play the old SMIL 1.0 files.

This chapter introduces the variety in MMS features caused by its ongoing
development. For this I have divided MMS to three different versions: MMS without
scene description, MMS according to the OMA’s MMS Conformance Document
(‘OMA MMS’) and MMS as specified by the specifications of 3GPP (‘3GPP
MMS’). Focus will be on describing in which ways 3GPP MMS is more complex
than OMA MMS, because OMA MMS is roughly the current implementation level
in Series 40, and 3GPP MMS is the object of this work. The relationship between all
of these versions is shown in Figure 4.1.
There lies a possible source of misunderstanding in this naming: MMS as a whole
is specified by both OMA and 3GPP, and most of their MMS documents apply to all
of the versions listed here. The document that separates OMA MMS and 3GPP MMS
is OMA’s MMS Conformance Document. It originates from Nokia and Ericsson but
has later been adopted as part of the OMA MMS documents.
It should also be noted that in this comparison it is crucial to take into account the
versions of the specifications listed – future versions the OMA Conformance
Document will probably introduce more advanced features. The current OMA
Conformance Document is meant to be an intermediate phase to enable conformance
between phones from different manufacturers even in this early stage of MMS

Figure 4.1 Relations between the MMS versions introduced in this chapter

4.1 MMS without scene description

Some of the early MMS phones, though not the first one (Ericsson T68i), send
multimedia messages without a scene description. This stage of MMS is not actually
specified in any document, it is merely something that happened to be implemented
in practice. These messages most often include one textual part and additional
images or audios as attachments, like an email message. The encoding of these
messages is done according to the MMS specifications [OMA-MMSEnc], using

MIME as introduced in Section 2.4. The formats for media are not specified, but the
ones listed in MMS Conformance Document (see next section) are widely used.
Many of the first MMS phones ignore the SMIL scene description when viewing a
multimedia message, but still send MM with a valid SMIL. Such implementations do
not belong to this category.


Media formats, media codecs, scene description, and other features of OMA MMS
are defined in MMS Conformance Document v2.0.0 [OMA-MMSCon], a part of
OMA MMS 1.1 specification. As already mentioned in Chapter 3.7.1, Nokia and
Ericsson created the document in order to improve interoperability of first MMS
implementations, because the features specified by 3GPP documents were not all
feasible in the time span that MMS was supposed to be commercially introduced.
The contents of the document have not changed significantly since the first release in
The media formats for both OMA MMS and 3GPP MMS are listed in Table 4.1
(on the next page). The mandatory scene description of OMA MMS is MMS SMIL,
which was introduced in Chapter 3.7.1. The standard includes the most widely used
image formats, as well as Wireless Bitmap (WBMP) that is mostly used in WAP
pages. It is worth noticing that interoperability is guaranteed only for images with a
maximum resolution of 160x120 pixels. This is because images taken for example
with digital cameras can as unpacked contain several megabytes of data, which
would be inappropriate in a mobile phone with a memory size of a few megabytes, or
even less. AMR, a codec specified by 3GPP for speech, is the only audio-codec
supported. Text is supported without any formatting. For sending phonebook entries
the vCard format must be supported and, in a phone with a calendar, calendar notes
in vCalendar format. The total size of a message is limited to 30 kilobytes.
OMA MMS does not mention MIDI, a format for synthetic audio, or any video
formats, even though such might be useful in modern phones. This is because OMA
MMS was designed to be possible to implement even on the low-end products.
The level of features in OMA MMS is supported by most of the newest phone
models, naturally with the exception of the most price-driven phones without MMS.
However, because of its limitations, many manufacturers have added some features
to OMA MMS. Most phones with a built-in or accessory camera can optionally send
the photos in the full resolution, for example 352x288 or 640x480. Many phones that
support polyphonic ringing tones support receiving of SP-MIDI files. As the newest
phones with cameras can be used to record video clips, such phones can usually send
these clips in multimedia messages. The formats defined for 3GPP MMS are widely
used for these kinds of enhancements.

Table 4.1 Media formats in 3GPP MMS and OMA MMS


26.140 V5.2.0) Conformance Document
AMR x x
MPEG-4 AAC-LC / 48 kHz, mono & stereo x -
SP-MIDI, format 0 or 1 x -

H.263 Profile 0 level 10 1 -
H.263 Profile 3 level 10 o -
MPEG-4 Visual Simple profile L0 o -

vCalendar 1.0 - 2
vCard 2.1 - x

Still images
JPEG, baseline DCT x x
JPEG, progressive DCT o -

Bitmap graphics 3
GIF87a x x
GIF89a x x
PNG x x
WBMP - x

Vector graphics
SVG Tiny profile 4 -
SVG Basic profile o -

XHTML Mobile Profile (no images) x -
UTF-8 x x
UTF-16 x x
UCS-2 Unicode x -

Scene description
3GPP PSS SMIL Profile x -
XHTML Mobile Profile o -

x = Mandatory
o = Optional

1 = Mandatory for terminals supporting media type video

2 = Mandatory if the phone has got a calendar
3 = Interoperability guaranteed for 160x120 pixels
4 = Mandatory for terminals supporting media type "2D vector graphics"
5 = MMS SMIL is a subset of 3GPP SMIL and thus included in support for 3GPP SMIL

The newest version of Nokia Series 40 MMS viewer, which is in use for example
in the model 6220, supports OMA MMS with the following additional features:

• The maximum message size is 100 kilobytes

• There is no specified limit for image resolution
• SP-MIDI format for receiving and sending polyphonic ringing tones
• Video clips can be used (H.263 codec)
• Digital rights management (DRM) is supported (Mobile DRM)

The two additional file formats of the Series 40 player, SP-MIDI and H.263 are
both mandatory in 3GPP MMS. Digital rights management has not been specified for
MMS at all, but has been seen as an important feature because of the rich media
contents in MMS. It allows a file to be encrypted and viewed only with a special key,
so that the user is able to use the file but not copy it – if the content is forwarded it
cannot be viewed because the key is not present. This is useful for example if the
user purchases a ringing tone that should not be copied to anybody else. The DRM
scheme used is Mobile DRM.
Version 1.2 of the OMA MMS specifications, which currently only exist as a
candidate, takes into account the need for interoperability in richer media. The
conformance document divides features into five categories, currently called Text,
Image Basic, Image Rich, Video Basic, and Video Rich. This enables a manufacturer
to easily specify the level of functionality in a phone by referring to these categories.
In addition, version 1.3 is currently being discussed. It will most probably include
3GPP SMIL with some restrictions.

4.3 3GPP MMS

The term ‘3GPP MMS’ is in this thesis used to refer to MMS with the media formats
defined in [3GPP26.140] version 5.2.0 and the scene description in [3GPP26.234]
version 5.5.0. It is worth pointing out that [3GPP26.234] is a part of the 3GPP
streaming specifications, and only the chapter defining 3GPP PSS SMIL is related to
MMS. The specification also defines media formats and other details, but these are
for streaming, not for MMS.
The specifications of 3GPP MMS are of a different type and style than those of
OMA MMS. This is because the latter is defined by a document striving to
conformance, where as the specifications for 3GPP MMS are general service
specifications. It might be that not all of the features of 3GPP MMS will ever be fully
utilized by commercial products, but it still serves as a good source of features to be
studied, and the future conformance documents will probably require more of these
features. This thesis work will present also the features specified as optional, not just
the smallest set that enables calling a product 3GPP SMIL compliant.
The full listing of the media formats in 3GPP MMS is shown in Table 4.1 (page
32). The mandatory media types are audio, image, and text. The only media formats
that are included in OMA MMS but not in 3GPP MMS are WBMP images and
vCard and vCalendar files. For both video and vector graphics the 3GPP
specification defines mandatory media formats in case the terminal supports the
media type. The format defined for the latter, Scalable Vector Graphics (SVG), is an
XML language for describing two-dimensional vector graphics. It is specified by

W3C, and modularized like SMIL, so that profiles with limited functionality can be
defined. If supported, the profile Tiny is mandatory and Basic optional.
3GPP MMS specifications do not state anything about the maximum size of a
message, because it is clearly a figure that will change with time with the continuous
improvement of network and hardware technologies.
The mandatory scene description language in 3GPP MMS is 3GPP SMIL;
XHTML Mobile Profile is in addition mentioned as optional. 3GPP SMIL is the
fundamental difference between OMA and 3GPP MMS, as it is much more complex
to support than MMS SMIL. This is caused especially by the following details in

• The spatial layout can include an arbitrary number of arbitrary placed

regions, requiring the software to keep track of z-indexes of all of the items
drawn, and then take care of the overlapping parts, or to do a lot of
unnecessary redrawing. Especially partly transparent images or animations on
top of video are tricky to render
• There can be many simultaneous video and audio objects, which can lead to
very processor demanding rendering
• The timing of 3GPP SMIL is very complex, especially because of sporadic
events, which make the whole timing dynamic. Much more advanced data
structures are needed to handle all the timing relations than in MMS SMIL.
Also the ability to link to the beginning of a specified SMIL element
increases complexity of the timing of a presentation
• Besides the impact on timing in case of links within a presentation, the
linking features add the need for a pointer or a cursor that points to the active
element, so that the user can select a link. This if further complicated by the
possibility to associate links to parts of a visual element, not just the whole
• Transitions, even though the specification states that transitions may be
implemented partly or not at all

Many of the problems of a mobile device can be easily handled with appropriate
use of the SMIL’s content control, using the switch element (introduced in Chapter
3.6.3) to control the behavior of the presentation depending on the properties of the
devices. This is likely to be used in SMIL presentations created by professional
designers, mostly in machine-to-person scenarios, but the MMS generator software
in mobile phones will probably not be able to do this so elegantly, because the set of
possible target devices is not known when the software is written.
It is understood that the whole feature set of 3GPP SMIL is very demanding, and
will not be fully supported on all mobile devices. The 3GPP streaming specification
contains a chapter about SMIL authoring guidelines [3GPP26.234: Annex B]. These
are valid also in the scope of MMS. The major points are:

• The linking of a presentation should not depend on the area element,

valuable links should be done using the a element
• Because the layout may be discarded by the target device if it is unsuitable,
the switch element should be used to define different layouts for the whole
range of targeted devices

• The fit attribute value "scroll" should only be used for text components,
not for image or video
• Scaling of video or even images might not be possible on a constrained
device, "hidden" fit and a suitable size in pixels is therefore recommended
especially for video
• The events inBoundsEvent and outOfBoundsEvent assume that the
terminal has a pointer device for focusing elements, and should thus be used
with care
• XHTML as media element:
o No images should be defined in the XHTML parts, these should be
included in the SMIL document
o Tags that are not in XHTML Basic might not be rendered correctly
(3GPP MMS supports XHTML Mobile Profile, which is a superset of

The first phone supporting 3GPP MMS, Nokia 6600, was released in November
2003. It is built on the Series 60 platform and uses the Symbian operating system.
The MMS viewer of the phone supports the full 3GPP MMS, even though there is
some minor details that are not according to the specs. Because there are not yet any
other products supporting this set of MMS features, the MMS composer does not
support creating such messages, so they can only be used by for machine-to-person
messages if the receiver’s phone is known to be a Nokia 6600. The phone also
supports some additional features, like Mobile DRM.
The next chapter will introduce the Series 40 platform and Chapter 6 will focus on
how the feature set of 3GPP MMS can be implemented, possibly with some
adaptation, on the Series 40 platform. Chapter 7 focuses on user interface design.

This chapter introduces the Nokia’s Series 40 mobile phone platform, with emphasis
on the details that are related to viewing SMIL presentations. Platforms for similar
usage profile by other manufacturers have most probably somewhat similar features.
The Series 40 platform is a size driven, but still feature rich platform. This means
that small size is favored over excessive amounts of features. It is used in many of
the mobile phones Nokia has released lately, for example the products 7210, 6220,
and 6800, shown in Figure 5.1. The simpler, price driven Nokia phones are based on
the Series 30 platform, and the most advanced one-hand operated feature driven
Nokia phones are built on the Series 60 platform, which uses the Symbian operating

Figure 5.1 Examples of Series 40 products: Nokia 7210, 6220 and 6800

There is some variation among the features of the Series 40 phones. For example,
the Nokia 6800 includes a full QWERTY-keyboard that can be revealed by flipping
the face of the phone. This chapter will focus on the features of the newest models,
because such a set of features is definitely available in the phones that 3GPP SMIL
will be implemented to. Some ideas will also be given about how the platform might
be developed in the future and which details are unlikely to change.

5.1 User interface

5.1.1 Display

Most of the Series 40 terminals so far have a display composed of 128x128 square
shaped pixels and is capable of showing 4096 separate colors. The physical size of
this display is about 27 x 27 mm. There are two models with a bigger display of
128x160 pixels – these are in some occasions referred to as Series 45 phones. Two of
the most recently released models have a display with 65536 (64K) colors, and it is
likely that such, and even better, displays will be used in all Series 40 products in the
The Series 40 platform is size driven, which limits the maximum physical size of
the display. The face of the phones is not totally utilized yet, so some improvement
can be done if the size of the other components in the phone can be reduced. It is

likely that display technologies will be developed so that the size of the pixels is
reduced, which will also increase the number of pixels that can be displayed. The
extra pixels gained by reducing pixel size cannot however change the fact that it is a
limited amount of information human eye can conveniently interpret from a given
area. Smaller pixels will, at least after some point, only add to the clarity of the
image, not to amount of information that can be shown simultaneously.

5.1.2 Keypad

There is some variation in the keypads of the Series 40 phones. All of the phones
have the following keys:

• Number keys: 0-9, star and hash

• Scroll up, down, left and right (4-way scroll key)
• 2 softkeys (left and right)
• Send key (dial) and end call key
• Power key

Many of the Series 40 phones have separate volume keys; otherwise, volume is
adjusted with scroll left and right. Some models have a full QWERTY-keyboard that
can be flipped forth from under the normal keypad, but these are a minority. The
model 6108 has a separated pen input pad enabling the inputting of characters using
a stylus. It has been designed especially for inputting Chinese.
One feature that has been introduced to the newer models is that the 4-way scroll
key, which can also be seen as one key with 4 different functions, can also be clicked
inwards. This is often called a 5-way scroll key. It is used as a third softkey, and is
assumed to be present in the next chapter when user interface for the MMS viewer is
discussed. Softkeys are introduced in the next sub-section.

Figure 5.2 shows the keypad of Nokia 6230. It has the standard keys listed above
complemented with volume keys and the 5-way functionality in the scroll key.

Figure 5.2 The keypad of the Nokia 6230

5.1.3 User interface logic

The basic ideas of the Series 40 user interface logic are presented in this sub-section
to give background information to the next chapter that also discusses the user
interface of the 3GPP MMS viewer application.
The idea of the softkeys is that the functionality of these keys changes depending
on the state of the current application. In Series 40 user interface the functionality for
each key is shown in the bottom of the display, right above the particular key. Figure
5.3 exemplifies this with the camera layout for a Series 40 product that has three
softkeys. The left softkey opens the options list, middle softkey captures the image,
and the right softkey exits the camera application.

Figure 5.3 Camera viewfinder layout in a Series 40 product with three softkeys

[NOK-S40UI] describes the logic of the user interface only with two softkeys (a
newer version of the document will probably be released before this thesis is
published), but this can easily be adapted to three softkeys. Basic usage of an
application builds around the softkeys and the scroll keys.
The left soft key is used for positive and forward-going actions, like Select, OK,
Options and Yes. If there are multiple possible actions, they are collected in an
options list, which is accessible via the left softkey. The middle softkey contains the
action that is most important, like Capture in the case of camera viewfinder. The
right softkey is used for negative and back-stepping actions, like Exit, Delete, Back,
and No.
It should be noted that the example layout, camera viewfinder, does not really
work according to this logic in the case of two softkeys, because accessing Capture
via an options list would not be appropriate. Therefore, in the case of two softkeys
the Capture action is in the left softkey, and the options list is not accessible when
viewfinder is active.
The scroll keys are naturally used to move the cursor or focus to the four possible
directions, or to scroll the data visible if it does not fit the display. When entering a
number (i.e. time or date), scroll up and down change the currently focused value,
and scroll left and right move the focus. In a product without volume keys scroll left
and right are used to adjust volume. This might cause problems in an application
where both horizontal scrolling and volume adjusting should be easily accessible,
like the MMS viewer, because the user interface might change significantly between
products that have the volume keys and products that do not.
The other keys of the keypad are not used for basic application usage. The end call
key always functions as a global exit and it should always end the currently active
application. The send key is used for making a call or as a shortcut to activate a
sending operation, like sending the current image taken with the camera. A long
press of the power key always turns the phone off. The number keys, especially star
and hash, are used in many applications to implement shortcuts for an advanced user.
The keys 2, 4, 6, and 8 are also used as optional up, left, right and down controls in
many games.
There are no touch screens in the Series 40 products; the only Nokia with a touch
screen is the 7700, built on the Series 90 platform. Neither is there anything that
would function like a mouse on a workstation. Therefore, the user interface does not
have a concept of a pointer other than the cursor in text editing and the focus in a
selection list or grid. This must be taken into account when designing the user
interface for the MMS viewer.

5.2 Memory and processor

The total memory of a Series 40 product consists of Random Access Memory

(RAM) and a flash memory, which is memory that does not need power to maintain
its contents but is much slower to access, especially write, than RAM. The RAM is
used for volatile, dynamical application data. The flash memory is used to store the
software, data needed by the software, e.g. images, ringing tones and texts, and the
user data, for example additional applications, multimedia messages, and calendar
entries. There are also some smaller cache memories and a small unchangeable read-
only memory. Some products also support extending the user data storage with a
removable Multimedia Card (MMC).

The memory sizes of the Series 40 products vary – the newer phones tend to have
more memory as the capacity of memory chips is constantly growing and the price is
getting lower. In general, memory is still quite an expensive component in a mobile
phone, and a target of optimization because even small saving per unit will mean big
amounts of money when the volume is up to hundreds of millions.
The size of the RAM in the current Series 40 products is between 4 and 8 mega
bytes, depending on the features present. It is hard to say even for a specific product
how much of this is available for an application, but the phones network related and
other basic functionality will take up much of it. Something between a few hundred
kilobytes in the older products and a couple of megabytes in the newer ones is a fair
estimation. Compared to a PC this is not much, but the user interface and the
applications are also much simpler, and more optimized for the specific purpose.
There is not any memory to waste for the MMS viewer, and for example huge
images (millions of pixels) and hundreds of slides might cause problems with
memory, but there should be no problems in case of somewhat normal multimedia
The size of the flash memory varies much more than that of RAM, but most of the
current Series 40 products have 16 megabytes of it. The software and data needed by
it takes most of this, and about one megabyte is left for user data. There are currently
two models that have exceptional user data capacities: The 7600, which has got
about 29 megabytes of flash memory available, and the 6230, which supports MMCs
with at least 32 megabytes for user data. Both of them support playback of MP3 and
AAC audio files, which require much space.
There are two processors in the Series 40 terminals, one Microprocessor Control
Unit (MCU) and one Digital Signal Processor (DSP), a processor specially designed
for high-performance, repetitive, numerically intensive tasks. The former is mostly
used for executing standard application logic and the latter is used for audio and
video processing. The speeds of the MCUs used are about 50-100 MHz, whereas the
DSPs run at approximately 100-200 MHz. If the terminal is not doing anything else –
e.g. a phone or data call – then most of the processing power should be available to
the MMS viewer, even though some is naturally used by the basic network
functionality. In practice, these figures mean that, for example, video can be decoded
in real-time, but resizing of video or many videos and audios that should be decoded
simultaneously will cause problems. Processor speed is fortunately a figure that will
definitely improve a lot in the course of time.

5.3 Software

The Series 40 platform is based on the proprietary Nokia Operating System, which
has been developed by Nokia especially for mobile phones. The general architecture
of the system is based on clients and servers: Each of the phone’s resources is
controlled by one or more servers, and these servers provide services to clients. A
client could be a user application or a server needing access to some other resources
than the ones it controls. A resource could be for example the SIM-card,
loudspeaker, or microphone.
Besides such low level services, the software platform has servers and other
software components that provide services like image viewing, video rendering, and
audio file playback. This means that the MMS viewer does not have to implement
such things, but merely controls the playback of media elements using the

appropriate components. This also means that if support for some media type is
implemented, it is available for all the applications in a terminal, not just the MMS


This chapter will discuss the problems that arise when multimedia messages
according to the 3GPP standards are played on a mobile terminal. The Series 40
platform is used as reference for specific details, but the same kinds of limitations
apply also to other platforms. The limitations are discussed and then solutions to the
problems that arise from those presented.
Even though the task of implementing 3GPP MMS viewer is far from trivial, there
are only a few very troublesome features. The hardest problems are caused by the
need for interoperability between all the devices supporting SMIL. When composing
a multimedia messages on a Series 40 terminal the features of the presentation will
naturally be limited by the limitations of the viewer, and viewing such messages will
not cause any problems. In addition, most other mobile phones have quite similar
limitations to those of the Series 40 terminals and messages from those will not be
problematic, even though somewhat bigger displays are common high-end products
and presentations optimized for such may cause problems. The biggest issue is SMIL
presentations created for a PC. The display size, runtime memory, and processing
power available on the original target device of such presentations are enormous
compared to those of a Series 40 terminal.
It is possible for the Multimedia Messaging Service Center (MMSC) to do content
adaptation according to the destination device’s properties. This means that viewing
messages sent via an MMSC with this feature is an easy task. Unfortunately, this
adaptation is optional and the phone software must be able to work even without it in
order for the phone to be able to function in all possible network environments.
If there is no way to render a multimedia message according to the scene
description the phone can, as a last resort, show the user a list of the files included in
the message, with the possibility to open files separately. There should be options to
save the files to phone’s memory or to view individually with the Media Player, a
standard component in the Series 40 software, or some other appropriate application.
This way the user can at least view the data included in the message, even though
much of the content might be uninteresting when separated from the presentation.
Any unreferenced files in a multimedia message, meaning files that are not referred
to by the presentation, should also be handled this way.

6.1 Display size

The display size of a Series 40 phone is smaller than that of many devices that the
terminals share content with; the width of the display compared to a PC screen might
be one tenth, or even less. It is likely that in the future the Series 40 displays will be a
bit bigger, both physically and in pixels per inch. This will improve the situation, but
the problem of viewing content designed for bigger screens will always exist.
There are many studies about viewing web-content on a mobile device, discussing
both content and user interface adaptation. Regrettably, the methods used for HTML
are not appropriate for SMIL content, because a HTML page is mostly static and
consists of media and links laid out on one level, whereas a SMIL presentation has
temporal behavior and the content is on many visual levels.

6.1.1 A presentation that is bigger than the display

If a presentation is bigger than the display size of the rendering device, the MMS
viewer application has the following four options for handling the rendering:

1. Just show the presentation in its original size so that some of the content is
not shown
2. Resize the presentation to fit the display
3. Show the presentation in its original size, but with scrollbars so that the
user can view the whole area
4. Use the last resort explained earlier, i.e. just show the content files

These approaches are in no way exclusionary. The last one should probably
always be available as an option, even if the presentation would not be bigger than
the display.
Using the first approach is appropriate only if the presentation is slightly bigger
than the display, maybe up to ten pixels, so that only some pixels near the border are
missed. In case of a bigger presentation, the user might miss much of the content.
Figure 6.1 exemplifies methods two and three. It shows a presentation designed to
be viewed in full screen on a Series 60 phone, and how it looks when scrolling versus
resizing is used to fit is on the Series 40 display. The softkey texts and the header in
the rightmost example will be discussed later on.

Figure 6.1 Viewing a presentation designed for a bigger screen (the screens are
printed approximately in their physical size)

Resizing the presentation to fit the display is in many cases a good solution,
especially because it is so seamless to the user. Nevertheless, in case the content
includes many visual details that are crucial for the content, e.g. small text in images,
such details might be invisible after the resizing. This can be seen in Figure 6.1 – the

texts in the last two cases are mostly unreadable, even though the original
presentation has only been resized by a factor of 0.62 and 0.46. A presentation for
PC might be for example 800 times 600 pixels, and would totally lose details like
small texts if resized to fit the Series 40 display. The MMS viewer cannot know the
types of the contents in a presentation, so the user should be able to also select some
other viewing method than resizing: scrolling or viewing the files separately.
In the example, the resizing is done so that the aspect ratio of the original
presentation is preserved. In cases like it, where the aspect ratio of the original
presentation is not extremely different from that of the target size, it could be more
beneficial to fit the presentation to the whole area available in order to utilize the
screen area more effectively. Of course, this will not look good if the original
presentation is stretched so much that it totally loses its form, so the aspect ratio
should not changed too much.
When the presentation is shown with scrollbars, the content can be shown in the
size originally intended and no details are missed. Even this approach has some
drawbacks: Some timed visual parts might be easily missed and the user interface
becomes more complex. There are also limits to how big the scrollable area can be
before it becomes unusable – presentations bigger that for example four times four
display sizes should probably be reduced to that limit in order to maintain usability.
When scrolling is used, the user views only a part of the whole presentation at a
time. This works fine for viewing a static presentation, or one that only responses to
user input. If there are any components that have timed behavior, the user may miss
some of the content, because for example an animation might be running in a part of
the presentation that is outside the user’s viewport.
The possibility to scroll the whole presentation, adds one more feature to the user
interface. This is problematic, because the user interface of any application in a
phone should be as simple as possible, and there are already many activities in the
MMS viewer that should be easily accessible to the user. This will be discussed more
in Chapter 6.2.
Because resizing and scrolling are best suited for different kinds of content, it
might be the best solution to provide the user with both of the methods. It would be
convenient to start the MMS viewer with the whole presentation fitted to screen and
provide zoomed viewing, i.e. natural size with scrollbars, as an optional view.

6.1.2 Optimizing the use of the display

The display size is a limitation for the MMS viewer, and therefore the usage of the
display’s pixels should be optimized. In standard Series 40 applications, the header
uses 14 pixel rows and softkey texts use 18 pixel rows, like in the camera layout in
Figure 5.3 (page 38). This means that the area left for application specific free
content is only 128x96 pixels. These could be hidden in order to use the screen area
more effectively.
The header and the softkey texts are visible in all of the current native Series 40
application, and removing them from the MMS viewer would make it look quite
different. It is questionable whether this is acceptable. From usability point of view
the header is not so necessary – it would probably hold content like the timer and
possibly an icon to notify the user if the presentation includes sound. These could be
implemented by drawing them on top of the contents. Removing the header would
increase the usable area width to 110 pixels.

The softkey texts are an essential part of the whole softkey idea – without them the
user would not know how the phone exactly responses to a key press in all the
situations. It could still be usable to hide these texts when playing an SMIL
presentation, if the viewer would pause the presentation and show the softkey texts in
case any key was pressed. Scrolling, and possibly selecting, media objects should
still be available while a presentation is playing, which complicates the matter.

6.1.3 A presentation that is smaller than the display

It is worth noticing that a presentation might define a root layout smaller than the
display. The first idea about viewing such presentation might be to show them in the
original, intended size. This is not the best solution, because the screen size is quite
small and it would be foolish not to utilize it as fully as possible. Therefore,
presentations that are smaller than the display should definitely be resized to fill as
much of the display as possible, probably so that also the media elements are resized.
This will help the user to see as much of the presentation as possible.

6.2 Processing power

The speed of the processors in a Series 40 device can be a limiting factor for
rendering of audio, video, transitions, and possibly images if really fast pace is
needed. This sub-section discusses some processor hungry situations and how they
could be resolved.

6.2.1 Transitions

Transitions demand a large amount of processing power. They can be implemented

with generating a mask – the state of the transition – and copying the appearing
content on top of the old one using that mask. This operation has to be done many
times per second in order for the transition to be smooth. Therefore, many
simultaneous transitions are too heavy to calculate, and a simple solution is to skip
any transitions if some given number of them is already active.
As the drawing of the mask and the result consumes most of the processing power,
multiple transitions for small areas might be less demanding to render than one
occupying the full screen. As a result, the optimal way to limit the number of
simultaneous transitions would be to keep knowledge about the amount of processing
power left at a moment and based on that decide whether a new transition should be
skipped. This would though add unnecessary complexity to the software and the gain
is probably not worth the trouble.
Rendering transitions is especially heavy for video, because each video frame
drawn should be manipulated according to the transition. The current Series 40
hardware and software architecture, which has been optimized for the speed of
standard video decoding, does not effectively support this. Therefore, transitions for
video will most probably have to be skipped.

6.2.2 Video

Real-time decoding of video is a demanding process, because the codecs use

complex algorithms to be able to reduce the size of the files with enormous ratios.
The decoding has to be done in real-time, because as unpacked, a few seconds of

even a small video would take up several hundreds kilobytes of space and the
unpacking would take a remarkable long time. A video in a mobile phone usually
includes about 10 to 20 frames per second, so there is also a demand to update the
screen in a rapid pace. Normal viewing of a video requires much of the processing
power on the current terminals and therefore operations like resizing, transitions, z-
indexing, and simultaneous rendering of many instances, are non-trivial.
The current hardware and software architecture may also cause problems. If the
separated Digital Signal Processor is used for decoding the video, which is often the
case, it can operate separately from the Main Control Unit (MCU) and decode the
video directly to the screen. This makes decoding very effective, but makes
implementation of z-index and transitions complicated for video, as the MCU should
be able to change each decoded frame before it is shown. It may also make decoding
many videos simultaneously inefficient, if the architecture has been optimized for
one video instance at a time.
To implement all the fit parameter values for video elements, it should be possible
to resize a video to an arbitrary size. This is not feasible with the current processors,
even though clever design of the decoder software component makes this more
effective than resizing of each of the video frames separately like an image. Hence,
fit parameter values have to be implemented with cropping the video to achieve the
needed size, so that the center of the video is preserved. This way the intended layout
can be maintained, but parts of the video will be invisible. Figure 6.2 exemplifies

Figure 6.2 Replacing video resizing with cropping

Drawing something on top of a video element demands that the elements on top
are redrawn after each frame of video, probably to some buffer and then to the
display. This can be an expensive operation, especially if the element on top of video
is transparent. For elements on top of video, the overlapping areas might thus be
invisible in order to make the processing feasible. There are two types of transparent
images: binary transparent – meaning that selected pixels are fully transparent – and
alpha transparent, meaning that each pixel has a visibility level value. Alpha
transparency is more expensive to render than binary transparency. GIF animations
and vector graphics may also include transparency. Figure 6.3 shows examples of
these features.

Figure 6.3 A transparent element on top of video

Decoding multiple videos simultaneously is not feasible with the current Series 40
hardware and neither does the current software support it effortlessly. Therefore,
only one video should be active at a time. If the original timing of the presentation is
preserved and simultaneous videos skipped, the user will miss some of the contents.
If this should be avoided, the timing of the presentation needs to be redone. A
possible solution to this is the following:

• If a video starts when some other video is active, freeze the former one to
the current frame
• The new video is shown for the duration intended – this time does not
affect the timing of the whole presentation and other objects should not
change their state, unless some interactive timing event is fired
• The former video continues and the presentation timing continues as usual

This solution works with interactive timing and allows the user to see all of the
contents, but may result in a bit odd timings. Thus, the choice has to be made to
either respect the original timing, or to always let the user see all of the contents
It should be noted that in many cases the new video should replace the former one,
as that is the timing specified by the SMIL. In such cases, there is naturally no need
for special handling.

6.2.3 Audio

As video files, also audio files have to be decoded in real-time, as storing them
unpacked would need great amounts of memory. This means that processing power
can become a limiting factor, especially with high-quality audio like MP3 or AAC
files with a high bit rate. Besides, as is the case with video, the current Series 40
audio architecture has been designed for decoding one signal at a time. This
limitation will most probably be solved in the future, but for the current terminals,
only one audio files can play at a time.

The obvious solution for multiple simultaneous audios is that newer sounds
override older ones. It would be good not to stop the older audios while this happens,
so that they will continue as is there was no interruption after the overriding sample
has ended. This way if there is for example background music and some short click
that is played when some events occur, then the music will seem to continue
normally even when it is interrupted for a while. The only problem is that the user
will miss some of the contents if a presentation depends on playing many audios
simultaneously, but this is hardly a common use case.

6.3 Memory

The amount of run-time memory available does not set any specific limitations for
viewing 3GPP MMS – there is enough memory to handle normal messages sent from
mobile phones. However, 3GPP MMS does not have the same content size and
amount limitations as OMA MMS and it is therefore with any memory size possible
to receive a presentation that will cause the MMS viewer application to run out of
memory. The application should be aware of this, and monitor the amount of free
memory. In case of an out of memory error, it should show an error message and let
the user view the content files separately.


This chapter shows that it is possible to design a user interface supporting 3GPP
SMIL features in a small mobile terminal. The Series 40 platform is again used as a
reference, but most of today’s mobile terminals have a somewhat similar user
interface. Designing a truly useful and optimized user interface that complies with
the conventions of Series 40 applications is out of the scope of this thesis, as it would
require usability studies and creation of simulation software.
The added complexity in 3GPP MMS compared to OMA MMS has much impact
on the user interface design. There is much more functionality, as there can for
example be many scrollable items on the screen simultaneously.
Additionally, all the functions in a phone should be easily usable and have similar
user interfaces. This requirement is non-trivial for MMS viewer application, because
the content viewed needs interactions of many kinds. This sub-section proposes two
different user interfaces for the MMS viewer in a Series 40 terminal with three
softkeys and volume keys. The Nokia 6230, shown in Figure 5.2 (on page 38), is an
example of such a device.
When viewing a 3GPP MMS the user should be able to:

1. Select any of the interactive objects (media elements that cause events that
are used by other elements)
2. Select any of the links available
3. Vertically and horizontally scroll any of the visible media objects that have
a scroll bar
4. Pause or stop the presentation
5. Restart the presentation
6. Fast forward or rewind the presentation
7. Set audio volume
8. Mute the audio
9. Select viewing of the content files separately
10. Send (forward) the presentation as MMS
11. Select between resized and scrolled mode
12. Vertically and horizontally scroll the whole presentation

The last two actions are only needed if resizing and scrolling of the whole
presentation are both supported viewing modes and the currently active presentation
does not fit the screen.
The next sub-section will present a user interface that includes all of these actions.
Because it is quite advanced and might be considered too complex, a simplified
version that lacks some functionality is presented afterwards.

7.1 An advanced user interface

The scrolling of the whole presentation is a valuable feature, because it improves

MMS interoperability between the phone and devices with a bigger screen. It is
therefore included in this user interface design.
To be consistent with other Series 40 applications and the user interface style of
the whole system, the user interface of the MMS viewer should be built around using

the softkeys and the four scroll keys. Number keys and the send key should not be
needed for standard operation, but they may be used for shortcuts.
The logic of the whole Series 40 user interface is based on that the user is always
capable of returning to the previous state with the right softkey, mostly labeled as
“Back”. This should also be implemented in the MMS viewer. Also like in any
Series 40 application, all the actions that are not needed instantly while viewing a
presentation are accessed via an options list, which is opened with the left softkey.
The actions that can be placed to the options list are 4, 5, 8, 9, 10, and 11, written
with italic in the previous listing. Pausing the presentation should actually happen
always when the options list is opened, so that action does not have to be visible in
the list. Setting the volume is naturally handled with the volume keys. The hash key
could be used as a shortcut to mute audio, like in some current Series 40 applications.
The user interface has not got a mouse pointer or a touch screen, but there should
be means to select an active block, i.e. any element whose activateEvent has
been used by some other element (more in section 3.5.5) or any element or part of an
element that is a link (more in section 3.6.2). Additionally, multiple scrollable
regions can be visible simultaneously, and the user should be able to scroll any of
them. To achieve this the concept of focus must be introduced – the focus shows
which element is the one that will be selected, opened, or scrolled currently and it
can be moved around the screen. The focus could be visible for example as a
rectangle around the active block, as shown in Figure 7.1.

Figure 7.1 Example visualization of the focus and the scroll bar

Because it is the most complex case, the features of the user interface will be
presented assuming the presentation does not fit the screen and scrolling of the whole
area has been enabled. If the presentation fits or is resized to fit the screen, some
details become simpler.
The focus is moved using the four scroll keys so that a key press in any of the four
directions moves the focus to the next active block in that direction. If that block is
not fully visible, the presentation will be scrolled so that it is, if possible, fully
visible. If there are no active blocks visible in the key press direction, the
presentation is scrolled to that direction. This functionality is exemplified in Figure
7.2, in which the balls at the cities in the map are the active blocks, maybe links to
other states of the presentation. The scroll right key is pressed between the screens.

Figure 7.2 Moving the focus and scrolling the whole presentation – the scroll right
key is used

The middle softkey is used for selecting the currently focused element, which
means opening the link or firing the activateEvent of that media element.
So far, the actions have been quite clear, but there is also need to enable scrolling
of any of the scrollable regions. To achieve this, these elements must also be
focusable. When a scrollable region is focused, like in the second screen shot in
Figure 7.1, the middle softkey can be used to enable region scrolling. This enters
region-scrolling mode, in which the scroll keys are used to scroll the focused
element. To quit this mode, the used must press the middle softkey again.
One exception has to be covered – selecting an element that is both scrollable and
a link or a source for events. This case is probably rarely relevant in practice. It can
be handled so that the middle softkey enables region scrolling and the link or event
activation is placed in the Options list. This must also somehow be shown to the user,
probably so that the scroll bar for such an element also shows a hint that the element
is also selectable.
The action 6, fast forwarding and rewinding the presentation, is somewhat
troublesome. In the existing OMA MMS viewer the user can skip to the next or
previous slide with scroll up and down. The 3GPP SMIL does not include the
concept of slide, so this is not anymore a useful action, and the scroll keys are fully
occupied by other actions. Most PC SMIL players allow the user to scroll the
timeline using the mouse – this way the user can easily jump to any part of a
continuous presentation. A scroll bar visible on the screen could be a solution, so that
it is selectable with the focus like the active areas. It would however use up valuable
screen space, and not be that easily selectable when whole screen scrolling is
A better solution is to use some keys, for example 4 and 6, for seeking backwards
and forwards in time. A first-timer will not easily notice the option, but it is
conveniently available for an advanced user. This should be acceptable, as the
function is not a vital one. A seek could be for example five seconds, or to the next
element in the currently active sequential container, as exemplified in Figure 7.3 (on
the next page). A good thing about the latter approach, which might seem a bit
illogical with some presentations, is that it would work as changing from slide to
slide when an OMA MMS compliant message is played, because each slide consists
of a par-element, and those elements reside successively in a sequential container.

Figure 7.3 Seeking forwards in time to the next sequential element – the arrows
show the element that is activated with seek forwards at a given time

To deal with the complexity of this user interface, the different actions of the
middle softkey should be clearly visible and the user should always be aware of the
state of the focus and the scrollbars. The softkey text for the middle softkey could be
shown and changed according to the focused element and whether scrolling mode is
enabled or not. If optimized screen space usage is wanted, the softkey texts should be
hidden. This is achievable if the graphical elements for focus and scroll bars are
easily recognizable so that they show the user what is going on and what is the
current action of the scroll keys and the middle softkey.
The user interface proposed here enables horizontal and vertical scrolling of both
individual regions and the whole presentation, opening links and selecting active
media elements. It enables viewing of 3GPP MMS messages with even complex
features. A downside is its complexity. A simpler user interface with slightly limited
functionality is presented in the next sub-section.

7.2 Simplifying the user interface

To simplify the user interface, two limitations are added to the functionality: The
whole presentation is not scrollable, meaning that only resized mode is available, and
the regions can only be scrolled vertically. The scrolling is probably mostly used
with text components, and the natural way to render text is to wrap rows so that its
width fits the destination region’s width. Thus, the limitation to vertical only
scrolling is sensible.
The thing that made the previous interface complex was the fact that the middle
softkey and the scroll keys had many different actions depending on what was
focused and if region-scrolling mode was enabled. The two limitations make the set
of actions smaller and achievable without such complex behavior. The concept of
focus is also somewhat complex, at least to a user that is not familiar with the
functionality of the Series 40 xHTML browser, but it cannot be skipped because
active elements and links are a crucial part of the 3GPP SMIL and there has to be a
way to select these.
The functionality is otherwise similar to the previous interface, but with the
following changes:

• The middle softkey functions always activates the currently focused media,
i.e. opens the link or fires an event

• The scroll up and scroll down keys are only used to scroll the currently
focused region up and down. If there is only one scrollable region, it does not
need to be focusable at all – it is always be scrolled when these keys are
• The scroll left and scroll right keys are always used to select the previous and
next active block, respectively. The concept of next and previous could be
according to the horizontal order of the blocks, or according to some other

With these changes, the user interface logic becomes more simple and easier to
follow. One problem might be the sequencing of the active blocks, especially if there
are plenty of those on the screen simultaneously. Additionally, if the places of the
blocks only differ in vertical direction it is quite illogical that the focus is moved with
the left and right scroll keys, not up and down.
If a presentation depends on horizontally scrolling a region, then this interface will
not be able to deal with it, but this is a minor issue. The biggest problem might be
that showing the presentation in its original size with scroll bars is not enabled – this
might make some presentations designed for a bigger screen unusable. Nevertheless,
as mentioned earlier, there is always the last resort of showing the content files
separately, which partly solves this issue.
As the functionality of the middle softkey is more static in this version, the softkey
texts are not that important. Still, in order to create a simple interface, they should be
visible so that the user knows what the available options are. The text for the middle
softkey might also change between something like “Activate” and “Open link”,
depending on the focused element. The hiding of the softkey texts might be available
via the options list.
As phone user interfaces should as be simple as possible, it might be reasonable to
use this simpler interface. How relevant the limitations are depends on the content
viewed – when showing messages from other Series 40 phones the limitations will be
irrelevant, but SMIL presentations for PC’s might cause problems.

7.3 An alternative to media scrolling

Scrolling is a good method on for example PCs, but may not be the best for SMIL
presentations on a Series 40 terminal. This sub-chapter presents another approach
that can be used instead of scrolling.
Even though the SMIL standard has the region property fit="scroll", there
are two clear disadvantages to implementing this as such on a mobile phone with a
small screen. Firstly, especially for test media, the achieved usability is not very
good. Let us use a standard MMS SMIL presentation (see Chapter 3.7.1) as an
example. A slide will most probably have both an image and some text visible on the
screen, so for example only a third of the screen size might be utilizable for viewing
the text. Even with a small font only a few (three on Series 40), lines of text fit into
the text region. If the text is for example a few sentences, the user will have to scroll
down many screens to read the whole text, and may not see even one full sentence at
a time.
Secondly, as seen in the previous two sub-chapters, scrolling of medias adds
complexity to the user interface, as it increases the number of actions the user has to
be able to perform while viewing a presentation.

An alternative to scrolling in presentation mode is to let the user open the media to
full screen mode, so that the presentation is paused in the background. Figure 7.4
illustrates this behavior. The media may still require scrolling, but the screen area is
utilizes much more effectively. Especially text is much easier to apprehend when
more, if not all, is visible.

Figure 7.4 An alternative to scrolling, exemplified with text media

The second advantage is that this approach makes the user interface more
straightforward. For the advanced user interface, from Chapter 7.1, the state change
from viewing presentation to the “scrolling” mode is more clearly visible. Selecting,
or opening a link on, a scrollable item could be done with middle softkey in the full
screen view mode, or as previously via the Options-menu when viewing the
If this approach used on the simplified user interface from the previous sub-
chapter, all the four scroll keys can be used for moving the focus, because dedicated
scrolling keys are not needed. This is a clear improvement, because using only
left/right for moving the focus is far from intuitive, especially if the focusable
elements are in a vertical sequence. Selecting a scrollable text element becomes
somewhat less usable, as it in the standard scroll-approach is done with the middle
soft key while viewing the presentation. That is however a much less needed action
than scrolling a media.
This design can also be used for scrolling images. The only difference to scrolling
text is that images may need to be scrolled also horizontally, which does not cause
any problems because the scrolling happens in a separate user interface state and can
use all of the scroll buttons.

This chapter goes through the most important points of this thesis and discusses the
MMS is a strong competitor in future of mobile messaging. Its capabilities are
currently limited by the simple scene description language profile used, so there is a
clear need for 3GPP SMIL. The advanced features will be most valuable for content
created by professionals on a workstation, used for services and advertisement
provided by companies to end-users. However, also person-to-person content based
on templates can benefit from 3GPP SMIL.
Many features of SMIL make it a particularly good scene description language.
Possibly the most important is its profiling mechanism that makes it easy to create
different kinds of SMIL profiles for different usage. This makes the same language
usable for describing the current, very simple multimedia messages, but also for
describing complex presentations for workstation usage. The content control features
allow content creators to make the content adapt to system limitations. Moreover,
even the somewhat limited 3GPP SMIL is very powerful and has very few
limitations as to what can be described.
SMIL syntax is still quite simple, allowing simple presentations to be authored
using a standard text editor. It should still be noticed that viewing complex 3GPP
presentations on the current low-end mobile phones is not feasible, and thus the
added features will decrease interoperability.
Besides MMS’s SMIL profile, also the media types and formats affect the
complexity of a multimedia message viewer application. Fortunately, there are
already standards for audio and video formats for mobile phones, and these formats
are supported by most mobile phones supporting advanced medias, so this should not
be a problem for MMS.
The Nokia Series 40 platform was used as the target platform for this thesis. The
platform is rich enough for utilizing 3GPP SMIL, but some features will cause
problems. Biggest difficulties will be encountered with presentations that have been
designed for workstation usage, due to different level of many platform features,
especially screen size. Besides this, video is a problematic feature, and the following
limitations are probably needed for video: nothing can be shown on top of it (z-index
is ignored), transitions are not supported, no simultaneous instances and possible no
scaling support. Additional troublesome features are synchronous audios, rapid
transitions, and other rapid events. Nevertheless, the resulting user experience can be
very rich even with these limitations, particularly compared to that of current MMS
The physical user interface of a mobile phone is quite simple and limited,
especially compared to that of a PC. It was shown that it is still possible with the
Series 40 user interface to support 3GPP SMIL features. However, in order to make
the user experience as good as possible, it needs to be evaluated that which of the
features presented are really needed and what are the priorities for using those
features. The advanced user interface presented is probably too complex for a mass-
market product.
As a whole, 3GPP SMIL is a good next step for MMS. It gives possibilities for
richer user experiences, but is still feasible to implement on at least the Series 40
platform. The current lowest end platforms will presumably not be rich enough to
support all of the needed features.

Some issues may still become problematic, mostly regarding to interoperability.

There are phones with different kinds of screen sizes and other features, and PC-to-
phone messaging is increasing as well. 3GPP SMIL does have mechanisms to deal
with this diversity, so that the presentation can for example use a simpler layout on a
smaller screen, or leave out some medias depending on the platform characteristics.
Much of the burden is on the content creators – the people designing the
presentations or the templates to be used for phone-to-phone messaging.
It is also clear that because the content is richer, there will be more use cases, and
thus more diversity in how different phones will handle some special cases.
Therefore, the content creators will most probably not be able to use all of the
features available on content targeted to multiple platforms, or at least the content
should not be dependent on advanced features like SMIL events or viewing
something on top of video. These kinds of guidelines will most probably be listed in
future MMS documents.

This thesis studies viewing multimedia presentations according to the 3GPP SMIL
standard on a size driven mobile phone. The Nokia Series 40 platform is used as the
The thesis shows that 3GPP SMIL is a feasible standard, and can be utilized on a
Series 40 mobile phone. The 3GPP SMIL profile is much richer than the SMIL
profile currently used in multimedia messaging service (MMS), and can thus provide
a better user experience.
Interoperability will be an issue, especially for messages sent from a PC. Small
screen size of the terminal is the biggest problem, but also the smaller processor and
memory capacity may cause problems. Phone-to-phone messaging should not be so
much effected by these matters.
User interface design for a 3GPP SMIL viewer is non-trivial, because there are so
many actions the user should be able to take. The thesis includes a user interface
design that proves, though it suffers from complexity, that such a design supporting
all of the features can be made.

[3GPP22.140] 3GPP (June 2003), Technical Specification 22.140 V6.2.0:
Multimedia Messaging Service (MMS); Stage 1
[3GPP23.040] 3GPP (June 2003), Technical Specification 23.040 V6.1.0:
Technical realization of the Short Message Service (SMS)
[3GPP23.140] 3GPP (June 2003), Technical Specification 23.140 V6.2.0:
Multimedia Messaging Service (MMS); Functional
description; Stage 2
[3GPP26.140] 3GPP (December 2002), Technical Specification 26.140
V5.2.0: Multimedia Messaging Service (MMS); Media
formats and codecs
[3GPP26.234] 3GPP (June 2003), Technical Specification 26.234 V5.5.0:
Transparent end-to-end Packet-switched Streaming Service
(PSS); Protocols and codecs
[ITU-E.164] International Telecommunication Union (May 1997), E.164:
The international public telecommunication numbering plan
[Le Bodic] Le Bodic, Gwenaël (December 2002), Mobile Messaging
technologies and services: SMS, EMS and MMS, John Wiley
& Sons Ltd
[NOK-S40UI] Nokia Mobile Phones (January 2003), Nokia Series 40 UI
Style Guide v1.0, available from http://forum.nokia.com
[OMA-MMSArc] OMA (November 2002), Multimedia Messaging Service:
Architecture Overview, Version 1.1 (OMA-WAP-MMS-
[OMA-MMSCTr] OMA (October 2002), Multimedia Messaging Service: Client
Transactions, Version 1.1 (OMA-WAP-MMS-CTR-v1_1-
[OMA-MMSCon] OMA (February 2002), MMS Conformance Document,
Version 2.0.0 (OMA-IOP-MMSCONF-2_0_0-20020206C)
[OMA-MMSEnc] OMA (October 2002), Multimedia Messaging Service:
Encapsulation Protocol, Version 1.1 (OMA-MMS-ENC-
[RFC-2045] Freed, N. and N. Borenstein, N. (November 1996), Request
for Comments 2045: Multipurpose Internet Mail Extensions:
Part One: Format of Internet Message Bodies
[RFC-2046] Freed, N. and Borenstein, N. (November 1996), Request for
Comments 2046: Multipurpose Internet Mail Extensions: Part
Two: Media Types
[RFC-2047] Moore, K. (November 1996), Request for Comments 2047:
Multipurpose Internet Mail Extensions: Part Three: Message
Header Extensions for Non-ASCII Text
[RFC-2387] Levinson, E. (August 1998), Request for Comments 2387:
The MIME Multipart/Related Content-type, The Internet
[RFC-2397] Masinter, L. (August 1998), Request for Comments 2397:
The "data" URL scheme, The Internet Society
[RFC-2822] Resnick P. (editor) (April 2001), Request for Comments
2822: Internet Message Format, The Internet Society

[SNE-EMS-Dg] Sony Ericsson (August 2003), Enhanced Messaging Service

(EMS): Developers Guidelines, Fourth edition
[W3C-SMIL1] W3C (June 1998), Recommendation, Synchronized
Multimedia Integration Language (SMIL) 1.0,
[W3C-SMIL2] W3C (August 2001), Recommendation, Synchronized
Multimedia Integration Language (SMIL 2.0),
[W3C-XML] W3C (October 2000), Recommendation, Extensible Markup
Language (XML) 1.0 (Second Edition),
[WAP-230WSP] Wireless Application Protocol Forum (July 2001), Wireless
Application Protocol: Wireless Session Protocol
Specification (WAP-230-WSP-20010705-a)


A.1 Introduktion

Mobila meddelanden har blivit en allt större business runt hela världen. Mångfalden
hos multimediameddelanden begränsas av språket som används för att beskriva
växelverkan mellan och synkronisering av medierna. Synchronized Multimedia
Integration Language (SMIL) har blivit industristandarden för detta ändamål och
Third Generation Partnership Program (3GPP) SMIL är ett förslag för kommande
version av SMIL för multimediameddelandeservice (MMS).
Detta diplomarbete undersöker användning av 3GPP SMIL för att beskriva
meddelanden. Som målplattform används Nokia Series 40.
MMS, SMIL och Series 40 plattformen beskrivs i tillräcklig detalj, och sedan
diskuteras problem som kan uppstå samt lösningar till dessa.

A.2 Multimediameddelandeservice

A.2.1 Från SMS till MMS

Textmeddelandeservice (”Short Messaging Service”, SMS) introducerades

kommersiellt 1992 för GSM. Nuförtiden skickas hundratals miljoner
textmeddelanden per år, även mellan olika slags nät. Ett textmeddelande kan
innehålla bara 140 oktett data, alltså 160 tecken med sju bitars kodning. Systemet
används ändå för många olika ändamål, t.ex. nyhets-, e-post- och väderlekstjänster.

A.2.2 Egenskaperna hos MMS

Som namnet säger, stöder MMS överföring av verkligt multimedia innehåll. För
användaren kan servicen verka ganska likadan som SMS, men teknologin bakom är
mycket annorlunda. MMS har designats att inte vara beroende av någon viss
transportteknologi – samma service fungerar så väl i tredje generationens nät (3G)
som i standard GSM nät.
Ett multimediameddelande kan innehålla många filer, som är hopslagna enligt
[RFC-2822]. Filerna kan innehålla vad som helst, vanligtvis olika typ av medier som
t.ex. text, bild, ljud eller video. För att meddelandet skulle kunna bilda en
presentation kan det innehålla en fil som beskriver den grafiska layouten och det
tidsmässiga beroendet av medierna, en så kallad scenbeskrivning (”scene
Multimediameddelanden kan skickas till e-post, ha flera mottagare samt innehålla
prioriteter. Meddelandets storlek är i nuvarande systemen begränsat till 100 kilobyte,
men detta kommer att förändras i framtiden.
MMS är definierad av två olika organisationer, Open Mobile Alliance (OMA) och
Third Generation Partnership Project (3GPP). De viktigaste dokumenten finns listade
i Tabell A.1.

Tabell A.1. Dokumenten som definierar MMS

Av Dokument
3GPP • Multimedia Messaging Service (MMS); Stage 1 (TS 22.140)
• Multimedia Messaging Service (MMS); Functional description; Stage 2
(TS 23.140) [3GPP23.140]
OMA • Multimedia Messaging Service: Architecture Overview [OMA-
• Multimedia Messaging Service: Client Transactions [OMA-MMSCTr]
• Multimedia Messaging Service: Encapsulation Protocol [OMA-MMSEnc]


Förkortningen SMIL står för ”Synchronized Multimedia Integration Language”,

alltså synkroniserat multimediaintegrationsspråk. Det är ett språk baserat på XML
[W3C-XML] som används för att beskriva multimediapresentationer. Språket
skapades för att beskriva multimedia i www, men har också tagits i bruk för
multimediameddelanden. Nyaste versionen, SMIL 2.0, är definierad i [W3C-

A.3.1 Syntax och struktur

SMIL är lätt att förstå för en som är känner till elementär HTML. Språket presenteras
här kort med ett exempel som innehåller de mest utnyttjade egenskaperna. Figur A.1
illustrerar koden för en fungerande SMIL presentation, och följande paragraf
beskriver de olika taggarna och deras funktioner.

<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
<root-layout width="170" height="208" />
<region id="Text" width="100%" height="25%" left="0%"
top="75%" fit="scroll" />
<region id="Bild" width="100%" height="75%" left="0%"
top="0%" fit="slice" />
<audio src="fem_sekunder.mid"/>
<par dur="5s">
<img src="lappland.jpg" region="Image" dur="4s"/>
<text src="lappland.txt" region="Bild" begin="1s"/>
<text src="slutet.txt" region="Text" dur="3s"/>

Figur A.1 Ett enkelt SMIL-dokument


Ett SMIL-dokument börjar alltid med taggen smil, som kan innehålla attributet
xmlns för att definiera versionen av språket. Inom huvudelementen fanns två olika
element, med olika ändamål:

• Elementet head, innehåller definitioner som påverkar hela presentationen

• Elementet body, definierar hur de olika medierna beter sig tidsmässigt

I exemplet innehåller head-elementet bara ett layout-element, som bestämmer

presentationens visuella layout. Dess första underelement, root-layout, anger
storleken av presentationen i bildpunkter. De två följande underelementen definierar
båda en region, ett rektangulärt område som visuella medier kan bindas till.
Regionen har följande attribut:

• id, regionens namn som kan användas för att binda medier till denna
• width, height, left och top, bestämmer regionens storlek och
placering, i detta exempel relativt till hela presentationens storlek
• fit, definierar hur synliga medier uppvisas i regionen

I elementet body bestäms de olika medierna och deras tidsmässiga beteende.

Medierna definieras med taggarna audio, img och text, som alla har attributet
src, vilket ger adressen till mediets källa, här med filnamn. För synliga medier
definieras en region.
I exemplet finns det dessutom två olika tidsmässiga attribut för medier, dur och
begin. Det förstnämnda definierar varaktigheten för mediet, alltså hur länge mediet
är aktiv, och det andra definierar tiden när mediet aktiveras. Alla tider i exemplet ges
i sekunder.
Utöver media definitioner innehåller body-elementet två olika så kallade
tidsbehållare (”time container”), nämligen seq och par. Dessa kan kombineras hur
som helst inom varandra. Elementet seq gör att dess underelement, som kan vara
medier eller tidsbehållare, spelas som en sekvens. Det andra elementet, par, gör att
underelement spelas parallellt. Resultat av dessa element illustreras i figur A.2.

Figur A.2 Synkronisering av medierna i SMIL-exemplet (Figur A.1)

Förutom det som presenterades i exemplet, kan SMIL t.ex. definiera medier som
beror av användarens indata, länkar till olika tider inom presentationen eller till
webbadresser samt visuella övergångar mellan medier.

A.4 3GPP multimediameddelande

Olika multimediameddelande versioner skiljer sig från varandra, förutom i SMIL

egenskaper, även i vilka typer av medier som stöds. Målversionen för detta arbete är
3GPPs multimediameddelande dokument, version 5.2.0. Dessa definierar att SMIL-
presenteraren skall stöda ljud-, video-, bild- och textmedier, samt SMIL profilen
3GPP PSS. Exakta formaten hittas i dokumenten.

A.5 Series 40 plattform

Målplattformen för detta arbete är Nokias Series 40, och som exakt terminal valdes
Nokia 6230 (figur A.3). Telefonen har 8 Mbyte arbetsminne och två processorer: en
centralprocessor och en digitalsignalprocessor. Skärmen är 128x128 bildpunkter, och
tangentbordet består av en strömbrytare, volymknappar, en bläddringsknapp i 5
riktningar, två väljarknappar, en ring-knapp, en avslutningsknapp samt

Figur A.3 Tangentbord hos Nokia 6230

Terminalens användargränssnitt bygger på att applikationerna kan fritt använda

bläddringsknappen samt vänstra och högra väljarknappen, medan dom andra
knapparna används mera sällan. Bläddringsknappen fungerar också som tredje,
mellersta, väljarknapp. Volymknapparna har naturligtvis sin funktion även i SMIL-

A.6 3GPP SMIL på en Series 40 terminal

3GPP SMIL är betydligt mindre begränsat än det språk som för tillfället stöds av
Series 40 terminaler. De största problemen orsakas av SMIL-presentationer som är
gjorda för att uppvisas på en persondator. Då har den ursprungliga målterminalen

mycket större skärm, mer processoreffekt och större minne. Största problemet är
När presentationen är större än skärmen, finns det två alternativa lösningar: att
rulla hela presentationen, eller att minska presentationen så att den kan visas på en
gång. Rullandet ger sällan bra resultat, för då ser man bara en del av presentationen
som kan innehålla många olika aktiva element. Dessutom ökar den även
komplexiteten av användargränssnittet. Då man minskar presentationen till aktuell
skärmstorlek är hela presentationen åtminstone synlig, även om små detaljerna kan
försvinna. Användarvänlighet bör föredras.
Som sista chans att visa problematiska presentationen är lista alla medier som
finns med, och låta användaren öppna var och en av dem skilt. Då kan användaren
åtminstone utnyttja informationen som finns i meddelandet.

A.7 Användargränssnittsdesign

Gränssnittsdesignen som presenteras här baserar sig på följande val (för att förenkla
gränssnittet): Hela presentationen går ej att rulla och medier går endast att rulla i
vertikal riktning. Det senare är naturligt, för rullandet behövs för det mesta för
textmedia, som kan indelas på passliga rader.
Högra väljarknappen används för att avsluta programmet, vilket är standard hos
Series 40 plattformen. Vänstra väljarknappen öppnar en meny som innehåller
egenskaper som behövs mera sällan.
Series 40 telefoner har inte pekskärm eller muspekare. Konceptet fokus används
för att kunna välja länkar, och fokusen visualiseras med en rektangel som ritas runt
mediet. Alla medier som har länkar eller är rullbara kan fokuseras. Mittersta
väljarknappen används för att öppna länken som är definierat för den fokuserade
Höger och vänster bläddringsknappar används för att flytta på fokusen. Med upp
och ner knapparna rullar man det fokuserade mediet, om den är rullbar.

A.7.1 Ett alternativ till rullandet av media

Som alternativ till rullandet av medier, i en presentation, kunde följande användas:

Man kan aktivera rullbara medier med mittersta väljarknappen, och då man väljer
mediet stannar presentationen upp och endast det valda mediet öppnas i hela
skärmen. Användaren kan då rulla mediet för att se den helt, och med hjälp av högra
knappen komma tillbaka till presentationen. Med detta alternativ används skärmen
mera effektivt, om det t.ex. finns en längre text. Dessutom behöver rullandet inte
begränsas till endast vertikal riktning.

A.8 Diskussion

Detta arbete forskade i presentation av multimediameddelanden enligt 3GPP

standarden på en Series 40 mobiltelefon. Slutsatsen är att 3GPP SMIL är en utförbar
standard. 3GPP SMIL-profilen är mycket mer omfattande än den som nuförtiden
används i multimediameddelanden, och tillför en allt rikare användarerfarenhet.
Samverkan mellan olika slags terminaler kommer att orsaka problem, speciellt
med meddelanden som skickas från en persondator, för att då har den mottagande
terminalen mycket mindre skärm samt mindre processor- och minneskapacitet.