Vous êtes sur la page 1sur 8

Solving the Arabic UTF-8 Characters Transaction Issues in an

Online Malay-Arabic Dictionary

Khirulnizam Abd Rahman1, Syuria Amiruddin2, Che Wan Shamsul Bahri Che Wan Ahmad3,
Wan Harun Hussaini4, Siti Zaharah Mohid5
Kolej Universiti Islam Antarabangsa Selangor
Bandar Seri Putra
43000 Bangi, Selangor
{1khirulnizam, 2syuria, 3cwshamsul, 5zaharahm}@kuis.edu.my, 4dr.husaini@yahoo.com

Abstract
The Malay-Arabic online dictionary stores and manipulate both Latin and Arabic alphabets. The Malay
alphabets use Latin characters, while Arabic alphabets use Unicode representation to store the characters in the
computer system. One of the Unicode character encoding scheme is UTF-8. Since the Arabic alphabets are stored
in Unicode, there are additional efforts to transmit the Arabic word from the client to the server and vice-versa. The
paper discuses an implementation of the information retrieval for the Arabic UTF-8 characters in an online Malay-
Arabic dictionary. The processes involved are; displaying Arabic words, searching for Arabic words from the
database; insert/update an Arabic word into the database.

Keywords: Arabic string manipulation, UTF-8, Malay-Arabic online dictionary.

1. Introduction
Malay language is a language widely used in Malaysia, a multiracial country in the Asian continent. Mean while
Arabic is an old language and internationally used especially in the Middle East. Many Malays, which are mostly
Moslem, are learning Arabic to perform and practice their religion [3]. This is because Arabic is the language for the
Moslems’ holy book, the Quran.

Unlike the regular Latin alphabets, Arabic alphabets are much more complicated. It is written right-to-left, and
the characters are written continuously in a word [2]. A single alphabet has many forms depending on where the
alphabet is situated. For example, “‫ ”ن‬pronounced “noon” (similar to “N”) has this form in front of the word “‫”نثفشة‬,
while in the middle “‫”النهار‬, and at the end “‫”النسن‬.
In order to facilitate the learning process of Arabic language for the Malays, we are developing an Online Malay-
Arabic Multimedia Dictionary. As the name suggests, this application will be available online and contains all the
basic multimedia features such as the words’ pronunciation sounds and the related picture or video for suitable
words.

We are having difficulties to manipulate the Arabic words, especially in the process of saving and searching the
word entries. So we decided to use one of the Unicode character sets, which is the UTF-8 to represent the Arabic
words. However, UTF-8 character sets need some additional process to manipulate them correctly.

This paper proposes solutions to manipulate UTF-8 character sets correctly. Manipulating the UTF-8 character
sets in this paper context means; typing and displaying the characters in the browsers, transmitting the characters
from the browsers to the web server and vice-versa, querying from and inserting to the database.

2. Why UTF-8?
Unicode is a collection of standards, algorithms and table of properties [4]. It provides a unique number for every
character, for any platform, program or language [11]. One of the character encoding schemes in Unicode is UTF-8.
UTF-8 is a sequence of 1 to 6 octets (8 bits) [1]. The PHP 5 string type is a sequence of 8 bits unit [4]. Hence, there
is no compatibility issue to manipulate UTF-8 in PHP5.

3. The environment
The web application is being developed using HTML as the user interface, PHP 5 as the middleware (and the
sever-side-scripting), Apache 2 as the web server and MySQL 5 as the database server. The testing is done using
Windows XP client, using five most popular browsers [6]; Internet Explorer, Mozilla Firefox, Safari, Netscape
Navigator and Opera.

4. The implementation
MySQL has provided an API (MySQLi) that caters the interaction between PHP and MySQL database server for
UTF-8 character sets. The API contains several functions that are proven to be useful in the character sets handling.
The functions are mysqli_connect, mysqli_set_charset, mysqli_query and mysqli_fetch_array. Further information
about the API can be found in http://php.net/mysqli .

4.1 Preparing the HTML pages.

To prepare the browsers for the UTF-8 characters handling, set the Content-type HTTP header
content=”text-html; charset=utf-8” in the MIME-style designation of all the HTML pages involved [8].

<html>
<head>
<title>…</title>
<meta http-equiv="content-type"
content="text-html; charset=utf-8">
</head>
<body>

This is to make sure the browser treats the HTML code as the UTF-8 character sets. Refer to Figure 1 to view
correctly displayed Arabic word. If this is not implemented, the browser will display the information incorrectly, that
results a display like shown in Figure 2.

Figure 1. The correct display of the word “‫ ”طعام‬in Arabic

2
Figure 2. The incorrect version of displaying the word “‫ ”طعام‬in Arabic

4.2 Preparing the database

Since this web application deals with Arabic, it is an important requirement for the database to handle UTF-8
character sets. MySQL 5 Community Edition does support UTF-8 [9]. We choose utf8_unicode_ci as the database
collation. This is important in the future development since we are also planning to expand this project to handle
other language such as Thais.

Collation is the general term for the process and function of determining the sorting order of strings of characters
[5].

4.3 The HTML codes to UTF-8 converter

There is a function contributed by one of the PHP developer used in converting the HTML codes into UTF-8
characters. The function is html_to_utf8 and available in the online manual of PHP [10]. In order to submit the
Arabic word to the web server, the browser needs to convert the word into the HTML code. When the word arrived
at the web server, this function will turn the HTML codes back into the actual word in UTF-8.

For example, the word ” ‫ “ طعام‬is converted into “%D8%B7%D8%B9%D8%A7%D9%85” in order to transmit
the Arabic word from the client to the server. This conversion is done automatically by most browsers. In the server
the HTML codes which is the code “%D8%B7%D8%B9%D8%A7%D9%85” needed to be converted back to ” ‫طعام‬
“ for the next process (query). The function is needed to do this conversion.

4.4 Search the record for Malay word to be translated into Arabic.

Since the Malay word is using Latin characters, there is nothing much different in handling the characters. There
is a HTML form provided for the user to key in a Malay word. The form will receive the user’s search request. It is
not compulsory to set the HTML page as specified in 4.1. However we think it is a good approach to make sure all
the HTML pages are capable of handling UTF-8 character sets.

The Malay word is sent to the web server when the user click the submit button. The requested word is received
by another file that contains the script to handle the searching procedure.

The accepted word is stored in a variable and the SQL command will be generated to search the Malay word
from the table in the database. The SQL command is then sent to the database server.

The searching is done in the database server, and any match record will be sent back to the web server. The web
server receives the Arabic word, and generates the HTML file to be sent back to the requesting clients. This HTML

3
page that contains the Arabic word must be set as the specification discussed in 4.1. This process is simplified using
a diagram in the Figure 3.
Receive the Malay word from the user

Send the word to the web server

Web server receive the word

Construct and send the SQL command to


search the word in the database server

Web server receives the Arabic word (if


any match is found)

Display the Arabic word

Figure 3. The process of searching Malay word, and display the word in Arabic

4.5 Searching the record for Arabic word to be translated into Malay.

Arabic alphabets are much more complex than Latin alphabets. We need Unicode representation for the
characters. Unicode handling is different than the regular representation of Latin characters. There is a HTML form
provided for the user to enter an Arabic word. The form will receive the user’s search request. Since we are dealing
with Unicode characters, it is compulsory to set the HTML page content property of the meta tag to
content="text-html; charset=utf-8", as mentioned in 4.1.

The Arabic word is then sent to the web server when the user click the submit button. Before the browser sends
the Arabic word, there is a process of converting the Arabic characters into the HTML codes. This is done
automatically by the browser in order to maintain the integrity of the information transmitted. The requested word is
received by another file that contains the script to handle the searching procedure.

The accepted word is stored in a variable. Since the Arabic word is in HTML code form, there is a need to
convert back to the UTF-8 character sets. The conversion is done using the function html_to_utf8 as mentioned in
4.1.

Later, the SQL command will be generated to search the requested Arabic word from the table in the database.
The SQL command is then sent to the database server.

The searching is done in the database server, and any match record will be sent back to the web server. The web
server receives the Arabic and Malay words, and generates the HTML file to be sent to the requesting client. This
HTML page that contains the Arabic word must be set as the specification discussed in 4.1. This process is
simplified using a diagram in Figure 4.

4
Receive the Arabic word from the user

Send the word (HTML codes) to the web


server

Receive the word (HTML codes)

Convert word in HTML codes into UTF-8

Construct and send the SQL command to


search the word in the database server

Web server receives the Malay word (if any


match is found)

Display the Arabic word


Figure 4. The process of searching Arabic word, and display the translation in Malay

4.6 Inserting/updating a record into/in the database.

This is a process to insert or update an entry that contains both the Arabic and the Malay words. The process is
simplified using a diagram in Figure 5.

There is a HTML form provided for the user to enter a Malay word and the meaning in Arabic. The form will
receive the user’s entry. Since we are dealing with Unicode characters, it is compulsory to set the HTML page
content property of the meta tag to content="text-html; charset=utf-8", as mentioned in 4.1.

The information (the Malay and Arabic word) is then sent to the web server when the user click the submit
button. Before the browser send the information, the Arabic word is converted into the HTML codes. The
information is received by another file that contains the script to handle the insert/update procedure.

Both words are stored in different variables. Since the Arabic word is in HTML code form, there is a need to
convert back to the UTF-8 character sets. No conversion needed for the Malay word, since it is not using the UTF-8
representation.

The SQL command will be generated to insert/update the Malay and Arabic words to the table in the database. The
command is then sent to the database server, and executed there.

5
Receive the Arabic and Malay word from the
user

Send the word (HTML codes) to the web


server

Receive the word (in HTML code form)

Convert the Arabic word in HTML code into


UTF-8

Construct and send the SQL command to


insert/update the entry to the database server

Save the changes

Display the insert/update status


Figure 5. The process of inserting/updating an entry in the database

5. Testing & Results

5.1 Preparing the server’s environment.

The server is equipped with the Microsoft Windows XP as the operating system, Apache 2.2.6 as the web server,
MySQL Community Edition 5.0.45 as the database server and PHP 5.2.4 server-side scripting. For the time being,
the application is in the early development state, and the testing is done in the intranet environment.

5.2 Preparing the client’s environment.

The testing is done on a client computer with the Windows XP as the operating system. The client’s machine is
fully equipped with the Arabic (right-to-left languages) input facilities. Turn on the language bar with the option of
Arabic language. To key-in the Arabic character sets, user needs to opt for Arabic keyboard or the On-Screen
Keyboard that already built-in. More information to manage Arabic characters input can be found in Microsoft’s
website [7].

5.3 Testing against several popular web browsers.

The testing is done using selected popular browsers [6], Mozilla Firefox version 2, Internet Explorer version 7,
Opera 9.24, Netscape Navigator 9 and Safari 3.0, in a Windows XP client computer.

5.4 And the result.

The results are displayed in the Table 1. The symbol √ represents capable, while × is incapable.

6
Table 1. The result of testing the application using selected browsers, in Windows XP client
Browser Display Handling Search Insert/
Arabic Arabic for update
word input Arabic Arabic
word word
Mozilla √ √ √ √
Firefox 2
Internet √ √ √ √
Explorer 7
Opera √ √ √ √
9.24
Netscape √ √ √ √
Navigator
9
Safari 3.0 × × × √

Among the five browsers used for testing, four of them are capable of displaying and handling input for Arabic
characters correctly. They are also capable of handling the process of searching, querying and inserting the Arabic
words from/into the database server.

However Safari 3.0 could not display and handle the Arabic characters perfectly. What happen in Safari 3.0 is the
Arabic characters in the word are not connected. For example “‫ ”طعام‬is displayed “‫”م ا ع ط‬. So we decided to
conclude that Safari 3.0 is not fully capable of handling the Arabic UTF-8 character sets properly (or may be Safari
needs a different approach). Surprisingly, Safari 3.0 capable of sending the Arabic characters to the web server and
the Arabic word is stored perfectly in the database.

The word “‫”طعام‬


is displayed
unconnected.

Figure 6. The display of UTF-8 Arabic characters in Safari 3.0

6. Conclusions
The paper suggests the solution of UTF-8 Arabic character sets transaction in the web environment. The web
server has been set and capable of manipulating the UTF-8 characters. There are four web browsers capable of
handling the character sets correctly, which are Internet Explorer 7, Mozilla Firefox 2, Netscape Navigator 9 and

7
Opera 9.24. However Safari 3.0 is not yet capable of displaying Arabic characters correctly using the solution
proposed. The client testing should be widen to Mac and Linux machine since these environment s are also among
the key player in the market [6].

7. References
[1] F. Yergeau, “UTF-8, A Transformation Format of ISO 10646”, Request for Comments 2279, 1998, retrieved
from http://www.ietf.org/rfc/rfc3629.txt on January 2008.

[2] H. Moukdad and A. Large, “Information Retrieval from Full-Text Arabic Database: Can Search Engine Designed for
English do the Job”, Libri, 2001, pp. 63-74.
[3] Ibrahim Suliman Ahmad, “The Role of Languages During the Era of Technology”, Proceedings of the Malaysia
International Conference on Foreign Languages, Universiti Putra Malaysia, 2007, pp. 132-137.

[4] J. DeLaHunt, “Unicode and PHP: A Gentle Introduction”, php|architect, Marco Tabini & Associates, 2007, Vol 6 Issue
5, pp. 38-49.

[5] M. Davis and K. Whistler, “Unicode Technical Standards #10: Unicode Collation Algorithm”, Unicode.org, 2003,
retrieved from http://unicode.org/reports/tr10/ on January 2008.

[6] Market Share for Browsers, Operating Systems and Search Engines by NetApplications, retrieved from
http://marketshare.hitslink.com on January 2008.

[7] Microsoft Products and Arabic Support v 3.5, retrieved from


http://www.microsoft.com/middleeast/arabicdev/windows/winxp/DigitsSupport.aspx on January 2008

[8] R. M. Lerner, “At the Forge: Unicode”, Linux Journal, Specializes System Consultants, Inc. 2003, Vol 2003 Issue 107.

[9] Unicode Support, retrieved from http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html on January 2008.

[10] utf8_encode, retrieved from http://php.net/utf8_encode on January 2008.

[11] What is Unicode?, retrieved from http://www.unicode.org/standard/WhatIsUnicode.html on January 2008.

Vous aimerez peut-être aussi