Vous êtes sur la page 1sur 15

Unicode:

The Language of Universe

Written by,
Mr. Kute T. B.
(Lecturer in Information Technology,
K. K. Wagh Polytechnic, Nashik, Maharashtra, India)
tbkute@gmail.com

Unicode: The Language of Universe Page -1-


Introduction

Unicode is a multi-byte character representation system for computers, provides


the encoding and exchanging of all of the text of the world's languages. This is the
important international language support and the concepts of designing and
incorporating Unicode support in Linux applications.
Unicode is not just a programming tool, but also a political and economic tool.
Applications that do not incorporate world language support can often be used only by
individuals who read and write a language supported by ASCII. This puts computer
technology based on ASCII out of reach of most of the world's people. Unicode allows
programs to utilize any of the world's character sets and therefore support any
language.
Unicode allows programmers to provide software that ordinary people can use in
their native language. The prerequisite of learning a foreign language is removed and
the social and monetary benefits of computer technology are more easily realized. It is
easy to imagine how little computer use would be seen in America if the user had to
learn Urdu to use an Internet browser. The Web would never have happened.
Linux has a large degree of commitment to Unicode. Support for Unicode is
embedded into both the kernel and the code development libraries. It is, for the most
part, automatically incorporated into the code using a few simple commands from the
program.
The basis of all modern character sets is the American Standard Code for
Information Interchange (ASCII), published in 1968 as ANSIX3.4. The notable exception
to this is IBM's EBCDIC (Extended Binary Coded Decimal Information Code) that was
defined before ASCII. ASCII is a coded character set (CCS), in other words, a mapping
from integer numbers to character representations. The ASCII CCS allows the
representation of 256 characters with an eight-bit (a base of 2, 0, or 1 value) field or
byte (2^8 =256). This is a highly limited CCS that does not allow the representation of
the all of the characters of the many different languages (like Chinese and Japanese),
scientific symbols, or even ancient scripts (runes and hieroglyphics) and music. It would
be useful, but entirely impractical to change the size of a byte to allow a larger set of
characters to be coded. All computers are based on the eight-bit byte. The solution is a
character encoding scheme (CES) that can represent numbers larger than 256 using a
multi-byte sequence of either fixed or variable length. These values are then mapped
through the CCS to the characters they represent.

Technical Introduction

The Unicode Standard is the universal character encoding standard used for
representation of text for computer processing. Versions of the Unicode Standard are
fully compatible and synchronized with the corresponding versions of International
Standard ISO/IEC 10646. For example, Unicode 3.0 contains all the same characters
and encoding points as ISO/IEC 10646-1:2000. Unicode 3.1 adds all the characters and
encoding points of ISO/IEC 10646-2:2001. The Unicode Standard provides additional
information about the characters and their use. Any implementation that is conformant
to Unicode is also conformant to ISO/IEC 10646.
Unicode provides a consistent way of encoding multilingual plain text and brings
order to a chaotic state of affairs that has made it difficult to exchange text files

Unicode: The Language of Universe Page -2-


internationally. Computer users who deal with multilingual text -- business people,
linguists, researchers, scientists, and others -- will find that the Unicode Standard
greatly simplifies their work. Mathematicians and technicians, who regularly use
mathematical symbols and other technical characters, will also find the Unicode
Standard valuable.
The design of Unicode is based on the simplicity and consistency of ASCII, but
goes far beyond ASCII's limited ability to encode only the Latin alphabet. The Unicode
Standard provides the capacity to encode all of the characters used for the written
languages of the world. To keep character coding simple and efficient, the Unicode
Standard assigns each character a unique numeric value and name.
The original goal was to use a single 16-bit encoding that provides code points
for more than 65,000 characters. While 65,000 characters are sufficient for encoding
most of the many thousands of characters used in major languages of the world, the
Unicode standard and ISO/IEC 10646 now support three encoding forms that use a
common repertoire of characters but allow for encoding as many as a million more
characters. This is sufficient for all known character encoding requirements, including
full coverage of all historic scripts of the world, as well as common notational systems.

Unicode definition

`Unicode is usually used as a generic term referring to a two-byte character-


encoding scheme. The Unicode CCS 3.1 is officially known as the ISO 10646-1
Universal Multiple Octet Coded Character Set (UCS). Unicode 3.1 adds 44,946 new
encoded characters. With the 49,194 already existing characters in Unicode 3.0, the
total is now 94,140.
The Unicode CCS utilizes a four-dimensional coding space of 128 three-
dimensional groups. Each group has 256 two-dimensional planes. Each plane consists
of 256 one-dimensional rows and each row has 256 cells. A cell codes a character at
this coding space or the cell is declared unused. This coding concept is called UCS-4;
four octets of bits are used to represent each character specifying the group, plane, row
and cell.
The first plane (plane 00 of the group 00) is the Basic Multilingual Plane (BMP).
The BMP defines characters in general use in alphabetic, syllabic and ideographic
scripts as well as various symbols and digits. Subsequent planes are used for additional
characters or other coded entities not yet invented. This full range is needed to cope
with all of the world's languages; specifically, some East Asian languages that have
almost 64,000 characters.
The BMP is used as a two-octet coded character set identified as the UCS-2
form of ISO 10646. ISO 10646 USC-2 is commonly referred to as (and is identical to)
Unicode. This BMP, like all UCS planes, contains 256 rows each of 256 cells, and a
character is coded at a cell by just the row and cell octets in the BMP. This allows 16-bit
coding characters to be used for writing most commercially important languages. USC-2
requires no code page switching, code extensions or code states. USC-2 is a simple
method to incorporate Unicode into software, but it is limited in only supporting the
Unicode BMP.
To represent a character coding system (CCS) of more than 2^8 = 256
characters with eight-bit bytes, a character-encoding scheme (CES) is required.

Unicode: The Language of Universe Page -3-


A brief history

The concept of the Unicode character set began in 1987, thanks to Joe Becker from
Xerox and Mark Davis from Apple. The following year, Becker, Davis, and Lee Collins
(currently of Xerox; formerly of Apple) began investigating the design and soon made
the case for Han unification to ANSI, ISO. Unicode is, indeed, based on the historic
evolution of the Chinese character set (Han).
Several people from various high tech companies began holding bimonthly
meetings in 1989. By the end of 1990, an initial, full-review draft was created.
In 1991, the group became the Unicode Consortium, a non-profit organization
incorporated as Unicode, Inc.
Version 1.0 became available to the public for the first time in 1992.

1. Principles And Implementation Of Unicode

Unicode's basic principles

Briefly, the Unicode character set encompasses the following 10 basic principles:
• 16-bit characters
• Full encoding
• Characters, not glyphs
• Semantics
• Plain text
• Logical order
• Unification
• Dynamic composition
• Equivalent sequence
• Convertibility
Relative to the first principle, the characters are of 2-byte representation and are
uniform in width (there's a slight exception to this, which I'll address later).
As noted, Unicode relies upon characters, not glyphs. A character is the smallest
component of written language that has semantic value (including phonetic value). On
the other hand, a glyph is the shape representation of a character. For example, glyphs
of Latin lowercase "as" can include: a, a, a, a, and a.
A character shape based on position or semantic context is considered to be a
glyph, such as Arabic characters.
A ligature is considered to be a glyph resulting from a combination of two
characters, such as ""fi" in some fonts and ""æ" as it's used in English (but not as it's
used in Norwegian).
As for semantics, it is important to note that each individual character has
properties. The character can be numeric, can have spacing properties, and can have
directionality. It becomes very important to know the properties of the character when
you receive an input stream to render it properly, or when doing comparisons.
Regarding the use of plain text, there's only enough information to adequately
render a character in plain text. There is no additional encoding in Unicode for layout,
language, font, style, color, hypertext links, kerning, etc.

Unicode: The Language of Universe Page -4-


Unicode relies upon logical order. Information is stored in the order it is typed in,
which may be different from rendering order. The principle of unification refers to the
fact that no duplicate encodings of a character occur simply because of varying uses.
Use in different languages is just another use (see prior point).
It is also important to note that round trip compatibility is more important than
unification. This can get a little fuzzy, for example, when there are several "hyphen"
symbols (which one could argue are just varying uses of the same character).
Dynamic composition, also known as combining characters, is the process of
following a character with diacritic marks that a rendering engine can then display as
one character.
As for equivalent sequence, pre-composed characters should be equivalent to
their combining counterparts, e.g. è = e + `. These are guidelines, however, and are not
necessary for conformance.
Java does not consider the previous example to be equivalents. And, as a last
point regarding equivalent sequence, the order of combining characters can make a
difference in equivalency but not always.
As of May 1993, accurate round-trip convertibility is guaranteed between the
Unicode Standard and other standards in wide usage. If variant forms are given
separate codes within one of the widely used standards, these were preserved in
Unicode.

Implementation levels

Independently of the two encoding forms of UCS, the standard ISO/IEC 10646
also draws a distinction between three different implementation levels. The full coded
character set is available on level 3. On the lower levels certain subsets of the
characters are not usable. This restricts the range of languages that can be coded on
these levels. On the other hand it makes simpler implementations possible.
A full implementation of the Unicode standard amounts to an implementation at
level 3 of UCS.
The simplest implementation level 1 works exactly like the older simple coded
character sets, such as ASCII and ISO/IEC 8859-1: Each graphic character occupies
one position and moves the active position one step in the writing direction (even
though the movement need not be constant; it is not if a proportional font is used). This
model works well for among others the Latin, Greek, and Cyrillic scripts. On this level
the composite letters, consisting of a base letter and one or more diacritical marks,
which are used in certain languages, are included as single characters in their own
right. UCS includes the composite letters of all official languages and also of most other
languages with a well-established orthography using these scripts. Also the Arabic and
Hebrew scripts are handled on this level, but they introduce an extra complication:
Arabic and Hebrew are normally written from right to left, but when words in e.g. the
Latin script are included within such text, these are written in their normal direction, from
left to right. In computer memory all characters are stored in the logical order, i.e. the
order in which the sounds of the words are pronounced and the letters normally are
input. When displayed, the writing direction of some words must be changed, relative to
the order in memory. Two alternative methods to handle bi-directional text can be used
together with UCS, one based on the international standard ISO/IEC 6429 and one
defined for Unicode.

Unicode: The Language of Universe Page -5-


Other languages for which implementation level 1 is sufficient are Japanese and
Chinese. These are not affected by any of the two complications noted above. For these
languages it is the big number of different characters that make implementations
difficult.
On implementation level 2 also the South-Asian scripts, e.g. Devnagari used on
the Indian subcontinent, can be handled. These causes’ further complications of display
software, since in many cases both the appearance and the relative position of a certain
letter are determined by which the nearest surrounding letters are.
On the full implementation level 3 conforming programs also must be able to
handle independent combining characters, e.g. accents and other diacritical marks that
are printed over, under or through ordinary letters. Such characters can be freely
combined with other characters and UCS sets no limit on the number of combining
characters attached to a base character. A difference compared to some other coded
character sets is that the codes for combining characters are stored after the code of
the base character, not before it.

2. Encoding Forms

Character encoding standards define not only the identity of each character and
its numeric value, or code point, but also how this value is represented in bits.
The Unicode Standard defines three encoding forms that allow the same data to
be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per
code unit). All three encoding forms encode the same common character repertoire and
can be efficiently transformed into one another without loss of data. The Unicode
Consortium fully endorses the use of any of these encoding forms as a conformant way
of implementing the Unicode Standard.
UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming
all Unicode characters into a variable length encoding of bytes. It has the advantages
that the Unicode characters corresponding to the familiar ASCII set have the same byte
values as ASCII, and that Unicode characters transformed into UTF-8 can be used with
much existing software without extensive software rewrites.
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage. It is reasonably compact and all the heavily
used characters fit into a single 16-bit code unit, while all other characters are
accessible via pairs of 16-bit code units.
UTF-32 is popular where memory space is no concern, but fixed width, single
code unit access to characters is desired. Each Unicode character is encoded in a
single 32-bit code unit when using UTF-32.
All three encoding forms need at most 4 bytes (or 32-bits) of data for each
character.

Defining Elements of Text

Written languages are represented by textual elements that are used to create words
and sentences. These elements may be letters such as "w" or "M"; characters such as
those used in Japanese Hiragana to represent syllables; or ideographs such as those
used in Chinese to represent full words or concepts.

Unicode: The Language of Universe Page -6-


The definition of text elements often changes depending on the process handling
the text. For example, in historic Spanish language sorting, "ll"; counts as a single text
element. However, when Spanish words are typed, "ll" is two separate text elements: "l"
and "l".
To avoid deciding what is and is not a text element in different processes, the
Unicode Standard defines code elements (commonly called "characters"). A code
element is fundamental and useful for computer text processing. For the most part,
code elements correspond to the most commonly used text elements. In the case of the
Spanish "ll", the Unicode Standard defines each "l" as a separate code element. The
task of combining two "l" together for alphabetic sorting is left to the software processing
the text.

Text Processing

Computer text handling involves processing and encoding. Consider, for


example, a word processor user typing text at a keyboard. The computer's system
software receives a message that the user pressed a key combination for "T", which it
encodes as U+0054. The word processor stores the number in memory, and also
passes it on to the display software responsible for putting the character on the screen.
The display software, which may be a window manager or part of the word processor
itself, uses the number as an index to find an image of a "T", which it draws on the
monitor screen. The process continues as the user types in more characters.
The Unicode Standard directly addresses only the encoding and semantics of
text. It addresses no other action performed on the text. For example, the word
processor may check the typist's input as it is being entered, and display misspellings
with a wavy underline. Or it may insert line breaks when it counts a certain number of
characters entered since the last line break. An important principle of the Unicode
Standard is that it does not specify how to carry out these processes as long as the
character encoding and decoding is performed properly

Interpreting Characters and Rendering Glyphs

The difference between identifying a code point and rendering it on screen or


paper is crucial to understanding the Unicode Standard's role in text processing. The
character identified by a Unicode code point is an abstract entity, such as "LATIN
CHARACTER CAPITAL A" or "BENGALI DIGIT 5." The mark made on screen or paper
-- called a glyph -- is a visual representation of the character.
The Unicode Standard does not define glyph images. The standard defines how
characters are interpreted, not how glyphs are rendered. The software or hardware-
rendering engine of a computer is responsible for the appearance of the characters on
the screen. The Unicode Standard does not specify the size, shape, nor style of on-
screen characters.

Character Sequences

Text elements are encoded as sequences of one or more characters. Certain of


these sequences are called combining character sequences, made up of a base letter
and one or more combining marks, which are rendered around the base letter (above it,

Unicode: The Language of Universe Page -7-


below it, etc.). For example, a sequence of "a" followed by a combining circumflex "^"
would be rendered as "â".
The Unicode Standard specifies the order of characters in a combining character
sequence. The base character comes first, followed by one or more non-spacing marks.
If there is more than one non-spacing mark, the order in which the non-spacing marks
are stored isn't important if the marks don't interact typographically. If they do interact,
then their order is important. The Unicode Standard specifies how successive non-
spacing characters are applied to a base character, and when the order is significant.
Certain sequences of characters can also be represented as a single character,
called a pre-composed character (or composite or decomposable character). For
example, the character "ü" can be encoded as the single code point U+00FC "ü" or as
the base character U+0075 "u" followed by the non-spacing character U+0308 "¨". The
Unicode Standard encodes pre-composed characters for compatibility with established
standards such as Latin 1, which includes many pre-composed characters such as "ü"
and "ñ".
Pre-composed characters may be decomposed for consistency or analysis. For
example, in alphabetizing (collating) a list of names, the character "ü" may be
decomposed into a "u" followed by the non-spacing character "¨". Once the character
has been decomposed, it may be easier to work with the character because it can be
processed as a "u" with modifications. This allows easier alphabetical sorting for
languages where character modifiers do not affect alphabetical order. The Unicode
Standard defines the decompositions for all pre-composed characters. It also defines
normalization forms in Unicode Normalization Forms, to provide for unique
representations of characters.

Assigning Character Codes

A single number is assigned to each code element defined by the Unicode


Standard. Each of these numbers is called a code point and, when referred to in text, is
listed in hexadecimal form following the prefix "U". For example, the code point U+0041
is the hexadecimal number 0041 (equal to the decimal number 65). It represents the
character "A" in the Unicode Standard.
Each character is also assigned a unique name that specifies it and no other. For
example, U+0041 is assigned the character name "LATIN CAPITAL LETTER A."
U+0A1B is assigned the character name "GURMUKHI LETTER CHA." These Unicode
names are identical to the ISO/IEC 10646 names for the same characters.
The Unicode Standard groups characters together by scripts in code blocks. A
script is any system of related characters. The standard retains the order of characters
in a source set where possible. When the characters of a script are traditionally
arranged in a certain order -- alphabetic order, for example -- the Unicode Standard
arranges them in its code space using the same order whenever possible. Code blocks
vary greatly in size. For example, the Cyrillic code block does not exceed 256 code
points, while the CJK code blocks contain many thousands of code points.
Code elements are grouped logically throughout the range of code points, called
the codespace. The coding starts at U+0000 with the standard ASCII characters, and
continues with Greek, Cyrillic, Hebrew, Arabic, Indic and other scripts; then followed by
symbols and punctuation. The code space continues with Hiragana, Katakana, and
Bopomofo. The unified Han ideographs are followed by the complete set of modern

Unicode: The Language of Universe Page -8-


Hangul. The range of surrogate code points is reserved for use with UTF-16. Towards
the end of the BMP is a range of code points reserved for private use, followed by a
range of compatibility characters. The compatibility characters are character variants
that are encoded only to enable transcoding to earlier standards and old
implementations, which made use of them.
A range of code points on the BMP and two very large ranges in the
supplementary planes are reserved as private use areas. These code points have no
universal meaning, and may be used for characters specific to a program or by a group
of users for their own purposes. For example, a group of choreographers may design a
set of characters for dance notation and encode the characters using code points in
user space. A set of page-layout programs may use the same code points as control
codes to position text on the page. The main point of user space is that the Unicode
Standard assigns no meaning to these code points, and reserves them as user space,
promising never to assign them meaning in the future.

3. Unicode And ISO/IEC 10646

The Unicode Standard is very closely aligned with the international standard ISO/IEC
10646 (also known as the Universal Character Set, or UCS, for short). Close
cooperation and formal liaison between the committees has ensured that all additions to
either standard are coordinated and kept in synch, so that the two standards maintain
exactly the same character repertoire and encoding.
Version 3.0 of the Unicode Standard is code-for-code identical to ISO/IEC 10646-
1:2000. This code-for-code identity is true for all encoded characters in the two
standards, including the East Asian (Han) ideographic characters. Subsequent versions
of the Unicode Standard track additional parts and amendments to ISO/IEC 10646.
The Unicode encoding forms correspond exactly to forms of use and
transformation formats also defined in ISO/IEC 10646. UTF-8 and UTF-16 are defined
in Annexes to ISO/IEC 10646-1:2000. And UTF-32 corresponds to the four-octet form
UCS-4 of 10646.

Unicode vs. ISO 10646

We should examine how Unicode differs from ISO 10646.


ISO 10616 is simply a character set that maps characters to binary code
numbers. Unicode, on the other hand, is the 2-byte/4-byte form of that same character
set plus character properties, implementation rules, and guidelines.
ISO is an international standards organization, with academic and governmental
focus. Unicode remains a private consortium, made up of several commercial entities,
plus some academic and user groups.
It should also be noted that ISO and Unicode have been working together
cooperatively since 1991 in an effort to ensure that there are no conflicts or
discrepancies between the two character sets.

4. General Layout of Unicode

Code point section descriptions

Unicode: The Language of Universe Page -9-


The following list illustrates how the character set is divided up, resulting in text
values of the character set. The first is General scripts, followed by Symbols (CJK is
Chinese-Japanese-Korean Miscellaneous), Ideographs (or whatever you would like to
call them), Hangul, Surrogates, Private use, and, lastly, Compatibility and Special.
0000 = 1FFF - General scripts
Basic Latin: U+0000--U+007F
Latin-1 Supplement: U+0080--U+00FF
Latin Extended-A: U+0100--U+017F
Latin Extended-B: U+0180--U+024F
IPA Extensions: U+0250--U+02AF
Spacing Modifier Letters: U+02B0--U+02FF
Combining Diacritical Marks: U+0300--U+036F
Greek: U+0370--U+03FF
Cyrillic: U+0400--U+04FF
Armenian: U+0530--U+058F
Hebrew: U+0590--U+05FF
Arabic: U+0600--U+06FF
Devnagari: U+0900--U+097F
Bengali: U+0980--U+09FF
Gurmukhi: U+0A00--U+0A7F
Gujarati: U+0A80--U+0AFF
Oriya: U+0B00--U+0B7F
Tamil: U+0B80--U+0BFF
Telugu: U+0C00--U+0C7F
Kannada: U+0C80--U+0CFF
Malayalam: U+0D00--U+0D7F
Thai: U+0E00--U+0E7F
Lao: U+0E80--U+0EFF
Tibetan: U+0F00--U+0FBF
Georgian: U+10A0--U+10FF
Hangul Jamo: U+1100--U+11FF
Latin Extended Additional: U+1E00--U+1EFF
Greek Extended: U+1F00--U+1FFF
2000 = 2FFF - Symbols
General Punctuation: U+2000--U+206F
Superscripts and Subscripts: U+2070--U+209F
Currency Symbols: U+20A0--U+20CF
Combining Marks for Symbols: U+20D0--U+20FF
Letterlike Symbols: U+2100--U+214F
Number Forms: U+2150--U+218F
Arrows: U+2190--U+21FF
Mathematical Operators: U+2200--U+22FF
Miscellaneous Technical: U+2300--U+23FF
Control Pictures: U+2400--U+243F
Optical Character Recognition: U+2440--U+245F
Enclosed Alphanumerics: U+2460--U+24FF
Box Drawing: U+2500--U+257F
Block Elements: U+2580--U+259F

Unicode: The Language of Universe Page - 10 -


Geometric Shapes: U+25A0--U+25FF
Miscellaneous Symbols: U+2600--U+26FF
Dingbats: U+2700--U+27BF
3000 = 33FF - CJK miscellaneous
CJK Symbols and Punctuation: U+3000--U+303F
Hiragana: U+3040--U+309F
Katakana: U+30A0--U+30FF
Bopomofo: U+3100--U+312F
Hangul Compatibility Jamo: U+3130--U+318F
Kanbun: U+3190--U+319F
Enclosed CJK Letters and Months: U+3200--U+32FF
CJK Compatibility: U+3300--U+33FF
4E00 = 9FFF - CJK ideographs
AC00 = D7A3 - Hangul
D800 = DFFF - Surrogates (other planes)
E000 = F8FF - Private use
F900 = FFFF - Compatibility and Special
CJK Compatibility Ideographs: U+F900--U+FAFF
Alphabetic Presentation Forms: U+FB00--U+FB4F
Arabic Presentation Forms-A: U+FB50--U+FDFF
Combining Half Marks: U+FE20--U+FE2F
CJK Compatibility Forms: U+FE30--U+FE4F
Small Form Variants: U+FE50--U+FE6F
Arabic Presentation Forms-B: U+FE70--U+FEFF
Half width and Full width Forms: U+FF00--U+FFEF
Specials: U+FEFF, U+FFF0--U+FFFF

The general layout

Based upon Unicode's general layout, about 8K is allocated to alphabetic-like


scripts. This would include everything from Roman letters through indic scripts to
Georgian, Armenian, Arabic, Hebrew, Greek, and Cyrillic types of things. There's also a
section, about 4K, for symbols. There's also a little area called the CJK auxiliary area,
which encodes auxiliary characters used in the Chinese-Japanese-Korean code bases.
There's also a reserve space and one for Hangul, which is nearly 21,000 Chinese
characters (unified, as Andrea mentioned). As a result, you cannot look at a character
code and know whether it came from Chinese, Japanese, or Korean. There is a space
called the Surrogate Area, which has been described. As you know, Korean can be
written as a very elegant writing system which uses very few symbols (no more than 40)
of vowels and consonants that can be combined to make syllables. These syllables are
then written in a block. Each block is called a Hangul (?) and a Hangul character.
Basically, each one of these corresponds to some Chinese character, which is
the older way of writing. This system of writing is in great use now, but it's taken a long
time since its inception in 1446 (?).
As for the Private Use area, you can do anything you want here, as long as you
and the Unicode Consortium are in agreement as to what that use is. The Compatibility
Zone is for roundtrip mappings to and from older character sets.

Unicode: The Language of Universe Page - 11 -


Unicode extension mechanism

Unicode can encode a maximum of 64K characters with only 16 bits of


information. Some Chinese dictionaries list 70,000 or more characters. And while
ancient scripts are missing, they are required for scholarly work.
Extension to 1 million characters has been added by pairing two, 16-bit elements.
This allows for representation of rare characters in future extensions of Unicode. A 2K
region of Unicode, called the Surrogate Area, has been set aside (high surrogate area
U+D800 to U+DBFF; low surrogate area U+DC00 to U+DFFF).
In terms of the surrogate pair, two Unicode values as a sequence where the first
value is a high surrogate and the second is a low surrogate. The mechanism provides a
1 million-plus character extension. Currently, there are no characters defined there but
not for long.

Conformance to the Unicode Standard

The Unicode Standard specifies unambiguous requirements for conformance in


terms of the principles and encoding architecture it embodies. A conforming
implementation has the following characteristics, as a minimum requirement:
• Characters are from the common repertoire;
• Characters are encoded according to one of the encoding forms;
• Characters are interpreted with Unicode semantics;
• Unassigned codes are not used; and,
• Unknown characters are not corrupted.
Implementations of the Unicode Standard are conformant as long as they follow
the rules for the encoding characters into sequences of bytes, words or double words
that are in effect for the chosen encoding form and otherwise interpret characters
according to the Unicode specification.

Stability

The Unicode Standard has a lot of room to grow, and there are a considerable number
of scripts that will be encoded in upcoming versions. This process is strictly additive, in
other words, while characters may be added or new character properties may be
defined, no characters will be removed -- or reinterpreted in incompatible ways. These
stability guarantees make it possible to encode data in Unicode and expect that future
implementations that conform to a later version of the Unicode Standard will be able to
interpret them in the same way, as implementations conforming to The Unicode
Standard, Version 3.2.

5. Benefits and Some Misunderstandings

Benefits of Unicode to application developers:

Support for the Unicode standard provides many benefits to application


developers. These benefits include:
• Global source and binary

Unicode: The Language of Universe Page - 12 -


• Support for mixed-script computing environments
• Reduced time-to-market for localized products
• Expanded market access
• Improved cross-platform data interoperability through a common codeset
• Space-efficient encoding scheme for data storage
Unicode is a building block that designers and engineers can use to create truly
global applications. By making use of one flat codeset, end-users can exchange data
more freely without relying on elaborate code conversions to make characters
comprehensible.

Advantages of Unicode

One advantage of Unicode is that each character has a unique code pointa very
powerful feature. This means there's no confusion about how to interpret a code point.
(Compare this situation with multibyte character sets where either a character must be
tagged with its code page or some state maintained.) This facilitates data exchange and
requires no stateful encoding.
Another advantage is that each character is assigned a set of properties. This
allows semantic, direction, case, type, relations to other character(s), etc. It also
eliminates ambiguity of certain characters and provides for precise character typing.

Disadvantages of Unicode

One disadvantage of Unicode is that it takes more space to store plain text. Each
character is derived from ISO standards and requires an extra 8-bit of space over its
single byte representation. However, compression techniques are available if required.
UTF-8 has pretty good compression on ASCII, for example. It gives you one-for-two.
Transmission of Unicode data can also use more bandwidth. Despite this
drawback, Unicode, in the form of UTF-8, is becoming a popular item on the Web.
It also requires significant extensions to tools and libraries. C-Library requires
new functions with new names that work with Unicode data. This can bifurcate the API
set of an operating system, i.e., Win32 splits its API subset of functions using strings
into two parts.

Some misunderstandings

Now, let us address some misunderstandings about the system.


Misunderstanding #1: Unicode is a two-byte encoding system
The data structure for Unicode is not an array of characters. Routinely, it is an
array of 16-bit elements. So don't think of it as a two-byte system except possibly when
you may have to exchange data between little-endian and big-endian systems.
Misunderstanding #2: A single null byte terminates a Unicode string
If Unicode is being stored as a 16-bit array, then the terminator is the NULL
Unicode character U+0000.
Misunderstanding #3: One character = one Unicode element

Unicode: The Language of Universe Page - 13 -


In the Surrogate area, two Unicode elements are required for one character.
Also, in terms of composite representation, an unlimited number of elements can be
combined to make up a character.
Misunderstanding #4: Unicode pointer arithmetic is the same as ASCII
In short, say goodbye to p++ and p-. A character may require more than one
Unicode element to represent it. As a result, techniques taken from multi-byte
processing must be used. Also, Win32 supports composite representations in CharNext
and CharPrev.
Misunderstanding #5: Unicode solves all I18N and L10N problems
We wish. Unicode trades one set of problems for another. However, the new set
is much more interesting and offers greater possibilities for worldwide operations.

Unicode realized in Windows NT

Some of the highlights of Unicode in Windows NT include:


• Kernel Mode is mostly Unicode.
• User Mode application can be Unicode or multibyte (which means you can write
an application that uses Unicode exclusively).
• GDI text calls assume Unicode.
• NTFS file names are in Unicode.
• True Type fonts are Unicode encoded.
• Resources strings are compiled to Unicode.

There are probably some bugs here and there, but, in general, Unicode works
well in the Windows NT 5.0 language kit.

A partial support in Windows 95/98

If running in Windows 95 or 98, you'll find that resource strings are compiled to
Unicode. You'll also see True Type fonts Unicode encoded, and note some Unicode
functions such as TextOutW and related calls.
You'll also note Unicode character semantics and sorting.

The Unicode Consortium

The Unicode Consortium is responsible for defining the behavior and


relationships between Unicode characters, and providing technical information to
implementers. The Consortium cooperates with ISO in refining the specification and
expanding the character set. It has liaison status "C" with ISO/IEC/JTC 1/SC2/WG2,
which is responsible for ISO/IEC 10646.
Members of the Consortium include major computer corporations, software
producers, database vendors, research institutions, international agencies, various user
groups, and interested individuals. The Consortium Members section lists these
members.
The Consortium's Directors and Officers come from a variety of organizations,
representing a wide spectrum of text-encoding and computing applications.

Unicode: The Language of Universe Page - 14 -


The Consortium's technical activities are conducted by the Unicode Technical
Committee, which is responsible for the creation, maintenance, and quality of the
Unicode Standard.
The Unicode Consortium sponsors an occasional Bulldog award given to various
personalities for their "outstanding personal contributions to the philosophy and
dissemination of the Unicode Standard".

6. Conclusion

The purpose of this text is to give a brief technical overview of the new character
set standard ISO/IEC 10646 and the nearly related Unicode standard, a character
coding system designed to support the interchange, processing, and display of the
written texts of the world's diverse languages.
To support the world's languages, a character coding system (CCS) of more than
the 2^8 = 256 characters of ASCII (the extended version using an unsigned byte) with
an eight-bit byte character-encoding scheme (CES) is required. Unicode does this as a
CCS with a four-dimensional coding space of 128 three-dimensional groups with 94,140
defined character values that is supported with numerous CES methods, the more
popular of which, in Linux, is the Unicode transformation format UTF-8.

7. Selective Reference

www.unicode.org

Unicode: The Language of Universe Page - 15 -

Vous aimerez peut-être aussi