ASK HERE

computer science technology · 22-01-2010, 03:34 PM

[attachment=1334]

INTRODUCTION
UNICODE
Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.
Fundamentally, computers just deal with numbers. They store
letters and other characters by assigning a number for each one. Before
Unicode was invented, there were hundreds of different encoding systems
for assigning these numbers. No single encoding could contain enough
characters: for example, the European Union alone requires several
different encodings to cover all its languages. Even for a single
language like English no single encoding was adequate for all the
letters, punctuation, and technical symbols in common use.
These encoding systems also conflict with one another. That is,
two encodings can use the same number for two different characters, or
use different numbers for the same character. Any given computer
(especially servers) needs to support many different encodings; yet
whenever data is passed between different encodings or platforms, that
data always runs the risk of corruption.
This paper is intended for software developers interested in
support for the Unicode standard in the Solarisâ€žÂ¢ 7 operating
environment. It discusses the following topics:
Â¢ An overview of multilingual computing, and how Unicode and the
internationalization framework in the Solaris 7 operating environment
work together to achieve this aim
Â¢ The Unicode standard and support for it within the Solaris
operating environment
Â¢ Unicode in the Solaris 7 Operating Environment
Â¢ How developers can add Unicode support to their applications
Â¢ Codeset conversions

UNICODE AND MULTILINGUAL COMPUTING
It is not a new idea that today's global economy demands global
computing solutions. Instant communications and the free flow of
information across continents - and across computer platforms -
characterize the way the world has been doing business for some time.
The widespread use of the Internet and the arrival of electronic
commerce (e-commerce) together offer companies and individuals a new
set of horizons to explore and master. In the global audience, there
are always people and businesses at work - 24 hours of the day, 7 days
a week. So global computing can only grow.
What is new is the increasing demand of users for a computing
environment that is in harmony with their own cultural and linguistic
requirements. Users want applications and file formats that they can
share with colleagues and customers an ocean away, application
interfaces in their own language, and time and date displays that they
understand at a glance. Essentially, users want to write and speak at
the keyboard in the same way that they always write and speak. Sun
Microsystems addresses these needs at various levels, bringing together
the components that make possible a truly multilingual computing
environment.
It begins with the internationalization framework in the
Solaris operating environment. Developers have different ways to
internationalize their applications to meet the requirements of
specific cultural regions. This framework continues by incorporating
the Unicode encoding standard, a standard that provides users and
developers with a universal codeset. Unicode is well-suited to
applications such as multilingual databases, electronic commerce, and
government research and reference. The Solaris 7 operating environment
supports multilingual computing with multiple character sets and
multiple cultural attributes.
MULTILINGUAL COMPUTING
The concept "multilingual" in practice takes different forms.
It is important to distinguish among the following types of
environments:
Â¢ Multilanguage
Â¢ Multiscript
Â¢ Multilingual
The movement from multilanguage to multiscript to multilingual
implies an increasing level of complexity in the underlying operating
environment.
Multilanguage Environment
A multilanguage environment means that a locale supports one
writing system, or script, and one set of cultural attributes. In this
environment, an application inherits all the language and cultural
attributes of the current locale, with text manipulated according to
the language rules of this current locale. Because the locale is
limited to supporting one writing script and one set of cultural
attributes, the application is also limited to creating documents
containing text in one script.
Thus, in a multilanguage environment, the user must launch a
separate instance of an application in different locales for the
application to take advantage of differing language and cultural
attributes. For example, if a user is using the English-based
operating environment and wishes to create a document containing
Chinese characters, the user must first set up the Chinese locale and
then launch the application to begin creating Chinese content.
If the user then wishes to enter Russian text, the Russian
locale must be set up and another instance of the application launched
- this time within the Russian locale - for the user to generate
Russian text. In this environment, the Chinese and Russian text cannot
be mixed and the user must alternate between locales to create text of
different scripts.
Multiscript Environment
A multiscript environment means that a locale may support more
than one script, but the locale is still associated with only one set
of cultural attributes.
In this environment, an application can create a document with text in
multiple scripts. However, the application must tag or otherwise mark
each separate run of text of the same script (script run) to apply the
appropriate language attributes for proper input and display.

Note - In a Unicode-enabled locale, tagging script runs is not
necessary because all language attributes are inherent in the Unicode
codeset.

The multiscript environment supports text written in multiple
scripts but is still limited to one set of cultural attributes. This
means, for example, that text is sorted according to the sorting rules
of the current locale. To expand on the Chinese example from the
multilanguage section, in a multiscript environment, rather than create
two separate documents, the user can now create one multiscript
document containing both Chinese and Russian text. The cultural
attributes of the active locale still apply. Therefore, if the user is
in the Chinese locale, the Chinese sorting rules (for example) will be
applied to the mixed script text.

Multilingual Environment
A multilingual environment means that a locale can support
multiple scripts and multiple cultural attributes. In this environment,
an application can have the ability to transparently make use of both
the language and cultural attributes of the locale within a single
locale. In this case, an application can create a document in multiple
scripts, and because the application has access to multiple cultural
attributes, it has greater control over how text is manipulated. For
example, a document containing text in multiple scripts can sort text
according to its script rather than the sort order of the current
locale. In a Unicode-enabled locale, the application can eliminate the
step of tagging script runs.
To Extend the Chinese multiscript document example further to a
multilingual environment, you are no longer constrained to applying
only sorting rules of the active Chinese locale. In a multilingual
environment, you can apply the sorting rules of the Chinese locale to
the Chinese portion of our multiscript text, and then call upon the
Russian sorting rules to apply to the Russian portion of our
multiscript text.
The multilingual environment brings you closest to the ideal of
multilingual computing. An application can make use of locale data from
any number of locales, while at the same time allowing the user to
easily manipulate text in a variety of scripts. Every user can
communicate and work in his or her own language and still understand
and be understood by other users anywhere in the world.

Software Internationalization
Sun Microsystems defines the following levels at which an
application can support a customer's international needs:
Â¢ Internationalization
Â¢ Localization
Software internationalization is the process of designing and
implementing software to transparently manage different cultural and
linguistic conventions without additional modification. The same binary
copy of an application should run on any localized version of the
Solaris operating environment, without requiring source code changes or
recompilation. Software localization is the process of adding language
translation (including text messages, icons, buttons, and so on),
cultural data, and components (such as input methods and spell
checkers) to a product to meet regional market requirements.
The Solaris operating environment is an example of a product
that supports both internationalization and localization. The Solaris
operating environment is a single internationalized binary that is
localized into various languages such as French, Japanese, and Chinese,
and supports the associated cultural and language conventions of each
language.
When properly designed, applications can easily accommodate a
localized interface without extensive modification. One suggestion for
creating easy-to-localize software is to first internationalize the
software and then encapsulate the language- and cultural-specific
elements in a locale-specific database or file. This greatly simplifies
the localization process, should a developer choose to localize in the
future.
At a minimum, Sun Microsystems strongly encourages developers
to internationalize their software. In this way, their applications can
run on any localized version of the Solaris operating environment. As a
result, such an application can easily manage the user's language and
cultural preferences.
Internationalization Framework
A major aspect of developing a properly internationalized
application is to separate language and culturally-specific information
from the rest of the application code. The internationalization
framework in the Solaris operating environment uses the following
concepts to fulfill this aim:
Â¢ Locales
Â¢ Localizable interface
Â¢ Codeset independence
A locale is a set of language and cultural data that is
dynamically loaded into memory at runtime. Users can set the cultural
aspects of their local work environment by setting specific variables
in a locale. These settings are then applied to the operating system
and to subsequent application launches.
The Solaris operating environment includes APIs for developers
to access directly the cultural data of the current locale. For
example, an application will not need to encode the currency symbol
for a particular region. By calling the appropriate system API, the
locale returns the symbol associated with the current currency symbol
the user has specified. Applications can run in any locale without
having special knowledge of the cultural or language information
associated with the locale.
Creating a localizable interface means accounting for the
variations that take place when an interface is translated into another
language. In addition, the Solaris operating environment provides
messaging APIs and utilities that can be used to collect, generate, and
use messages in applications. Codeset independence means designing
applications that do not make assumptions about the underlying codeset.
For example, text-handling routines should not define in advance the
size of the character codeset being manipulated.
Supporting the Unicode Standard
Unicode, or Universal Codeset, is a universal character
encoding scheme developed and promoted by the Unicode Consortium, a
non-profit organization that includes Sun Microsystems. The Unicode
standard encompasses most alphabetic, ideographic, and symbolic
characters used on computers today.
Using this one codeset enables applications to support text
from multiple scripts in the same documents without elaborate marking
of text runs. At the same time, applications need to treat Unicode as
just another codeset - that is, apply the principle of codeset
independence to Unicode as well. Unicode locales in the Solaris
operating environment are called the same way, and function the same
way, as all other locales. These locales provide the extra benefits
that the Unicode codeset brings to the work environment, including the
ability to create text in multiple scripts without having to switch
locales. Sun Microsystems provides the same level of Unicode locale
support for both 32-bit and 64-bit Solaris environments.

Benefits of Unicode
Support for the Unicode standard provides many benefits to
application developers. These benefits include:
Â¢ Global source and binary
Â¢ Support for mixed-script computing environments
Â¢ Reduced time-to-market for localized products
Â¢ Expanded market access
Â¢ Improved cross-platform data interoperability through a common
codeset
Â¢ Space-efficient encoding scheme for data storage
Unicode is a building block that designers and engineers can
use to create truly global applications. By making use of one flat
codeset, end-users can exchange data more freely without relying on
elaborate code conversions to make characters comprehensible. In
adopting the internationalization framework in the Solaris operating
environment, Unicode can be thought of as "just another codeset". By
following the concepts of codeset-independent design, applications will
be able to handle different codesets without the need for extensive
code rework to support specific languages.

THE UNICODE STANDARD
About Unicode
On most computer systems supporting writing systems such as
Roman or Cyrillic, user input (usually via keypresses) is converted
into character codes that are stored in memory. These stored character
codes are then converted into glyphs of a particular font before being
passed to the application for display and printing. Each locale has one
or more codesets associated with it. A codeset includes the coded
representation of characters used in a particular language. Codesets
may span one-byte (for alphabetic languages) or two or more bytes for
ideographic languages. Each codeset assigns its own code-point values
to each character, without any inherent relationship to other codesets.
That is, a code-point value that represents the letter `a' in a Roman
character set will represent another character entirely in the Cyrillic
or Arabic system, and may not represent anything in an ideographic
system.
In Unicode, every character, symbol, and ideograph has its own
unique character code. As a result, there is no overlap or confusion
between the code-point values of different codesets. There is, in fact,
no need to define multiple codesets because each character code used in
each writing system finds its place in the Unicode scheme. Unicode
includes not only characters of the world's languages, but also
includes characters such as publishing characters, mathematical and
technical symbols, and punctuation characters.
Unicode version 2.1 contains alphabetic characters from languages
including
Anglo-Saxon, Russian, Arabic, Greek, Hebrew, Thai, and
Sanskrit. It also contains ideographic characters in the unified Han
subset defined by national and industry standards for China, Japan,
Korea, and Taiwan.
Coded Representations of Unicode
In recent years, the Unicode Consortium and other related
organizations have developed different formats for representing and
storing a Unicode codeset. The ISO/IEC International Standard 10646-1
(commonly referred to as 10646) defines the Universal Multiple-Octet
Coded Character Set (UCS) format for representing characters from all
the significant world languages in multibyte format. The 10646
specification contains the following basic forms for representing
characters:
Â¢ Universal Coded Character Set-2 (UCS-2) - Also known as Basic
Multilingual Plane (BMP). Characters encoded in two bytes on a single
plane.
Â¢ Universal Coded Character Set-4 (UCS-4) - Characters encoded in
four bytes on multiple planes and multiple groups.
Â¢ UCS Transformation Format 16-bit form (UTF-16) - Extended
variant of UCS-2 with characters encoded in 2-4 bytes.
Â¢ UCS Transformation Format 8-bit form (UTF-8) - A transformation
format using characters encoded in 1-6 bytes.
UCS-2 defines a 64K coding space, or BMP, for representing
character codes in a two-octet row and cell format. The row and cell
octets designate the cell location of a particular character code
within a 256 by 256 (00-FF) plane. UCS-4 defines a four-octet coding
space divided into four units: group, plane, row, and cell. The row and
cell octets designate the cell location of a particular
character code within a plane. The plane octet designates the plane
number (00-FF), and the group octet the group number (00-7F) to which
the plane belongs. In total, there are 256 planes occurring 127 times.
Figure 1 illustrates the UCS-2 and UCS-4 coding schemes.

Figure 1: UCS-2 and UCS-4 coding schemes
In addition to the UCS forms of 10646, the Unicode standard
also defines another form called UTF, or UCS Transformation Format. One
version of UTF is an extended
UCS-2 encoding form designed to include characters from outside
the BMP 64K coding space. This form was first called UCS-2E (extended
UCS-2), but is now known as UTF-16 (UCS Transformation Format 16-bit
form). The UTF-16 form translates a range of UCS-4 codes into a two-
octet encoded string. It does this by reserving an area of codes in the
BMP coding space for mapping to and from 16 planes of group 00 of UCS-
4. Each plane is assigned a certain set of code positions in the two-
octet UCS-2 scheme. Specifically, Planes 01 to 0E (14 planes, or 14 x
65,536 = 917,504 characters) are reserved for standard encodings and
Planes 0F and 10 (2 planes, or 2 x 65,536 = 131,072 characters) are
reserved for private use.
Although UCS-4 and UTF-16 provide comprehensive ways of
representing several character sets, they do not preserve the byte
values for ASCII characters. All UNIXÃ‚Â® systems, because they are based
on an ASCII kernel, reserve certain character codes for I/O operations,
such as the null character as a string terminator, the slash (/)
character as a path name separator, and the DEL and SPACE control
characters.
To circumvent this problem, another version of UTF was devised,
called FSS-UTF (File System Safe-UTF), or now commonly known as UTF-8.
UTF-8 is an encoding scheme that maps the entire UCS-4 character set to
a series of single-octet and multi-octet strings. In this scheme, the
most significant bit is 0 for ASCII characters, and 1 for all other
characters. The ASCII character range is contained in a single-byte
encoding, and all other characters in a range from 2 up to 6-byte
encoding. Table 1 describes the UTF-8 encoding scheme.
Bits Hex Min Hex Max UTF-8 Binary Encoding
7 00000000 0000007F 0xxxxxxx
11 00000080 000007FF 110xxxxx 10xxxxxx
16 00000800 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
21 00010000 001FFFFF 11110xxx 10xxxxxx 10xxxxxx
10xxxxxx
26 00200000 03FFFFFF 111110xx 10xxxxxx 10xxxxxx
10xxxxxx 10xxxxxx
31 04000000 7FFFFFFF 1111110x 10xxxxxx 10xxxxxx
10xxxxxx 10xxxxxx 10xxxxxx
Note - The UTF-8 scheme does not use any ASCII byte values in
its 2- to 6-byte sequences, yet ASCII values remain 8-bit within the
new byte structure. Thus, UTF-8 is compatible with all legacy file
systems and other systems that parse for the ASCII byte, while UCS-
2/UTF-16 and UCS-4 are not compatible with ASCII. This is
important because it allows applications that support Unicode to use
existing data in ASCII format without applying a conversion utility. In
addition, there is support within the Internet community for adopting
UTF-8 as the encoding standard for the Internet.
In addition to its backward compatibility with 7-bit ASCII,
UTF-8 is a space-efficient encoding scheme when the encoded data needs
only one-byte or less (as for English and other Roman character-based
writing systems). Because UTF-8 stores one-byte data as one byte rather
than, for example, the two bytes required by UTF-16, this can
significantly decrease the storage space required to hold large blocks
of international data.
Because of its flexibility, and compatibility with ASCII and
the UNIX operating environment, the Unicode support on the UTF-8 format
is used in the Solaris operating environment. UTF-8 provides developers
with a format that is compatible with existing internationalized
environments and provides an easy path for Internet and legacy data
interoperability. As a file system safe format, UTF-8 supports one-byte
unit I/O operations and can represent the Unicode formats UCS-2 and
UCS-4. And UTF-8 fits well within the XPG internationalization
framework.

UNICODE IN THE SOLARIS OPERATING ENVIRONMENT
The Solaris operating environment provides support for Unicode
through the use of several Unicode locales. The Unicode locales provide
a framework for developing applications with multiple-script support
for any country or region of the world. Solaris also benefits from the
advantages that Unicode brings to creating global applications.
Properly internationalized applications require no changes to support
the Unicode locales. All internationalized CUI and GUI utilities and
commands in the Solaris operating environment are available for use
with the Unicode locales without modification.
All Unicode locales in the Solaris operating environment are based on
the UTF-8 format. Each locale includes a base language in the UTF-8
codeset and the regional data related to this base language and its
cultural conventions. This includes local formatting rules, text
messages, help messages, and other related files. Each locale also
supports several other scripts for input, display, code conversion, and
printing.
Unicode UTF-8 en_US.UTF-8 Locale
en_US.UTF-8 is the flagship Unicode locale in the Solaris 7
operating environment. The en_US.UTF-8 locale is an American English-
based locale with multiscript processing support for characters of many
different languages. Features of all Unicode locales include enhanced
input modes and input mode switching, support for MIME character sets
in DtMail, expanded iconv code conversions, and a PostScript print
filter.
The en_US.UTF-8 locale supports multiscript processing, with
eight input modes: English/European, Cyrillic, Greek, Arabic, Hebrew,
Thai, and for the Asian languages, the Unicode hexadecimal code input
method and the Table lookup input method. The end user can input
characters from any combination of these scripts and from the entire
Unicode coding space.
End users switch between input modes using the Compose key or a
Control key sequence. The Arabic, Hebrew, and Thai input modes provide
full complex text layout features, including right-to-left display and
context-sensitive character rendering. The Unicode hexadecimal code
input method lets the user generate Unicode characters by typing in the
hexadecimal equivalents. The Table lookup input method is the easiest
method for non-native speakers to input characters of a foreign
language. This method provides a lookup window on the desktop for
choosing a script and then selecting characters of the script from the
available look-up table.
The UTF-8 locales provide a print utility for printing text
files. This utility can print flat text files written in UTF-8 using
bitmap fonts available on the system. Because the output from the
utility is standard PostScript, users can send the output to any
PostScript printer.
The en_US.UTF-8 locale supports various MIME character sets in
DtMail, including Latin, Greek, Cyrillic, Complex Text Layout, and
Asian character sets. With this support, users can send and receive e-
mail messages encoded in MIME character sets from almost any region of
the world (Figure 2). DtMail decodes received e-mail messages by
recognizing the MIME character set in use and the content transfer
encoding provided in the message. The user sending a message specifies
the MIME character set to use for the recipient mail user agent.

Figure 1: Using DtMail with multiple character sets
Codeset Conversion
The en_US.UTF-8 locale supports code conversions among the
major codesets of several countries. Figure 3 illustrates the available
codeset conversions between UTF-8 and other codesets under the Solaris
operating environment.

Figure 2: UTF-8 Codeset Conversions
Users can perform codeset conversions via a new graphical
Sdtconvtool utility or via the iconv(1) command. The Sdtconvtool
(Figure 4) detects all possible iconv code conversions available on the
user's system and presents them in an easy-to-use format.

Figure 3: Graphical Sdtconvtool for converting between codesets
Developers can use the iconv(3) function to access the same
functionality. This includes conversions to and from UTF-8 and many
ISO-standard codesets: UCS-2, UTF-16, UCS-4, UTF-7, KO18-R, Japanese
EUC, Korean EUC, Simplified Chinese EUC, Traditional Chinese EUC, GBK,
PCK (Shift JIS), BIG5, Johap, ISO-2022-JP,
ISO-2022-KR, ISO-2022-CN, and so on.
See Appendix A for a detailed listing of the supported code
conversions.
European Unicode Locales
The Solaris 7 operating environment provides five European
Unicode locales that offer the same level of support as en_US.UTF-8,
with modifications to regional definitions in keeping with the
requirements of each language and cultural area. The five other UTF-8
locales are fr.UTF-8 (French), de.UTF-8 (German), it.UTF-8 (Italian),
es.UTF-8 (Spanish) and sv.UTF-8 (Swedish). Each of these locales
contains the same feature set as en_US.UTF-8, with regional definitions
for national currency, date and time, numerical notation, and
translated text messages. These locales also support the new euro
monetary symbol and related conventions.

Figure 4: Euro symbol glyph
Asian Unicode Locales
The Solaris 7 operating environment for the Asian language
environment includes two UTF-8 locales: ja_JP.UTF-8 for the Japanese
Solaris 7 operating environment and ko.UTF-8 for the Korean Solaris 7
operating environment. These locales provide the same scope of support
as en_US.UTF-8 and the European Unicode locales. The ja_JP.UTF-8 locale
for Japanese language in the Solaris operating environment contains all
regional definitions and the same input methods as the other Japanese
locales. The Japanese language version of the Solaris 7 operating
environment provides enhanced code conversions for UTF-8, new Japanese
UCS-2/UTF-8 PostScript fonts, and support for Unicode codesets in the
User-Defined Character editor.
The ko_KR-UTF-8. locale for Korean language in the Solaris
operating environment contains all regional definitions and the same
input methods as other Korean locales. It also contains expanded code
conversion support and a Hanja Tool for dictionary editing.
Font Resources Under Unicode
Unicode version 2.1 supports a total of 38,887 alphabetic
characters from various regions of the world, with over 20,000 of them
ideographic characters for the Chinese, Japanese, and Korean languages.
The font resources for representing these characters, however, do not
always operate in a one-to-one relationship. That is, there are some
Unicode code points that have different, multiple glyphs associated
with it. This is to ensure that these specific code points can be
rendered correctly based upon the various contexts in which they may
appear. This is especially the case with the Asian languages, where,
for example, the Unified han (Chinese/Japanese/Korean ideographs)
glyphs are
written and displayed differently among Simplified Chinese, Traditional
Chinese, Japanese kanji, and Korean hanja ideographs. To manage these
difficulties, the Solaris operating environment contains an output
method that combines existing fonts to form a Unicode font set, instead
of providing a single Unicode font. The Solaris 7 operating environment
supports the following range of scripts:
Â¢ Western/Eastern/Northern European scripts
Â¢ Greek, Turkish, Cyrillic
Â¢ Simplified Chinese, Traditional Chinese, Japanese, Korean
Â¢ Arabic, Hebrew, Thai
For European scripts, there is one-to-one direct mapping
between Unicode characters and corresponding glyphs. For text from
Complex Text Layout languages (Arabic, Hebrew, Thai), the Solaris
layout engines pre-process the text (that is, perform right-to-left
swapping, contextual analysis, and so on) before rendering the
associated glyphs. For Asian characters, the Solaris operating
environment output methods provide dynamic remapping of the font and
glyph index according to the locale definition. Each locale contains a
font table with mapping mechanisms that specify which font and glyph to
use for each character code. The mechanism remaps the Unicode code
point values to existing Chinese, Japanese, and Korean fonts and glyph
index pairs. A locale administrator can define the priority among
fonts, that is, which font is searched first. For example, the
mechanism may search the Simplified Chinese fonts first, then, if it
cannot find the appropriate glyph, search the Traditional Chinese
fonts, and so on.

UNICODE TECHNICAL CONSIDERATIONS
Internationalized Applications with Unicode
The use of the Unicode codeset enables developers to write
applications that support multiple scripts simultaneously. Depending on
the Unicode locale being used, the user will have access to one or
several additional scripts, along with the base language script, for
input, display, and printing. This way, network environments with
distributed applications can provide individual users access to
different language environments simultaneously.
By itself, using Unicode in an application does not mean that
the application is fully internationalized. If, for example, an
application customizes data handling for Unicode directly, then it
needs to provide codeset converters as wrappers to support a codeset
other than Unicode.
This approach is direct Unicode localization - not
internationalization. With direct localization, developers may localize
applications in a manner that duplicates or conflicts with the
localization provided by the operating system. In addition, an
application may assume that all characters are represented in two-octet
cells, which conflicts with UTF-8.
To properly internationalize an application, use the following
guidelines:
1. Avoid direct access with Unicode. This is a task of the
platform's internationalization framework.
2. Use the POSIX model for multibyte and wide-character
interfaces. See the following section, "Unicode Application
Interfaces."
3. Only call the APIs that the internationalization framework
provides for language and cultural-specific operations. All POSIX, X11,
Motif, and CDE interfaces are available to Unicode locales.
4. Remain code-set independent.
Unicode Application Interfaces
When internationalizing applications for Unicode, it is
recommended that developers use the POSIX or X Window model. These
models define two sets of interfaces - for both multibyte and wide
character - without specifying the encoding methods to use. Standard
multibyte codesets contain characters of varying widths, from one to
several bytes, to represent a character. Characters are represented in
a minimal amount of storage space in order to use the fewest number of
bytes possible. Because multibyte codesets contain characters of
varying widths, they are not convenient to process by standard
functions.
The Unicode codeset provides the necessary format for both
multibyte and wide-character representation. In the Solaris Unicode
locales, multibyte interfaces use the UTF-8 representation of the
character sets and wide-character interfaces use the UCS-4
representation.
Font Resources
Properly internationalized applications require only a few
changes to run properly in the Solaris Unicode locales. One change that
is required is to set the proper resource definitions for font sets
(FontSet) or font list (XmFontList) in the application's resource file.
The en_US.UTF-8 locale supports the following set of font
character sets as the FontSet:
Â¢ ISO 8859-1 (Latin-1)
Â¢ ISO 8859-2 (Latin-2)
Â¢ ISO 8859-4 (Latin-4)
Â¢ ISO 8859-5 (Latin/Cyrillic)
Â¢ ISO 8859-7 (Latin/Greek)
Â¢ ISO 8859-9 (Latin-5)
Â¢ ISO 8859-15 (Latin-9)
Â¢ BIG5 (Traditional Chinese)
Â¢ GB 2312-1980 (Simplified Chinese)
Â¢ JIS X0201-1976, JIS X0208-1983 (Japanese)
Â¢ KS C 5601-1992 Annex 3 (Korean)
Â¢ ISO 8859-6 based one (Arabic)
Â¢ ISO 8859-8 (Hebrew)
Â¢ TIS 620-2533 based one (Thai)
Setting Resource Definitions
To create a font set for an application, the resource
definition should contain the complete set of fonts supported by the
Unicode locale. For example:
fs = XCreateFontSet(display,
"-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-1,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-2,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-4,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-5,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-6,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-7,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-8,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-9,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-15,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-big5-1,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-gb2312.1980-
0,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-
jisx0201.1976-0,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-
jisx0208.1983-0,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-ksc5601.1992
-3,
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-tis620.2533-
0",
&missing_ptr, &missing_count, &def_string);
The XmFontList resource definition of an application should
also include all fonts for every character set supported by the locale.
For example: !
! This is an example XmNFontList definition for en_US.UTF-8 locale:
*fontList:
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-1;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-2;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-4;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-5;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-6;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-7;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-8;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-9;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-15;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-big5-1;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-gb2312.1980-
0;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-
jisx0201.1976-0;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-
jisx0208.1983-0;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-ksc5601.1992
-3;
-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-tis620.2533-
0:
CONCLUSION
UNICODE IS CHANGING ALL THAT
Unicode provides a unique number for every character, no matter
what the platform, no matter what the program, no matter what the
language. The Unicode Standard has been adopted by such industry
leaders as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun,
Sybase, Unisys and many others. Unicode is required by modern standards
such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc.,
and is the official way to implement ISO/IEC 10646. It is supported in
many operating systems, all modern browsers, and many other products.
The emergences of the Unicode Standard, and the availability of tools
supporting it, are among the most significant recent global software
technology trends.
Incorporating Unicode into client-server or multi-tiered
applications and websites offers significant cost savings over the use
of legacy character sets. Unicode enables a single software product or
a single website to be targeted across multiple platforms, languages
and countries without re-engineering. It allows data to be transported
through many different systems without corruption.

REFERENCES
Â¢ Unicode Consortium
http://unicode
Â¢ Sun Microsystems, Inc.
http://wwws.sun
Â¢ Sun White Papers
http://wwws.sunsoftware/whitepapers.html
Â¢ Sun Online Documentation
http://docs.sun
Â¢ Sun Developer Programs
http://wwws.sundevelopers/

ACKNOWLEDGEMENT
I express my sincere gratitude to Dr. Agnisarman Namboodiri, Head of
Department of Information Technology and Computer Science , for his
guidance and support to shape this paper in a systematic way.
I am also greatly indebted to Mr. Saheer H. and
Ms. S.S. Deepa, Department of IT for their valuable suggestions in the
preparation of the paper.
In addition I would like to thank all staff members of IT department
and all my friends of S7 IT for their suggestions and constrictive
criticism.

CONTENTS
1. INTRODUCTION
2. UNICODE AND MULTILINGUAL COMPUTING
3. THE UNICODE STANDARD
4. UNICODE IN THE SOLARIS OPERATING ENVIRONMENT
5. UNICODE TECHNICAL CONSIDERATIONS
6. CONCLUSION
7. REFERENCES

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	cryptography and network security full report	computer science technology	21	31,353	31-05-2016, 12:17 PM Last Post: dhanabhagya
	Application of Software Testing in E-Learning full report	project topics	3	6,556	27-06-2013, 07:52 PM Last Post: Ashley Brownile
	CROSS LAYER TECHNIQUE FULL REPORT	seminar class	1	3,354	27-01-2013, 10:46 PM Last Post: Guest
	optical fiber communication full report	project report tiger	15	24,723	31-12-2012, 02:13 PM Last Post: seminar details
	Instant Operating System	seminar class	1	2,475	22-12-2012, 11:15 AM Last Post: seminar details
	Firewall Configuration and Testing full report	computer science topics	1	4,043	10-12-2012, 01:23 PM Last Post: seminar details
	Blue print ----- full report	seminar class	1	2,356	01-11-2012, 12:43 PM Last Post: seminar details
	tripwire full report	computer science technology	4	22,364	11-02-2012, 01:44 PM Last Post: seminar addict
	Emerging Trends In Contactless RFID Technologies full report	project topics	1	10,497	11-02-2012, 12:06 PM Last Post: seminar addict
	virtual private network VPN full report	computer science technology	7	11,112	09-02-2012, 12:47 PM Last Post: seminar paper

Important Note..!

ASK HERE