ISO/IEC 10646-1:1993
(Main)Information technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane
Information technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane
Technologies de l'information — Jeu universel de caractères codés à plusieurs octets — Partie 1: Architecture et table multilingue
General Information
- Status
- Withdrawn
- Publication Date
- 19-May-1993
- Withdrawal Date
- 19-May-1993
- Technical Committee
- ISO/IEC JTC 1/SC 2 - Coded character sets
- Drafting Committee
- ISO/IEC JTC 1/SC 2/WG 2 - Universal coded character set
- Current Stage
- 9599 - Withdrawal of International Standard
- Start Date
- 05-Oct-2000
- Completion Date
- 30-Oct-2025
Relations
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
- Effective Date
- 06-Jun-2022
Frequently Asked Questions
ISO/IEC 10646-1:1993 is a standard published by the International Organization for Standardization (ISO). Its full title is "Information technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane". This standard covers: Information technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane
Information technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane
ISO/IEC 10646-1:1993 is classified under the following ICS (International Classification for Standards) categories: 35.040 - Information coding; 35.040.10 - Coding of character sets. The ICS classification helps identify the subject area and facilitates finding related standards.
ISO/IEC 10646-1:1993 has the following relationships with other standards: It is inter standard links to ISO/IEC 10646-1:1993/Amd 3:1996, ISO/IEC 10646-1:1993/Amd 8:1997, ISO/IEC 10646-1:1993/Amd 12:1998, ISO/IEC 10646-1:1993/Amd 6:1997, ISO/IEC 10646-1:1993/Amd 16:1998, ISO/IEC 10646-1:1993/Amd 11:1998, ISO/IEC 10646-1:1993/Amd 23:1999, ISO/IEC 10646-1:1993/Amd 2:1996, ISO/IEC 10646-1:1993/Amd 1:1996, ISO/IEC 10646-1:1993/Amd 21:1999, ISO/IEC 10646-1:1993/Amd 7:1997, ISO/IEC 10646-1:1993/Amd 17:1999, ISO/IEC 10646-1:1993/Amd 10:1998, ISO/IEC 10646-1:1993/Amd 19:1998, ISO/IEC 10646-1:1993/Amd 18:1999. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.
ISO/IEC 10646-1:1993 is available in PDF format for immediate download after purchase. The document can be added to your cart and obtained through the secure checkout process. Digital delivery ensures instant access to the complete standard document.
Standards Content (Sample)
INTERNATIONAL
lSO/IEC
STANDARD
First edition
1993-05-01
Information technology - Universal
Multiple-Octet Coded Character Set
(UCS) -
Part 1:
Architecture and Basic Multilingual Plane
Technologies de I’informa tion
- Jeu universe/ de caracMes cod& 2
plusieurs octets -
Partie 1: Architecture et table multilingue
Reference number
&O/l EC 10646-I :1993(E)
ISOllEC 10646-l : 1993 (E)
Contents
Page
1 scope .
............................................................................................ 1
2 Conformance
3 Normative references .
4 Definitions .
................................................................... 3
5 General structure of the UCS
............................................................
6 Basic structure and nomenclature
..................................................................... 7
7 Special features of the UCS
.................................................................... 7
8 The Basic Multilingual Plane
............................................................................................. 7
9 Other planes
.......................................................................... 7
10 The Restricted Use zone
................................................................ 8
11 Private Use groups and planes
........................................................... 8
12 Revision and updating of the UCS
13 Subsets .
.................................................. 8
14 Coded representation forms of the UCS
15 Implementation levels .
...................................................... 9
16 Use of control functions with the UCS
.................................................... 9
17 Declaration of identification of features
..................................................... 11
18 Structure of the code tables and lists
19 Block names .
20 Characters in bi-directional context .
21 Special characters .
22 .
Order of characters
............................................................................. 13
23 Combining characters
24 Hangul syllable composition method .
.............................................. 14
25 Code tables and lists of character names
........................................................................ 262
26 CJK unified ideographs
Annexes
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Collections of graphic characters for subsets 699
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
B List of combining characters
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704
C Mirrored characters in Arabic bi-directional context
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
D Alternate format characters
0 ISO/IEC 1993
All rights reserved. No part of this publication may be reproduced or utilized in any form or
by any means, electronic or mechanical, including photocopying and microfilm, without per-
mission in writing from the publisher.
lSO/IEC Copyright Office l Case Postale 56 l CH-1211 Geneve 20 l Switzerland
Printed in Switzerland
ii
ISOAEC 10646-l : 1993 (E)
E Alphabetically sorted list of character names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709
F The use of “signatures” to identify UCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743
G UCS transformation format (UTF-1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
H Recommendation for combined receiving/originating
devices with internal storage . .*.
J Notations of octet value representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
K Character naming guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .*.*.~.
L Sources of characters 750
M External references to character repertoires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
N Scripts under consideration for future editions of
ISO/IEC 10646 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
..I
Ill
ISOAEC 10646-l : 1993 (E)
Foreword
IS0 (the International Organization for Standardization) and IEC (the Inter-
national Electrotechnical Commission) form the specialized system for
worldwide standardization. National bodies that are members of IS0 or
IEC participate in the development of International Standards through
technical committees established by the respective organization to deal
with particular fields of technical activity. IS0 and IEC technical com-
mittees collaborate in fields of mutual interest. Other international organ-
izations, governmental and non-governmental, in liaison with IS0 and IEC,
also take part in the work.
In the field of information technology, IS0 and IEC have established a joint
technical committee, lSO/IEC JTC 1. Draft International Standards adopted
by the joint technical committee are circulated to national bodies for vot-
ing. Publication as an International Standard requires approval by at least
75 % of the national bodies casting a vote.
International Standard lSO/IEC 10646-l was prepared by Joint Technical
Committee lSO/IEC JTC 1, Information technology, Sub-Committee SC 2,
Character sets and information coding.
lSO/IEC 10646 consists of the following parts, under the general title In-
formation technology - Universal Multiple-Octet Coded Character Set
(KS):
- Part 1: Architecture and Basic Multilingual Plane
Additional parts will specify other planes.
Annexes A and B form an integral part of this part of lSO/IEC 10646. An-
nexes C to N are for information only.
ISOllEC 10646-l : 1993 (E)
Introduction
ISOAEC 10646 specifies the Universal Multiple-Octet Coded Character Set
(UCS). It is applicable to the representation, transmission, interchange,
processing, storage, input and presentation of the written form of the
languages (scripts) of the world as well as additional symbols.
This part of ISOAEC 10646 specifies the overall architecture and the Basic
Multilingual Plane (BMP) of the UCS.
ISO/IEC 10646-l : 1993 (E)
vi
~~~
INTERNATIONAL STANDARD ISO/IEC 10646-I : 1993 (E)
Information technology - Universal Multiple-Octet
Coded Character Set (UCS) -
Part 1:
Architecture and Basic Multilingual Plane
1 Scope 2 Conformance
ISO/l EC 10646 specifies the Universal Multiple-Octet
2.1 General
Coded Character Set (UCS). It is applicable to the
Whenever Private Use characters are used as
interchange,
representation, transmission,
specified in ISO/IEC 10646, the characters
processing, storage, input and presentation of the
themselves shall not be covered by these
written form of the languages of the world as well as
conformance requirements.
additional symbols.
This part of ISO/lEC 10646 specifies the overall
2.2 Conformance of information interchange
architecture, and
A coded-character-data-element (CC-data-element)
within coded information for interchange is in
- defines terms used in ISO/IEC 10646;
conformance with ISO/IEC 10646 if
- describes the general structure of the coded
a) all the coded representations of graphic
character set;
characters within that CC-data-element conform to
- specifies the Basic Multilingual Plane (BMP) of the
clauses 6 and 7, to an identified form chosen from
UCS, and defines a set of graphic characters used in
clause 14, and to an identified implementation level
scripts and the written form of languages on a
chosen from clause 15;
world-wide scale;
b) all the graphic characters represented within that
- specifies the names for the graphic characters of
CC-data-element are taken from those within an
the BMP, and the coded representations;
identified subset (clause 13);
- specifies the four-octet (32-bit) canonical form of
c) all the coded representations of control functions
the UCS: UCS-4;
within that CC-data-element conform to clause 16.
- specifies a two-octet (16-bit) BMP form of the UCS:
A claim of conformance shall identify the adopted
ucs-2;
form, the adopted implementation level and the
adopted subset by means of a list of collections
- specifies the coded representations for control
and/or characters.
functions;
- specifies the management of future additions to this
2.3 Conformance of devices
coded character set.
A device is in conformance with ISO/IEC 10646 if it
The UCS is a coding system different from that
conforms to the requirements of item a) below, and
specified in IS0 2022. The method to designate
either or both of items b) , and c).
UCS from IS0 2022 is specified in 17.2.
NOTE - The term device is defined (in 4.17) as a
component of information processing equipment which can
transmit and/or receive coded information within
CC-data-elements. A device may be a conventional
ISOAEC 10646-l : 1993 (E)
such as an application
input/output device, or a process editions of the standards listed below. Members of
program or gateway function.
IEC and IS0 maintain registers of currently valid
International Standards.
A claim of conformance shall identify the document
IS0 2022:1986 Information processing - IS0 7-bit
that contains the description specified in a) below,
and 8-bit coded character sets -Code extension
and shall identify the adopted form(s), the adopted
techniques.
implementation level, the adopted subset (by means
of a list of collections and/or characters), and the
ISO/lEC 6429:1992 Information technology -
selection of control functions adopted in accordance
Control functions for coded character sets.
with clause 16.
a) Device description: A device that conforms to
4 Definitions
ISO/IEC 10646 shall be the subject of a description
that identifies the means by which the user may
For the purposes of ISO/IEC 10646, the following
supply characters to the device and/or may
definitions apply :
recognise them when they are made available to the
user, as specified respectively, in subclauses b), and 4.1 Basic Multilingual Plane (BMP) : Plane 00 of
c) below. Group 00.
b) Originating device: An originating device shall 4.2 block : A contiguous collection of characters that
allow its user to supply any characters from an
share common characteristics, such as script.
adopted subset, and be capable of transmitting their
4.3 canonical form : The form with which characters
coded representations within a CC-data-element in
of this coded character set are specified using four
accordance with the adopted form and
octets to represent each character.
implementation level.
4.4 CC-data-element (Coded-Character-Data-
c) Receiving device: A receiving device shall be
Element) : An element of interchanged information
capable of receiving and interpreting any coded
that is specified to consist of a sequence of coded
representation of characters that are within a
representations of characters, in accordance with
CC-data-element in accordance with the adopted
one or more identified standards for coded character
form and implementation level, and shall make any
sets.
corresponding characters from the adopted subset
available to the user in such a way that the user can 4.5 cell : The place within a row at which an
identify them. individual character may be allocated.
Any corresponding characters that are not within the 4.6 character : A member of a set of elements used
adopted subset shall be indicated to the user in a for the organisation, control, or representation of
way which need not allow them to be distinguished data.
from each other.
4.7 character boundary : Within a stream of octets
NOTES
the demarcation between the last octet of the coded
representation of a character and the first octet of
An indication to the user may consist of making available
that of the next coded character.
the same character to represent all characters not in the
adopted subset, or providing a distinctive audible or visible
4.8 coded character : A character together with its
signal when appropriate to the type of user.
coded representation.
receiving with
2 See also annex H for
4.9 coded character set : A set of unambiguous
re-transmission capability.
rules that establishes a character set and the
relationship between the characters of the set and
their coded representation.
3 Normative references
4.10 code table : A table showing the characters
allocated to the octets in a code.
The following standards contain provisions which,
through reference in this text, constitute provisions of
4.11 combining character : A member of an
this part of ISO/IEC 10646. At the time of publication,
identified subset of the coded character set of
the editions indicated were. valid. All standards are
ISO/IEC 10646 intended for combination with the
subject to revision, and parties to agreements based
preceding non-combining graphic character, or with
on this part of ISO/IEC 10646 are encouraged to
a sequence of combining characters preceded by a
investigate the possibility of applying the most recent
non-combining character (see also 4.13).
lSO/IEC 10646-l : 1993 (E)
NOTE - This part of ISO/IEC 10646 specifies several subset
4.24 plane : A subdivision of a group; of 256 x 256
collections which include combining characters.
cells.
4.25 presentation; to present : The process of
4.12 compatibility character : A graphic character
writing, printing, or displaying a graphic symbol.
included as a coded character of ISO/IEC 10646
primarily for compatibility with existing coded
4.26 presentation form : In the presentation of
character sets.
some scripts, a form of a graphic symbol
representing a character that depends on the
4.13 composite sequence : A sequence of graphic
position of the character relative to other characters.
characters consisting of a non-combining character
followed by one or more combining characters (see
4.27 private use planes : Planes within this coded
also 4.11).
character set the contents of which are not specified
in ISO/IEC 10646 (see 10.1).
NOTES
4.28 repertoire :
A specified set of characters that
1 A graphic symbol for a composite sequence generally
are represented in a coded character set.
consists of the combination of the graphic symbols of each
character in the sequence.
4.29 row : A subdivision of a plane; of 256 cells.
2 A composite sequence is not a character and therefore is
4.30 script : A set of graphic characters used for the
not a member of the repertoire of ISO/lEC 10646.
written form of one or more languages.
4.31 supplementary planes : Planes that
4.14 control function : An action that affects the
accommodate characters which have not been
recording, processing, transmission or interpretation
allocated to the Basic Multilingual Plane.
of data, and that has a coded representation
consisting of one or more octets.
4.32 user : A person or other entity that invokes the
service provided by a device. (This entity may be a
4.15 default state : The state that is assumed when
process such as an application program if the
no state has been explicitly specified.
“device” is a code converter or a gateway function,
4.16 detailed code table : A code table showing the
for example.)
individual characters, and normally showing a partial
4.33 zone : A sequence of cells of a code table,
row.
or more rows, either in whole or in
comprising one
4.17 device : A component of information
part, containing characters of a particular class (see
processing equipment which can transmit and/or
clause 8).
receive coded information within CC-data-elements.
(It may be an input/output device in the conventional
sense, or a process such as an application program
5 General structure of the UCS
or gateway function.)
The general structure of the Universal Multiple-Octet
4.18 graphic character : A character, other than a
Coded Character Set (referred to hereafter as “this
control function, that has a visual representation
coded character set”) is described in this explanatory
normally handwritten, printed, or displayed.
clause, and is illustrated in figures 1 and 2. The
4.19 graphic symbol : The visual representation of normative specification of the structure is given in
a graphic character or of a composite sequence.
later clauses.
4.20 group : A subdivision of the coding space of The value of any octet is expressed in hexadecimal
this coded character set; of 256 x 256 x 256 cells. notation from 00 to FF in ISO/IEC 10646 (see annex
.
J)
4.21 interchange : The transfer of character coded
another, using
data from one user to The canonical form of this coded character set - the
telecommunication means or interchangeable media. way in which it is to be conceived - uses a
four-dimensional coding space, regarded as a single
4.22 interworking : The process of permitting two or
entity, consisting of 128 three-dimensional groups.
more systems, each employing different coded
NOTE - Thus, bit 8 of the most significant octet in the
character sets, meaningfully to interchange
canonical form of a coded character can be used for internal
character coded data; conversion between the two
processing purposes within a device as long as it is set to
codes may be involved.
zero within a conforming CC-data-element.
An ordered sequence of eight bits
4.23 octet :
Each group consists of 256 two-dimensional planes.
considered as a unit.
iSO/IEC 10646-l : 1993 (E)
Each plane consists of 256 one-dimensional rows, a single entity.
each row containing 256 cells. A character is located
This entire coded character set shall be conceived of
and coded at a cell within this coding space or the
as comprising 128 groups of 256 planes. Each plane
cell is declared unused.
shall be regarded as containing 256 rows of
In the canonical form, four octets are used to characters, each row containing 256 cells. In a code
represent each character, and they specify the table representing the contents of a plane (such as
group, plane, row and cell, respectively. The in figure 2), the horizontal axis shall represent the
canonical form consists of four octets since two least significant octet, with its smaller value to the
octets are not sufficient to cover all the characters in left; and the vertical axis shall represent the more
the world, and a 32-bit representation follows significant octet, with its smaller value at the top.
modern processor architectures.
Each axis of the coding space shall be coded by one
The four-octet canonical form can be used as a octet. Within each octet the most significant bit shall
four-octet coded character set in which case it is be bit 8 and the least significant bit shall be bit I.
called UCS-4.
Accordingly, the weight allocated to each bit shall
The first plane (Plane 00 of Group 00) is called the
be
Basic Multilingual Plane. The Basic Multilingual
Plane includes characters in general use in
bit 8 bit 7 bit 6 bit 5 bit 4 bit 3 bit 2 bit 1
alphabetic, syllabic and ideographic scripts together
128 64 32 16 8 4
2 1
with various symbols and digits. The BMP also has a
Restricted Use (RU) zone in which the characters
6.2 Coding of characters
have special characteristics.
In the canonical form of the coded character set,
The subsequent planes are regarded as
each character within the entire coded character set
supplementary or private use planes, which will
shall be represented by a sequence of four octets.
accommodate additional graphic characters.
The most significant octet of this sequence shall be
the group-octet. The least significant octet of this
The 32 planes with Plane-octet values EO to FF of
sequence shall be the cell-octet. Thus this sequence
Group 00 are for Private Use. The 32 groups with
may be represented as
Group-octet values 60 to 7F of this coded character
set are also for Private Use. The contents of the cells
m.s.
Is
in Private Use zones are not specified in ISO/IEC
Group-octet 1 Plane-octet1 Row-octet 1 Cell-octet
10646.
where m.s. means the most significant octet, and I.s.
Each character is located within the coded character
means the least significant octet.
set in terms of its Group-octet, Plane-octet,
Row-octet, and Cell-octet.
For brevity, the octets may be termed
In addition to the canonical form, a two-octet BMP
m.s.
Is I
form is specified. Thus, the Basic Multilingual Plane
G-octet 1 P-octet 1 R-octet 1 C-octet ’
can be used as a two-octet coded character set
identified as UCS-2.
Where appropriate, these may be further
abbreviated to G, P, R, and C.
Subsets of the coding space may be used in order to
give a sub-repertoire of graphic characters.
The value of any octet shall be represented by two
hexadecimal digits, for examples: 31 or FE. When a
A UCS Transformation Format (UTF-1) is specified
single character is to be identified in terms of the
in annex G which can be used to transmit text data
values of its group, plane, row and cell, this shall be
through communication systems which are sensitive
represented such as:
to octet values for control characters coded
according to the structure of IS0 2022.
0000 0030 for DIGIT ZERO
0000 0041 for LATIN CAPITAL LETTER A
6 Basic structure and nomenclature
When referring to characters within a plane, the
leading four zeros (for G-octet and P-octet) may be
6.1 Structure
omitted. For example, 0030 may be used to refer to
DIGIT ZERO.
The Universal Multiple-Octet Coded Character Set
as specified in lSO/IEC 10646 shall be regarded as
ISOllEC 10646-1 : 1993 (E)
Group 7F
0 0
0 0
0 0
Group 01
\
Plane 00 of
Group 7F
Group 00
Plane 00 of Group 01
Each plane:
256 x 256
CdS.
h
Plane FF of Group 00
Plane 00 of Group 00
Figure 1 - Entire coding space of the Universal Multiple-Octet Coded
Character Set
ISOAEC 10646-l : 1993 (E)
Supplementary planes
I
I
I
I
I
I
I
I
I
I
I
I
1-M
r--r L 111111111111111111 llllSlll
I I
I
I
I I
I
I I
I
I I
I
--. -----
I
I I
I
FF
00 I I
I
Row-
I I
I
octet I I
I
I I
I
I I
I
I I
A-zone I
I I
I
I I
I
I I
I
I I
I
4E I I
I
I I
I
I I
I
I I
I
I I
I
I-zone
I I
I
I I
I
I I
I FF
I I
I I
EO
I I
I I
I
A0 I I
.
m-11
I
I
O-zone
I 1
I
I
l-II-
EO 01
R-zone
FF 00
Private Use
planes
Basic Multilingual Plane
Plane-octet
Labels A-zone, l-zone, O-zone, and R-zone are specified in clause 8.
Figure 2 - Group 00 of the Universal Multiple-Octet Coded Character Set
ISOAEC 10646-l : 1993 (E)
The Basic Multilingual Plane shall be divided into
6.3 Octet order
four zones:
The sequence of the octets that represent a
character, and the most significant and least
A-zone:
code positions 0000 to 4DFF
significant ends of it, shall be maintained as shown
l-zone:
code positions 4EOO to 9FFF
above. When serialised as octets, a more significant
O-zone:
code positions A000 to DFFF
octet shall precede less significant octets. When not
R-zone:
code positions EOOO to FFFD
serialised as octets, the order of octets may be
specified by agreement between sender and
00 FF
recipient (see 17.1 and annex F).
o”~*-Ioneo
7 Special features of the UCS
4ErOnei20992pOSiOnE)I
The following characteristics apply to the entire
coded character set.
A0
1.
The values of P-, and R-, and C-octets used for
O-zone (16384 positions)
I I
representing graphic characters shall be in the I
EO
range 00 to FF. The values of G-octets used for
R-zone (8190 positions)
representation of graphic characters shall be in
the range 00 to 7F. On any plane, code positions
FFFE and FFFF shall not be used.
Code positions 0000 to OOIF in the BMP are
resewed for control characters, and code position
NOTE - Code position FFFE is reserved for “signature” (see
007F is reserved for the character DELETE (see
annex F). Code position FFFF can be used for internal
clause 16). Code positions 0080 to 009F are
processing uses requiring a numeric value that is guaranteed
not to be a coded character such as in terminating tables, or
reserved.
signaling end-of-text. Since it is the largest two-octet value, it
In the Basic Multilingual Plane, the A-zone is used
may also be used as the final value in binary or sequential
searching index.
for alphabetic and syllabic scripts together with
various symbols. The l-zone is used for
Code positions to which a character is not
Chinese/Japanese/Korean (CJK) unified ideographs
allocated, except for the positions reserved for
(unified East Asian ideographs). The O-zone is
Private Use characters, are reserved for future
reserved for future standardisation. The R-zone shall
standardisation and shall not be used for any
be used for the Restricted Use zone in the BMP
other purpose. Future editions of ISO/IEC 10646
which contains Private Use characters, presentation
will not allocate any characters to code positions
forms, and compatibility characters (see clause 10) .
reserved for Private Use characters.
The same graphic character shall not be
allocated to more than one code position. There
9 Other planes
are graphic characters with similar shapes in the
Planes 01 to DF in Group 00 and planes 00 to FF in
coded character set; they are used for different
Groups 01 to 5F are reserved for future
purposes and have different character names.
standardisation, and thus those code positions shall
Compatibility characters are included in ISO/IEC
not be used for any other purpose.
10646 primarily for compatibility with existing
coded character sets to allow two-way code
conversion without loss of information.
10 The Restricted Use zone
Sets of graphic characters that are used in particular
8 The Basic Multilingual Plane
ways are provided in the Restricted Use zone. These
sets include:
Plane 00 of Group 00 shall be the Basic Multilingual
a) Private Use characters,
Plane (BMP). The BMP can be used as a two-octet
coded character set in which case it shall be called
b) Presentation forms of characters,
UCS-2 (see 14.1).
c) Compatibility characters (see item 4 in clause 7).
ISO/IEC 10646-I : 1993 (E)
10.1 Private Use characters
12 Revision and updating of the UCS
Private Use characters are not restrained in any way
The revision and updating of this coded character
by ISO/IEC 10646. Private Use characters can be
set will be carried out by ISO/IEC JTClSC2.
used to provide user-defined characters. For
NOTE - It is intended that in future editions of ISOAEC
example, this is a common requirement for users of
10646, the names and allocation of the characters in this
ideographic scripts.
edition will remain unchanged.
NOTE 1 - For meaningful interchange of Private Use
characters, an agreement, independent of ISOAEC 10646,
is necessary between sender and recipient.
13 Subsets
Private Use characters can be used for
ISO/IEC 10646 provides the specification of subsets
dynamically-redefinable characters (DRCS)
of coded graphic characters for use in interchange,
applications.
by originating devices and by receiving devices.
NOTE 2 - For meaningful interchange of DRCS, an
agreement, independent of ISOAEC 10646 is necessary
There are two alternatives for the specification of
between sender and recipient. ISOAEC 10646 does not
subsets; limited subset and selected subset. An
specify the techniques for defining or setting up
adopted subset may comprise either of them, or a
dynamically-redefinable characters.
combination of the two.
10.2 Presentation forms of characters
13.1 Limited subset
Each presentation form of character provides an
A limited subset consists of a list of graphic
alternative form, for use in a particular context, to the
characters in the specified subset. This specification
nominal form of the character or sequence of
allows applications and devices that were developed
characters from the other zones of graphic
using other codes to interwork with this coded
characters. The transformation from the nominal
character set.
form to the presentation forms may involve
A claim of conformance referring to a limited subset
substitution, superimposition, or combination.
shall list the graphic characters in the subset by the
The rules for the superimposition, choice of
names of graphic characters or code positions as
differently shaped characters, or combination into
defined in ISO/IEC 10646.
ligatures, or conjuncts - which are often of extreme
complexity - are not specified in ISO/IEC 10646. 13.2 Selected subset
A selected subset consists of a list of collections of
In general, presentation forms are not intended to be
graphic characters as defined in ISO/IEC 10646. The
used as a substitute for the nominal forms of the
collections from which the selection may be made
graphic characters specified elsewhere within this
are listed in annex A of each part of ISO/IEC 10646.
coded character set. However, specific applications
A selected subset shall always automatically include
may encode these presentation forms instead of the
the Cells 20 to 7E of Row 00 of Plane 00 of Group
nominal forms for specific reasons among which is
00 I
compatibility with existing devices. The rules for
searching, sorting and other processing operations
A claim of conformance referring to a selected
on presentation forms are outside the scope of
subset shall list the collections chosen as defined in
ISO/I EC 10646.
ISO/I EC 10646.
11 Private Use groups and planes
14 Coded representation forms of the
ucs
The code positions of 32 planes from Plane EO to
Plane FF of Group 00 shall be for Private Use.
ISO/IEC 10646 provides two alternative forms of
coded representation of characters.
The code positions of the 32 groups from Group 60
to Group 7F shall be for Private Use.
NOTE - The characters from the IS0 646 IRV repertoire are
coded by simple zero extensions to their coded
The contents of these code positions are not
representations in IS0 646 IRV. Therefore, their coded
specified in ISO/IEC 10646 (see 10.1).
representations have the same integer values when
represented as 8-bit, 16-bit, or 32-bit integers. For
implementations sensitive to a zero valued octet (e.g. for
use as a string terminator), use of 8-bit based array data
ISOAEC 10646-I : 1993 (E)
type should be avoided as any zero valued octet may be
16 Use of control functions with the
interpreted incorrectly. Use of data types at least l&bits
ucs
wide is more suitable for UCS-2, and use of data types at
least 32-bits wide is more suitable for UCS-4.
This coded character set provides for use of control
functions encoded according to IS0 2022, ISOAEC
14.1 Two-octet BMP form
6429 or similarly structured standards for control
functions, and standards derived from these. A set or
This coded representation form permits the use of
subset of such coded control functions may be used
characters from the Basic Multilingual Plane with
each character represented by two octets. in conjunction with this coded character set. These
standards encode a control function as a sequence
Within a CC-data-element conforming to the
of one or more octets.
two-octet BMP form, a character from the Basic
When a CO control character of ISO/IEC 6429 is
Multilingual Plane shall be represented by two octets
used with this coded character set, its coded
comprising the R-octet and the C-octet as specified
representation as specified in ISOAEC 6429 shall be
in 6.2.
padded to correspond with the number of octets in
two-octet BMP
NOTE - A coded graphic cha ratter using the
the adopted form (see clause 14). Thus, the least
be implemented by a 16-bit integer for processing.
form may
significant octet shall be the bit combination
specified in ISOAEC 6429, and the more significant
14.2 Four-octet canonical form
octet(s) shall be zeros.
The canonical form permits the use of all the
For example, the control character FORM FEED is
characters of ISO/IEC 10646, with each character
represented by “OOOC” in the two-octet form, and
represented by four octets.
“0000 OOOC” in the four-octet form.
Within a CC-data-element conforming to the
For escape sequences, control sequences, and
four-octet canonical form, every character shall be
control strings (see ISOAEC 6429) consisting of a
represented by four octets comprising the G-octet,
coded control character followed by additional bit
the P-octet, the R-octet and the C-octet as specified
combinations in the range 20 to 7F, each bit
in 6.2.
combination shall be padded by octet(s) with value
NOTE - A coded graphi- c, character using the four-octet
00 .
canonical form may be implemented by a 32-bit integer for
For example, the escape sequence “ESC 02/00
processing.
04/00” is represented by “001 B 0020 0040” in the
two-octet form, and “0000 OOIB 0000 0020 0000
0040” in the four-octet form.
15 Implementation levels
When using a Cl control character of ISOAEC 6429
ISO/IEC 10646 specifies three levels of
with this coded character set, it shall be coded as
implementation. Combining characters are described
ESC Fe sequence (see ISO/lEC 6429) padded as
in clause 23 and listed in annex B.
specified above.
For example, the control character PARTIAL LINE
15.1 lmplementation level 1
BACKWARD - PLU (08/12 in ISOAEC 6429
level 1 a
When implementation is used,
representation) is represented by “001 B 004C” in the
CC-data-element not contain coded
shall
two-octet form, and “0000 OOIB 0000 004C” in the
representations of combining characters (see clause
four-octet form.
B.l) nor of characters from HANGUL JAM0 block
Code extension control functions for the IS0 2022
(see clause 24).
code extension techniques (such as designation
15.2 Implementation level 2 escape sequence, single shift and locking shift) shall
not be used with this coded character set.
level 2 is
When implementation used, a
CC-data-element shall not contain coded
representations of characters listed in clause B.2.
17 Declaration of identification of
features
15.3 Implementation level 3
When implementation level 3 is used, a
17.1 Purpose and context of identification
CC-data-element may contain coded representations
of any characters. CC-data-elements conforming to ISO/IEC 10646 are
ISO/IEC 10646-l : 1993 (E)
intended to form all or part of a composite unit of shall be padded in accordance with clause 16.
coded information that is interchanged between an
17.3 Identification of subsets of graphic
originator and a recipient. The identification of
characters
ISO/IEC 10646 (including the form), the
implementation level, and any subset of the coding
When the control sequences of ISO/IEC 6429 are
space that have been adopted by the originator must
used, the identification of subsets (see clause 13)
also be available to the recipient. The route by which
specified by ISO/IEC 10646 shall be by a control
such identification is communicated to the recipient
sequence IDENTIFY UNIVERSAL CHARACTER
is outside the scope of ISO/IEC 10646.
SUBSET (IUCS) as shown below.
However, some standards for interchange of coded
CSI Ps. 02/00 06/13
information may permit, or require, that the coded
Ps . . . means that there can be any number of
representation of the identification applicable to the
selective parameters. The parameters are to be
CC-data-element forms a part of the interchanged
taken from the subset collection numbers as shown
information. This clause specifies a coded
in annex A of each part of ISO/IEC 10646. When
representation for the identification of UCS with an
there is more than one parameter, each parameter
implementation level and a subset of ISO/lEC
value is separated by an octet with value 03/l 1.
10646, and also of a CO and a Cl set of control
functions from ISO/IEC 6429 for use in conjunction
Parameter values are represented by digits where
with ISO/IEC 10646. Such coded representations
octet values 03/00 to 03/09 represent digits 0 to 9.
provide all or part of an identification data element,
If such a control sequence appears within a
which may be included in information interchange in
CC-data-element conforming to ISO/IEC 10646, it
accordance with the relevant standard.
shall be padded in accordance with clause 16.
If two or more of the identifications are present, the
order of those identifications shall follow the order as
17.4 Identification of control function set
specified in this clause.
When the escape sequences from IS0 2022 are
NOTE - An alternative method of identification is described
used, the identification of each set of control
in annex M.
functions (see clause 16) of ISO/lEC 6429 to be
used in conjunction with ISO/lEC 10646 shall be an
identifier sequence of the type shown below.
17.2 Identification of UCS coded representation
form with implementation level
ESC 02101 04/00
identifies the full CO set
When the escape sequences from IS0 2022 are
of ISO/l EC 6429
used, the identification of a coded representation
ESC 02102 04103 identifies the full Cl set
form of UCS (see clause 14) and an implementation
of ISO/IEC 6429
level (see clause 15) specified by ISO/IEC 10646
For a subset of CO or Cl sets, the final octet F shall
shall be by a designation sequence chosen from the
be obtained from the International Register of Coded
following list:
Character Sets. The identifier sequences for these
ESC 02105 02115 04100
sets shall be:
UCS-2 with implementation level 1
ESC 02/01 F identifies a CO set
ESC 02105 02/l 5 04/01
ESC 02102 F identifies a Cl set
UCS-4 with implementation evel 1
If such an escape sequence appears within a
ESC 02105 02115 04103
CC-data-element conforming to ISO/IEC 10646, it
UCS-2 with implementation eve12
shall be padded in accordance with clause 16.
ESC 02/05 02/15 04/04
17.5 Identification of return from UCS to IS0
UCS-4 with implementation level 2
ESC 02/05 02/15 04/05
When the escape sequences from IS0 2022 are
UCS-2 with implementation level 3
used, the identification of the return from UCS to the
coding system of IS0 2022 shall be by the escape
ESC 02/05 02/15 04/06
sequence ESC 02/05 04/00, padded in accordance
UCS-4 with implementation level 3
with clause 16.
If such an escape sequence appears within a
CC-data-element conforming to ISO/IEC 10646, it
ISO/IEC 10646-l : 1993 (E)
addition, annex E contains an alphabetically sorted
18 Structure of the code tables and lists
list of character names.
The clauses 25 and 26 set out the detailed code
tables and the lists of character names for the
graphic characters. Together, these specify graphic 19 Block names
characters, their coded representation, and the
The following list contains the blocks defined in the
character name for each character.
BMP. The block names are used in providing for the
The graphic symbols are to be regarded as typical
specification of subsets (see annex A for subset
visual representations of the characters. lSO/l EC
collections).
10646 does not attempt to prescribe the exact shape
Block name
from to
of each character. The shape is affected by the
design of the font employed, which is outside the
BASIC LATIN
0020 - 007E
scope of lSO/lEC 10646.
LATIN-1 SUPPLEMENT
OOAO - OOFF
LATIN EXTENDED-A
0100 - 017F
Graphic characters specified in ISO/IEC 10646 are
LATIN EXTENDED-B
0180 - 024F
uniquely identified by their names. This does not
IPA EXTENSIONS
0250 - 02AF
imply that the graphic symbols by which they are
SPACING MODIFIER LETTERS
02BO - 02FF
commonly imaged are always different. Examples of
COMBINING DIACRITICAL MARKS
0300 - 036F
BASIC GREEK
graphic characters with similar graphic symbols are 0370 - 03CF
GREEK SYMBOLS AND COPTIC
0300 - 03FF
LATIN CAPITAL LETTER A, GREEK CAPITAL
CYRILLIC
0400 - 04FF
LETTER ALPHA, and CYRILLIC CAPITAL LETTER
ARMENIAN
0530 - 058F
A .
HEBREW EXTENDED-A
0590 - 05CF
BASIC HEBREW
05DO - 05EA
The meaning attributed to any character is not
HEBREW EXTENDED-B
05EB - 05FF
specified by ISO/IEC 10646; it may differ from
BASIC ARABIC
0600 - 0652
country to country, or from one application to ARABIC EXTENDED
0653 - 06FF
DEVANAGARI
0900 - 097F
another.
BENGALI
0980 - 09FF
For the alphabetic scripts, the general principle has GURMUKHI
OAOO - OA7F
GUJARATI
OA80 - OAFF
been to arrange the characters within any row in
ORIYA OBOO - OB7F
approximate alphabetic sequence; where the script
TAMIL
OB80 - OBFF
has capital and small letters, these are arranged in
TELUGU OCOO - OC7F
pairs. However, this general principle has been
KANNADA OC80 - OCFF
overridden in some cases. For example, for those MALAYALAM ODOO - OD7F
THAI OEOO - OE7F
scripts for which a relevant standard exists, the
LAO OE80 - OEFF
characters are allocated according to that standard.
BASIC GEORGIAN 1ODO - 1OFF
This arrangement within the code tables will aid
GEORGIAN EXTENDED 1OAO - 1OCF
conversion between the existing standards and this
HANGUL JAM0 llOO- 11FF
coded character set. In general, however, it is LATIN EXTENDED ADDITIONAL 1EOO - 1EFF
GREEK EXTENDED lFOO- 1FFF
anticipated that conversion between this coded
GENERAL PUNCTUATION 2000 - 206F
character set and any other coded character set will
SUPERSCRIPTS AND SUBSCRIPTS 2070 - 209F
use a table lookup technique.
CURRENCY SYMBOLS 20A0 - 20CF
COMBINING DIACRITICAL MARKS FOR SYMBOLS
It is not intended, nor will it often be the case, that
2000 - 20FF
the characters needed by any one user will be found
LETTERLIKE SYMBOLS 2100 - 214F
all grouped together in one part of the code table.
NUMBER FORMS 2150 - 218F
ARROWS 2190 - 21FF
Furthermore, the user of any script will find that
MATHEMATICAL OPERATORS 2200 - 22FF
characters he needs may have been already coded
MISCELLANEOUS TECHNICAL
2300 - 23FF
CONTROL PICTURES
earlier in this coded character set. This especially 2400 - 243F
OPTICAL CHARACTER RECOGNITION 2440 - 245F
applies to the digits, to the symbols, and to the use
ENCLOSED ALPHANUMERICS
2460 - 24FF
of Latin letters in dual-script applications.
BOX DRAWING 2500 - 257F
BLOCK ELEMENTS
2580 - 259F
Therefore, in using this coded character set, the
GEOMETRIC SHAPES 25A0 - 25FF
reader is advised to refer first to the block names list
MISCELLANEOUS SYMBOLS 2600 - 26FF
in clause 19 or an overview of the BMP in figure 3,
DINGBATS 2700 - 27BF
and the
...










Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...