ISO/IEC 19757-7:2020
(Main)Information technology — Document Schema Definition Languages (DSDL) — Part 7: Character Repertoire Description Language (CREPDL)
Information technology — Document Schema Definition Languages (DSDL) — Part 7: Character Repertoire Description Language (CREPDL)
This document specifies a Character Repertoire Description Language (CREPDL). A CREPDL schema describes a character repertoire. A stream of UCS code points can be validated against a CREPDL schema.
Technologies de l'information — Langages de définition de schéma de documents (DSDL) — Partie 7: Langage de description de répertoire de caractères (CREPDL)
General Information
Relations
Buy Standard
Standards Content (Sample)
INTERNATIONAL ISO/IEC
STANDARD 19757-7
Second edition
2020-08
Information technology — Document
Schema Definition Languages
(DSDL) —
Part 7:
Character Repertoire Description
Language (CREPDL)
Technologies de l'information — Langages de définition de schéma de
documents (DSDL) —
Partie 7: Langage de description de répertoire de caractères
(CREPDL)
Reference number
ISO/IEC 19757-7:2020(E)
©
ISO/IEC 2020
---------------------- Page: 1 ----------------------
ISO/IEC 19757-7:2020(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO/IEC 2020
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting
on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address
below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO/IEC 2020 – All rights reserved
---------------------- Page: 2 ----------------------
ISO/IEC 19757-7:2020(E)
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Notation . 2
5 Overview . 3
5.1 Basic constructs and compound constructs . 3
5.2 Characters and code points . 3
5.3 Grapheme clusters . 3
5.4 Kernel and Hull . 3
6 Syntax . 3
6.1 General . 3
6.2 RELAX NG schema . 4
6.3 NVDL script . 5
6.4 Regular Expressions . 5
7 Semantics . 5
7.1 General . 5
7.2 char . 6
7.3 union . 7
7.4 intersection . 7
7.5 difference . 7
7.6 ref . 8
7.7 repertoire . 8
8 Validation . 8
Annex A (informative) Differences of conformant processors .10
Annex B (informative) Example CREPDL schemas .11
Bibliography .15
© ISO/IEC 2020 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO/IEC 19757-7:2020(E)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that
are members of ISO or IEC participate in the development of International Standards through
technical committees established by the respective organization to deal with particular fields of
technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other
international organizations, governmental and non-governmental, in liaison with ISO and IEC, also
take part in the work.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for
the different types of document should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject
of patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent
rights. Details of any patent rights identified during the development of the document will be in the
Introduction and/or on the ISO list of patent declarations received (see www .iso .org/ patents) or the IEC
list of patent declarations received (see http:// patents .iec .ch).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www .iso .org/
iso/ foreword .html.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 34, Document description and processing languages.
This second edition cancels and replaces the first edition (ISO/IEC 19757-7:2009), which has been
technically revised. It also incorporates the Technical Corrigendum ISO/IEC 19757-7:2009/Cor 1:2015.
The main changes compared to the previous edition are as follows:
— addition of validation of grapheme clusters such as 'n' followed by COMBINING GRAVE ACCENT
(U+0300) and a CJK unified ideograph followed by a variation selector.
— addition of the Unicode Ideographic Variation Database as a registry.
A list of all parts in the ISO/IEC 19757 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www .iso .org/ members .html.
iv © ISO/IEC 2020 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/IEC 19757-7:2020(E)
Introduction
ISO/IEC 19757 (all parts) defines a set of Document Schema Definition Languages (DSDL) that can
be used to specify one or more validation processes performed against Extensible Markup Language
(XML) documents. A number of validation technologies are standardized in DSDL to complement those
already available as standards or from industry.
The main objective of ISO/IEC 19757 (all parts) is to bring together different validation-related
technologies to form a single extensible framework that allows technologies to work in series or in
parallel to produce a single or a set of validation results. The extensibility of DSDL accommodates
validation technologies not yet designed or specified.
This document provides a language for describing character repertoires. Descriptions in this language
can be referenced from schemas. Furthermore, they can also be referenced from forms and stylesheets.
Descriptions of character repertoires doesn't need to be exact. Non-exact descriptions are made
possible by kernels and hulls, which provide the lower and upper limits, respectively.
The structure of this document is as follows. Clause 5 provides an informal overview of CREPDL.
Clause 6 specifies the syntax of CREPDL schemas. Clause 7 specifies the semantics of a correct CREPDL
schema; the semantics specify when a code point or code point sequence is in a character repertoire
described by a CREPDL schema. Clause 8 defines the behaviour of CREPDL processors. Finally, Annex A
describes differences of conformant CREPDL processors; Annex B provides examples of CREPDL
schemas.
Although the first edition was restricted to the validation of characters, this edition can also enable the
validation of grapheme clusters such as 'n' followed by COMBINING GRAVE ACCENT (U+0300) and a CJK
unified ideograph followed by a variation selector.
CREPDL schemas conformant to the first edition do not conform to this edition. In particular, this
edition changes the namespace name for CREPDL schemas.
© ISO/IEC 2020 – All rights reserved v
---------------------- Page: 5 ----------------------
INTERNATIONAL STANDARD ISO/IEC 19757-7:2020(E)
Information technology — Document Schema Definition
Languages (DSDL) —
Part 7:
Character Repertoire Description Language (CREPDL)
1 Scope
This document specifies a Character Repertoire Description Language (CREPDL). A CREPDL schema
describes a character repertoire. A stream of UCS code points can be validated against a CREPDL schema.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.
ISO/IEC 10646, Information technology — Universal Multiple-Octet Coded Character Set (UCS)
ISO/IEC 19757-2, Information technology — Document Schema Definition Language (DSDL) — Part 2:
Regular-grammar-based validation — RELAX NG
ISO/IEC 19757-4, Information technology — Document Schema Definition Languages (DSDL) — Part 4:
Namespace-based Validation Dispatching Language (NVDL)
W3C XML, Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C Recommendation, 16 August
2006, available at http:// www .w3 .org/ TR/ 2006/ REC -xml -20060816
W3C XML-Names, Namespaces in XML (Second Edition), W3C Recommendation, 16 August 2006,
available at http:// www .w3 .org/ TR/ 2006/ REC -xml -names -20060816
IETF RFC 3987, Internationalized Resource Identifiers (IRIs), Internet Standards Track Specification,
January 2005, available at http:// www .ietf .org/ rfc/ rfc3987 .txt
Charsets I.A.N.A. IANA CHARACTER SETS, available at http:// www .iana .org/ assignments/ character -sets
Unicode, The Unicode Standard, The Unicode Consortium, available at http:// www .unicode .org/
CLDR, Unicode Common Locale Data Repository, The Unicode Consortium, available at http:// www
.unicode .org/ cldr/
UAX29, Unicode Standard Annex #29: Unicode Text Segmentation, The Unicode Consortium, available at
http:// unicode .org/ reports/ tr29/
UTS35, Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML), The Unicode
Consortium, available at https:// www .unicode .org/ reports/ tr35/
UTS37, Unicode Technical Standard #37: Unicode Ideographic Variation Database, The Unicode
Consortium, available at http:// www .unicode .org/ reports/ tr37/
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
© ISO/IEC 2020 – All rights reserved 1
---------------------- Page: 6 ----------------------
ISO/IEC 19757-7:2020(E)
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at http:// www .electropedia .org/
3.1
CREPDL processor
computer program that validates a stream of code points not containing high- or low-surrogate code
points against CREPDL schemas (3.2)
3.2
CREPDL schema
machine-readable description of a repertoire (3.8)
3.3
grapheme cluster
base character followed by zero or more continuing characters
Note 1 to entry: A grapheme cluster typically represents what the user thinks of as basic unit of a writing system
for a language.
[SOURCE: UAX 29]
3.4
hull
set of code points or code point sequences (excluding high- or low-surrogate code points) that are not
guaranteed to be excluded from the repertoire (3.8)
3.5
kernel
set of code points or code point sequences (excluding high- or low-surrogate code points) that are
guaranteed to be included by the repertoire (3.8)
3.6
mode
option to specify whether characters or grapheme clusters (3.3) are examined
Note 1 to entry: The first edition did not have modes. Thus, characters can be examined, but grapheme
clusters cannot.
3.7
registry
collection of named repertoires (3.8)
3.8
repertoire
description of a set of code points or code point sequences excluding high- or low-surrogate code points
4 Notation
in(x, A): code point or code point sequence x is in the repertoire described by a CREPDL element A;
not-in(x, A): code point or code point sequence x is not in the repertoire described by a CREPDL element A;
unknown(x, A): it is unknown whether code point or code point sequence x is in the repertoire described
by a CREPDL element A.
NOTE 1 This predicate-like notation captures the combination of three-valued logic and the interpretation
of a formula for a given character or grapheme cluster. In other words, in(x, A) implies that the interpretation of
A under x is truth in three-valued logic. Likewise, not-in(x, A) and unknown(x, A) imply the interpretations of A
under x are false and unknown, respectively.
2 © ISO/IEC 2020 – All rights reserved
---------------------- Page: 7 ----------------------
ISO/IEC 19757-7:2020(E)
NOTE 2 This document is intended to ensure that exactly one of in(x, A), not-in(x, A), and unknown(x, A) holds.
5 Overview
5.1 Basic constructs and compound constructs
Basic constructs of CREPDL schemas are created from regular expressions or references to registries
of repertoires. Compound constructs of CREPDL schemas are created by combining basic constructs by
set operators such as union, intersection, and difference.
5.2 Characters and code points
Although the title of this document is "Character Repertoire Description Language", this document uses
code points more often than characters. This is because CREPDL allows the use of unassigned code
points, which are not characters. For example, U+1CBB is an unassigned code point, and is thus not a
character. It is possible to create a CREPDL schema that allows this code point. A stream containing it is
valid against such a CREPDL schema.
5.3 Grapheme clusters
CREPDL can enable the validation of grapheme clusters, which are sequences of code points. For
example, a CREPDL schema can allow LATIN CAPITAL LETTER N (U+004E) or LATIN SMALL LETTER
n (U+006E) followed by COMBINING GRAVE ACCENT (U+0300) while disallowing other characters
followed by COMBINING GRAVE ACCENT (U+0300). Likewise, a CREPDL schema can indicate which
variation selector can follow which CJK unified ideograph.
NOTE The first edition cannot enable the validation of sequences of code points. It was thus not possible
to allow LATIN CAPITAL LETTER N (U+004E) or LATIN SMALL LETTER n (U+006E) followed by COMBINING
GRAVE ACCENT (U+0300) without allowing other characters followed by COMBINING GRAVE ACCENT (U+0300).
5.4 Kernel and Hull
It is sometimes difficult to precisely specify a repertoire. As an example, consider collections in
ISO/IEC 10646, which are numbered and named repertoires. Some collections are open: they contain
assigned code points as well as unassigned code points, which can be assigned in the future.
Recall that some basic constructs of CREPDL schemas are created from regular expressions. Such basic
constructs have pairs of regular expressions. One regular expression specifies what is guaranteed to
be included, while the other specifies what is not guaranteed to be excluded. The former and latter
are called kernel and hull, respectively. If a code point matches the kernel regular expression, the code
point is definitely included in the repertoire. Even if it isn't, it it not guranteed be excluded from the
repertoire if it matches the hull regular expression.
[3]
NOTE Kernel and hull are reproduced from W3C Note-charcol . Some examples in Annex B are also
[3]
reproduced from W3C Note-charcol .
6 Syntax
6.1 General
A CREPDL schema shall be an XML document (which shall be as specified in W3C XML and shall further
conform to W3C XML-Names) valid against the NVDL (ISO/IEC 19757-4) script in 6.3, which in turn
relies on the RELAX NG (ISO/IEC 19757-2) schema in 6.2. The elements allowed in the RELAX NG schema
© ISO/IEC 2020 – All rights reserved 3
---------------------- Page: 8 ----------------------
ISO/IEC 19757-7:2020(E)
are of the namespace http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0. Further constraints on
the character content of the char, kernel or hull elements are shown in 6.4.
NOTE 1 W3C XML specifies that characters in XML documents are either U+0009 (CHARACTER TABULATION),
U+000A (LINE FEED), U+000D (CARRIAGE RETURN), or a character in the ranges from U+0020 to U+D7FF,
U+E000 to U+FFFD, or U+10000 to U+10FFFF. Since CREPDL schemas are represented by XML documents, other
characters cannot directly occur in CREPDL schemas.
NOTE 2 The first edition used a different namespace name.
6.2 RELAX NG schema
# The following permission notice and disclaimer shall be included in
# all copies of this schema ("the Schema"), and derivations of
# the Schema:
#
# Permission is hereby granted, free of charge in perpetuity, to any
# person obtaining a copy of the Schema, to use, copy, modify, merge and
# distribute free of charge, copies of the Schema for the purposes of
# developing, implementing, installing and using software based on the
# Schema, and to permit persons to whom the Schema is furnished to do
# so, subject to the following conditions:
#
# THE SCHEMA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
# OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
# ARISING FROM, OUT OF OR IN CONNECTION WITH THE SCHEMA OR THE USE OR
# OTHER DEALINGS IN THE SCHEMA.
#
# In addition, any modified copy of the Schema shall include the following
# notice:
#
# THIS SCHEMA HAS BEEN MODIFIED FROM THE SCHEMA DEFINED IN ISO/IEC 19757-7,
# AND SHOULD NOT BE INTERPRETED AS COMPLYING WITH THAT STANDARD.
default namespace = "http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0"
start = coll
coll =
union | intersection | difference | ref | repertoire | char
union = element union { commonAtts, coll+ }
intersection = element intersection { commonAtts, coll+ }
difference = element difference { commonAtts, coll+ }
ref =
element ref {
commonAtts,
attribute href { xsd:anyURI }
}
repertoire =
element repertoire {
commonAtts,
attribute registry { text },
attribute version { text }?,
(attribute name { text } | attribute number {xsd:int})
}
char =
element char {
commonAtts,
(text
| element kernel { commonAtts, text }
| element hull { commonAtts, text }
| (element kernel { commonAtts, text },
element hull { commonAtts, text }))
}
commonAtts =
attribute minUcsVersion { text }?,
attribute maxUcsVersion { text }?,
4 © ISO/IEC 2020 – All rights reserved
---------------------- Page: 9 ----------------------
ISO/IEC 19757-7:2020(E)
attribute mode { "character" | "graphemeCluster" }?
#
# Note that xml:id is allowed, since any foreign attribute is
# allowed by the NVDL script.
6.3 NVDL script
schemaType="application/relax-ng-compact-syntax">
This NVDL script allows foreign elements and attributes everywhere.
6.4 Regular Expressions
The character content of a char, kernel or hull element shall be a Unicode set as specified in UTS35,
5.3.3 (Unicode Sets) of Part 1 (Core).
[8]
NOTE A Unicode set is guaranteed to be a regular expression as specified in UTS18 .
7 Semantics
7.1 General
This clause shall specify which character repertoire is represented by a CREPDL element. Specifically,
given a code point (which shall be as specified in ISO/IEC 10646) or code point sequence x, this clause
shall specify when x is in the repertoire, when x is not in the repertoire, and when it is unknown whether
x is in the repertoire.
© ISO/IEC 2020 – All rights reserved 5
---------------------- Page: 10 ----------------------
ISO/IEC 19757-7:2020(E)
7.2 char
First, the semantics of Unicode sets occurring inkernel and hull elements shall be as specified in UTS35.
The semantics of char shall be defined below.
— Case 1: the char element has neither kernel nor hull as a child element.
It is assumed that this element has a kernel element, the content of which is identical to the
character content of this char element, and also has a hull element, the content of which is identical
to the character content of this char element. The rest shall be the same as in Case 4.
— Case 2: the char element has a kernel element but does not have a hull element.
— in(x, . ) when x matches the regular expression specified as the content of the
kernel element.
— not-in(x, . ) never holds.
— unknown(x, . ) when x does not match the regular expression specified as
the content of the kernel element.
— Case 3: the char element has a hull element but does not have a kernel element.
— in(x, . ) never holds.
— not-in(x, . ) when x does not match the regular expression specified as the
content of the hull element.
— unknown(x, . ) when x matches the regular expression specified as the
content of the hull element.
— Case 4: the char element has a hull element and a kernel element.
— in(x, . ) when x matches the regular expression specified as the content of the
kernel element.
— not-in(x, . ) when x does not match the regular expression specified as the
content of the kernel element and x does not match the regular expression specified as the
content of the hull element.
— unknown(x, . ) when x does not match the regular expression specified as
the content of the kernel element and x matches the regular expression specified as the content
of the hull element.
NOTE 1 It is possible but not a good practice to specify a hull that disallows some code point or code
point sequence in the corresponding kernel. Note that the condition that a code point or code point
sequence is in a repertoire does not mention the hull.
Since the semantics of regular expressions depend on the version of the Unicode standard, the
author of a CREPDL schema may specify the intended versions by specifying the minUcsVersion and
maxUcsVersion attributes.
EXAMPLE \p{Nd} represents the set of
characters of the category "Nd" in Unicode Version 4.0.
NOTE 2 It is not guaranteed that every version between these two attribute values specify the same properties
for every character. However, the CREPDL schema author is assumed to accept the discrepancies.
If the CREPDL processor cannot use some version between these two attribute values, it should report
an error and may stop normal processing.
When a char element does not explicitly specify the minUcsVersion attribute, the nearest ancestor
element having this attribute is searched. If it is found, its attribute value is used. If not found, there is
6 © ISO/IEC 2020 – All rights reserved
---------------------- Page: 11 ----------------------
ISO/IEC 19757-7:2020(E)
no lower bound on Unicode versions. When a char element does not explicitly specify the maxUcsVersion
attribute, the nearest ancestor element having this attribute is searched. If it is found, its attribute value
is used. If it is not found, there is no upper bound on Unicode versions.
7.3 union
First, define the semantics of union elements A B, which contain two child elements A
and B. A code point or code point sequence shall be in the union repertoire described by this element if
and only if it is in the one described by A or the one described by B. It shall not be in the union repertoire
if and only if it is in neither the one described by A nor the one described by B.
—
...
FINAL
INTERNATIONAL ISO/IEC
DRAFT
STANDARD FDIS
19757-7
ISO/IEC JTC 1/SC 34
Information technology — Document
Secretariat: JISC
Schema Definition Languages
Voting begins on:
2020-05-12 (DSDL) —
Voting terminates on:
Part 7:
2020-07-07
Character Repertoire Description
Language (CREPDL)
Technologies de l'information — Langages de définition de schéma de
documents (DSDL) —
Partie 7: Langage de description de répertoire de caractères
(CREPDL)
RECIPIENTS OF THIS DRAFT ARE INVITED TO
SUBMIT, WITH THEIR COMMENTS, NOTIFICATION
OF ANY RELEVANT PATENT RIGHTS OF WHICH
THEY ARE AWARE AND TO PROVIDE SUPPOR TING
DOCUMENTATION.
IN ADDITION TO THEIR EVALUATION AS
Reference number
BEING ACCEPTABLE FOR INDUSTRIAL, TECHNO-
ISO/IEC FDIS 19757-7:2020(E)
LOGICAL, COMMERCIAL AND USER PURPOSES,
DRAFT INTERNATIONAL STANDARDS MAY ON
OCCASION HAVE TO BE CONSIDERED IN THE
LIGHT OF THEIR POTENTIAL TO BECOME STAN-
DARDS TO WHICH REFERENCE MAY BE MADE IN
©
NATIONAL REGULATIONS. ISO/IEC 2020
---------------------- Page: 1 ----------------------
ISO/IEC FDIS 19757-7:2020(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO/IEC 2020
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting
on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address
below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Fax: +41 22 749 09 47
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO/IEC 2020 – All rights reserved
---------------------- Page: 2 ----------------------
ISO/IEC FDIS 19757-7:2020(E)
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Notation . 2
5 Overview . 3
5.1 Basic constructs and compound constructs . 3
5.2 Characters and code points . 3
5.3 Grapheme clusters . 3
5.4 Kernel and Hull . 3
6 Syntax . 3
6.1 General . 3
6.2 RELAX NG schema . 4
6.3 NVDL script . 5
6.4 Regular Expressions . 5
7 Semantics . 5
7.1 General . 5
7.2 char . 6
7.3 union . 7
7.4 intersection . 7
7.5 difference . 7
7.6 ref . 8
7.7 repertoire . 8
8 Validation . 8
Annex A (informative) Differences of conformant processors .10
Annex B (informative) Example CREPDL schemas .11
Bibliography .15
© ISO/IEC 2020 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO/IEC FDIS 19757-7:2020(E)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that
are members of ISO or IEC participate in the development of International Standards through
technical committees established by the respective organization to deal with particular fields of
technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other
international organizations, governmental and non-governmental, in liaison with ISO and IEC, also
take part in the work.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for
the different types of document should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject
of patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent
rights. Details of any patent rights identified during the development of the document will be in the
Introduction and/or on the ISO list of patent declarations received (see www .iso .org/ patents) or the IEC
list of patent declarations received (see http:// patents .iec .ch).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www .iso .org/
iso/ foreword .html.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 34, Document description and processing languages.
This second edition cancels and replaces the first edition (ISO/IEC 19757-7:2009), which has been
technically revised. It also incorporates the Technical Corrigendum ISO/IEC 19757-7:2009/Cor 1:2015.
The main changes compared to the previous edition are as follows:
— addition of validation of grapheme clusters such as 'n' followed by COMBINING GRAVE ACCENT
(U+0300) and a CJK unified ideograph followed by a variation selector.
— addition of the Unicode Ideographic Variation Database as a registry.
A list of all parts in the ISO/IEC 19757 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www .iso .org/ members .html.
iv © ISO/IEC 2020 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/IEC FDIS 19757-7:2020(E)
Introduction
ISO/IEC 19757 (all parts) defines a set of Document Schema Definition Languages (DSDL) that can
be used to specify one or more validation processes performed against Extensible Markup Language
(XML) documents. A number of validation technologies are standardized in DSDL to complement those
already available as standards or from industry.
The main objective of ISO/IEC 19757 (all parts) is to bring together different validation-related
technologies to form a single extensible framework that allows technologies to work in series or in
parallel to produce a single or a set of validation results. The extensibility of DSDL accommodates
validation technologies not yet designed or specified.
This document provides a language for describing character repertoires. Descriptions in this language
can be referenced from schemas. Furthermore, they can also be referenced from forms and stylesheets.
Descriptions of character repertoires doesn't need to be exact. Non-exact descriptions are made
possible by kernels and hulls, which provide the lower and upper limits, respectively.
The structure of this document is as follows. Clause 5 provides an informal overview of CREPDL.
Clause 6 specifies the syntax of CREPDL schemas. Clause 7 specifies the semantics of a correct CREPDL
schema; the semantics specify when a code point or code point sequence is in a character repertoire
described by a CREPDL schema. Clause 8 defines the behaviour of CREPDL processors. Finally, Annex A
describes differences of conformant CREPDL processors; Annex B provides examples of CREPDL
schemas.
Although the first edition was restricted to the validation of characters, this edition can also enable the
validation of grapheme clusters such as 'n' followed by COMBINING GRAVE ACCENT (U+0300) and a CJK
unified ideograph followed by a variation selector.
CREPDL schemas conformant to the first edition do not conform to this edition. In particular, this
edition changes the namespace name for CREPDL schemas.
© ISO/IEC 2020 – All rights reserved v
---------------------- Page: 5 ----------------------
FINAL DRAFT INTERNATIONAL STANDARD ISO/IEC FDIS 19757-7:2020(E)
Information technology — Document Schema Definition
Languages (DSDL) —
Part 7:
Character Repertoire Description Language (CREPDL)
1 Scope
This document specifies a Character Repertoire Description Language (CREPDL). A CREPDL schema
describes a character repertoire. A stream of UCS code points can be validated against a CREPDL schema.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.
ISO/IEC 10646, Information technology — Universal Multiple-Octet Coded Character Set (UCS)
ISO/IEC 19757-2, Information technology — Document Schema Definition Language (DSDL) — Part 2:
Regular-grammar-based validation — RELAX NG
ISO/IEC 19757-4, Information technology — Document Schema Definition Languages (DSDL) — Part 4:
Namespace-based Validation Dispatching Language (NVDL)
W3C XML, Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C Recommendation, 16 August
2006, available at http:// www .w3 .org/ TR/ 2006/ REC -xml -20060816
W3C XML-Names, Namespaces in XML (Second Edition), W3C Recommendation, 16 August 2006,
available at http:// www .w3 .org/ TR/ 2006/ REC -xml -names -20060816
IETF RFC 3987, Internationalized Resource Identifiers (IRIs), Internet Standards Track Specification,
January 2005, available at http:// www .ietf .org/ rfc/ rfc3987 .txt
Charsets I.A.N.A. IANA CHARACTER SETS, available at http:// www .iana .org/ assignments/ character -sets
Unicode, The Unicode Standard, The Unicode Consortium, available at http:// www .unicode .org/
CLDR, Unicode Common Locale Data Repository, The Unicode Consortium, available at http:// www
.unicode .org/ cldr/
UAX29, Unicode Standard Annex #29: Unicode Text Segmentation, The Unicode Consortium, available at
http:// unicode .org/ reports/ tr29/
UTS35, Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML), The Unicode
Consortium, available at https:// www .unicode .org/ reports/ tr35/
UTS37, Unicode Technical Standard #37: Unicode Ideographic Variation Database, The Unicode
Consortium, available at http:// www .unicode .org/ reports/ tr37/
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
© ISO/IEC 2020 – All rights reserved 1
---------------------- Page: 6 ----------------------
ISO/IEC FDIS 19757-7:2020(E)
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at http:// www .electropedia .org/
3.1
CREPDL processor
computer program that validates a stream of code points not containing high- or low-surrogate code
points against CREPDL schemas (3.2)
3.2
CREPDL schema
machine-readable description of a repertoire (3.8)
3.3
grapheme cluster
base character followed by zero or more continuing characters
Note 1 to entry: A grapheme cluster typically represents what the user thinks of as basic unit of a writing system
for a language.
[SOURCE: UAX 29]
3.4
hull
set of code points or code point sequences (excluding high- or low-surrogate code points) that are not
guaranteed to be excluded from the repertoire (3.8)
3.5
kernel
set of code points or code point sequences (excluding high- or low-surrogate code points) that are
guaranteed to be included by the repertoire (3.8)
3.6
mode
option to specify whether characters or grapheme clusters (3.3) are examined
Note 1 to entry: The first edition did not have modes. Thus, characters can be examined, but grapheme
clusters cannot.
3.7
registry
collection of named repertoires (3.8)
3.8
repertoire
description of a set of code points or code point sequences excluding high- or low-surrogate code points
4 Notation
in(x, A): code point or code point sequence x is in the repertoire described by a CREPDL element A;
not-in(x, A): code point or code point sequence x is not in the repertoire described by a CREPDL element A;
unknown(x, A): it is unknown whether code point or code point sequence x is in the repertoire described
by a CREPDL element A.
NOTE 1 This predicate-like notation captures the combination of three-valued logic and the interpretation of
a formula for a given character or grapheme cluster. In other words, in(x, A) implies that the interpretation of A
under x is truth in three-valued logic. Likewise, not-in(x, A) and unknown(x, A) imply the interpretations of A
under x are false and unknown, respectively.
2 © ISO/IEC 2020 – All rights reserved
---------------------- Page: 7 ----------------------
ISO/IEC FDIS 19757-7:2020(E)
NOTE 2 This document is intended to ensure that exactly one of in(x,A), not-in(x,A), and unknown(x,A) holds.
5 Overview
5.1 Basic constructs and compound constructs
Basic constructs of CREPDL schemas are created from regular expressions or references to registries
of repertoires. Compound constructs of CREPDL schemas are created by combining basic constructs by
set operators such as union, intersection, and difference.
5.2 Characters and code points
Although the title of this document is "Character Repertoire Description Language", this document uses
code points more often than characters. This is because CREPDL allows the use of unassigned code
points, which are not characters. For example, U+1CBB is an unassigned code point, and is thus not a
character. It is possible to create a CREPDL schema that allows this code point. A stream containing it is
valid against such a CREPDL schema.
5.3 Grapheme clusters
CREPDL can enable the validation of grapheme clusters, which are sequences of code points. For
example, a CREPDL schema can allow LATIN CAPITAL LETTER N (U+004E) or LATIN SMALL LETTER
n (U+006E) followed by COMBINING GRAVE ACCENT (U+0300) while disallowing other characters
followed by COMBINING GRAVE ACCENT (U+0300). Likewise, a CREPDL schema can indicate which
variation selector can follow which CJK unified ideograph.
NOTE The first edition cannot enable the validation of sequences of code points. It was thus not possible
to allow LATIN CAPITAL LETTER N (U+004E) or LATIN SMALL LETTER n (U+006E) followed by COMBINING
GRAVE ACCENT (U+0300) without allowing other characters followed by COMBINING GRAVE ACCENT (U+0300).
5.4 Kernel and Hull
It is sometimes difficult to precisely specify a repertoire. As an example, consider collections in
ISO/IEC 10646, which are numbered and named repertoires. Some collections are open: they contain
assigned code points as well as unassigned code points, which can be assigned in the future.
Recall that some basic constructs of CREPDL schemas are created from regular expressions. Such basic
constructs have pairs of regular expressions. One regular expression specifies what is guaranteed to
be included, while the other specifies what is not guaranteed to be excluded. The former and latter
are called kernel and hull, respectively. If a code point matches the kernel regular expression, the code
point is definitely included in the repertoire. Even if it isn't, it it not guranteed be excluded from the
repertoire if it matches the hull regular expression.
[3]
NOTE Kernel and hull are reproduced from W3C Note-charcol . Some examples in Annex B are also
[3]
reproduced from W3C Note-charcol .
6 Syntax
6.1 General
A CREPDL schema shall be an XML document (which shall be as specified in W3C XML and shall further
conform to W3C XML-Names) valid against the NVDL (ISO/IEC 19757-4) script in 6.3, which in turn
relies on the RELAX NG (ISO/IEC 19757-2) schema in 6.2. The elements allowed in the RELAX NG schema
© ISO/IEC 2020 – All rights reserved 3
---------------------- Page: 8 ----------------------
ISO/IEC FDIS 19757-7:2020(E)
are of the namespace http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0. Further constraints on
the character content of the char, kernel or hull elements are shown in 6.4.
NOTE 1 W3C XML specifies that characters in XML documents are either U+0009 (CHARACTER TABULATION),
U+000A (LINE FEED), U+000D (CARRIAGE RETURN), or a character in the ranges from U+0020 to U+D7FF,
U+E000 to U+FFFD, or U+10000 to U+10FFFF. Since CREPDL schemas are represented by XML documents, other
characters cannot directly occur in CREPDL schemas.
NOTE 2 The first edition used a different namespace name.
6.2 RELAX NG schema
# The following permission notice and disclaimer shall be included in
# all copies of this schema ("the Schema"), and derivations of
# the Schema:
#
# Permission is hereby granted, free of charge in perpetuity, to any
# person obtaining a copy of the Schema, to use, copy, modify, merge and
# distribute free of charge, copies of the Schema for the purposes of
# developing, implementing, installing and using software based on the
# Schema, and to permit persons to whom the Schema is furnished to do
# so, subject to the following conditions:
#
# THE SCHEMA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
# OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
# ARISING FROM, OUT OF OR IN CONNECTION WITH THE SCHEMA OR THE USE OR
# OTHER DEALINGS IN THE SCHEMA.
#
# In addition, any modified copy of the Schema shall include the following
# notice:
#
# THIS SCHEMA HAS BEEN MODIFIED FROM THE SCHEMA DEFINED IN ISO/IEC 19757-7,
# AND SHOULD NOT BE INTERPRETED AS COMPLYING WITH THAT STANDARD.
default namespace = "http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0"
start = coll
coll =
union | intersection | difference | ref | repertoire | char
union = element union { commonAtts, coll+ }
intersection = element intersection { commonAtts, coll+ }
difference = element difference { commonAtts, coll+ }
ref =
element ref {
commonAtts,
attribute href { xsd:anyURI }
}
repertoire =
element repertoire {
commonAtts,
attribute registry { text },
attribute version { text }?,
(attribute name { text } | attribute number {xsd:int})
}
char =
element char {
commonAtts,
(text
| element kernel { commonAtts, text }
| element hull { commonAtts, text }
| (element kernel { commonAtts, text },
element hull { commonAtts, text }))
}
commonAtts =
attribute minUcsVersion { text }?,
attribute maxUcsVersion { text }?,
4 © ISO/IEC 2020 – All rights reserved
---------------------- Page: 9 ----------------------
ISO/IEC FDIS 19757-7:2020(E)
attribute mode { "character" | "graphemeCluster" }?
#
# Note that xml:id is allowed, since any foreign attribute is
# allowed by the NVDL script.
6.3 NVDL script
schemaType="application/relax-ng-compact-syntax">
This NVDL script allows foreign elements and attributes everywhere.
6.4 Regular Expressions
The character content of a char, kernel or hull element shall be a Unicode set as specified in UTS35,
5.3.3 (Unicode Sets) of Part 1 (Core).
[8]
NOTE A Unicode set is guaranteed to be a regular expression as specified in UTS18 .
7 Semantics
7.1 General
This clause shall specify which character repertoire is represented by a CREPDL element. Specifically,
given a code point (which shall be as specified in ISO/IEC 10646) or code point sequence x, this clause
shall specify when x is in the repertoire, when x is not in the repertoire, and when it is unknown whether
x is in the repertoire.
© ISO/IEC 2020 – All rights reserved 5
---------------------- Page: 10 ----------------------
ISO/IEC FDIS 19757-7:2020(E)
7.2 char
First, the semantics of Unicode sets occurring inkernel and hull elements shall be as specified in UTS35.
The semantics of char shall be defined below.
— Case 1: the char element has neither kernel nor hull as a child element.
It is assumed that this element has a kernel element, the content of which is identical to the
character content of this char element, and also has a hull element, the content of which is identical
to the character content of this char element. The rest shall be the same as in Case 4.
— Case 2: the char element has a kernel element but does not have a hull element.
— in(x, . ) when x matches the regular expression specified as the content of the
kernel element.
— not-in(x, . ) never holds.
— unknown(x, . ) when x does not match the regular expression specified as
the content of the kernel element.
— Case 3: the char element has a hull element but does not have a kernel element.
— in(x, . ) never holds.
— not-in(x, . ) when x does not match the regular expression specified as the
content of the hull element.
— unknown(x, . ) when x matches the regular expression specified as the
content of the hull element.
— Case 4: the char element has a hull element and a kernel element.
— in(x, . ) when x matches the regular expression specified as the content of the
kernel element.
— not-in(x, . ) when x does not match the regular expression specified as the
content of the kernel element and x does not match the regular expression specified as the
content of the hull element.
— unknown(x, . ) when x does not match the regular expression specified as
the content of the kernel element and x matches the regular expression specified as the content
of the hull element.
NOTE 1 It is possible but not a good practice to specify a hull that disallows some code point or code
point sequence in the corresponding kernel. Note that the condition that a code point or code point
sequence is in a repertoire does not mention the hull.
Since the semantics of regular expressions depend on the version of the Unicode standard, the
author of a CREPDL schema may specify the intended versions by specifying the minUcsVersion and
maxUcsVersion attributes.
EXAMPLE \p{Nd} represents the set of
characters of the category "Nd" in Unicode Version 4.0.
NOTE 2 It is not guaranteed that every version between these two attribute values specify the same properties
for every character. However, the CREPDL schema author is assumed to accept the discrepancies.
If the CREPDL processor cannot use some version between these two attribute values, it should report
an error and may stop normal processing.
When a char element does not explicitly specify the minUcsVersion attribute, the nearest ancestor
element having this attribute is searc
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.