ISO 24611-1:2025
(Main)Language resource management — Morphosyntactic annotation framework (MAF) — Part 1: Core model
Language resource management — Morphosyntactic annotation framework (MAF) — Part 1: Core model
This document establishes a framework for the representation of annotations of word-sized units in texts. Such annotations describe tokens, their relationship with lexical units (word-forms), and the relevant morphosyntactic properties. This document proposes a metamodel for morphosyntactic annotation that can be augmented with references to data categories contained in a data category repository conforming to ISO 12620-2. It also defines an XML serialization for morphosyntactic annotations, according to the principles laid out in the TEI Guidelines (see Reference [ REF Reference_ref_33 \r \h 31 08D0C9EA79F9BACE118C8200AA004BA90B0200000008000000110000005200650066006500720065006E00630065005F007200650066005F00330033000000 ]). This document does not apply to structural ambiguities or the structure and composition of morphosyntactic tagsets. This document does not address the linguistic choices that identify tokens or determine the language- or context-particular relationships between tokens and word-forms.
Gestion des ressources linguistiques — Cadre d'annotation morphosyntaxique (MAF) — Partie 1: Modèle de base
Upravljanje z jezikovnimi viri - Ogrodje za oblikoskladenjsko označevanje (MAF) - 1. del: Jedrni model
General Information
Relations
Standards Content (Sample)
International
Standard
ISO 24611-1
First edition
Language resource management —
2025-11
Morphosyntactic annotation
framework (MAF) —
Part 1:
Core model
Gestion des ressources linguistiques — Cadre d'annotation
morphosyntaxique (MAF) —
Partie 1: Modèle de base
Reference number
© ISO 2025
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 MAF metamodel . 6
4.1 Levels of description in the MAF metamodel .6
4.2 MAF in the standards landscape .6
4.3 Metadata .7
4.4 Structural ambiguities .7
4.5 MAF metamodel in detail .8
5 Token-level segmentation . 10
5.1 General remarks .10
5.2 Formal description: .10
5.3 Embedding notation .11
5.4 Stand-off notation . 12
5.5 Normalization and script conversion .14
5.6 Inline token annotation strategies for token separation . 15
5.6.1 General remarks . 15
5.6.2 Adjacent tokens in embedded mode . 15
5.6.3 Overlapping tokens . 15
6 Word-forms as linguistic units .16
6.1 General remarks .16
6.2 Formal description: .17
6.3 Token attachment .17
6.3.1 One token: one word-form .17
6.3.2 Several contiguous tokens: one word-form .18
6.3.3 Several discontinuous tokens: one word-form .18
6.3.4 Zero token: one word-form .19
6.3.5 One token: several word-forms .19
6.4 Referencing lexical entries .19
6.5 Compound word-forms . 20
6.6 Identification of word-forms .21
7 Morphosyntactic content .21
7.1 General remarks .21
7.2 Using feature structures . .21
7.3 Compact morphosyntactic tags . 22
7.4 FSR libraries . . 22
7.5 Designing morphosyntactic tagsets . 23
8 Handling ambiguities.24
8.1 General .24
8.2 Word-form content ambiguities .24
8.3 Lexical and structural ambiguities . 25
9 Conformance .25
Annex A (informative) Examples .27
Annex B (informative) Referencing externally defined data categories .31
Bibliography .34
iii
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee
has been established has the right to be represented on that committee. International organizations,
governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely
with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of ISO document should be noted. This document was drafted in accordance with the editorial rules of the
ISO/IEC Directives, Part 2 (see www.iso.org/directives).
ISO draws attention to the possibility that the implementation of this document may involve the use of (a)
patent(s). ISO takes no position concerning the evidence, validity or applicability of any claimed patent
rights in respect thereof. As of the date of publication of this document, ISO had not received notice of (a)
patent(s) which may be required to implement this document. However, implementers are cautioned that
this may not represent the latest information, which may be obtained from the patent database available at
www.iso.org/patents. ISO shall not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO’s adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 37, Language and terminology, Subcommittee
SC 4, Language resource management.
This first edition of ISO 24611-1 cancels and replaces ISO 24611:2012, which has been technically revised.
The main changes are as follows:
— the data model is fully serialized in TEI XML;
— definitions and text have been revised;
— conformance conditions have been added;
— most of the former Clause 8, dealing with word lattices, has been removed and delegated to a planned
ISO 24611-2;
— the annex of sample data categories has been removed in favour of an external repository of data
categories.
A list of all parts in the ISO 24611 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
iv
Introduction
ISO/TC 37/SC 4 focuses on the definition of models and formats for the representation of annotated
language resources. To this end, it has generalized the modelling strategy initiated by its sister committee,
ISO/TC 37/SC 3, for the representation of terminological data (see Reference [21]), through which linguistic
data models are seen as the combination of a generic data pattern (a metamodel), which is further refined
through a selection of data categories that provide the descriptors for this specific annotation level.
Such models are defined independently of any specific formats and ensure that an implementer has the
necessary conceptual instrument with which to design and compare formats with regard to their degrees of
interoperability.
One important aspect of representing any kind of annotation is the capacity to provide a clear and reliable
semantics for the various descriptors used, either in the form of formal features and feature values, or
directly as objects in a representation that is expressed, for instance, in XML. In order to be shared across
various annotation schemas and encoding applications, such semantics should be implemented as a
centralized repository of concepts: these concepts will henceforth be referred to as data categories. These
data categories are envisioned as having the following two properties:
— From a technical point of view, they should provide unique, stable references (implemented as persistent
identifiers, in the sense of ISO 24619) that specific encoding schemas can use to express their relatedness.
By virtue of that, two annotations will be deemed equivalent if they are defined in relation to the same
data categories (as feature and feature value).
— From a descriptive point of view, each unique semantic reference should be associated with precise
documentation combining a full text elicitation of the meaning of the descriptor with the expression of
specific constraints that bear upon the category.
In the ISO 12620 series, a general framework for representing and maintaining such a repository of data
categories has been developed, potentially encompassing all domains of language resources.
A possible instantiation of ISO 12620-1 is a “flat” marketplace of semantic objects, providing only a limited
set of ontological constraints. The objective of such a setup would be to facilitate the maintenance of a
comprehensive descriptive environment where new categories are easily inserted and re-used without the
need for any strong consistency check with the repository at large. Indeed, the following kinds of constraints
are part of the data category model, as defined in ISO 12620-1:
— Simple generic-specific relations, when these are useful for the proper identification of interoperability
descriptors between data categories. For instance, the fact that /properNoun/ is a sub-category of
/noun/ makes it possible to compare morphosyntactic annotations based on different descriptive levels
of granularity.
— The description of conceptual domains that make it possible to identify, when known or applicable, the
range of the possible values of so-called “complex data categories”. For instance, it can be used to record
that possible values of /grammaticalGender/ (limited to a small group of languages, see Reference [21]),
can be a subset of {/masculine/, /feminine/ and /neuter/}.
— Language-specific constraints, either in the form of specific application notes or as explicit restrictions
bearing upon the conceptual domains of complex data categories. For instance, it is possible to express
explicitly that /grammaticalGender/ in French can only take the two values: {/masculine/ and
/feminine/}.
This document provides a comprehensive framework for the representation of morphosyntactic annotations
(in their simplest form also referred to as “part of speech annotations” or “POS annotations”). This annotation
level corresponds to the first lexical abstraction level over language data (textual or spoken) and, depending
on the language to be annotated, as well as the characteristics of the annotation tool or annotation scheme
that is being used, can vary enormously in structure and complexity.
In order to deal with such complex issues as ambiguity and determinism in morphosyntactic annotation,
this document introduces a metamodel that draws a clear distinction between, on the one hand, the level
of tokens (representing the surface segmentation of the source) and, on the other, the level of word-forms
v
(identifying lexical abstractions associated with groups of tokens). Both these levels can be represented as
simple sequences and as local graphs in order to model constructions such as multiple segmentations and
ambiguous compounds. Elements of these two levels can enter into any kind of n-to-n relationships.
As linguistic segments (sometimes called “markables” in the literature (see, for instance, Reference [18])),
tokens can be delimited in the source document by means of inline mark-up, or they can be identified
remotely (separately from the source document) by means of so-called “stand-off annotations”.
As linguistic abstractions, word-forms can be qualified by various linguistic features characterizing the
morphosyntactic properties that are instantiated in the realization of the lexical entry within the annotated
text. Such properties can range from the simple identification of a lemma up to an explicit reference
to a lexical entry in a dictionary. In most existing applications of morphosyntactic annotation, linguistic
properties are expressed by means of so-called “tags”. These codes refer to basic feature structures (see
early examples in Reference [20]). Such codes can also provide morphological information, including its part
of speech (e.g. noun, adjective, verb), and features such as number, gender, person, mood or tense.
In keeping with the general modelling strategy of ISO/TC 37, this document provides means of relating
morphosyntactic tags expressed as feature structures (conforming to ISO 24610-1) to data categories
(conforming to ISO 12620-1). Implementers are encouraged to use external reference taxonomies as
described by ISO 12620-2 either directly, or by building on them in defining their own categories (appropriate
in the coverage, scope or semantics to the requirements of the given encoding project), in conformity with
ISO/TC 37 principles.
Associated to the metamodel, this document also provides a default XML syntax that can be used to serialize
annotation models conforming to the morphosyntactic annotation framework (MAF). Since many existing
projects are based on the Text Encoding Initiative (TEI) Guidelines, see Reference [31] (particularly in digital
humanities, where a proper encoding of textual sources is essential), and since the TEI Guidelines already
offer a variety of constructs and mechanisms to cope with many issues relevant to spoken corpora and their
annotations (see Reference [22] and ISO 24624), the metamodel provided by this document is serialized
as TEI XML. Many word-level annotation mechanisms used in this document elaborate on the proposal of
Reference [23], implemented in the TEI Guidelines.
MAF consists of two parts, referred to as MAF Core (this document) and MAF Lattice (planned as
ISO 24611-2).
Finally, this document forms the conceptual basis for the development of the ISO 24614 series on word
segmentation, whereby all general principles and rules defined in ISO 24614-1, as well as the constraints
expressed in additional parts for specific languages, can be understood according to the token versus word-
form dichotomy.
vi
International Standard ISO 24611-1:2025(en)
Language resource management — Morphosyntactic
annotation framework (MAF) —
Part 1:
Core model
1 Scope
This document establishes a framework for the representation of annotations of word-sized units in texts.
Such annotations describe tokens, their relationship with lexical units (word-forms), and the relevant
morphosyntactic properties. This document proposes a metamodel for morphosyntactic annotation that
can be augmented with references to data categories contained in a data category repository conforming
to ISO 12620-2. It also defines an XML serialization for morphosyntactic annotations, according to the
principles laid out in the TEI Guidelines (see Reference [31]).
This document does not apply to structural ambiguities or the structure and composition of morphosyntactic
tagsets.
This document does not address the linguistic choices that identify tokens or determine the language- or
context-particular relationships between tokens and word-forms.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
ISO 24610-1, Language resource management — Feature structures — Part 1: Feature structure representation
W3C XML Recommendation, Extensible Markup Language (XML) 1.0 (Fifth Edition), 26 November 2008,
http:// www .w3 .org/ TR/ xml/
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1
morphology
description of the structure and formation of words (3.7)
Note 1 to entry: Morphology is traditionally divided into:
a) word-formation (3.2), dealing with the formation of complex lexemes (3.6) out of simpler lexemes: by means
of derivation (often signalled by affixation, i.e. addition of a morpheme (3.11)) or by means of compounding
(combining two or more lexemes);
b) inflection (3.3) that creates inflected forms (3.4).
3.2
word-formation
branch of morphology (3.1), dealing with the creation of new lexemes (3.6) by the processes of derivation or
compounding
3.3
inflection
branch of morphology (3.1), dealing with contextual realizations of lexemes (3.6) as inflected forms (3.4)
3.4
inflected form
concrete form that a lexeme (3.6) can take when used in a sentence or a phrase
3.5
word-form
abstract instantiation of a lexeme (3.6) with the values of morphosyntactic features (3.12) fixed in a
syntactic context
Note 1 to entry: Word-forms can have no acoustic or graphic realization, or can correspond to one or more tokens
(3.21), not necessarily forming a contiguous sequence.
3.6
lexeme
abstract, fundamental unit in the lexicon (3.10) of a language, comprising semantic, formal (phonetic and/or
graphemic) and grammatical information
Note 1 to entry: A complex lexeme is the result of word-formation (3.2) (derivation or compounding) processes; a
simple lexeme can be thought of as the base for such processes. In a lexical entry (3.9), a lexeme is identified by a lemma
(3.8). Word-forms (3.5) are results of the interaction of lexemes with the grammatical system of the given language.
3.7
word
lexeme (3.6), word-form (3.5) or token (3.21)
Note 1 to entry: The term “word” is notoriously ambiguous, standing (at least) for lexeme, word-form or token,
depending on the context of its use. This document attempts to disambiguate this term where relevant.
3.8
lemma
base form
canonical form
lemmatized form
conventional form chosen to represent a lexeme (3.6)
Note 1 to entry: In European languages, the lemma is usually the singular if there is a variation in number, the
masculine form if there is a variation in gender, and the infinitive for all verbs. In some languages, certain nouns are
defective in the singular form; in these cases, the plural is chosen. For verbs in Arabic, the lemma is usually deemed to
be the third person singular with the accomplished aspect.
Note 2 to entry: The term “lemma” is most often used in the context of corpora, as a device to capture the identity of
tokens (3.21) and establish basic correspondence between a token and a lexical entry (3.9). The term that corresponds
to lemma in the context of lexicons (3.10) is “headword”. Mismatches between the two are possible due to the varying
macro- and microstructure of lexical entries. In order to handle such mismatches, apart from lemmas, direct references
to dictionary entries are sometimes added to tokens or word-forms (3.5) in corpora.
3.9
lexical entry
container for managing a set of word-forms (3.5) and possibly one or more meanings that describe a lexeme (3.6)
3.10
lexicon
resource comprising a collection of lexical entries (3.9) for a language
3.11
morpheme
exponent that signals a modification of a lexeme (3.6)
Note 1 to entry: This definition adheres to a lexeme-based approach to morphology (3.1) where it is the lexeme, not
the morpheme, that encodes the linguistic sign. On this approach, the morpheme is a unit of form (an exponent) that
marks various kinds of modifications (e.g. derivation or inflection (3.3)) of a lexeme.
Note 2 to entry: Morphemes can usually be divided into derivational and inflectional (signalling a morphosyntactic
category). Sometimes a modification of a lexeme is not overtly marked, and sometimes the morpheme is a combined
(fused) exponent of various kinds of morphosyntactic information.
Note 3 to entry: On morpheme-based (as opposed to lexeme-based) approaches, the morpheme is defined as the
minimal linguistic sign (a combination of the meaning and the form). On these approaches, the term “morph” is used
roughly in the meaning that is used for the term “morpheme” in this document.
3.12
morphosyntactic feature
feature induced from either the inflected form (3.4) of a lexeme (3.6) or from its syntactic context, or both
EXAMPLE “grammaticalGender”.
Note 1 to entry: Universal Dependencies (see Reference [26]) offer a set of general and language-specific features and
values, designed for pragmatically uniform cross-linguistic grammatical description.
3.13
part of speech
POS
grammatical category
word class
category assigned to a word (3.7) based on its grammatical and semantic properties
EXAMPLE Noun, verb.
3.14
morphosyntactic tag
label identifying a feature structure (3.16) used to qualify a word-form (3.5) within an established taxonomy
Note 1 to entry: Morphosyntactic tags can be atomic labels (“N” for “noun”), but very often they are mnemonic
representations for the feature structures that they identify (“NNL2” for “plural locative noun” in the CLAWS-7 tagset,
see Reference [28]). The relevant feature structures can also be encoded by character vectors, as in “N12201” for
“common noun, feminine, plural, countable” in the EAGLES intermediate tagset (see Reference [29]) or by agglutinated
shorthand feature identifiers, as in “subst:pl:gen:m3” for “noun, plural, genitive, masculine, inanimate” in the NKJP
tagset (see Reference [30]).
3.15
morphosyntactic tagset
comprehensive set of morphosyntactic tags (3.14) used for the morphosyntactic description of a language
3.16
feature structure
set of feature specifications (3.17)
[SOURCE: ISO 24610-1:2006, 3.10, modified — Note 1 to entry deleted.]
3.17
feature specification
assignment of a value to a feature
Note 1 to entry: Formally, it is treated as a pair of a feature and its value.
[SOURCE: ISO 24610-1:2006, 3.9]
3.18
phoneme
minimal unit in the sound system of a language
3.19
phonetic transcription
representation or modelling of spoken language based on the sound system of the respective language
[SOURCE: ISO 24624:2016, 3.5]
3.20
character
graphic character
element of a writing system, whether or not alphabetical, that represents a phoneme (3.18), a syllable, a word
(3.7) or even prosodic characteristics of the language, by using graphical symbols (letters, diacritical marks,
syllabic signs, punctuation marks, prosodic accents, etc.) or a combination of these signs (a letter having an
accent or a diacritical mark)
EXAMPLE a, B, ω or Γ are, therefore, characters as well as basic letters.
Note 1 to entry: See also ISO/IEC 2382:2015, 2121335, and ISO 15919:2001, 4.3 (graphic character).
[SOURCE: ISO 7098:2015, 2.1, modified — Note to 1 entry added.]
3.21
token
non-empty contiguous sequence of characters (3.20) in a document
Note 1 to entry: For editorial reasons, some annotation schemes extend the notion of token to an empty sequence.
3.22
tokenization
process that segments a language data stream into individual tokens (3.21)
3.23
transcription
representation of the sounds of a source language by graphic characters (3.20) associated with a target
language
[SOURCE: ISO 15919:2001, 4.6]
3.24
transliteration
representation of the graphic characters (3.20) of a source script (3.26) by the graphic characters of a target script
Note 1 to entry: In transcription, pronunciation conventions are of primary importance, while in transliteration,
writing conventions are of primary importance.
[SOURCE: ISO 15919:2001, 4.7]
3.25
script conversion
representing graphic characters (3.20) from a source script (3.26) by the graphic characters of a target
script, most commonly by romanization (3.31)
Note 1 to entry: The two basic methods of conversion of a system of writing are transliteration (3.24) and transcription
(3.23). The use of the terms “source script” and “target script” in transliteration is analogous to the terms “source
language” and “target language” in translation.
[SOURCE: ISO 15919:2001, 4.1., modified — “script” used as attribute of the main term.]
3.26
script
set of graphic characters (3.20) used for the written form of one or more languages
Note 1 to entry: A script, as opposed to an arbitrary subset of characters, is defined in distinction to other scripts; it
is possible that readers of one script are unable to read another script easily, even where there is a historic relation
between them.
[SOURCE: ISO 15924:2022, 3.7, modified — “in general” deleted and “may” replaced with “it is possible” in
Note 1 to entry. Note 2 to entry deleted.]
3.27
word lattice
set of possible alternative decompositions of a text or speech segment into word-forms (3.5)
Note 1 to entry: A word lattice has the algebraic properties of a directed acyclic graph (3.28) with an initial node and a
final node.
Note 2 to entry: See also directed acyclic graph (3.28) and finite state automata (3.29).
1)
Note 3 to entry: Word lattices are the topic of ISO 24611-2 .
3.28
directed acyclic graph
DAG
digraph
graph with directed edges and no cycles
Note 1 to entry: Directed acyclic graphs are a subset of finite state automata (3.29).
3.29
finite state automata
FSA
graphs made up of states with an initial state and a final state, and a finite set of transitions from state to state
Note 1 to entry: See also directed acyclic graph (3.28).
3.30
data category
class of data items that are closely related from a formal or semantic point of view
EXAMPLE /part of speech/, /subject field/, /definition/.
Note 1 to entry: A data category can be viewed as a generalization of the notion of a field in a database.
Note 2 to entry: In running text, such as in this document, data category names are enclosed in forward slashes
(e.g. /part of speech/).
[SOURCE: ISO 30042:2019, 3.8]
3.31
romanization
conversion of non-Latin graphic characters (3.20) into Latin graphic characters, using either transliteration
(3.24) or transcription (3.23)
1) Planned.
4 MAF metamodel
4.1 Levels of description in the MAF metamodel
Morphosyntactic annotations provide an important layer of linguistic information in a document. This
document is based on a metamodel that draws a clear distinction between two levels of description:
— the level of tokens (representing the surface segmentation of the source);
— the level of word-forms (identifying lexical abstractions associated with groups of tokens).
These two levels have the following property in common: they can be represented as simple sequences and
as local graphs (for the purpose of describing, for example, multiple possible segmentations or ambiguous
compounds). Any n-to-n relationship can obtain between word-forms and tokens. Word-forms can be
aggregated to form maximal units (such as compound words or multi-word units) that act as elementary
units for other levels of linguistic analysis, particularly syntax. In particular, word-forms in many cases
correspond directly to the terminal level defined in ISO 24615-1.
4.2 MAF in the standards landscape
Figure 1 presents a simplified view of the proposed metamodel for morphosyntactic annotations, together
with the place of MAF in the context of other standards for language description.
An annotated document comprises an original document and a set of annotations. Annotations are in most
cases associated with word-forms, which correspond to zero or more tokens in the original document.
A word-form can also be associated with a lexical entry providing information about its underlying lemma
and its inflected form(s). The morphosyntactic annotation associated with a word-form is represented
by a tag, which can also be expressed as a feature structure. A set of such tags used by a particular
annotation scheme is referred to as a “morphosyntactic tagset” and corresponds to what is defined in the
ISO 24610-1-specified feature structure representation (FSR) as a feature-structure library. Each discrete
category within such a tagset should be describable in terms of data categories as described in ISO 12620-1,
and implemented in a centralized repository of data categories conforming to ISO 12620-2. See Annex B for
an illustration.
Figure 1 — Simplified view of MAF metamodel in the context of other ISO standards
4.3 Metadata
The metadata needed to properly describe language resources can be handled by standards of the CMDI
family (CMDI = Component Metadata Infrastructure, see ISO 24622-1 and others) or similar established
standards, beyond the mechanisms provided by the TEI, in particular by the set of vocabulary and structural
relationships defined for the TEI Header.
NOTE 1 See Chapter 2 of Reference [31].
TEI-external metadata descriptions can even be made part of the TEI Header, by virtue of its
element.
NOTE 2 See Reference [32].
4.4 Structural ambiguities
Because annotation can be applied both to tokens and to word-forms, and given the n-to-n relationship
between the elements of both levels, combined annotations can be in structural conflict. Hence, annotation
is typically conceptualized as one or more streams, each represented as a word lattice or more formally as
a directed acyclic graph (DAG).
4.5 MAF metamodel in detail
This subclause presents the MAF metamodel in more detail. The primary focus is on the core model, which
describes the relationship between the two levels of description as well as the function and placement of
annotations. Word lattices are the focus of ISO 24611-2. The metamodel is diagrammed in Figure 2.
Figure 2 — UML view of MAF metamodel
5 Token-level segmentation
5.1 General remarks
Clause 5 looks at the serialization of the MAF metamodel shown in Clause 4. Unless explicitly stated, all
constructs serializing the MAF metamodel shall be valid TEI representations, which means that the
specification described in this document is a customization of the TEI Guidelines (see Reference [31]; for
thorough discussion of the concepts of customization and conformance, see chapter 23 in there). MAF
documents shall thus also be well-formed XML documents as specified by the W3C XML Recommendation.
In all XML examples in this document and in order to simplify the actual representations, it is assumed,
unless otherwise stated, that all XML elements belong to the TEI namespace, as defined by the following
XML namespace declaration:
xmlns = "http:// www .tei -c .org/ ns/ 1 .0"
An updated TEI ODD specification illustrating this document is available from Reference [24] together with
additional examples.
Morphosyntactic annotations reference segments, called “tokens”, that are present in the document flow,
but this does not imply that the resulting segmentation constitutes a sequence of adjacent segments
partitioning the original document. Some parts of a document can carry no annotations (e.g. typographic
marks, stage directions and markup elements), while other parts do not always correspond exactly to their
segmented form (e.g. abbreviations, brachygraphies, orthographic errors and variations, and typographic
and morphological contractions).
It is particularly important to distinguish word-forms from their realizations. A word-form does not always
correspond exactly to a segment delimited solely by orthographic marks such as white spaces or hyphens
(e.g. for German compound words, speech transcription, Sanskrit writing).
The following list shows typical examples of tokenized inputs in three languages, with the original linguistic
segment on the left and the corresponding representation of tokens as vertical-bar-separated strings on
the right:
— La petite fille La|petite|fille
— 白菜和猪肉 白|菜|和|猪|肉
— Don’t wanna do that Do|n’t|wan|na|do|that
The TEI element is used to represent those segments of the original document which, in approximate
terms, are delimited by orthographic, morphological or phonological boundaries. This document does not
review the linguistic correlates of tokens. Depending on the language, a token can be identified through
typographic properties (presence of white-space, hyphens or special characters), phonological properties
(e.g. linking phenomena, hiatus, elision or final-obstruent devoicing, such as the “Auslautverhärtung” in
German), morphological properties (constituting a root, stem, affix, morpheme, etc.), or by a mixture of them.
Also not covered by this document are those aspects of a writing system that are used to format pages or to
separate words and paragraphs, or to provide similar encoding information, since these do not constitute
morphosyntactic annotation.
5.2 Formal description:
The token level in MAF is implemented by means of the TEI element. The element can use the attributes
shown in Table 1.
Table 1 — attributes
Attribute Description
@startPos initial span boundary
@endPos final span boundary
@join relationship with neighbouring tokens
@norm normalized form of the token
@phon phonetic transcription
@transcr general transcription
@translit transliteration to some other script
Table 1 contains a selection of attributes defined for the element by the TEI Guidelines (see
Reference [31]) and by the TEI ODD specification document (see Reference [24]) that customizes the
TEI Guidelines for the purpose of serializing the abstract MAF datamodel. The normalization strategy, the
kind and granularity level of phonetic transcription, the type of general transcription used (if any), as well
as the type of transliteration should all be described in the TEI header accompanying the given MAF-TEI
document.
Many approaches to lightweight grammatical annotation do not apply the full distinction between tokens
and word-forms systematically. Instead, realizations of these two annotation layers are squished into a single
stream of markup. In this vein, the TEI Guidelines (see Reference [31]) offer an option of encoding tokens
by means of (“word”) and (“punctuation character”) elements, while at the same time making it
possible to encode morphosyntactic and morpholexical information directly in these elements. That should
not be seen as “wrong” but rather “pragmatic” or “shorthand” – very often, the token versus word-form
mismatches are not significant for the given language or the given data set, and it is more important to
provide explicit tokenization with part-of-speech identification, even if some fringe cases are then left to be
dealt with, often in ad hoc ways. A representation such as go!
represents a union of the level of tokens (with two anonymous sequences of characters, "go" and "!") and the
level of word-forms, which provides the part-of-speech identification and adds to the anonymous segments
the redundant information concerning their status, respectively, as “word” and as a punctuation character.
A standard should tease these layers apart and demonstrate how they can be combined for practical reasons,
and that is why, in this document, tokens are identified by means of anonymous elements, agnostic
with respect to anything that properly belongs at the word-form level of description.
5.3 Embedding notation
It is not always necessary to separate the original document from its annotations. In simple cases, textual
content can be directly embedded within elements in the form of inline annotation. An example is
shown in Figure 3.
Figure 3 — Inline annotation of tokens for the sentence “The victim’s friends told the police
that Krueger drove into the quarry and never surfaced.” (en), with significant whitespace
Given the fragile nature of whitespace across different editing and publishing environments, this style of
inline encoding, while attested in some text corpora, is not commonly used for text-technological purposes,
which often arrange all the tokens in a single vertical sequence, as shown in Figure 4.
NOTE The @join attribute compensates for the absence of significan
...
SLOVENSKI STANDARD
01-oktober-2024
Upravljanje z jezikovnimi viri - Ogrodje za oblikoskladenjsko označevanje (MAF) -
1. del: Jedrni model
Language resource management — Morphosyntactic annotation framework (MAF) —
Part 1: Core model
Gestion des ressources linguistiques - Cadre d'annotation morphosyntaxique (MAF) —
Partie 1: Modèle de base
Ta slovenski standard je istoveten z: ISO/DIS 24611-1
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
01.140.20 Informacijske vede Information sciences
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
DRAFT
International
Standard
ISO/DIS 24611-1
ISO/TC 37/SC 4
Language resource management —
Secretariat: KATS
Morphosyntactic annotation
Voting begins on:
framework (MAF) —
2024-07-25
Part 1:
Voting terminates on:
2024-10-17
Core model
Gestion des ressources linguistiques - Cadre d'annotation
morphosyntaxique (MAF) —
Partie 1: Modèle de base
ICS: 01.020
THIS DOCUMENT IS A DRAFT CIRCULATED
FOR COMMENTS AND APPROVAL. IT
IS THEREFORE SUBJECT TO CHANGE
AND MAY NOT BE REFERRED TO AS AN
INTERNATIONAL STANDARD UNTIL
PUBLISHED AS SUCH.
IN ADDITION TO THEIR EVALUATION AS
BEING ACCEPTABLE FOR INDUSTRIAL,
TECHNOLOGICAL, COMMERCIAL AND
USER PURPOSES, DRAFT INTERNATIONAL
STANDARDS MAY ON OCCASION HAVE TO
This document is circulated as received from the committee secretariat.
BE CONSIDERED IN THE LIGHT OF THEIR
POTENTIAL TO BECOME STANDARDS TO
WHICH REFERENCE MAY BE MADE IN
NATIONAL REGULATIONS.
RECIPIENTS OF THIS DRAFT ARE INVITED
TO SUBMIT, WITH THEIR COMMENTS,
NOTIFICATION OF ANY RELEVANT PATENT
RIGHTS OF WHICH THEY ARE AWARE AND TO
PROVIDE SUPPORTING DOCUMENTATION.
Reference number
ISO/DIS 24611-1:2024(en)
DRAFT
ISO/DIS 24611-1:2024(en)
International
Standard
ISO/DIS 24611-1
ISO/TC 37/SC 4
Language resource management —
Secretariat: KATS
Morphosyntactic annotation
Voting begins on:
framework (MAF) —
Part 1:
Voting terminates on:
Core model
Gestion des ressources linguistiques - Cadre d'annotation
morphosyntaxique (MAF) —
Partie 1: Modèle de base
ICS: 01.020
THIS DOCUMENT IS A DRAFT CIRCULATED
FOR COMMENTS AND APPROVAL. IT
IS THEREFORE SUBJECT TO CHANGE
AND MAY NOT BE REFERRED TO AS AN
INTERNATIONAL STANDARD UNTIL
PUBLISHED AS SUCH.
IN ADDITION TO THEIR EVALUATION AS
BEING ACCEPTABLE FOR INDUSTRIAL,
© ISO 2024
TECHNOLOGICAL, COMMERCIAL AND
USER PURPOSES, DRAFT INTERNATIONAL
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
STANDARDS MAY ON OCCASION HAVE TO
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
This document is circulated as received from the committee secretariat. BE CONSIDERED IN THE LIGHT OF THEIR
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
POTENTIAL TO BECOME STANDARDS TO
WHICH REFERENCE MAY BE MADE IN
or ISO’s member body in the country of the requester.
NATIONAL REGULATIONS.
ISO copyright office
RECIPIENTS OF THIS DRAFT ARE INVITED
CP 401 • Ch. de Blandonnet 8
TO SUBMIT, WITH THEIR COMMENTS,
CH-1214 Vernier, Geneva
NOTIFICATION OF ANY RELEVANT PATENT
Phone: +41 22 749 01 11
RIGHTS OF WHICH THEY ARE AWARE AND TO
PROVIDE SUPPORTING DOCUMENTATION.
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland Reference number
ISO/DIS 24611-1:2024(en)
ii
ISO/DIS 24611-1:2024(en)
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 The MAF metamodel . 5
4.1 Levels of description in the MAF metamodel .5
4.2 MAF in the standards landscape .6
4.3 Metadata .7
4.4 Structural ambiguities .7
4.5 MAF metamodel in detail .7
5 Token-level segmentation . 9
5.1 Introduction .9
5.2 Formal description: .9
5.3 Embedding notation .10
5.4 Stand-off notation .11
5.5 Normalization and script conversion . 12
5.6 Inline token annotation strategies for token separation . 13
5.6.1 Introduction . 13
5.6.2 Adjacent tokens in embedded mode . 13
5.6.3 Overlapping tokens .14
6 Word-forms as linguistic units .15
6.1 Introduction . 15
6.2 Formal description: .16
6.3 Token attachment .16
6.3.1 One token : one word-form . .16
6.3.2 Several contiguous tokens : one word-form .16
6.3.3 Several discontinuous tokens : one word-form.17
6.3.4 Zero token: one word-form .17
6.3.5 One token : several word-forms .18
6.4 Referencing lexical entries .18
6.5 Compound word-forms .19
6.6 Identification of word-forms . 20
7 Morphosyntactic content .20
7.1 Introduction . 20
7.2 Using feature structures . . 20
7.3 Compact morphosyntactic tags .21
7.4 FSR libraries . .21
7.5 Designing morphosyntactic tagsets . 23
8 Handling ambiguities.24
8.1 Introduction .24
8.2 Word-form content ambiguities .24
8.3 Lexical and structural ambiguities . 25
9 Conformance .25
Annex A (informative) Examples .26
Annex B (informative) Referencing externally defined data categories .31
Bibliography .34
iii
ISO/DIS 24611-1:2024(en)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee
has been established has the right to be represented on that committee. International organizations,
governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely
with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of ISO documents should be noted. This document was drafted in accordance with the editorial rules of the
ISO/IEC Directives, Part 2 (see www.iso.org/directives).
ISO draws attention to the possibility that the implementation of this document may involve the use of (a)
patent(s). ISO takes no position concerning the evidence, validity or applicability of any claimed patent
rights in respect thereof. As of the date of publication of this document, ISO had not received notice of (a)
patent(s) which may be required to implement this document. However, implementers are cautioned that
this may not represent the latest information, which may be obtained from the patent database available at
www.iso.org/patents. ISO shall not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 37, Language and terminology, Subcommittee
SC 4, Language resource management.
This first edition of ISO 24611-1 cancels and replaces ISO 24611:2012, which has been technically revised.
The main changes are as follows:
— the data model is fully serialised in TEI XML;
— definitions and text have been revised;
— conformance conditions have been added;
— most of Clause 8, dealing with word lattices, has been removed and delegated to a planned part 2 of the
ISO 24611 series;
— informative annex of sample data categories has been removed in favour of an external repository of
data categories.
A list of all parts in the ISO 24611 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
iv
ISO/DIS 24611-1:2024(en)
Introduction
ISO/TC 37/SC 4 focuses on the definition of models and formats for the representation of annotated
language resources. To this end, it has generalized the modelling strategy initiated by its sister committee,
ISO/TC 37/SC 3, for the representation of terminological data (see [22]), through which linguistic data models
are seen as the combination of a generic data pattern (a metamodel), which is further refined through a
selection of data categories that provide the descriptors for this specific annotation level. Such models are
defined independently of any specific formats and ensure that an implementer has the necessary conceptual
instrument with which to design and compare formats with regard to their degrees of interoperability.
One important aspect of representing any kind of annotation is the capacity to provide a clear and reliable
semantics for the various descriptors used, either in the form of formal features and feature values, or
directly as objects in a representation that is expressed, for instance, in XML. In order to be shared across
various annotation schemas and encoding applications, such semantics should be implemented as a
centralized repository of concepts: we will henceforth refer to these concepts as data categories. These data
categories are envisioned as having the following two properties:
— From a technical point of view, they should provide unique, stable references (implemented as persistent
identifiers, in the sense of ISO 24619) that specific encoding schemas can use to express their relatedness.
By virtue of that, two annotations will be deemed equivalent if they are defined in relation to the same
data categories (as feature and feature value).
— From a descriptive point of view, each unique semantic reference should be associated with precise
documentation combining a full text elicitation of the meaning of the descriptor with the expression of
specific constraints that bear upon the category.
In the ISO 12620 series, a general framework for representing and maintaining such a repository of data
categories has been developed, potentially encompassing all domains of language resources. That initiative
makes it possible to implement an online environment providing access to data categories that various
language resource-related activities within ISO should align against.
A possible instantiation of ISO 12620-1 is a ‘flat’ marketplace of semantic objects, providing only a limited
set of ontological constraints. The objective of such a setup would be to facilitate the maintenance of a
comprehensive descriptive environment where new categories are easily inserted and re-used without the
need for any strong consistency check with the repository at large. Indeed, the following kinds of constraints
are part of the data category model, as defined in ISO 12620-1:
— simple generic-specific relations, when these are useful for the proper identification of interoperability
descriptors between data categories. For instance, the fact that /properNoun/ is a sub-category of /
noun/ makes it possible to compare morphosyntactic annotations based on different descriptive levels
of granularity;
— the description of conceptual domains, in the sense of the ISO/IEC 11179 series, to identify, when known
or applicable, the possible value of so-called complex data categories. For instance, it can be used to
record that possible values of /grammaticalGender/ (limited to a small group of languages, see [22]),
could be a subset of {/masculine/, /feminine/ and /neuter/};
— language-specific constraints, either in the form of specific application notes or as explicit restrictions
bearing upon the conceptual domains of complex data categories. For instance, it is possible to express
explicitly that /grammaticalGender/ in French can only take the two values: {/masculine/ and /
feminine/}.
This document provides a comprehensive framework for the representation of morphosyntactic annotations
(in their simplest form also referred to as ‘part of speech’ or ‘POS’). This annotation level corresponds to the
first lexical abstraction level over language data (textual or spoken) and, depending on the language to be
annotated, as well as the characteristics of the annotation tool or annotation scheme that is being used, can
vary enormously in structure and complexity.
In order to deal with such complex issues as ambiguity and determinism in morphosyntactic annotation,
this document introduces a metamodel that draws a clear distinction between, on the one hand, the level
v
ISO/DIS 24611-1:2024(en)
of tokens (representing the surface segmentation of the source) and, on the other, the level of word-forms
(identifying lexical abstractions associated with groups of tokens). Both these levels can be represented as
simple sequences and as local graphs such as multiple segmentations and ambiguous compounds; elements
of these two levels can enter into any kind of n-to-n relationships.
[19]
As linguistic segments (sometimes called ‘markables’ in the literature (see, for instance ,), tokens may be
delimited in the source document by means of inline mark-up, or they may be identified remotely (separately
from the source document) by means of so-called stand-off annotations.
As linguistic abstractions, word-forms can be qualified by various linguistic features characterising the
morphosyntactic properties that are instantiated in the realization of the lexical entry within the annotated
text. Such properties may range from the simple identification of a lemma up to an explicit reference to
a lexical entry in a dictionary. In most existing applications of morphosyntactic annotation, linguistic
properties are expressed by means of so-called tags; these codes refer to basic feature structures (see early
[21]
examples in ). Such codes may also provide morphological information, including its part of speech (e.g.
noun, adjective or verb), and features such as number, gender, person mood and verbal tense.
In keeping with the general modelling strategy of ISO/TC 37, this document provides means of relating
morphosyntactic tags expressed as feature structures (compliant with ISO 24610-1) to data categories
(compliant with ISO 12620-1). Implementers are encouraged to use external reference taxonomies as
described by ISO 12620-1 either directly, or by building on them in defining their own categories (appropriate
in the coverage, scope or semantics to the requirements of the given encoding project), in compliance with
ISO/TC 37 principles.
Associated to the metamodel, this document also provides a default XML syntax that can be used to serialize
annotation models compliant with the Morphosyntactic Annotation Framework (MAF). Since many
existing projects are based on the Text Encoding Initiative (TEI) guidelines (see [32]) — particularly in
Digital Humanities, where a proper encoding of textual sources is essential — and since the TEI guidelines
already offer a variety of constructs and mechanisms to cope with many issues relevant to spoken corpora
and their annotations (see [23] and ISO 24624), the metamodel provided by this document is serialized as
TEI XML. Many word-level annotation mechanisms used here elaborate on the proposal of Reference [24],
implemented in the TEI Guidelines.
Finally, it should be noted here that this document forms the conceptual basis for the development of the
ISO 24614 series on word segmentation, whereby all general principles and rules defined in ISO 24614-1, as
well as the constraints expressed in additional parts for specific languages, are to be understood according
to the token vs. word-form dichotomy.
vi
DRAFT International Standard ISO/DIS 24611-1:2024(en)
Language resource management — Morphosyntactic
annotation framework (MAF) —
Part 1:
Core model
1 Scope
This document provides a framework for the representation of annotations of word-sized units in texts.
Such annotations describe tokens, their relationship with lexical units (word-forms), and the relevant
morphosyntactic properties. This document proposes a metamodel for morphosyntactic annotation that
can be augmented with references to data categories contained in an ISO-12620-1-compliant data category
repository. It also defines an XML serialization for morphosyntactic annotations, according to the principles
laid out in the TEI Guidelines.
The Morphosyntactic Annotation Framework consists of two parts, referred to as MAF Core (this document)
and MAF Lattice (planned as ISO 24611-2).
Structural ambiguities are not in the scope of this document, and neither is the structure and composition of
morphosyntactic tagsets.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
ISO 24610-1:2006, Language resource management — Feature structures — Part 1: Feature structure
representation
TEI P5, Guidelines for Electronic Text Encoding and Interchange. Version 4.7.0. Last updated on 16 November
2023. TEI Consortium. https:// tei -c .org/ release/ doc/ tei -p5 -doc/ en/ html/ index .html
W3C XML Recommendation, Extensible Markup Language (XML) 1.0 (Fifth Edition), 26 November 2008,
http:// www .w3 .org/ TR/ xml/
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
ISO/DIS 24611-1:2024(en)
3.1
morphology
description of the structure and formation of words (3.7)
Note 1 to entry: Morphology is traditionally divided into (a) word-formation (3.2) – dealing with the formation of
complex lexemes (3.6) out of simpler lexemes: by means of derivation (often signalled by affixation, i.e., addition of
a morpheme (3.11)) or by means of compounding (combining two or more lexemes), and into (b) inflection (3.3) that
creates inflected forms (3.4).
3.2
word-formation
branch of morphology (3.1), dealing with the creation of new lexemes (3.6) by the processes of derivation and
compounding
3.3
inflection
branch of morphology (3.1), dealing with contextual realizations of lexemes (3.6) as inflected forms (3.4)
3.4
inflected form
concrete form that a lexeme (3.6) can take when used in a sentence or a phrase
3.5
word-form
abstract instantiation of a lexeme (3.6) with the values of morphosyntactic features (3.12) fixed in a
syntactic context
Note 1 to entry: Word-forms may have no acoustic or graphic realization, or may correspond to one or more tokens
(3.21), not necessarily forming a contiguous sequence.
3.6
lexeme
abstract, fundamental unit in the lexicon of a language, comprising semantic, formal (phonetic and/or
graphemic) and grammatical information
Note 1 to entry: A complex lexeme is the result of word-formation (3.2) (derivation or compounding) processes; a
simple lexeme can be thought of as the base for such processes. In a lexical entry (3.9), a lexeme is identified by a lemma
(3.8). Word-forms (3.5) are results of the interaction of lexemes with the grammatical system of the given language.
3.7
word
lexeme (3.6), word-form (3.5) or token (3.21)
Note 1 to entry: The term word is notoriously ambiguous, standing (at least) for lexeme, word-form or token, depending
on the context of its use. This document attempts to disambiguate this term where relevant.
3.8
lemma
conventional form chosen to represent a lexeme (3.6)
Note 1 to entry: In European languages, the lemma is usually the singular if there is a variation in number, the masculine
form if there is a variation in gender, and the infinitive for all verbs. In some languages, certain nouns are defective in
the singular form; in these cases, the plural is chosen. For verbs in Arabic, the lemma is usually deemed to be the third
person singular with the accomplished aspect.
Note 2 to entry: The term lemma is most often used in the context of corpora, as a device to capture the identity of
tokens (3.21) and establish basic correspondence between a token and a lexical entry (3.9). The term that corresponds
to lemma in the context of lexicons is headword. Mismatches between the two are possible due to the varying macro-
and microstructure of lexical entries. In order to handle such mismatches, apart from lemmas, direct references to
dictionary entries are sometimes added to tokens or word-forms (3.5) in corpora.
ISO/DIS 24611-1:2024(en)
3.9
lexical entry
container for managing a set of word-forms (3.5) and possibly one or more meanings that describe a lexeme (3.6)
3.10
lexicon
resource comprising a collection of lexical entries (3.9) for a language
3.11
morpheme
exponent that signals a modification of a lexeme (3.6)
Note 1 to entry: This definition adheres to a lexeme-based approach to morphology where it is the lexeme, not the
morpheme, that encodes the linguistic sign. On this approach, the morpheme is a unit of form (an exponent) that marks
various kinds of modifications (e.g. derivation or inflection) of a lexeme.
Note 2 to entry: Morphemes can usually be divided into derivational and inflectional (signalling a morphosyntactic
category); sometimes a modification of a lexeme is not overtly marked, and sometimes the morpheme is a combined
(fused) exponent of various kinds of morphosyntactic information.
Note 3 to entry: On morpheme-based (as opposed to lexeme-based) approaches, the morpheme is defined as the
minimal linguistic sign (a combination of the meaning and the form). On these approaches, the term morph is used
roughly in the meaning that is used for the term morpheme in this standard.
3.12
morphosyntactic feature
feature induced from either the inflected form (3.4) of a lexeme (3.6) or from its syntactic context, or both
EXAMPLE “grammaticalGender”
Note 1 to entry: Universal Dependencies (see [27]) offer a set of general and language-specific features and values,
designed for pragmatically uniform cross-linguistic grammatical description.
3.13
part of speech
POS
grammatical category
category assigned to a word (3.7) based on its grammatical and semantic properties
EXAMPLE Noun, verb.
3.14
morphosyntactic tag
tag
label identifying a feature structure (3.16) used to qualify a word-form (3.5) within an established taxonomy
Note 1 to entry: Morphosyntactic tags can be atomic labels (“N” for ‘noun’), but very often they are mnemonic
representations for the feature structures that they identify (“NNL2” for ’plural locative noun’ in the CLAWS-7 tagset
(3.15), see [29]). The relevant feature structures can also be encoded by character vectors, as in “N12201” for ‘common
noun, feminine, plural, countable’ in the EAGLES intermediate tagset (see [30]) or by agglutinated shorthand feature
identifiers, as in “subst: pl: gen: m3” for ‘noun, plural, genitive, masculine, inanimate’ in the NKJP tagset (see [31]).
3.15
morphosyntactic tagset
tagset
comprehensive set of morphosyntactic tags (3.14) used for the morphosyntactic description of a language
3.16
feature structure
set of feature specifications (3.17)
[SOURCE: ISO 24610-1:2006, 3.10, modified: Note removed]
ISO/DIS 24611-1:2024(en)
3.17
feature specification
assignment of a value to a feature
Note 1 to entry: Formally, it is treated as a pair of a feature and its value.
[SOURCE: ISO 24610-1:2006, 3.9]
3.18
phoneme
minimal unit in the sound system of a language
3.19
phonetic transcription
representation or modelling of spoken language based on the sound system of the respective language
[SOURCE: ISO 24624:2016, 3.5]
3.20
character
element of a writing system, whether or not alphabetical, that represents a phoneme (3.18), a syllable, a word
(3.7) or even prosodic characteristics of the language, by using graphical symbols (letters, diacritical marks,
syllabic signs, punctuation marks, prosodic accents, etc.) or a combination of these signs (a letter having an
accent or a diacritical mark)
EXAMPLE a, B, ω or Γ are, therefore, characters as well as basic letters.
Note 1 to entry: See also ISO/IEC 2382:2015, 2121335.
[SOURCE: ISO 7098:2015, 2.1]
3.21
token
non-empty contiguous sequence of characters (3.20) in a document
Note 1 to entry: For editorial reasons, some annotation schemes may extend the notion of token to an empty sequence.
3.22
tokenization
process that segments a language data stream into individual tokens (3.21)
3.23
transcription
general transcription
form resulting from a type of script conversion (3.25) whereby characters (3.20) of one script (3.26) are
mapped onto characters of another script)
3.24
transliteration
form resulting from the conversion (3.25) of one script (3.26) into another, usually through a one-to-one
correspondence between characters (3.20)
3.25
script conversion
transcription (3.23) and transliteration (3.24)
[SOURCE: ISO 5127:2017, 3.1.6.13]
ISO/DIS 24611-1:2024(en)
3.26
script
set of graphic characters (3.20) used for the written form of one or more languages
Note 1 to entry: A script, as opposed to an arbitrary subset of characters, is defined in distinction to other scripts;
in general, readers of one script may be unable to read another script easily, even where there is a historic relation
between them.
[SOURCE: ISO 15924:2022, 3.7, modified – Note 2 to entry deleted.]
3.27
word lattice
set of possible alternative decompositions of a text or speech segment into word-forms (3.5)
Note 1 to entry: A word lattice has the algebraic properties of a directed acyclic graph (3.28) with an initial node and a
final node.
Note 2 to entry: See also DAG (3.28) and FSA (3.29).
Note 3 to entry: Word lattices are the topic of the planned Part 2 of this international standards series.
3.28
directed acyclic graph
digraph
DAG
graph with directed edges and no cycles
Note 1 to entry: DAGs are a subset of finite state automata (3.29).
3.29
finite state automata
FSA
graphs made up of states with an initial state and a final state, and a finite set of transitions from state to state
Note 1 to entry: See also DAG (3.28).
3.30
data category
class of data items that are closely related from a formal or semantic point of view
EXAMPLE /part of speech/, /subject field/, /definition/
Note 1 to entry: A data category can be viewed as a generalization of the notion of a field in a database.
Note 2 to entry: In running text, such as in this document, data category names are enclosed in forward slashes (e.g. /
part of speech/).
[SOURCE: ISO 30042:2019, 3.8]
4 The MAF metamodel
4.1 Levels of description in the MAF metamodel
Morphosyntactic annotations provide an important layer of linguistic information in a document. This
document is based on a metamodel that draws a clear distinction between two levels of description: the level
of tokens (representing the surface segmentation of the source) and the level of word-forms (identifying
lexical abstractions associated with groups of tokens). These two levels have the following property in
common: they can be represented as simple sequences and as local graphs (for the purpose of describing,
e.g., multiple possible segmentations or ambiguous compounds). Any n-to-n relationship can obtain between
word-forms and tokens. Word-forms can be aggregated to form maximal units (such as compound words or
multi-word units) that act as elementary units for other levels of linguistic analysis, particularly syntax. In
particular, word-forms in many cases correspond directly to the terminal level defined in ISO 24615-1.
ISO/DIS 24611-1:2024(en)
4.2 MAF in the standards landscape
Figure 1 presents a simplified view of the proposed metamodel for morphosyntactic annotations, together
with the place of MAF in the context of other standards for language description.
An annotated document comprises an original document and a set of annotations. Annotations are in most
cases associated with word-forms, which correspond to zero or more tokens in the original document. A
word-form may also be associated with a lexical entry providing information about its underlying lemma and
its inflected form(s). The morphosyntactic annotation associated with a word-form is represented by a tag,
which may also be expressed as a feature structure. A set of such tags used by a particular annotation scheme
is referred to as a ‘morphosyntactic tagset’ and corresponds to what is defined in the ISO 24610-1-specified
feature structure representation (FSR) as a feature-structure library. Each discrete category within such a
tagset should be describable in terms of data categories as described in ISO 12620-1, and implemented in a
centralized repository of data categories compliant with ISO 12620-2. See Annex B for an illustration.
Figure 1 — Simplified view of MAF metamodel in the context of other standards
ISO/DIS 24611-1:2024(en)
4.3 Metadata
The metadata needed to properly describe language resources is assumed to be handled by standards
of the CMDI family (CMDI = Component Metadata Infrastructure, see ISO 24622-1 and others). Given the
restrictions of the chosen serialization format (e.g. the impossibility to directly specify the language and/
or notation of some information attributes), the relevant information may be specified by the TEI header
mechanisms.
4.4 Structural ambiguities
Because annotation may be applied both to tokens and to word-forms, and given the n-to-n relationship
between the elements of both levels, it is possible that combined annotations may be in structural conflict.
Hence annotation is typically conceptualized as one or more streams, each represented as a word lattice or
more formally as a directed acyclic graph (DAG).
4.5 MAF metamodel in detail
This subclause presents the MAF metamodel in more detail. The primary focus is on the core model, which
describes the relationship between the two levels of description as well as the function and placement of
annotations. Word lattices are the focus of a planned part 2 of this standards series.
ISO/DIS 24611-1:2024(en)
Figure 2 — UML view of MAF metamodel
ISO/DIS 24611-1:2024(en)
5 Token-level segmentation
5.1 Introduction
Clause 5 looks at the serialization of the MAF metamodel shown in Clause 4. Unless explicitly stated, all
constructs serializing the MAF metamodel shall be valid TEI representations, which means that the
specification described in this document is a customization of the TEI Guidelines (for thorough discussion
[32]
of the concepts of customization and conformance, see chapter 23 of ). MAF documents shall thus also be
well-formed XML documents as specified by the W3C XML recommendation.
In all XML examples in this document and in order to simplify the actual representations, it is assumed,
unless otherwise stated, that all XML elements belong to the TEI namespace, as defined by the following
XML namespace declaration:
xmlns="http:// www .tei -c .org/ ns/ 1 .0"
An updated TEI ODD specification corresponding to this document is available from Reference [25] together
with additional examples.
Morphosyntactic annotations reference segments, called tokens, that are present in the document flow, but
this does not imply that the resulting segmentation constitutes a sequence of adjacent segments partitioning
the original document. Some parts of a document may carry no annotations (e.g. typographic marks, stage
directions and markup elements), while other parts may not correspond exactly to their segmented form
(e.g. abbreviations, brachygraphies, orthographic errors and variations, and typographic and morphological
contractions).
It is particularly important to distinguish word-forms from their realizations. A word-form may not
correspond exactly to a segment delimited solely by orthographic marks such as white spaces or hyphens
(e.g. for German compound words, speech transcription, Sanskrit writing, etc.).
The following list shows typical examples of tokenized inputs in three languages, with the original linguistic
segment on the left and the corresponding representation of tokens as vertical-bar-separated strings on
the right:
La petite fille La|petite|fille
白菜和猪肉 白|菜|和|猪|肉
Don’t wanna do that Do|n’t|wan|na|do|that
The TEI element is used to represent those segments of the original document which, in approximate
terms, are delimited by orthographic, morphological or phonological boundaries. This document does not
review the linguistic correlates of tokens. Depending on the language, a token may be identified through
typographic properties (presence of white-space, hyphens or special characters), phonological properties
(e.g. linking phenomena, hiatus, elision or final-obstruent devoicing, such as the "Auslautverhärtung" in
German), morphological properties (constituting a root, stem, affix, morpheme, etc.), or by a mixture of them.
Also not covered by this document are those aspects of a writing system that are used to format pages or to
separate words and paragraphs, or to provide similar encoding information, since these do not constitute
morphosyntactic annotation.
5.2 Formal description:
The token level in MAF is implemented by means of the TEI element. The element may use the
attributes shown in Table 1.
ISO/DIS 24611-1:2024(en)
Table 1 — attributes
@startPos initial span boundary
@endPos final span boundary
@join relationship with neighbouring tokens
@norm normalized form of the token
@phon phonetic transcription
@transcr general transcription
@translit transliteration to some other script
Table 1 contains a selection of attributes defined for the element by the TEI Guidelines and by the
TEI ODD specification document (see [25]) that customizes the Guidelines for the purpose of serializing the
abstract MAF datamodel. It is expected that the normalization strategy, the kind and granularity level of
phonetic transcription, the type of general transcription used (if any), as well as the type of transliteration
will all be described in the TEI header accompanying the given MAF-TEI document.
Note that many approaches to lightweight grammatical annotation do not apply the full distinction between
tokens and word-forms systematically. Instead, realizations of these two annotation layers are squished
into a singl
...










Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...