ISO 24611:2012
(Main)Language resource management — Morpho-syntactic annotation framework (MAF)
Language resource management — Morpho-syntactic annotation framework (MAF)
ISO 24611:2012 provides a framework for the representation of annotations of word-forms in texts; such annotations concern tokens, their relationship with lexical units, and their morpho-syntactic properties.It describes a metamodel for morpho-syntactic annotation that relates to a reference to the data categories contained in the ISOCat data category registry (DCR, as defined in ISO 12620). It also describes an XML serialization for morpho-syntactic annotations, with equivalences to the guidelines of the TEI (text encoding initiative).
Gestion des ressources langagières — Cadre d'annotation morphosyntaxique (MAF)
L'ISO 24611:2012 fournit un cadre pour la représentation des annotations des mots-formes dans les textes; ces annotations concernent les segments, leurs relations avec les unités lexicales, et leurs propriétés morphosyntaxiques. Elle présente un métamodèle pour l'annotation morphosyntaxique qui référence les catégories de données dans le registre des catégories de données ISOCat (DCR tel que défini dans l'ISO 12620). Elle décrit aussi une sérialisation XML pour l'annotation morphosyntaxique, avec les équivalences des lignes directrices de la TEI (Text Encoding Initiative).
Upravljanje z jezikovnimi viri - Ogrodje za oblikoskladenjsko označevanje (MAF)
Ta mednarodni standard zagotavlja ogrodje za predstavitev označevanja besednih oblik v besedilih; to označevanje vključuje žetone, njihov odnos z leksikalnimi enotami in njihove oblikoskladenjske lastnosti. Opisuje metamodel za oblikoskladenjsko označevanje, ki je povezan s sklicevanjem na podatkovne kategorije iz registra kategorij podatkov ISOCat (kot ga določa ISO 12620). Prav tako opisuje serializacijo oblikoskladenjskega označevanja XML z upoštevanjem smernic TEI (iniciativa za zapis besedil).
General Information
Buy Standard
Standards Content (Sample)
SLOVENSKI STANDARD
SIST ISO 24611:2013
01-julij-2013
Upravljanje z jezikovnimi viri - Ogrodje za oblikoskladenjsko označevanje (MAF)
Language resource management -- Morpho-syntactic annotation framework (MAF)
Gestion des ressources langagières -- Cadre d'annotation morphosyntaxique (MAF)
Ta slovenski standard je istoveten z: ISO 24611:2012
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
01.140.20 Informacijske vede Information sciences
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
SIST ISO 24611:2013 en,fr,de
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
---------------------- Page: 1 ----------------------
SIST ISO 24611:2013
---------------------- Page: 2 ----------------------
SIST ISO 24611:2013
INTERNATIONAL ISO
STANDARD 24611
First edition
2012-11-01
Language resource management —
Morpho-syntactic annotation framework
(MAF)
Gestion des ressources langagières — Cadre d'annotation
morphosyntaxique (MAF)
Reference number
ISO 24611:2012(E)
©
ISO 2012
---------------------- Page: 3 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Contents Page
Foreword . v
Introduction . vi
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 The MAF meta-model . 4
4.1 Overview . 4
4.2 MAF Meta-model . 4
5 Segmenting with tokens . 6
5.1 General . 6
5.2 Formal description: . 7
5.3 Embedding notation . 7
5.4 Alternate representation for TEI based documents . 8
5.5 Stand-off notation . 9
5.6 Informative attributes . 9
5.7 Completing the inline token notation . 10
5.7.1 Joining tokens in embedded mode . 10
5.7.2 Overlapping tokens . 11
6 Word-forms as linguistic units . 11
6.1 Formal description: . 12
6.2 Token attachment . 12
6.2.1 One token; one word-form . 12
6.2.2 Several contiguous tokens; one word-form . 12
6.2.3 Several discontinuous tokens; one word-form . 13
6.2.4 Zero token; one word-form . 13
6.2.5 One token; several word-forms . 14
6.3 Referring to lexical entries . 14
6.4 Compound word-forms . 15
6.5 Identification of word-forms within a TEI-compliant document . 15
7 Morpho-syntactic content . 18
7.1 General . 18
7.2 Using feature structures . 18
7.3 Compact morpho-syntactic tags . 18
7.4 FSR libraries . 19
7.5 Designing tagsets . 20
7.6 Formal description: . 22
8 Handling ambiguities . 22
8.1 Word-form content ambiguities . 22
8.2 Lexical Ambiguities . 23
8.3 Structural ambiguities . 23
8.3.1 Structural ambiguities with word-forms . 23
8.3.2 Structural ambiguities with tokens . 24
8.4 Simplified structuring variants . 24
8.4.1 Non-ambiguous linear representation . 24
8.4.2 Mixed linear and lattice representation . 25
8.5 Expanding the simplified variants . 26
8.5.1 Separating tokens and word-forms . 26
8.5.2 Wrapping into local lattices . 26
© ISO 2012 – All rights reserved iii
---------------------- Page: 5 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
8.5.3 Merging local lattices .27
8.5.4 Removing .28
8.6 Formal description: and .29
Annex A (informative) Encoded example using the MAF serialization .30
Annex B (normative) MAF specification .33
B.1 Elements .33
B.1.1 .33
B.1.2 .34
B.1.3 .34
B.1.4 .35
B.1.5 .35
B.1.6 .36
B.1.7 .36
B.1.8 .37
B.2 Model classes .38
B.3 Attribute classes .38
B.3.1 att.token.information .38
B.3.2 att.token.join .39
B.3.3 att.token.span .39
B.3.4 att.wordForm.content .39
B.3.5 att.wordForm.tokens .40
B.4 Macros .40
B.4.1 data.certainty .40
B.4.2 data.code .40
B.4.3 data.count .40
B.4.4 data.duration.w3c .41
B.4.5 data.enumerated .41
B.4.6 data.key .41
B.4.7 data.language .42
B.4.8 data.name .43
B.4.9 data.numeric .43
B.4.10 data.pointer .43
B.4.11 data.probability .44
B.4.12 data.temporal.w3c.44
B.4.13 data.truthValue .44
B.4.14 data.word .45
B.4.15 data.xTruthValue .45
Annex C (normative) Morpho-syntactic data categories .46
Bibliography .58
iv © ISO 2012 – All rights reserved
---------------------- Page: 6 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24611 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
© ISO 2012 – All rights reserved v
---------------------- Page: 7 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Introduction
ISO/TC 37/SC 4 focuses on the definition of models and formats for the representation of annotated language
resources. To this end, it has generalised the modelling strategy initiated by its sister committee, SC 3, for the
representation of terminological data [Romary, 2001], through which linguistic data models are seen as the
combination of a generic data pattern (a meta-model), which is further refined through a selection of data
categories that provide the descriptors for this specific annotation level. Such models are defined
independently of any specific formats, and ensure that an implementer has the necessary conceptual
instrument with which to design and compare formats with regard to their degrees of interoperability.
One important aspect of representing any kind of annotation is the capacity to provide a clear and reliable
semantics for the various descriptors used, either in the form of formal features and feature values, or directly
as objects in a representation that is expressed, for instance, in XML. In order to be shared across various
annotation schemas and encoding applications, such a semantics should be implemented as a centralised
registry of concepts: we will henceforth refer to these as data categories. As such, data categories should
bear the following constraints.
From a technical point of view, they must provide unique, stable references (implemented as persistent
identifiers, in the sense of ISO 24619) such that the designer of a specific encoding schema can refer to
them in his or her specification. By doing so, two annotations will be deemed to be equivalent when they
are in fact defined in relation to the same data categories (as feature and feature value).
From a descriptive point of view, each unique semantic reference should be associated with precise
documentation combining a full text elicitation of the meaning of the descriptor with the expression of
specific constraints that bear upon the category.
In recent years, ISO has developed a general framework for representing and maintaining such a registry of
data categories, encompassing all domains of language resources. This initiative, described in ISO 12620,
has led to the implementation of an online environment providing access to all data categories that have been
standardized in the context of the various language resource-related activities within ISO, or specifically as
part of the maintenance of the data category registry. It also provides access to the various data categories
that individual language technology practitioners have defined in the course of their own work and decided to
share with the community.
The ISO data category registry, as available through the ISOCat (www.isocat.org) implementation, is intended
as a ‘flat’ marketplace of semantic objects, providing only a limited set of ontological constraints. The objective
there is to facilitate the maintenance of a comprehensive descriptive environment where new categories are
easily inserted and reused without the need for any strong consistency check with the registry at large.
Indeed, the following basic constraints are part of the data category model, as defined in ISO 12620:
simple generic-specific relations, when these are useful for the proper identification of interoperability
descriptors between data categories. For instance, the fact that /properNoun/ is a sub-category of /noun/
makes it possible to compare morpho-syntactic annotations based on different descriptive levels of
granularity;
the description of conceptual domains, in the sense of ISO 11179, to identify, when known or applicable,
the possible value of so-called complex data categories For instance, it can be used to record that
possible values of /grammaticalGender/ (limited to a small group of languages [Romary 2011]), could be
a subset of {/masculine/, /feminine/ and /neutral/};
language-specific constraints, either in the form of specific application notes or as explicit restrictions
bearing upon the conceptual domains of complex data categories. For instance, it is possible to express
explicitly that /grammaticalGender/ in French can only take the two values: {/masculine/ and /feminine/}.
vi © ISO 2012 – All rights reserved
---------------------- Page: 8 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
This International Standard provides a comprehensive framework for the representation of morpho-syntactic
(also referred to as part-of-speech) annotations. Such an annotation level corresponds to a first lexical
abstraction level over language data (textual or spoken) and, depending on the language to be annotated,
together with the characteristics of the annotation tool or annotation scheme that is being used, can vary
enormously in structure and complexity.
In order to deal with such complex issues as ambiguity and determinism in morpho-syntactic annotation, this
International Standard introduces a meta-model that draws a clear distinction between the two levels of tokens
(representing the surface segmentation of the source) and word-forms (identifying lexical abstractions
associated with groups of tokens). These two levels share the following specificities: on the one hand, they
can be represented as simple sequences and as local graphs such as multiple segmentations and ambiguous
compounds; on the other hand, any n-to-n combination can stand between word forms and tokens.
As linguistic segments (sometimes called ‘markables’ in the literature [see, for instance, Carletta et al. 1997]),
tokens may be embedded in the source document as inline mark-up, or they may point remotely to it by
means of so-called stand-off annotations.
As linguistic abstractions, word-forms can be qualified by various linguistic features characterising the
morpho-syntactic properties that are instantiated in the realisation of the lexical entry within the annotated text.
Such properties may range from the simple indication of a lemma up to an explicit reference to a lexical entry
in a dictionary. In most existing applications of morpho-syntactic annotation, linguistic properties are
expressed by means of so-called tags; these codes refer to basic feature structures (see early examples in
Monachini and Calzolari, 1994). Such codes may also provide morphological information, including its part of
speech (e.g. noun, adjective or verb), and features such as number, gender, person, mood and verbal tense.
In keeping with the general modelling strategy of ISO/TC 37, this International Standard/MAF provides means
of relating morpho-syntactic tags expressed as feature structures (compliant with ISO 24610) to the data
categories available in ISOCat. A normative annex of this International Standard elicits a core set of data
categories that can be used as reference for most current morpho-syntactic annotation tasks in a multilingual
context. However, when implementers of this International Standard find these categories inappropriate in
either coverage, scope or semantics, they are encouraged to use ISOCat to define their own categories in
compliance with ISO/TC 37 principles.
Associated to the meta-model, MAF also provides a default XML syntax that may be used to serialise MAF-
compliant annotation models. Since many existing projects are based on the text encoding initiative (TEI)
guidelines (www.tei-c.org) — particularly in digital humanities, where a proper encoding of textual sources is
essential — this International Standard will also provide clues about how to articulate the MAF model with TEI-
compliant encodings. Indeed, the TEI guidelines already offer a variety of constructs and mechanisms to cope
with many issues relevant to spoken corpora and their annotations (Romary and Witt, 2012).
Finally, it should be noted here that this International Standard forms the conceptual basis for the
development of the ISO 24614 series on word segmentation, whereby all general principles and rules defined
in ISO 24614-1, as well as the constraints expressed in additional parts for specific languages, are to be
understood according to the token–word-form dichotomy.
© ISO 2012 – All rights reserved vii
---------------------- Page: 9 ----------------------
SIST ISO 24611:2013
---------------------- Page: 10 ----------------------
SIST ISO 24611:2013
INTERNATIONAL STANDARD ISO 24611:2012(E)
Language resource management — Morpho-syntactic
annotation framework (MAF)
1 Scope
This International Standard provides a framework for the representation of annotations of word-forms in texts;
such annotations concern tokens, their relationship with lexical units, and their morpho-syntactic properties.
It describes a metamodel for morpho-syntactic annotation that relates to a reference to the data categories
contained in the ISOCat data category registry (DCR, as defined in ISO 12620). It also describes an XML
serialization for morpho-syntactic annotations, with equivalences to the guidelines of the TEI (text encoding
initiative).
2 Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.
ISO 24610-1, Language resource management — Feature structures — Part 1: Feature structure
representation
3 Terms and definitions
For the purposes of this document, the terms and definitions given in ISO 24610-1 and the following apply.
3.1
DAG
directed acyclic graph
graph with directed edges and no cycles
Note 1 to entry: DAGs are a subset of finite state automata (3.4).
3.3
feature structure
set of feature specifications, used in the morpho-syntactic annotation framework (MAF) to express morpho-
syntactic content
Note 1 to entry: Feature structures are described in ISO 24610-1.
3.4
FSA
finite state automata
graphs made up of states with an initial state and a final state, and a finite set of transitions from state to state
Note 1 to entry: See also DAG (3.1).
© ISO 2012 – All rights reserved 1
---------------------- Page: 11 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
3.5
grapheme
minimal unit in a written language
EXAMPLE Letter, pictogram, ideogram, numeral, punctuation.
3.6
inflection
modification or marking of a lexeme that reflects its morpho-syntactic properties
3.7
inflected form
form that a word can take when used in a sentence or a phrase
Note 1 to entry: An inflected form of a word is associated with a combination of morphological features, such as
grammatical number and case.
3.8
lemma
lemmatised form
conventional form chosen to represent a lexeme
Note 1 to entry: In European languages, the lemma is usually the singular if there is a variation in number, the
masculine form if there is a variation in gender, and the infinitive for all verbs. In some languages, certain nouns are
defective in the singular form; in these cases, the plural is chosen. For verbs in Arabic, the lemma is usually deemed to be
the third person singular with the accomplished aspect.
3.9
lexeme
morpheme generally associated with a set of word-forms sharing a common meaning
3.10
lexical entry
container for managing a set of word-forms and possibly one or more meanings to describe a lexeme
3.11
lexicon
resource comprising a collection of lexical entries for a language
3.12
morpheme
smallest linguistic unit that carries a meaning in a discourse, but which cannot be divided into smaller
meaningful units
Note 1 to entry: A morpheme is either grammatical (grammeme) or lexical (lexeme).
3.13
morphological feature
morpho-syntactic feature
feature induced from the inflected form of a word
Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for European languages.
EXAMPLE “grammaticalGender”.
3.14
morphology
description of the structure and formation of word-forms
2 © ISO 2012 – All rights reserved
---------------------- Page: 12 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
3.15
morpho-syntactic tag
tag
feature structure used systematically to qualify a word-form
3.16
tagset
comprehensive set of tags used for the morpho-syntactic description of a language
Note 1 to entry: The ISOCat data category registry is to be used as the reference for describing a tagset.
3.17
part of speech
grammatical category
category assigned to a word based on its grammatical and semantic properties
EXAMPLE Noun, verb.
Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for parts of speech.
3.18
phoneme
minimal unit in the sound system of a language
3.19
script
set o
...
МЕЖДУНАРОДНЫЙ ISO
СТАНДАРТ 24611
Первое издание
2012-11-01
Управление языковыми ресурсами.
Морфосинтаксическая аннотационная
система (MAF)
Language resource management. – Morpho-syntactic
annotation framework (MAF)
Ответственность за подготовку русской версии несѐт GOST R
(Российская Федерация) в соответствии со статьѐй 18.1 Устава ISO
Ссылочный номер
ISO 24611:2012(R))
©
ISO 2012
---------------------- Page: 1 ----------------------
ISO 24611:2012(R)
ДОКУМЕНТ ЗАЩИЩЁН АВТОРСКИМ ПРАВОМ
© ISO 2012
Все права сохраняются. Если не указано иное, никакую часть настоящей публикации нельзя копировать или использовать в
какой-либо форме или каким-либо электронным или механическим способом, включая фотокопии и микрофильмы, без
предварительного получения письменного согласия ISO по указанному ниже адресу или организации-члена ISO в стране
запрашивающей стороны.
Бюро ISO по авторским правам:
Case postale 56 CH-1211 Geneva 20
Тел.: + 41 22 749 01 11
Факс: + 41 22 749 09 47
Эл. почта: copyright@iso.org
Веб-сайт: www.iso.org
Опубликовано в Швейцарии
©
ii ISO 2012 – Все права сохраняются
---------------------- Page: 2 ----------------------
ISO 24611:2012(R)
Содержание Страница
Предисловие. v
Введение . vi
1 Область применения . 1
2 Нормативные ссылки . 1
3 Термины и определения . 1
4 Метамодель MAF . 4
4.1 Общий обзор . 4
4.2 Метамодель MAF . 5
5 Сегментирование с помощью лексем . 6
5.1 Общие замечания . 6
5.2 Формальное описание: . 7
5.3 Нотация вложения . 7
5.4 Альтернативное представление документов на основе рекомендаций TEI . 8
5.5 Автономная аннотация . 8
5.6 Информативные атрибуты . 9
5.7 Улучшение строковой формы записи лексем . 10
5.7.1 Соединение лексем в режиме вложения . 10
5.7.2 Перекрещивающиеся лексемы . 10
6 Словоформы как лингвистические единицы . 11
6.1 Формальное описание словоформы: . 12
6.2 Присоединение лексических единиц . 12
6.2.1 Одна лексическая единица - одна словоформа . 12
6.2.2 Несколько неразрывных лексем – одна словоформа . 12
6.2.3 Несколько дискретных лексем – одна словоформа . 12
6.2.4 Нулевое число лексем – одна словоформа . 13
6.2.5 Одна лексема – несколько словоформ. 14
6.3 Ссылки на лексические статьи . 14
6.4 Сложносоставные словоформы . 15
6.5 Идентификация словоформ в рамках TEI-совместимого документа . 15
7 Морфосинтаксическое содержание . 18
7.1 Общие замечания . 18
7.2 Использование признаковых структур . 18
7.3 Компактные морфосинтаксические теги . 19
7.4 Библиотеки FSR . 19
7.5 Построение теговых наборов . 20
7.6 Формализованное описание: . 22
8 Обработка неопределѐнностей . 22
8.1 Неопределѐнности содержания словоформ . 22
8.2 Лексические неопределѐнности . 23
8.3 Структурные неопределѐнности . 23
8.3.1 Структурные неопределѐнности словоформ . 23
8.3.2 Структурные неопределѐнности, связанные с лексемами . 24
8.4 Упрощѐнные варианты структурирования . 24
8.4.1 Непротиворечивое линейное представление . 24
8.4.2 Смешанное линейно-решѐточное представление . 25
8.5 Расширение упрощѐнных вариантов . 26
8.5.1 Разбиение лексем и словоформ . 26
8.5.2 Свѐртывание в локальные решѐтки . 26
8.5.3 Слияние локальных решѐток . 27
8.5.4 Удаление элемента . 28
8.6 Формализованное описание элементов и . 29
Приложение A (информативное) Пример кодирования с использованием сериализации MAF . 30
iii
© ISO 2012 – Все права сохраняются
---------------------- Page: 3 ----------------------
ISO 24611:2012(R)
Приложение B (информативное) Спецификация MAF . 33
B.1 Элементы . 33
B.1.1 . 33
B.1.2 . 34
B.1.3 . 34
B.1.4 . 35
B.1.5 . 35
B.1.6 . 36
B.1.7 . 36
B.1.8 . 37
B.2 Классы моделей . 38
B.3 Классы атрибутов . 38
B.3.1 att.token.information . 38
B.3.2 att.token.join . 39
B.3.3 att.token.span . 39
B.3.4 att.wordForm.content . 39
B.3.5 att.wordForm.tokens . 40
B.4 Макросы . 40
B.4.1 data.certainty . 40
B.4.2 data.code . 40
B.4.3 data.count . 40
B.4.4 data.duration.w3c . 41
B.4.5 data.enumerated . 41
B.4.6 data.key . 41
B.4.7 data.language . 42
B.4.8 data.name . 43
B.4.9 data.numeric . 43
B.4.10 data.pointer . 43
B.4.11 data.probability . 44
B.4.12 data.temporal.w3c . 44
B.4.13 data.truthValue . 44
B.4.14 data.word. 45
B.4.15 data.xTruthValue . 45
Приложение C (нормативное) Категории морфосинтаксических данных . 46
Библиография . 62
iv
© ISO 2012 – Все права сохраняются
---------------------- Page: 4 ----------------------
ISO 24611:2012(R)
Предисловие
Международная организация по стандартизации (ISO) является всемирной федерацией национальных
организаций по стандартизации (комитетов-членов ISO). Разработка международных стандартов
обычно осуществляется техническими комитетами ISO. Каждый комитет-член, заинтересованный в
деятельности, для которой был создан технический комитет, имеет право быть представленным в этом
комитете. Международные правительственные и неправительственные организации, имеющие связь с
ISO, также принимают участие в работе. ISO работает в тесном сотрудничестве с Международной
электротехнической комиссией (IEC) по всем вопросам стандартизации в области электротехники.
Проекты международных стандартов разрабатываются согласно правилам, приведѐнным в Директивах
ISO/IEC, Часть 2.
Разработка международных стандартов является основной задачей технических комитетов. Проекты
международных стандартов, принятые техническими комитетами, рассылаются комитетам-членам на
голосование. Для публикации в качестве международного стандарта требуется одобрение не менее
75 % комитетов-членов, принявших участие в голосовании.
Принимается во внимание тот факт, что некоторые из элементов настоящей части стандарта ISO 9735
могут быть объектом патентных прав. ISO не принимает на себя обязательств по определению
отдельных или всех таких патентных прав.
ISO 24611 был подготовлен Техническим комитетом ISO/TC 37, Терминология и другие языковые и
информационные ресурсы, Подкомитетом SC 4, Управление языковыми ресурсами.
v
© ISO 2012 – Все права сохраняются
---------------------- Page: 5 ----------------------
ISO 24611:2012(R)
Введение
Внимание подкомитета SC 4 Технического комитета TC 37 сосредоточено на определении моделей и форм
представления аннотированных языковых ресурсов, вследствие чего он распространил стратегию построения
моделей, определѐнную родственным подкомитетом SC 3, на представление терминологических данных [14];
благодаря этому модели лингвистической информации рассматриваются как обобщѐнные структуры данных
(метамодели), которые в дальнейшем детализируются путѐм отбора соответствующих категорий данных на роль
дескрипторов для конкретного уровня аннотирования. Такие модели определяются независимо от каких-либо
конкретных форматов и предоставляют в распоряжение специалиста, реализующего определѐнный проектный
замысел, нужный для этого концептуальный инструментарий, который позволяет ему проектировать и сравнивать
разные форматы представления по их функциональной эффективности.
Одним из важнейших аспектов представления аннотации любого вида является возможность чѐткого и
достоверного описания семантики различных используемых дескрипторов – либо в виде формального
описания их характеристик и конкретных значений, либо как объектов формализованного
представления, например, на языке XML. In order to be shared across various annotation schemas and
encoding applications, такие семантические средства должны реализовываться как некий
централизованный реестр понятий, к которому пользователь может обращаться как к справочнику
категорий данных. Категории данных как таковые должны нести в себе следующие ограничения:
С технической точки зрения, они должны обеспечивать однозначные стабильные ссылки (реализуемые как
постоянные идентификаторы в том смысле, как они определены в ISO 24619), чтобы разработчик конкретной
схемы кодирования мог использовать ссылки на стандартизованные категории данных в своѐм описании. При
таком подходе две аннотации будут считаться эквивалентными, когда они определены применительно к
одним и тем же категориям (что и характеристика с еѐ значением).
В дескриптивном плане каждая уникальная семантическая ссылка должна ассоциироваться с
подробной документацией, которая содержит в себе полнотекстовый фрагмент описания значения
дескриптора с представлением конкретных ограничений, обусловливающих категорию данных.
В последние годы ISO был разработана общая основа для представления и сопровождения такого реестра
категорий данных, охватывающего все сферы использования языковых ресурсов. Реализация этой
инициативной разработки, описанной в стандарте ISO 12620, привела к созданию оперативно доступной
лингвистической среды применительно ко всем категориям данных, которые стандартизуются в рамках
многочисленных операций с языковыми ресурсами в связи с внутренней деятельностью ISO, или специально –
как часть механизма сопровождения реестра категорий данных. Через этот реестр обеспечивается также доступ
к многочисленным категориям данных, которые специалисты по лингвистическим технологиям определяют
применительно к конкретным языкам в рамках своей повседневной работы и считают целесообразным довести
информацию о них до сведения пользовательского сообщества.
Реестр категорий данных ISO в том виде, как он доступен на сайте ISOCat (www.isocat.org), призван играть роль ―не
спекулятивной‖ рыночной площадки семантических объектов, которая накладывает минимум онтологических
ограничений. Цель создания подобного реестра заключается в том, чтобы облегчить сопровождение всеобъемлющей
дескриптивной среды, в которую легко встраиваются для повторного использования новые категории, без
необходимости жѐсткой проверки их на соответствие всему реестру в целом. При этом, естественно, частью модели
категорий данных являются перечисленные ниже базовые ограничения, как они определены в ISO 12620:
связи типа ―общий - специальный‖ должны быть простыми, чтобы они могли использоваться для точной
идентификации дескрипторов совместимости различных категорий данных. Например, тот факт, что
/properNoun/ (имя собственное) является подкатегорией /noun/ (имени существительного), делает возможным
сравнение морфосинтаксических аннотаций на основе описаний с разными уровнями детализации;
описание концептуальных областей должно соответствовать требованиям ISO 11179 для облегчения
идентификации возможных значений так называемых сложных категорий данных, когда они применимы или
распознаваемы. Например, подобное описание может использоваться для регистрации того факта, что
возможные значения концепта /grammaticalGender/ (грамматический род) в малочисленной группе языков [15],
могут принадлежать подмножеству {/masculine/, /feminine/ and /neutral/} (мужской, женский и средний);
ограничения, относящиеся к конкретному языку, должны представляться в форме замечаний по
vi
© ISO 2012 – Все права сохраняются
---------------------- Page: 6 ----------------------
ISO 24611:2012(R)
применению или явно сформулированных ограничений, касающихся концептуальных областей сложных
категорий данных. Например, можно в явной форме записать, что концепт /grammaticalGender/ во
французском языке может принимать только два значения: {/masculine/ и /feminine/} (мужской и женский).
Настоящий международный стандарт обеспечивает широкую основу для представления аннотаций
морфосинтаксических структур (называемых также частями речи). Такая аннотация соответствует
первому уровню абстрагирования от лексических значений языковых данных (текстовых или речевых),
и в зависимости от языка, в рамках которого осуществляется аннотирование, и от характеристик
используемого метода или схемы аннотирования, может в очень широких пределах изменяться по
своей структуре и степени сложности.
Для облегчения проработки таких сложных вопросов аннотирования, как обеспечение однозначности и
детерминизма, настоящим Международным стандартом определяется метамодель, в которой
проводится чѐткое различие между двумя уровнями лексических единиц (представляющих
сегментацию источника информации на поверхностном уровне) и словоформами (которые
идентифицируют лексические абстракции, связанные с группами лексических единиц). Оба этих уровня
обладают следующими одинаковыми особенностями: с одной стороны, они могут представляться как
простые последовательности и локальные графы множества сегментаций и неоднозначных компоновок,
а с другой стороны, все N словоформ могут образовывать комбинации с N лексическими единицами.
В качестве лингвистических сегментов, которые иногда называются в специальной литературе, как,
например, в [12], маркерами (‗markables‘), лексические единицы могут встраиваться в первоисточник
информации в виде внутристрочных меток либо могут указывать на него дистанционно посредством
так называемых автономных аннотаций.
Словоформы как лингвистические абстракции могут классифицироваться по различным лингвистическим
признакам, характеризующим морфосинтаксические свойства, которые приписаны конкретной реализации
лексической статьи в рамках аннотируемого текста. Такие свойства могут варьироваться в широком диапазоне –
от простого указания на лемму до представленной явным образом ссылки на лексему в словаре. В большинстве
существующих приложений морфосинтаксического аннотирования лингвистические характеристики
отображаются с помощью так называемых тегов, которые являются кодовым представлением основных
признаковых структур (их давние примеры приведены в работе Моначини и Кальзолари [13]). Эти коды могут
также нести в себе морфологическую информацию, включая указание части речи (например, существительное,
прилагательное или глагол) и такие характеристики, как число, род, лицо, наклонение и глагольное время.
В соответствии с общей стратегией моделирования, принятой Техническим комитетом ISO/TC 37,
представленная в настоящем Международном стандарте морфосинтаксическая аннотационная система (MAF)
обеспечивает необходимые средства привязки морфосинтаксических тегов, реализуемых признаковыми
структурами (согласно ISO 24610), к категориям данных, имеющимся на сайте ISOCat. Нормативное
Приложение настоящего Международного стандарта устанавливает множество ключевых категорий данных,
которые могут использоваться в режиме ссылок при решении наиболее актуальных текущих задач
морфосинтаксического аннотирования в многоязычном контексте. Тем пользователям настоящего
Международного стандарта, которые сочтут представленные в нѐм категории не подходящими им по охвату,
сфере применения или семантическим характеристикам, рекомендуется использовать реестр ISOCat для
определения собственных категорий данных в соответствии с принципами работы ISO/TC 37.
В соединении с метамоделью MAF обеспечивает также стандартную синтаксическую структуру языка XML,
которая может использоваться для сериализации аннотационных моделей, совместимых с MAF. Так как многие
существующие лингвистические проекты основываются на рекомендациях Международной организации по
кодированию текстовой информации [Text Encoding Initiative (TEI)] (www.tei-c.org), жизненно важных для
цифрового представления текстовых первоисточников в компьютеризованном обществе, настоящий
Международный стандарт нацелен также на разъяснение способов использования модели MAF в сочетании с
TEI-совместимыми методами кодирования. В рамках руководящих указаний TEI уже предложено множество
концепций и механизмов для решения широкого круга проблем, связанных с формированием корпусов
разговорных языков и их аннотирования [15].
В заключение следует отметить, что данный международный стандарт создаѐт концептуальную основу для
разработки стандартов серии ISO 24614, касающихся сегментирования текстовой информации, общие принципы и
правила которого определены в ISO 24614-1, равно как и для понимания ограничений, излагаемых в
дополнительных частях этой серии, которые относятся к конкретным языкам, в соответствии с дихотомией
лексема – словоформа.
vii
© ISO 2012 – Все права сохраняются
---------------------- Page: 7 ----------------------
МЕЖДУНАРОДНЫЙ СТАНДАРТ ISO 24611:2012(R)
Управление языковыми ресурсами. Морфосинтаксическая
аннотационная система (MAF)
1 Область применения
Настоящий международный стандарт обеспечивает основу для представления аннотаций словоформ
в текстах; такие аннотации содержат в себе лексемы, а также их связи с лексическими единицами и
морфосинтаксические свойства.
В стандарте описывается метамодель морфосинтаксической аннотации применительно к ссылкам на
категории данных, которые содержатся в реестре категорий данных ISOCat (определѐнном как DCR в
ISO 12620). Описывается также сериализация XML-описаний для морфосинтаксических аннотаций в
соответствии с рекомендациями TEI (Text Encoding Initiative).
2 Нормативные ссылки
Перечисленные ниже ссылочные документы обязательны для применения данного документа. В
случае датированных ссылок действующим является только указанное издание. Применительно к
недатированным ссылочным документам применяются их самые последние издания (включая все
последующие изменения):
ISO 24610-1, Управление языковыми ресурсами. Структуры элементов. Часть 1: Представление
структур элементов
3 Термины и определения
Для целей данного документа используются термины и определения из стандарта ISO 24610-1, а
также терминология, приведѐнная ниже.
3.1
орграф без циклов, ациклический орграф
DAG
directed acyclic graph
граф с ориентированными дугами, не имеющий циклов
Примечание 1 к статье: графы без циклов являются подмножеством конечных автоматов (3.4).
3.3
признаковая структура
feature structure
множество спецификаций элементов, используемых в системе морфосинтаксического аннотирования
(MAF) для выражения морфосинтаксического содержания
Примечание к статье 1: признаковые структуры описываются как структуры элементов в ISO 24610-1.
3.4
конечные автоматы, КА
FSA
finite state automata
графы переходных состояний, отображающие начальное и конечное состояния и конечное множество
переходов автомата из одного состояния в другое
Примечание 1 к статье: см. также орграф без циклов (3.1).
1
© ISO 2012 – Все права сохраняются
---------------------- Page: 8 ----------------------
ISO 24611:2012(R)
3.5
графема
grapheme
минимальная единица письменного языка
ПРИМЕР буква, пиктограмма, идеограмма, число, знак пунктуации.
3.6
изменение формы слова
inflection
модификация или маркировка лексемы, отражающая еѐ морфосинтаксические свойства
3.7
изменѐнная форма
inflected form
форма, которую слово может принимать в предложении или грамматическом обороте
Примечание 1 к статье: Изменѐнная форма слова ассоциируется с какой-либо из морфологических характеристик,
таких как грамматическое число и падеж.
3.8
лемма
лемматизированная форма
lemma
lemmatised form
общеупотребительная форма представления лексемы
Примечание 1 к статье: В европейских языках лемма обычно представляется в единственном числе, если
существует множественное; в мужском роде, когда существует изменение по родам, и в инфинитиве глаголов. В
некоторых языках определѐнные имена существительные в форме единственного числа имеют недостаточную
парадигму; в таких случаях для представления леммы выбирается множественное число. Для глаголов арабского
языка лемма обычно представляется в третьем лице единственного числа совершенного вида.
3.9
лексема
lexeme
морфема, обычно ассоциируемая с множеством словоформ, соответствующих одному общему значению
3.10 лексическая статья
lexical entry
контейнер, обеспечивающий манипулирование множеством словоформ и, возможно, одним или
несколькими значениями для описания лексемы
3.11
словарь
lexicon
информационный ресурс, содержащий коллекцию лексических статей некоторого языка
3.12 морфема
morpheme
мельчайшая лингвистическая единица, которая несѐт в себе смысл в дискурсе, но не может быть
разбита на более мелкие значимые единицы
Примечание 1 к статье: Морфема может быть грамматической (и тогда она называется граммемой) или
лексической (т.е. лексемой).
3.13
морфологическая характеристика
морфосинтаксическая характеристика
morphological feature
morpho-syntactic feature
характеристика, выводимая из формы слова
2
© ISO 2012 – Все права сохраняются
---------------------- Page: 9 ----------------------
ISO 24611:2012(R)
Примечание 1 к статье: Реестр категорий данных
...
INTERNATIONAL ISO
STANDARD 24611
First edition
2012-11-01
Language resource management —
Morpho-syntactic annotation framework
(MAF)
Gestion des ressources langagières — Cadre d'annotation
morphosyntaxique (MAF)
Reference number
ISO 24611:2012(E)
©
ISO 2012
---------------------- Page: 1 ----------------------
ISO 24611:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 2 ----------------------
ISO 24611:2012(E)
Contents Page
Foreword . v
Introduction . vi
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 The MAF meta-model . 4
4.1 Overview . 4
4.2 MAF Meta-model . 4
5 Segmenting with tokens . 6
5.1 General . 6
5.2 Formal description: . 7
5.3 Embedding notation . 7
5.4 Alternate representation for TEI based documents . 8
5.5 Stand-off notation . 9
5.6 Informative attributes . 9
5.7 Completing the inline token notation . 10
5.7.1 Joining tokens in embedded mode . 10
5.7.2 Overlapping tokens . 11
6 Word-forms as linguistic units . 11
6.1 Formal description: . 12
6.2 Token attachment . 12
6.2.1 One token; one word-form . 12
6.2.2 Several contiguous tokens; one word-form . 12
6.2.3 Several discontinuous tokens; one word-form . 13
6.2.4 Zero token; one word-form . 13
6.2.5 One token; several word-forms . 14
6.3 Referring to lexical entries . 14
6.4 Compound word-forms . 15
6.5 Identification of word-forms within a TEI-compliant document . 15
7 Morpho-syntactic content . 18
7.1 General . 18
7.2 Using feature structures . 18
7.3 Compact morpho-syntactic tags . 18
7.4 FSR libraries . 19
7.5 Designing tagsets . 20
7.6 Formal description: . 22
8 Handling ambiguities . 22
8.1 Word-form content ambiguities . 22
8.2 Lexical Ambiguities . 23
8.3 Structural ambiguities . 23
8.3.1 Structural ambiguities with word-forms . 23
8.3.2 Structural ambiguities with tokens . 24
8.4 Simplified structuring variants . 24
8.4.1 Non-ambiguous linear representation . 24
8.4.2 Mixed linear and lattice representation . 25
8.5 Expanding the simplified variants . 26
8.5.1 Separating tokens and word-forms . 26
8.5.2 Wrapping into local lattices . 26
© ISO 2012 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO 24611:2012(E)
8.5.3 Merging local lattices .27
8.5.4 Removing .28
8.6 Formal description: and .29
Annex A (informative) Encoded example using the MAF serialization .30
Annex B (normative) MAF specification .33
B.1 Elements .33
B.1.1 .33
B.1.2 .34
B.1.3 .34
B.1.4 .35
B.1.5 .35
B.1.6 .36
B.1.7 .36
B.1.8 .37
B.2 Model classes .38
B.3 Attribute classes .38
B.3.1 att.token.information .38
B.3.2 att.token.join .39
B.3.3 att.token.span .39
B.3.4 att.wordForm.content .39
B.3.5 att.wordForm.tokens .40
B.4 Macros .40
B.4.1 data.certainty .40
B.4.2 data.code .40
B.4.3 data.count .40
B.4.4 data.duration.w3c .41
B.4.5 data.enumerated .41
B.4.6 data.key .41
B.4.7 data.language .42
B.4.8 data.name .43
B.4.9 data.numeric .43
B.4.10 data.pointer .43
B.4.11 data.probability .44
B.4.12 data.temporal.w3c.44
B.4.13 data.truthValue .44
B.4.14 data.word .45
B.4.15 data.xTruthValue .45
Annex C (normative) Morpho-syntactic data categories .46
Bibliography .58
iv © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
ISO 24611:2012(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24611 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
© ISO 2012 – All rights reserved v
---------------------- Page: 5 ----------------------
ISO 24611:2012(E)
Introduction
ISO/TC 37/SC 4 focuses on the definition of models and formats for the representation of annotated language
resources. To this end, it has generalised the modelling strategy initiated by its sister committee, SC 3, for the
representation of terminological data [Romary, 2001], through which linguistic data models are seen as the
combination of a generic data pattern (a meta-model), which is further refined through a selection of data
categories that provide the descriptors for this specific annotation level. Such models are defined
independently of any specific formats, and ensure that an implementer has the necessary conceptual
instrument with which to design and compare formats with regard to their degrees of interoperability.
One important aspect of representing any kind of annotation is the capacity to provide a clear and reliable
semantics for the various descriptors used, either in the form of formal features and feature values, or directly
as objects in a representation that is expressed, for instance, in XML. In order to be shared across various
annotation schemas and encoding applications, such a semantics should be implemented as a centralised
registry of concepts: we will henceforth refer to these as data categories. As such, data categories should
bear the following constraints.
From a technical point of view, they must provide unique, stable references (implemented as persistent
identifiers, in the sense of ISO 24619) such that the designer of a specific encoding schema can refer to
them in his or her specification. By doing so, two annotations will be deemed to be equivalent when they
are in fact defined in relation to the same data categories (as feature and feature value).
From a descriptive point of view, each unique semantic reference should be associated with precise
documentation combining a full text elicitation of the meaning of the descriptor with the expression of
specific constraints that bear upon the category.
In recent years, ISO has developed a general framework for representing and maintaining such a registry of
data categories, encompassing all domains of language resources. This initiative, described in ISO 12620,
has led to the implementation of an online environment providing access to all data categories that have been
standardized in the context of the various language resource-related activities within ISO, or specifically as
part of the maintenance of the data category registry. It also provides access to the various data categories
that individual language technology practitioners have defined in the course of their own work and decided to
share with the community.
The ISO data category registry, as available through the ISOCat (www.isocat.org) implementation, is intended
as a ‘flat’ marketplace of semantic objects, providing only a limited set of ontological constraints. The objective
there is to facilitate the maintenance of a comprehensive descriptive environment where new categories are
easily inserted and reused without the need for any strong consistency check with the registry at large.
Indeed, the following basic constraints are part of the data category model, as defined in ISO 12620:
simple generic-specific relations, when these are useful for the proper identification of interoperability
descriptors between data categories. For instance, the fact that /properNoun/ is a sub-category of /noun/
makes it possible to compare morpho-syntactic annotations based on different descriptive levels of
granularity;
the description of conceptual domains, in the sense of ISO 11179, to identify, when known or applicable,
the possible value of so-called complex data categories For instance, it can be used to record that
possible values of /grammaticalGender/ (limited to a small group of languages [Romary 2011]), could be
a subset of {/masculine/, /feminine/ and /neutral/};
language-specific constraints, either in the form of specific application notes or as explicit restrictions
bearing upon the conceptual domains of complex data categories. For instance, it is possible to express
explicitly that /grammaticalGender/ in French can only take the two values: {/masculine/ and /feminine/}.
vi © ISO 2012 – All rights reserved
---------------------- Page: 6 ----------------------
ISO 24611:2012(E)
This International Standard provides a comprehensive framework for the representation of morpho-syntactic
(also referred to as part-of-speech) annotations. Such an annotation level corresponds to a first lexical
abstraction level over language data (textual or spoken) and, depending on the language to be annotated,
together with the characteristics of the annotation tool or annotation scheme that is being used, can vary
enormously in structure and complexity.
In order to deal with such complex issues as ambiguity and determinism in morpho-syntactic annotation, this
International Standard introduces a meta-model that draws a clear distinction between the two levels of tokens
(representing the surface segmentation of the source) and word-forms (identifying lexical abstractions
associated with groups of tokens). These two levels share the following specificities: on the one hand, they
can be represented as simple sequences and as local graphs such as multiple segmentations and ambiguous
compounds; on the other hand, any n-to-n combination can stand between word forms and tokens.
As linguistic segments (sometimes called ‘markables’ in the literature [see, for instance, Carletta et al. 1997]),
tokens may be embedded in the source document as inline mark-up, or they may point remotely to it by
means of so-called stand-off annotations.
As linguistic abstractions, word-forms can be qualified by various linguistic features characterising the
morpho-syntactic properties that are instantiated in the realisation of the lexical entry within the annotated text.
Such properties may range from the simple indication of a lemma up to an explicit reference to a lexical entry
in a dictionary. In most existing applications of morpho-syntactic annotation, linguistic properties are
expressed by means of so-called tags; these codes refer to basic feature structures (see early examples in
Monachini and Calzolari, 1994). Such codes may also provide morphological information, including its part of
speech (e.g. noun, adjective or verb), and features such as number, gender, person, mood and verbal tense.
In keeping with the general modelling strategy of ISO/TC 37, this International Standard/MAF provides means
of relating morpho-syntactic tags expressed as feature structures (compliant with ISO 24610) to the data
categories available in ISOCat. A normative annex of this International Standard elicits a core set of data
categories that can be used as reference for most current morpho-syntactic annotation tasks in a multilingual
context. However, when implementers of this International Standard find these categories inappropriate in
either coverage, scope or semantics, they are encouraged to use ISOCat to define their own categories in
compliance with ISO/TC 37 principles.
Associated to the meta-model, MAF also provides a default XML syntax that may be used to serialise MAF-
compliant annotation models. Since many existing projects are based on the text encoding initiative (TEI)
guidelines (www.tei-c.org) — particularly in digital humanities, where a proper encoding of textual sources is
essential — this International Standard will also provide clues about how to articulate the MAF model with TEI-
compliant encodings. Indeed, the TEI guidelines already offer a variety of constructs and mechanisms to cope
with many issues relevant to spoken corpora and their annotations (Romary and Witt, 2012).
Finally, it should be noted here that this International Standard forms the conceptual basis for the
development of the ISO 24614 series on word segmentation, whereby all general principles and rules defined
in ISO 24614-1, as well as the constraints expressed in additional parts for specific languages, are to be
understood according to the token–word-form dichotomy.
© ISO 2012 – All rights reserved vii
---------------------- Page: 7 ----------------------
INTERNATIONAL STANDARD ISO 24611:2012(E)
Language resource management — Morpho-syntactic
annotation framework (MAF)
1 Scope
This International Standard provides a framework for the representation of annotations of word-forms in texts;
such annotations concern tokens, their relationship with lexical units, and their morpho-syntactic properties.
It describes a metamodel for morpho-syntactic annotation that relates to a reference to the data categories
contained in the ISOCat data category registry (DCR, as defined in ISO 12620). It also describes an XML
serialization for morpho-syntactic annotations, with equivalences to the guidelines of the TEI (text encoding
initiative).
2 Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.
ISO 24610-1, Language resource management — Feature structures — Part 1: Feature structure
representation
3 Terms and definitions
For the purposes of this document, the terms and definitions given in ISO 24610-1 and the following apply.
3.1
DAG
directed acyclic graph
graph with directed edges and no cycles
Note 1 to entry: DAGs are a subset of finite state automata (3.4).
3.3
feature structure
set of feature specifications, used in the morpho-syntactic annotation framework (MAF) to express morpho-
syntactic content
Note 1 to entry: Feature structures are described in ISO 24610-1.
3.4
FSA
finite state automata
graphs made up of states with an initial state and a final state, and a finite set of transitions from state to state
Note 1 to entry: See also DAG (3.1).
© ISO 2012 – All rights reserved 1
---------------------- Page: 8 ----------------------
ISO 24611:2012(E)
3.5
grapheme
minimal unit in a written language
EXAMPLE Letter, pictogram, ideogram, numeral, punctuation.
3.6
inflection
modification or marking of a lexeme that reflects its morpho-syntactic properties
3.7
inflected form
form that a word can take when used in a sentence or a phrase
Note 1 to entry: An inflected form of a word is associated with a combination of morphological features, such as
grammatical number and case.
3.8
lemma
lemmatised form
conventional form chosen to represent a lexeme
Note 1 to entry: In European languages, the lemma is usually the singular if there is a variation in number, the
masculine form if there is a variation in gender, and the infinitive for all verbs. In some languages, certain nouns are
defective in the singular form; in these cases, the plural is chosen. For verbs in Arabic, the lemma is usually deemed to be
the third person singular with the accomplished aspect.
3.9
lexeme
morpheme generally associated with a set of word-forms sharing a common meaning
3.10
lexical entry
container for managing a set of word-forms and possibly one or more meanings to describe a lexeme
3.11
lexicon
resource comprising a collection of lexical entries for a language
3.12
morpheme
smallest linguistic unit that carries a meaning in a discourse, but which cannot be divided into smaller
meaningful units
Note 1 to entry: A morpheme is either grammatical (grammeme) or lexical (lexeme).
3.13
morphological feature
morpho-syntactic feature
feature induced from the inflected form of a word
Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for European languages.
EXAMPLE “grammaticalGender”.
3.14
morphology
description of the structure and formation of word-forms
2 © ISO 2012 – All rights reserved
---------------------- Page: 9 ----------------------
ISO 24611:2012(E)
3.15
morpho-syntactic tag
tag
feature structure used systematically to qualify a word-form
3.16
tagset
comprehensive set of tags used for the morpho-syntactic description of a language
Note 1 to entry: The ISOCat data category registry is to be used as the reference for describing a tagset.
3.17
part of speech
grammatical category
category assigned to a word based on its grammatical and semantic properties
EXAMPLE Noun, verb.
Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for parts of speech.
3.18
phoneme
minimal unit in the sound system of a language
3.19
script
set of graphic characters used for the written form of one or more languages
3.20
syntagmatic relation
relation by which linguistic units in a discourse are associated
3.21
token
non-empty contiguous sequence of graphemes or phonemes in a document
Note 1 to entry: For editorial reasons, some annotation scheme may extend the notion of token to an empty sequence.
See the section on token attachment (6.2).
3.22
tokenization
process identifying tokens
3.23
transcription
form resulting from a coherent method of writing down speech sounds
3.24
transliteration
form resulting from the conversion of one script into another, usually through a one-to-one correspondence
between characters
3.25
word-form
morpho-syntactic unit
contiguous or non-contiguous linguistic unit identified as corresponding to a lexical entity in a language
Note 1 to entry: Word-forms may have no acoustic or graphic realization, or may correspond to one or more tokens.
© ISO 2012 – All rights reserved 3
---------------------- Page: 10 ----------------------
ISO 24611:2012(E)
3.26
word lattice
set of possible alternative decompositions of a text or speech segment into word-forms
...
SLOVENSKI STANDARD
SIST ISO 24611:2013
01-julij-2013
8SUDYOMDQMH]MH]LNRYQLPLYLUL2JURGMH]DREOLNRVNODGHQMVNRR]QDþHYDQMH0$)
Language resource management -- Morpho-syntactic annotation framework (MAF)
Gestion des ressources langagières -- Cadre d'annotation morphosyntaxique (MAF)
Ta slovenski standard je istoveten z: ISO 24611:2012
ICS:
01.020 7HUPLQRORJLMDQDþHODLQ Terminology (principles and
NRRUGLQDFLMD coordination)
SIST ISO 24611:2013 en,fr,de
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
---------------------- Page: 1 ----------------------
SIST ISO 24611:2013
---------------------- Page: 2 ----------------------
SIST ISO 24611:2013
INTERNATIONAL ISO
STANDARD 24611
First edition
2012-11-01
Language resource management —
Morpho-syntactic annotation framework
(MAF)
Gestion des ressources langagières — Cadre d'annotation
morphosyntaxique (MAF)
Reference number
ISO 24611:2012(E)
©
ISO 2012
---------------------- Page: 3 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2012
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved
---------------------- Page: 4 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Contents Page
Foreword . v
Introduction . vi
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 The MAF meta-model . 4
4.1 Overview . 4
4.2 MAF Meta-model . 4
5 Segmenting with tokens . 6
5.1 General . 6
5.2 Formal description: . 7
5.3 Embedding notation . 7
5.4 Alternate representation for TEI based documents . 8
5.5 Stand-off notation . 9
5.6 Informative attributes . 9
5.7 Completing the inline token notation . 10
5.7.1 Joining tokens in embedded mode . 10
5.7.2 Overlapping tokens . 11
6 Word-forms as linguistic units . 11
6.1 Formal description: . 12
6.2 Token attachment . 12
6.2.1 One token; one word-form . 12
6.2.2 Several contiguous tokens; one word-form . 12
6.2.3 Several discontinuous tokens; one word-form . 13
6.2.4 Zero token; one word-form . 13
6.2.5 One token; several word-forms . 14
6.3 Referring to lexical entries . 14
6.4 Compound word-forms . 15
6.5 Identification of word-forms within a TEI-compliant document . 15
7 Morpho-syntactic content . 18
7.1 General . 18
7.2 Using feature structures . 18
7.3 Compact morpho-syntactic tags . 18
7.4 FSR libraries . 19
7.5 Designing tagsets . 20
7.6 Formal description: . 22
8 Handling ambiguities . 22
8.1 Word-form content ambiguities . 22
8.2 Lexical Ambiguities . 23
8.3 Structural ambiguities . 23
8.3.1 Structural ambiguities with word-forms . 23
8.3.2 Structural ambiguities with tokens . 24
8.4 Simplified structuring variants . 24
8.4.1 Non-ambiguous linear representation . 24
8.4.2 Mixed linear and lattice representation . 25
8.5 Expanding the simplified variants . 26
8.5.1 Separating tokens and word-forms . 26
8.5.2 Wrapping into local lattices . 26
© ISO 2012 – All rights reserved iii
---------------------- Page: 5 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
8.5.3 Merging local lattices .27
8.5.4 Removing .28
8.6 Formal description: and .29
Annex A (informative) Encoded example using the MAF serialization .30
Annex B (normative) MAF specification .33
B.1 Elements .33
B.1.1 .33
B.1.2 .34
B.1.3 .34
B.1.4 .35
B.1.5 .35
B.1.6 .36
B.1.7 .36
B.1.8 .37
B.2 Model classes .38
B.3 Attribute classes .38
B.3.1 att.token.information .38
B.3.2 att.token.join .39
B.3.3 att.token.span .39
B.3.4 att.wordForm.content .39
B.3.5 att.wordForm.tokens .40
B.4 Macros .40
B.4.1 data.certainty .40
B.4.2 data.code .40
B.4.3 data.count .40
B.4.4 data.duration.w3c .41
B.4.5 data.enumerated .41
B.4.6 data.key .41
B.4.7 data.language .42
B.4.8 data.name .43
B.4.9 data.numeric .43
B.4.10 data.pointer .43
B.4.11 data.probability .44
B.4.12 data.temporal.w3c.44
B.4.13 data.truthValue .44
B.4.14 data.word .45
B.4.15 data.xTruthValue .45
Annex C (normative) Morpho-syntactic data categories .46
Bibliography .58
iv © ISO 2012 – All rights reserved
---------------------- Page: 6 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24611 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
© ISO 2012 – All rights reserved v
---------------------- Page: 7 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
Introduction
ISO/TC 37/SC 4 focuses on the definition of models and formats for the representation of annotated language
resources. To this end, it has generalised the modelling strategy initiated by its sister committee, SC 3, for the
representation of terminological data [Romary, 2001], through which linguistic data models are seen as the
combination of a generic data pattern (a meta-model), which is further refined through a selection of data
categories that provide the descriptors for this specific annotation level. Such models are defined
independently of any specific formats, and ensure that an implementer has the necessary conceptual
instrument with which to design and compare formats with regard to their degrees of interoperability.
One important aspect of representing any kind of annotation is the capacity to provide a clear and reliable
semantics for the various descriptors used, either in the form of formal features and feature values, or directly
as objects in a representation that is expressed, for instance, in XML. In order to be shared across various
annotation schemas and encoding applications, such a semantics should be implemented as a centralised
registry of concepts: we will henceforth refer to these as data categories. As such, data categories should
bear the following constraints.
From a technical point of view, they must provide unique, stable references (implemented as persistent
identifiers, in the sense of ISO 24619) such that the designer of a specific encoding schema can refer to
them in his or her specification. By doing so, two annotations will be deemed to be equivalent when they
are in fact defined in relation to the same data categories (as feature and feature value).
From a descriptive point of view, each unique semantic reference should be associated with precise
documentation combining a full text elicitation of the meaning of the descriptor with the expression of
specific constraints that bear upon the category.
In recent years, ISO has developed a general framework for representing and maintaining such a registry of
data categories, encompassing all domains of language resources. This initiative, described in ISO 12620,
has led to the implementation of an online environment providing access to all data categories that have been
standardized in the context of the various language resource-related activities within ISO, or specifically as
part of the maintenance of the data category registry. It also provides access to the various data categories
that individual language technology practitioners have defined in the course of their own work and decided to
share with the community.
The ISO data category registry, as available through the ISOCat (www.isocat.org) implementation, is intended
as a ‘flat’ marketplace of semantic objects, providing only a limited set of ontological constraints. The objective
there is to facilitate the maintenance of a comprehensive descriptive environment where new categories are
easily inserted and reused without the need for any strong consistency check with the registry at large.
Indeed, the following basic constraints are part of the data category model, as defined in ISO 12620:
simple generic-specific relations, when these are useful for the proper identification of interoperability
descriptors between data categories. For instance, the fact that /properNoun/ is a sub-category of /noun/
makes it possible to compare morpho-syntactic annotations based on different descriptive levels of
granularity;
the description of conceptual domains, in the sense of ISO 11179, to identify, when known or applicable,
the possible value of so-called complex data categories For instance, it can be used to record that
possible values of /grammaticalGender/ (limited to a small group of languages [Romary 2011]), could be
a subset of {/masculine/, /feminine/ and /neutral/};
language-specific constraints, either in the form of specific application notes or as explicit restrictions
bearing upon the conceptual domains of complex data categories. For instance, it is possible to express
explicitly that /grammaticalGender/ in French can only take the two values: {/masculine/ and /feminine/}.
vi © ISO 2012 – All rights reserved
---------------------- Page: 8 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
This International Standard provides a comprehensive framework for the representation of morpho-syntactic
(also referred to as part-of-speech) annotations. Such an annotation level corresponds to a first lexical
abstraction level over language data (textual or spoken) and, depending on the language to be annotated,
together with the characteristics of the annotation tool or annotation scheme that is being used, can vary
enormously in structure and complexity.
In order to deal with such complex issues as ambiguity and determinism in morpho-syntactic annotation, this
International Standard introduces a meta-model that draws a clear distinction between the two levels of tokens
(representing the surface segmentation of the source) and word-forms (identifying lexical abstractions
associated with groups of tokens). These two levels share the following specificities: on the one hand, they
can be represented as simple sequences and as local graphs such as multiple segmentations and ambiguous
compounds; on the other hand, any n-to-n combination can stand between word forms and tokens.
As linguistic segments (sometimes called ‘markables’ in the literature [see, for instance, Carletta et al. 1997]),
tokens may be embedded in the source document as inline mark-up, or they may point remotely to it by
means of so-called stand-off annotations.
As linguistic abstractions, word-forms can be qualified by various linguistic features characterising the
morpho-syntactic properties that are instantiated in the realisation of the lexical entry within the annotated text.
Such properties may range from the simple indication of a lemma up to an explicit reference to a lexical entry
in a dictionary. In most existing applications of morpho-syntactic annotation, linguistic properties are
expressed by means of so-called tags; these codes refer to basic feature structures (see early examples in
Monachini and Calzolari, 1994). Such codes may also provide morphological information, including its part of
speech (e.g. noun, adjective or verb), and features such as number, gender, person, mood and verbal tense.
In keeping with the general modelling strategy of ISO/TC 37, this International Standard/MAF provides means
of relating morpho-syntactic tags expressed as feature structures (compliant with ISO 24610) to the data
categories available in ISOCat. A normative annex of this International Standard elicits a core set of data
categories that can be used as reference for most current morpho-syntactic annotation tasks in a multilingual
context. However, when implementers of this International Standard find these categories inappropriate in
either coverage, scope or semantics, they are encouraged to use ISOCat to define their own categories in
compliance with ISO/TC 37 principles.
Associated to the meta-model, MAF also provides a default XML syntax that may be used to serialise MAF-
compliant annotation models. Since many existing projects are based on the text encoding initiative (TEI)
guidelines (www.tei-c.org) — particularly in digital humanities, where a proper encoding of textual sources is
essential — this International Standard will also provide clues about how to articulate the MAF model with TEI-
compliant encodings. Indeed, the TEI guidelines already offer a variety of constructs and mechanisms to cope
with many issues relevant to spoken corpora and their annotations (Romary and Witt, 2012).
Finally, it should be noted here that this International Standard forms the conceptual basis for the
development of the ISO 24614 series on word segmentation, whereby all general principles and rules defined
in ISO 24614-1, as well as the constraints expressed in additional parts for specific languages, are to be
understood according to the token–word-form dichotomy.
© ISO 2012 – All rights reserved vii
---------------------- Page: 9 ----------------------
SIST ISO 24611:2013
---------------------- Page: 10 ----------------------
SIST ISO 24611:2013
INTERNATIONAL STANDARD ISO 24611:2012(E)
Language resource management — Morpho-syntactic
annotation framework (MAF)
1 Scope
This International Standard provides a framework for the representation of annotations of word-forms in texts;
such annotations concern tokens, their relationship with lexical units, and their morpho-syntactic properties.
It describes a metamodel for morpho-syntactic annotation that relates to a reference to the data categories
contained in the ISOCat data category registry (DCR, as defined in ISO 12620). It also describes an XML
serialization for morpho-syntactic annotations, with equivalences to the guidelines of the TEI (text encoding
initiative).
2 Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.
ISO 24610-1, Language resource management — Feature structures — Part 1: Feature structure
representation
3 Terms and definitions
For the purposes of this document, the terms and definitions given in ISO 24610-1 and the following apply.
3.1
DAG
directed acyclic graph
graph with directed edges and no cycles
Note 1 to entry: DAGs are a subset of finite state automata (3.4).
3.3
feature structure
set of feature specifications, used in the morpho-syntactic annotation framework (MAF) to express morpho-
syntactic content
Note 1 to entry: Feature structures are described in ISO 24610-1.
3.4
FSA
finite state automata
graphs made up of states with an initial state and a final state, and a finite set of transitions from state to state
Note 1 to entry: See also DAG (3.1).
© ISO 2012 – All rights reserved 1
---------------------- Page: 11 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
3.5
grapheme
minimal unit in a written language
EXAMPLE Letter, pictogram, ideogram, numeral, punctuation.
3.6
inflection
modification or marking of a lexeme that reflects its morpho-syntactic properties
3.7
inflected form
form that a word can take when used in a sentence or a phrase
Note 1 to entry: An inflected form of a word is associated with a combination of morphological features, such as
grammatical number and case.
3.8
lemma
lemmatised form
conventional form chosen to represent a lexeme
Note 1 to entry: In European languages, the lemma is usually the singular if there is a variation in number, the
masculine form if there is a variation in gender, and the infinitive for all verbs. In some languages, certain nouns are
defective in the singular form; in these cases, the plural is chosen. For verbs in Arabic, the lemma is usually deemed to be
the third person singular with the accomplished aspect.
3.9
lexeme
morpheme generally associated with a set of word-forms sharing a common meaning
3.10
lexical entry
container for managing a set of word-forms and possibly one or more meanings to describe a lexeme
3.11
lexicon
resource comprising a collection of lexical entries for a language
3.12
morpheme
smallest linguistic unit that carries a meaning in a discourse, but which cannot be divided into smaller
meaningful units
Note 1 to entry: A morpheme is either grammatical (grammeme) or lexical (lexeme).
3.13
morphological feature
morpho-syntactic feature
feature induced from the inflected form of a word
Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for European languages.
EXAMPLE “grammaticalGender”.
3.14
morphology
description of the structure and formation of word-forms
2 © ISO 2012 – All rights reserved
---------------------- Page: 12 ----------------------
SIST ISO 24611:2013
ISO 24611:2012(E)
3.15
morpho-syntactic tag
tag
feature structure used systematically to qualify a word-form
3.16
tagset
comprehensive set of tags used for the morpho-syntactic description of a language
Note 1 to entry: The ISOCat data category registry is to be used as the reference for describing a tagset.
3.17
part of speech
grammatical category
category assigned to a word based on its grammatical and semantic properties
EXAMPLE Noun, verb.
Note 1 to entry: The ISOCat data category registry provides a comprehensive list of values for parts of speech.
3.18
phoneme
minimal unit in the sound system of a language
3.19
script
set of graphic characters used for the written form of one or more languages
3.20
syntagmatic relation
relation by which linguistic units in a discourse are associated
3
...
NORME ISO
INTERNATIONALE 24611
Première édition
2012-11-01
Gestion des ressources langagières —
Cadre d'annotation morphosyntaxique
(MAF)
Language resource management — Morpho-syntactic annotation
framework (MAF)
Numéro de référence
ISO 24611:2012(F)
©
ISO 2012
---------------------- Page: 1 ----------------------
ISO 24611:2012(F)
DOCUMENT PROTÉGÉ PAR COPYRIGHT
© ISO 2012, Publié en Suisse
Droits de reproduction réservés. Sauf indication contraire, aucune partie de cette publication ne peut être reproduite ni utilisée sous
quelque forme que ce soit et par aucun procédé, électronique ou mécanique, y compris la photocopie, l’affichage sur l’internet ou sur un
Intranet, sans autorisation écrite préalable. Les demandes d’autorisation peuvent être adressées à l’ISO à l’adresse ci-après ou au comité
membre de l’ISO dans le pays du demandeur.
ISO copyright office
Ch. de Blandonnet 8 CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO 2012 – Tous droits réservés
---------------------- Page: 2 ----------------------
ISO 24611:2012(F)
Sommaire Page
Avant-propos . v
Introduction . vi
1 Domaine d’application . 1
2 Références normatives . 1
3 Termes et définitions . 1
4 Le métamodèle MAF . 4
4.1 Vue d’ensemble . 4
4.2 Métamodèle MAF . 5
5 Segmentation . 6
5.1 Aspect général . 6
5.2 Description formelle: . 7
5.3 Notation enchâssée . 8
5.4 Représentation alternative pour les documents conformes à la TEI . 8
5.5 Notation déportée . 9
5.6 Attributs informatifs . 10
5.7 Compléter la notation enchâssée . 10
5.7.1 Joindre des segments dans le mode enchâssé . 11
5.7.2 Segments chevauchants . 11
6 Les mots-formes en tant qu’unités linguistiques . 12
6.1 Description formelle: . 13
6.2 Attachement de segment . 13
6.2.1 Un segment, un mot-forme . 13
6.2.2 Plusieurs segments contigus, un mot-forme . 13
6.2.3 Plusieurs segments discontigus, un mot forme . 13
6.2.4 Absence de segment, un mot-forme . 14
6.2.5 Un segment, plusieurs mots-formes . 14
6.3 Référencer les entrées lexicales . 15
6.4 Mots-formes composés . 16
6.5 Identification des mots-formes au sein d’un document conforme à la TEI . 16
7 Contenu morphosyntaxique . 19
7.1 Aspect général . 19
7.2 Utiliser les structures de traits . 19
7.3 Balises morphosyntaxiques compactes. 20
7.4 Les bibliothèques FSR . 20
7.5 Conception des ensembles de balises . 21
7.6 Description formelle: . 23
8 Gestion des ambiguïtés . 23
8.1 Ambiguïtés du contenu des mots-formes . 23
8.2 Ambiguïtés lexicales . 24
8.3 Ambiguïtés structurelles . 24
8.3.1 Ambiguïtés structurelles avec des mots-formes . 24
8.3.2 Ambiguïtés structurelles avec les segments . 25
8.4 Variantes structurées simplement . 25
© ISO 2012 – Tous droits réservés iii
---------------------- Page: 3 ----------------------
ISO 24611:2012(F)
8.4.1 Représentation linéaire non ambiguë . 25
8.4.2 Représentation mixte linéaire et en treillis . 26
8.5 Expanser les variantes simplifiées . 27
8.5.1 Séparer les segments et les mots-formes . 27
8.5.2 Envelopper dans les treillis locaux . 27
8.5.3 Fusion de treillis locaux . 28
8.5.4 Suppression de . 30
8.6 Description formelle: and . 30
Annexe A (informative) Exemple encodé selon la sérialisation MAF . 31
Annexe B (normative) Spécification MAF . 34
B.1 Eléments . 34
B.1.1 . 34
B.1.2 . 35
B.1.3 . 35
B.1.4 . 36
B.1.5 . 36
B.1.6 . 37
B.1.7 . 37
B.1.8 . 38
B.2 Classes de modèles . 39
B.3 Classes d’attributs . 39
B.3.1 att.token.information . 39
B.3.2 att.token.join . 40
B.3.3 att.token.span . 40
B.3.4 att.wordForm.content . 40
B.3.5 att.wordForm.tokens . 41
B.4
Macros . 41
B.4.1 data.certainty. 41
B.4.2 data.code . 41
B.4.3 data.count . 42
B.4.4 data.duration.w3c . 42
B.4.5 data.enumerated . 42
B.4.6 data.key. 43
B.4.7 data.language . 43
B.4.8 data.name . 44
B.4.9 data.numeric . 45
B.4.10 data.pointer . 45
B.4.11 data.probability . 46
B.4.12 data.temporal.w3c . 46
B.4.13 data.truthValue . 46
B.4.14 data.word . 47
B.4.15 data.xTruthValue . 47
Annexe C (normative) Catégories de données morphosyntaxiques . 48
Bibliographie . 62
iv © ISO 2012 – Tous droits réservés
---------------------- Page: 4 ----------------------
ISO 24611:2012(F)
Avant-propos
L'ISO (Organisation internationale de normalisation) est une fédération mondiale d'organismes
nationaux de normalisation (comités membres de l'ISO). L'élaboration des Normes internationales est
en général confiée aux comités techniques de l'ISO. Chaque comité membre intéressé par une étude a le
droit de faire partie du comité technique créé à cet effet. Les organisations internationales,
gouvernementales et non gouvernementales, en liaison avec l'ISO participent également aux travaux.
L'ISO collabore étroitement avec la Commission électrotechnique internationale (IEC) en ce qui
concerne la normalisation électrotechnique.
Les procédures utilisées pour élaborer le présent document et celles destinées à sa mise à jour sont
décrites dans les Directives ISO/IEC, Partie 1. Il convient, en particulier de prendre note des différents
critères d'approbation requis pour les différents types de documents ISO. Le présent document a été
rédigé conformément aux règles de rédaction données dans les Directives ISO/IEC, Partie 2
(voir www.iso.org/directives).
L'attention est appelée sur le fait que certains des éléments du présent document peuvent faire l'objet
de droits de propriété intellectuelle ou de droits analogues. L'ISO ne saurait être tenue pour
responsable de ne pas avoir identifié de tels droits de propriété et averti de leur existence. Les détails
concernant les références aux droits de propriété intellectuelle ou autres droits analogues identifiés
lors de l'élaboration du document sont indiqués dans l'Introduction et/ou dans la liste des déclarations
de brevets reçues par l'ISO (voir www.iso.org/brevets).
Les appellations commerciales éventuellement mentionnées dans le présent document sont données
pour information, par souci de commodité, à l’intention des utilisateurs et ne sauraient constituer un
engagement.
Pour une explication de la signification des termes et expressions spécifiques de l'ISO liés à l'évaluation
de la conformité, ou pour toute information au sujet de l'adhésion de l'ISO aux principes de
l’Organisation mondiale du commerce (OMC) concernant les obstacles techniques au commerce (OTC),
voir le lien suivant: www.iso.org/iso/fr/avant‐propos.html.
Le comité chargé de l'élaboration du présent document est l'ISO/TC 37, Terminologie et autres
ressources langagières et ressources de contenu, sous‐comité SC4, Gestion de ressources linguistiques.
© ISO 2012 – Tous droits réservés v
---------------------- Page: 5 ----------------------
ISO 24611:2012(F)
Introduction
L’ISO/TC 37/SC 4 se concentre sur la définition des modèles et des formats utilisés pour représenter les
ressources linguistiques annotées. A cette fin, il généralise la stratégie de modélisation initialisée par
son comité frère le SC 3 pour la représentation des données terminologiques [Romary, 2001], selon
laquelle les modèles de données linguistiques sont considérés comme la combinaison d’un patron de
données génériques (un métamodèle), qui est ensuite perfectionné au moyen d’une sélection de
catégories de données qui fournissent les descripteurs correspondant à ce niveau spécifique
d’annotation. Ces modèles sont définis indépendamment des formats spécifiques et permettent à
l’implémenteur de disposer de l’outil conceptuel nécessaire pour concevoir et comparer les formats en
fonction de leurs niveaux d’interopérabilité.
Pour représenter tout type d’annotation, il est important de mettre à disposition une sémantique claire
et fiable pour les divers descripteurs utilisés, soit sous la forme de traits valués formels, soit
directement comme objets d’une représentation exprimée par exemple en XML. Pour que cette
sémantique puisse être partagée entre différents schémas d’annotation et d’applications d’encodage, il
convient de l’implémenter comme un registre centralisé de concepts: aussi, nous considérerons ces
concepts comme des catégories de données. En tant que telles, il convient que ces catégories de données
remplissent les conditions suivantes:
d’un point de vue technique, elles doivent fournir des références uniques et stables (implémentées
sous la forme d’identifiants pérennes au sens de l’ISO 24619) de telle manière que le concepteur
d’un schéma spécifique d’encodage puisse les référencer dans ses spécifications. Ainsi, deux
annotations seront considérées comme équivalentes quand elles feront référence à la même
catégorie de données (en tant que trait et valeur).
d’un point de vue descriptif, il convient que chaque référence sémantiquement unique soit associée
à une documentation précise combinant une explication en prose de la signification du descripteur
avec l’expression des contraintes spécifiques qui portent sur la catégorie.
Ces dernières années, l’ISO a développé un cadre général pour représenter et maintenir un tel registre
de catégories de données couvrant tous les domaines des ressources linguistiques. Cette initiative,
spécifiée par l’ISO 12620, a abouti à l’implémentation d’un environnement mis en ligne afin d’une part
de fournir l’accès à toutes les catégories de données qui ont été normalisées dans le contexte des
activités liées aux diverses ressources linguistiques au sein de l’ISO, et d’autre part spécifiquement au
titre de la maintenance du registre de catégories de données. Le système propose aussi un accès aux
diverses catégories de données que les praticiens des technologies du langage ont définies dans le cadre
de leur propre travail et qu’ils ont partagé ensuite avec la communauté.
Le registre de catégories de données, accessible via l’implémentation ISOCat (www.isocat.org) est juste
un espace d’objets sémantiques n’offrant qu’un ensemble limité de contraintes ontologiques. L’objectif
est de faciliter la maintenance d’un environnement au sein duquel de nouvelles catégories sont
facilement insérées et réutilisées sans qu’il soit nécessaire de procéder à une vérification approfondie
de la cohérence par rapport au reste du registre. En effet, les contraintes de base sont intrinsèques au
modèle de catégorie de données tel que défini par l’ISO 12620:
de simples relations génériques‐spécifiques quand elles sont utiles à une identification exacte des
descripteurs d’interopérabilité entre catégories de données. Par exemple, le fait que /properNoun/
soit une sous‐catégorie de /noun/ permet de comparer des annotations morphosyntaxiques
fondées sur différents niveaux de granularité;
vi © ISO 2012 – Tous droits réservés
---------------------- Page: 6 ----------------------
ISO 24611:2012(F)
la description des domaines conceptuels au sens de l’ISO 11179 pour identifier, quand elle est
connue ou identifiable la valeur possible de la dite catégorie de donnée complexe. Par exemple, elle
peut être utilisée pour enregistrer que la valeur possible de /grammaticalGender/ (limitée à un
petit groupe de langues [Romary 2011]), peut être un sous‐ensemble de {/ masculine/, /feminine/
et /neutral/};
des contraintes linguistiques spécifiques, soit sous la forme de notes d’application ou comme des
restrictions explicites portant sur les domaines conceptuels des catégories de données. Par
exemple, il est possible d’exprimer explicitement que /grammaticalGender/ en français ne peut
prendre que les deux valeurs: {/masculine/ et /feminine/}.
La présente Norme internationale fournit un cadre complet pour la représentation des annotations
morphosyntactiques (aussi dénommées annotations en partie du discours). Ce niveau d’annotation
correspond à un premier niveau d’abstraction par rapport aux données linguistiques (textuelles ou
parlées), dont la structure et la complexité peuvent varier considérablement en fonction de la langue à
annoter, de même que selon les caractéristiques de l’outil d’annotation ou du schéma d’annotation
utilisé.
Pour résoudre les problématiques complexes de l’ambiguïté et du déterminisme en annotation
morphosyntaxique, la présente Norme internationale introduit un méta‐modèle qui établit une
distinction nette entre les deux niveaux que sont les segments (représentant le découpage de surface de
la source) et lesmots‐formes (identifiant les abstractions lexicales associées aux groupes de segments).
Ces deux niveaux partagent les caractéristiques suivantes: d’une part, ils peuvent être représentés
comme de simples séquences et des graphes locaux tels que segmentations multiples et éléments
ambigus, et d’autre part, toute combinaison N à M peut relier les segments et les mots‐formes.
En tant que segments linguistiques (quelquefois dénommés ‘tokens’ ou ‘markables ‘ dans la littérature
technique anglaise [par exemple, Carletta et al. 1997]), ces segments peuvent être enchâssés dans le
document source comme une balise en ligne, ou peuvent y faire référence par l’intermédiaire
d’annotations déportées (‘stand‐off annotation’ en anglais).
En tant qu’abstractions linguistiques, les mots‐formes peuvent être qualifiés par divers traits
linguistiques caractérisant les propriétés morphosyntaxiques qui sont instanciées dans la réalisation de
l’entrée lexicale dans le texte annoté. Ces propriétés peuvent prendre diverses formes: de la simple
indication d'un lemme à une référence explicite à une entrée lexicale dans un dictionnaire. Dans la
plupart des applications existantes de l’annotation morphosyntaxique, les propriétés linguistiques sont
exprimées au moyen de balises; ces codes font référence aux structures de traits basiques (voir les
exemples dans Monachini and Calzolari, 1994). Ces codes peuvent aussi fournir de l’information
morphologique, incluant la partie du discours (par exemple, nom, adjectif ou verbe), et des traits
comme le nombre, le genre, la personne, le mode et le temps du verbe.
En phase avec la stratégie générale de modélisation de l’ISO/TC 37, la présente Norme internationale/le
cadre MAF fournit les moyens de mise en relation des balises morphosyntaxiques exprimées en tant
que structures de traits (conformes à l’ISO 24610) avec les catégories de données d’ISOCat. Une annexe
normative de la présente Norme internationale explicite un jeu de base de catégories de données qui
peuvent être utilisées comme référence pour la plupart des tâches d’annotation morphosyntaxiques
dans un contexte multilingue. Néanmoins, si des utilisateurs de la présente Norme internationale
estiment que ces catégories sont inappropriées du point de vue de la couverture, du domaine
d’application ou de la sémantique, ils sont invités à utiliser ISOCat pour définir leurs propres catégories
en conformité avec les principes de l’ISO/TC 37.
Associé au méta‐modèle, le cadre MAF fournit aussi une syntaxe XML par défaut qui peut être utilisée
pour sérialiser les modèles d’annotation conformes. Etant donné que de nombreux projets existants
sont basés sur les lignes directrices émanant du consortium TEI (Text Encoding Initiative, www.tei‐
c.org) — particulièrement dans les humanités numériques, où un encodage correct des sources
textuelles est essentiel — la présente Norme internationale fournira aussi des informations sur la façon
© ISO 2012 – Tous droits réservés vii
---------------------- Page: 7 ----------------------
ISO 24611:2012(F)
concilier le modèle MAF et les encodages conformes à la TEI. En effet, les lignes directrices de la TEI
offrent d’ores et déjà une grande variété de constructions et de mécanismes pour prendre en charge les
nombreux défis posés par les corpus oraux et leurs annotations (Romary and Witt, 2012).
Enfin, il convient de noter que la présente Norme internationale constitue la base conceptuelle
permettant d’élaborer la série de normes ISO 24614 relative à la segmentation des unités lexicales. La
totalité des règles et principes généraux définis dans l’ISO 24614‐1 de même que les contraintes
exprimées dans des parties complémentaires traitant de langues spécifiques, doivent être appréhendés
dans le respect de la dichotomie segment / mot‐forme.
viii © ISO 2012 – Tous droits réservés
---------------------- Page: 8 ----------------------
NORME INTERNATIONALE ISO 24611:2012(F)
Gestion des ressources langagières — Cadre d'annotation
morphosyntaxique (MAF)
1 Domaine d’application
La présente Norme internationale fournit un cadre pour la représentation des annotations des mots‐
formes dans les textes; ces annotations concernent les segments, leurs relations avec les unités
lexicales, et leurs propriétés morphosyntaxiques.
Elle présente un métamodèle pour l’annotation morphosyntaxique qui référence les catégories de
données dans le registre des catégories de données ISOCat (DCR tel que défini dans l’ISO 12620). Elle
décrit aussi une sérialisation XML pour l’annotation morphosyntaxique, avec les équivalences des lignes
directrices de la TEI (Text Encoding Initiative).
2 Références normatives
Les documents référencés sont indispensables à l’application de ce document. Pour les références
datées, seule l’édition citée s’applique. Pour les références non datées, la dernière édition du document
référencé s’applique (incluant ses éventuels amendements).
ISO 24610‐1, Gestion des ressources linguistiques — Structures de traits — Partie 1: Représentation de
structures de traits
3 Termes et définitions
Pour les besoins du présent document, les termes et définitions donnés dans l’ISO 24610‐1 ainsi que les
suivants s’appliquent:
3.1
GOA
DAG
graphe orienté acyclique
graphe contenant des arcs orientés et sans cycle
Note 1 à l’article: les graphes orientés acycliques sont des sous‐ensembles des automates finis (3.4).
3.3
structure de trait
ensemble des spécifications de trait, utilisé dans le cadre d’annotation morphosyntaxique (MAF) pour
exprimer le contenu morphosyntaxique
Note 1 à l’article: les structures de trait sont spécifiées dans l’ISO 24610‐1.
© ISO 2012 – Tous droits réservés 1
---------------------- Page: 9 ----------------------
ISO 24611:2012(F)
3.4
AEF
FSA
automate fini
graphes comprenant plusieurs états avec un état initial et un état final, et un ensemble fini de
transitions pour passer d'un état à l'autre
Note 1 à l’article: Voir aussi GOA (3.1).
3.5
graphème
unité minimale dans une langue écrite
EXEMPLE Lettre, pictogramme, idéogramme, numérique, ponctuation.
3.6
flexion
modification ou balise d’un lexème qui reflète ses propriétés morphosyntaxiques
3.7
forme fléchie
forme qu’un mot peut prendre dans une phrase ou une proposition
Note 1 à l’article: Une forme fléchie d’un mot est associée avec une combinaison de traits morphologiques
comme le nombre grammatical ou le cas.
3.8
lemme
forme lemmatisée
forme conventionnelle choisie pour représenter un lexème
Note 1 à l’article: Dans les langues europ
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.