SIST ISO 24613-1:2019
(Main)Language resource management -- Lexical markup framework (LMF) -- Part 1: Core model
Language resource management -- Lexical markup framework (LMF) -- Part 1: Core model
This document describes the core model of the lexical markup framework (LMF)l, a metamodel for representing data in monolingual and multilingual lexical databases used with computer applications.
LMF provides mechanisms that allow the development and integration of a variety of electronic lexical resource types.
Gestion de ressources linguistiques -- Cadre de balisage lexical -- Partie 1: Modèle de base
Upravljanje jezikovnih virov - Ogrodje za označevanje leksikonov (LMF) - 1. del: Jedrni model
Ta dokument opisuje jedrni model ogrodja za označevanje leksikonov (LMF)l, metamodel za predstavljanje podatkov v enojezičnih in večjezičnih leksikalnih zbirkah podatkov, ki se uporabljajo z računalniškimi aplikacijami.
LMF zagotavlja mehanizme, ki omogočajo razvoj in integracijo številnih vrst elektronskih leksikalnih virov.
General Information
Relations
Frequently Asked Questions
SIST ISO 24613-1:2019 is a standard published by the Slovenian Institute for Standardization (SIST). Its full title is "Language resource management -- Lexical markup framework (LMF) -- Part 1: Core model". This standard covers: This document describes the core model of the lexical markup framework (LMF)l, a metamodel for representing data in monolingual and multilingual lexical databases used with computer applications. LMF provides mechanisms that allow the development and integration of a variety of electronic lexical resource types.
This document describes the core model of the lexical markup framework (LMF)l, a metamodel for representing data in monolingual and multilingual lexical databases used with computer applications. LMF provides mechanisms that allow the development and integration of a variety of electronic lexical resource types.
SIST ISO 24613-1:2019 is classified under the following ICS (International Classification for Standards) categories: 01.020 - Terminology (principles and coordination); 01.140.20 - Information sciences; 35.240.30 - IT applications in information, documentation and publishing. The ICS classification helps identify the subject area and facilitates finding related standards.
SIST ISO 24613-1:2019 has the following relationships with other standards: It is inter standard links to SIST ISO 24613:2013, SIST ISO 24613-1:2024. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.
You can purchase SIST ISO 24613-1:2019 directly from iTeh Standards. The document is available in PDF format and is delivered instantly after payment. Add the standard to your cart and complete the secure checkout process. iTeh Standards is an authorized distributor of SIST standards.
Standards Content (Sample)
SLOVENSKI STANDARD
01-oktober-2019
Upravljanje jezikovnih virov - Ogrodje za označevanje leksikonov (LMF) - 1. del:
Jedrni model
Language resource management -- Lexical markup framework (LMF) -- Part 1: Core
model
Gestion de ressources linguistiques -- Cadre de balisage lexical -- Partie 1: Modèle de
base
Ta slovenski standard je istoveten z: ISO 24613-1:2019
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
01.140.20 Informacijske vede Information sciences
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
INTERNATIONAL ISO
STANDARD 24613-1
First edition
2019-06
Language resource management —
Lexical markup framework (LMF) —
Part 1:
Core model
Gestion des ressources linguistiques — Cadre de balisage lexical
(LMF) —
Partie 1: Modèle de base
Reference number
©
ISO 2019
© ISO 2019
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting
on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address
below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Fax: +41 22 749 09 47
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO 2019 – All rights reserved
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Key standards used by LMF . 3
4.1 Unicode . 3
4.2 Language coding . 3
4.3 Script coding . 3
4.4 Unified modeling language (UML) . 3
5 The LMF model . 3
5.1 Introduction . 3
5.2 Class inheritance and data category selection procedures . 4
5.2.1 Class inheritance . 4
5.2.2 LMF attributes . . 4
5.2.3 Data category selection (DCS) . 4
5.2.4 User-defined data categories . 4
5.3 LMF core package . 4
5.3.1 General. 4
5.3.2 LexicalResource class . 5
5.3.3 GlobalInformation class . 5
5.3.4 Lexicon class . 6
5.3.5 LexiconInformation class . 6
5.3.6 LexicalEntry class . 6
5.3.7 Form class . 6
5.3.8 OrthographicRepresentation class . 6
5.3.9 GrammaticalInformation Class . 6
5.3.10 Sense class . 7
5.3.11 Definition class . 7
5.4 Cross reference (CrossREF) model . 7
5.4.1 General. 7
5.4.2 CrossREF and CrossREFConstraint classes . 7
5.4.3 CrossREFConstraint class . 7
5.5 Methods for data category selection and subclass creation . 7
5.5.1 General. 7
5.5.2 Generalization (typing) . 8
5.5.3 Object instantiation . 8
5.5.4 Design choices . 8
5.5.5 Data categories for orthographic representation . 9
5.5.6 Principles for model simplification. 9
5.6 LMF extension use . 9
5.6.1 General. 9
5.6.2 Lexicon comparison .10
Annex A (informative) Data category examples .11
Bibliography .13
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www .iso .org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www .iso
.org/iso/foreword .html.
The document was prepared by Technical Committee ISO/TC 37, Language and terminology,
Subcommittee 4, Language resource management.
This first edition of ISO 24613-1, together with ISO 24613-2 to ISO 24613-6, cancels and replaces
ISO 24613:2008, which has been technically revised.
The main changes compared to the previous edition are as follows:
The content has been entirely revised and subdivided into parts. Part 1, Core model, contains the
body of the previous edition. New classes include LexiconInformation and GrammaticalInformation.
The Representation class has been renamed the OrthographicRepresentation class. In addition, the
OrthographicRepresentation subclasses, FormRepresentation and TextRepresentation, no longer are
part of the core model, providing it with greater modeling flexibility. The LexicalEntry subclass now
allows subclasses, providing improved extensibility and flexibility for modeling future parts. The
addition of the CrossREF class and associated metadata provides a formal model for cross-reference
design and implementation, closing a functional gap in the previous edition. A thoroughly revised
description of data category allocation mechanisms and their relationship to generalization by typing
provides a more incisive description of how these interdependent mechanisms enable flexible and
extensible designs.
A list of all parts in the ISO 24613 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www .iso .org/members .html.
iv © ISO 2019 – All rights reserved
Introduction
Optimizing the production, maintenance and extension of electronic lexical resources is one of the
crucial aspects impacting human language technologies (HLT) in general and natural language
processing (NLP) in particular, as well as human-oriented translation technologies. A second crucial
aspect involves optimizing the process leading to their integration in applications. Lexical markup
framework (LMF) is an abstract metamodel that provides a common, standardized framework for
the construction of computational lexicons. LMF ensures the encoding of linguistic information in a
way that enables reusability in different applications and for different tasks. LMF provides a common,
shared representation of lexical objects, including morphological, syntactic and semantic aspects.
The goals of LMF are to provide a common model for the creation and use of electronic lexical resources
ranging from small to large in scale, to manage the exchange of data between and among these
resources, and to facilitate the merging of large numbers of different individual electronic resources to
form extensive global electronic resources. The ultimate goal of LMF is to create a modular structure
that will facilitate true content interoperability across all aspects of electronic lexical resources.
[3]
LMF supports existing lexical resource models such as Genelex , the EAGLES International Standard
[4] [10]
for Language Engineering (ISLE) , Multilingual ISLE Lexical Entry (MILE) models , Text Encoding
[8] [7]
Initiative (TEI) guidelines , Ontolex , and the Language Base Exchange (LBX) serialization together
[5]
with the U.S. Government Wordscape On-Line Dictionary system .
[9]
LMF uses UML modeling processes . The LMF core package describes the basic hierarchy of information
of a lexical entry, including information on the word form. The core package is supplemented by various
resources that are part of the definition of LMF. These resources include:
— specific data categories used by the variety of resource types associated with LMF, both those data
categories relevant to the metamodel itself, and those associated with the extensions to the core
package in additional LMF parts (see Annex A for data category examples);
— the constraints governing the relationship of these data categories to the metamodel and to its
extensions;
— standard procedures for expressing these categories and thus for anchoring them on the structural
skeleton of LMF and relating them to the respective extension models;
— the vocabularies used by LMF to express related informational objects for describing how to extend
LMF through linkage to a variety of specific resources (extensions) and methods for analysing and
designing such linked systems.
LMF parts are expressed in a framework that describes the reuse of the LMF core components (such as
structures, data categories, and vocabularies) in conjunction with the additional components required
for a specific resource.
The parts currently in or planned for the new organization of ISO 24613 include Part 1: Core model, Part
2: Machine readable dictionary (MRD) model, Part 3: Diachrony-etymology, Part 4: TEI serialization, Part
5: LBX serialization, and Part 6: Syntax and semantics.
[2]
The ISO 24613 series is designed to coordinate closely with ISO 16642 .
INTERNATIONAL STANDARD ISO 24613-1:2019(E)
Language resource management — Lexical markup
framework (LMF) —
Part 1:
Core model
1 Scope
This document describes the core model of the lexical markup framework (LMF)l, a metamodel for
representing data in monolingual and multilingual lexical databases used with computer applications.
LMF provides mechanisms that allow the development and integration of a variety of electronic lexical
resource types.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.
ISO 639 (all parts), Codes for the representation of names of languages
ISO 15924, Information and documentation — Codes for the representation of names of scripts
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at http: //www .iso .org/obp
— IEC Electropedia: available at http: //www .electropedia .org/
3.1
data category
DC
elementary descriptor used in a linguistic description or annotation scheme
3.2
word form
instance of a word, multi-word expression, root, stem, or morpheme
3.3
grammatical feature
property associated with a word form (3.2) to describe one of its grammatical attributes
EXAMPLE /grammatical gender/
3.4
lemma
lemmatized form
canonical form
conventional word form (3.2) chosen to represent a lexeme (3.5)
Note 1 to entry: In many European languages, the lemma is usually the /singular/ for a noun if there is a variation
in /number/, the /masculine/ form if there is a variation in /gender/ and the /infinitive/ for all verbs. In some
languages, certain nouns are defective in the singular form, in which case the /plural/ is chosen. In Arabic, for
a verb, the lemma is sometimes considered as being the third person singular with the accomplished aspect, in
other approaches it is considered as being the root.
3.5
lexeme
abstract unit generally associated with a set of word forms (3.2) sharing a common meaning
[SOURCE: ISO 24613:2008, 3.25, modified – "forms" replaced with "word forms".]
3.6
lexical resource
lexical database
database consisting of one or several lexicons (3.7)
3.7
lexicon
resource comprising lexical entries for one or several languages
Note 1 to entry: A special language lexicon or a lexicon prepared for a specific NLP application can comprise a
specific subset of a language.
3.8
multiword expression
MWE
lexeme (3.5) made up of a sequence of two or more lexemes that has properties that may not be
predictable from the properties of the individual lexemes or their normal mode of combination
EXAMPLE “To kick the bucket”, an idiomatic expression which means to die rather than to hit a bucket with
one's foot. An idiomatic expression is a subtype of MWE whose properties are not predictable from the properties
of the individual lexemes.
Note 1 to entry: An MWE can be a compound, a fragment of a sentence, or a sentence. The group of lexemes
making up an MWE can be continuous or discontinuous. It is not always possible to mark an MWE with a part of
speech (3.13).
3.9
natural language processing
NLP
field covering knowledge and techniques involved in the processing of linguistic data by a computer
3.10
orthography
way of spelling or writing lexemes (3.5) that conforms to a conventionalized use
Note 1 to entry: Usually, the notion of orthography covers standardized spellings of alphabetic languages, such
as standard UK or US English, or reformed German spelling, as well as hieroglyphic or syllabic writing systems.
For the purpose of this standard, we also subsume variations such as transliterations of languages in non-native
scripts, stenographic renderings, or representations in the International Phonetic Alphabet under the notion of
orthography.
2 © ISO 2019 – All rights reserved
3.11
part of speech
lexical category
word class
category assigned to a lexeme (3.5) based on its grammatical properties
EXAMPLE Typical parts of speech f
...
SLOVENSKI STANDARD
01-oktober-2019
Upravljanje jezikovnih virov - Ogrodje za označevanje leksikonov (LMF) - 1. del:
Jedrni model
Language resource management -- Lexical markup framework (LMF) -- Part 1: Core
model
Gestion de ressources linguistiques -- Cadre de balisage lexical -- Partie 1: Modèle de
base
Ta slovenski standard je istoveten z: ISO 24613-1:2019
ICS:
01.140.20 Informacijske vede Information sciences
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
INTERNATIONAL ISO
STANDARD 24613-1
First edition
2019-06
Language resource management —
Lexical markup framework (LMF) —
Part 1:
Core model
Gestion des ressources linguistiques — Cadre de balisage lexical
(LMF) —
Partie 1: Modèle de base
Reference number
©
ISO 2019
© ISO 2019
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting
on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address
below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Fax: +41 22 749 09 47
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO 2019 – All rights reserved
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Key standards used by LMF . 3
4.1 Unicode . 3
4.2 Language coding . 3
4.3 Script coding . 3
4.4 Unified modeling language (UML) . 3
5 The LMF model . 3
5.1 Introduction . 3
5.2 Class inheritance and data category selection procedures . 4
5.2.1 Class inheritance . 4
5.2.2 LMF attributes . . 4
5.2.3 Data category selection (DCS) . 4
5.2.4 User-defined data categories . 4
5.3 LMF core package . 4
5.3.1 General. 4
5.3.2 LexicalResource class . 5
5.3.3 GlobalInformation class . 5
5.3.4 Lexicon class . 6
5.3.5 LexiconInformation class . 6
5.3.6 LexicalEntry class . 6
5.3.7 Form class . 6
5.3.8 OrthographicRepresentation class . 6
5.3.9 GrammaticalInformation Class . 6
5.3.10 Sense class . 7
5.3.11 Definition class . 7
5.4 Cross reference (CrossREF) model . 7
5.4.1 General. 7
5.4.2 CrossREF and CrossREFConstraint classes . 7
5.4.3 CrossREFConstraint class . 7
5.5 Methods for data category selection and subclass creation . 7
5.5.1 General. 7
5.5.2 Generalization (typing) . 8
5.5.3 Object instantiation . 8
5.5.4 Design choices . 8
5.5.5 Data categories for orthographic representation . 9
5.5.6 Principles for model simplification. 9
5.6 LMF extension use . 9
5.6.1 General. 9
5.6.2 Lexicon comparison .10
Annex A (informative) Data category examples .11
Bibliography .13
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www .iso .org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www .iso
.org/iso/foreword .html.
The document was prepared by Technical Committee ISO/TC 37, Language and terminology,
Subcommittee 4, Language resource management.
This first edition of ISO 24613-1, together with ISO 24613-2 to ISO 24613-6, cancels and replaces
ISO 24613:2008, which has been technically revised.
The main changes compared to the previous edition are as follows:
The content has been entirely revised and subdivided into parts. Part 1, Core model, contains the
body of the previous edition. New classes include LexiconInformation and GrammaticalInformation.
The Representation class has been renamed the OrthographicRepresentation class. In addition, the
OrthographicRepresentation subclasses, FormRepresentation and TextRepresentation, no longer are
part of the core model, providing it with greater modeling flexibility. The LexicalEntry subclass now
allows subclasses, providing improved extensibility and flexibility for modeling future parts. The
addition of the CrossREF class and associated metadata provides a formal model for cross-reference
design and implementation, closing a functional gap in the previous edition. A thoroughly revised
description of data category allocation mechanisms and their relationship to generalization by typing
provides a more incisive description of how these interdependent mechanisms enable flexible and
extensible designs.
A list of all parts in the ISO 24613 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www .iso .org/members .html.
iv © ISO 2019 – All rights reserved
Introduction
Optimizing the production, maintenance and extension of electronic lexical resources is one of the
crucial aspects impacting human language technologies (HLT) in general and natural language
processing (NLP) in particular, as well as human-oriented translation technologies. A second crucial
aspect involves optimizing the process leading to their integration in applications. Lexical markup
framework (LMF) is an abstract metamodel that provides a common, standardized framework for
the construction of computational lexicons. LMF ensures the encoding of linguistic information in a
way that enables reusability in different applications and for different tasks. LMF provides a common,
shared representation of lexical objects, including morphological, syntactic and semantic aspects.
The goals of LMF are to provide a common model for the creation and use of electronic lexical resources
ranging from small to large in scale, to manage the exchange of data between and among these
resources, and to facilitate the merging of large numbers of different individual electronic resources to
form extensive global electronic resources. The ultimate goal of LMF is to create a modular structure
that will facilitate true content interoperability across all aspects of electronic lexical resources.
[3]
LMF supports existing lexical resource models such as Genelex , the EAGLES International Standard
[4] [10]
for Language Engineering (ISLE) , Multilingual ISLE Lexical Entry (MILE) models , Text Encoding
[8] [7]
Initiative (TEI) guidelines , Ontolex , and the Language Base Exchange (LBX) serialization together
[5]
with the U.S. Government Wordscape On-Line Dictionary system .
[9]
LMF uses UML modeling processes . The LMF core package describes the basic hierarchy of information
of a lexical entry, including information on the word form. The core package is supplemented by various
resources that are part of the definition of LMF. These resources include:
— specific data categories used by the variety of resource types associated with LMF, both those data
categories relevant to the metamodel itself, and those associated with the extensions to the core
package in additional LMF parts (see Annex A for data category examples);
— the constraints governing the relationship of these data categories to the metamodel and to its
extensions;
— standard procedures for expressing these categories and thus for anchoring them on the structural
skeleton of LMF and relating them to the respective extension models;
— the vocabularies used by LMF to express related informational objects for describing how to extend
LMF through linkage to a variety of specific resources (extensions) and methods for analysing and
designing such linked systems.
LMF parts are expressed in a framework that describes the reuse of the LMF core components (such as
structures, data categories, and vocabularies) in conjunction with the additional components required
for a specific resource.
The parts currently in or planned for the new organization of ISO 24613 include Part 1: Core model, Part
2: Machine readable dictionary (MRD) model, Part 3: Diachrony-etymology, Part 4: TEI serialization, Part
5: LBX serialization, and Part 6: Syntax and semantics.
[2]
The ISO 24613 series is designed to coordinate closely with ISO 16642 .
INTERNATIONAL STANDARD ISO 24613-1:2019(E)
Language resource management — Lexical markup
framework (LMF) —
Part 1:
Core model
1 Scope
This document describes the core model of the lexical markup framework (LMF)l, a metamodel for
representing data in monolingual and multilingual lexical databases used with computer applications.
LMF provides mechanisms that allow the development and integration of a variety of electronic lexical
resource types.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.
ISO 639 (all parts), Codes for the representation of names of languages
ISO 15924, Information and documentation — Codes for the representation of names of scripts
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at http: //www .iso .org/obp
— IEC Electropedia: available at http: //www .electropedia .org/
3.1
data category
DC
elementary descriptor used in a linguistic description or annotation scheme
3.2
word form
instance of a word, multi-word expression, root, stem, or morpheme
3.3
grammatical feature
property associated with a word form (3.2) to describe one of its grammatical attributes
EXAMPLE /grammatical gender/
3.4
lemma
lemmatized form
canonical form
conventional word form (3.2) chosen to represent a lexeme (3.5)
Note 1 to entry: In many European languages, the lemma is usually the /singular/ for a noun if there is a variation
in /number/, the /masculine/ form if there is a variation in /gender/ and the /infinitive/ for all verbs. In some
languages, certain nouns are defective in the singular form, in which case the /plural/ is chosen. In Arabic, for
a verb, the lemma is sometimes considered as being the third person singular with the accomplished aspect, in
other approaches it is considered as being the root.
3.5
lexeme
abstract unit generally associated with a set of word forms (3.2) sharing a common meaning
[SOURCE: ISO 24613:2008, 3.25, modified – "forms" replaced with "word forms".]
3.6
lexical resource
lexical database
database consisting of one or several lexicons (3.7)
3.7
lexicon
resource comprising lexical entries for one or several languages
Note 1 to entry: A special language lexicon or a lexicon prepared for a specific NLP application can comprise a
specific subset of a language.
3.8
multiword expression
MWE
lexeme (3.5) made up of a sequence of two or more lexemes that has properties that may not be
predictable from the properties of the individual lexemes or their normal mode of combination
EXAMPLE “To kick the bucket”, an idiomatic expression which means to die rather than to hit a bucket with
one's foot. An idiomatic expression is a subtype of MWE whose properties are not predictable from the properties
of the individual lexemes.
Note 1 to entry: An MWE can be a compound, a fragment of a sentence, or a sentence. The group of lexemes
making up an MWE can be continuous or discontinuous. It is not always possible to mark an MWE with a part of
speech (3.13).
3.9
natural language processing
NLP
field covering knowledge and techniques involved in the processing of linguistic data by a computer
3.10
orthography
way of spelling or writing lexemes (3.5) that conforms to a conventionalized use
Note 1 to entry: Usually, the notion of orthography covers standardized spellings of alphabetic languages, such
as standard UK or US English, or reformed German spelling, as well as hieroglyphic or syllabic writing systems.
For the purpose of this standard, we also subsume variations such as transliterations of languages in non-native
scripts, stenographic renderings, or representations in the International Phonetic Alphabet under the notion of
orthography.
2 © ISO 2019 – All rights reserved
3.11
part of speech
lexical category
word class
category assigned to a lexeme (3.5) based on its grammatical properties
EXAMPLE Typical parts of speech for European languages include: noun, verb, a
...
INTERNATIONAL ISO
STANDARD 24613-1
First edition
2019-06
Language resource management —
Lexical markup framework (LMF) —
Part 1:
Core model
Gestion des ressources linguistiques — Cadre de balisage lexical
(LMF) —
Partie 1: Modèle de base
Reference number
©
ISO 2019
© ISO 2019
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting
on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address
below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Fax: +41 22 749 09 47
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO 2019 – All rights reserved
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Key standards used by LMF . 3
4.1 Unicode . 3
4.2 Language coding . 3
4.3 Script coding . 3
4.4 Unified modeling language (UML) . 3
5 The LMF model . 3
5.1 Introduction . 3
5.2 Class inheritance and data category selection procedures . 4
5.2.1 Class inheritance . 4
5.2.2 LMF attributes . . 4
5.2.3 Data category selection (DCS) . 4
5.2.4 User-defined data categories . 4
5.3 LMF core package . 4
5.3.1 General. 4
5.3.2 LexicalResource class . 5
5.3.3 GlobalInformation class . 5
5.3.4 Lexicon class . 6
5.3.5 LexiconInformation class . 6
5.3.6 LexicalEntry class . 6
5.3.7 Form class . 6
5.3.8 OrthographicRepresentation class . 6
5.3.9 GrammaticalInformation Class . 6
5.3.10 Sense class . 7
5.3.11 Definition class . 7
5.4 Cross reference (CrossREF) model . 7
5.4.1 General. 7
5.4.2 CrossREF and CrossREFConstraint classes . 7
5.4.3 CrossREFConstraint class . 7
5.5 Methods for data category selection and subclass creation . 7
5.5.1 General. 7
5.5.2 Generalization (typing) . 8
5.5.3 Object instantiation . 8
5.5.4 Design choices . 8
5.5.5 Data categories for orthographic representation . 9
5.5.6 Principles for model simplification. 9
5.6 LMF extension use . 9
5.6.1 General. 9
5.6.2 Lexicon comparison .10
Annex A (informative) Data category examples .11
Bibliography .13
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www .iso .org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www .iso
.org/iso/foreword .html.
The document was prepared by Technical Committee ISO/TC 37, Language and terminology,
Subcommittee 4, Language resource management.
This first edition of ISO 24613-1, together with ISO 24613-2 to ISO 24613-6, cancels and replaces
ISO 24613:2008, which has been technically revised.
The main changes compared to the previous edition are as follows:
The content has been entirely revised and subdivided into parts. Part 1, Core model, contains the
body of the previous edition. New classes include LexiconInformation and GrammaticalInformation.
The Representation class has been renamed the OrthographicRepresentation class. In addition, the
OrthographicRepresentation subclasses, FormRepresentation and TextRepresentation, no longer are
part of the core model, providing it with greater modeling flexibility. The LexicalEntry subclass now
allows subclasses, providing improved extensibility and flexibility for modeling future parts. The
addition of the CrossREF class and associated metadata provides a formal model for cross-reference
design and implementation, closing a functional gap in the previous edition. A thoroughly revised
description of data category allocation mechanisms and their relationship to generalization by typing
provides a more incisive description of how these interdependent mechanisms enable flexible and
extensible designs.
A list of all parts in the ISO 24613 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www .iso .org/members .html.
iv © ISO 2019 – All rights reserved
Introduction
Optimizing the production, maintenance and extension of electronic lexical resources is one of the
crucial aspects impacting human language technologies (HLT) in general and natural language
processing (NLP) in particular, as well as human-oriented translation technologies. A second crucial
aspect involves optimizing the process leading to their integration in applications. Lexical markup
framework (LMF) is an abstract metamodel that provides a common, standardized framework for
the construction of computational lexicons. LMF ensures the encoding of linguistic information in a
way that enables reusability in different applications and for different tasks. LMF provides a common,
shared representation of lexical objects, including morphological, syntactic and semantic aspects.
The goals of LMF are to provide a common model for the creation and use of electronic lexical resources
ranging from small to large in scale, to manage the exchange of data between and among these
resources, and to facilitate the merging of large numbers of different individual electronic resources to
form extensive global electronic resources. The ultimate goal of LMF is to create a modular structure
that will facilitate true content interoperability across all aspects of electronic lexical resources.
[3]
LMF supports existing lexical resource models such as Genelex , the EAGLES International Standard
[4] [10]
for Language Engineering (ISLE) , Multilingual ISLE Lexical Entry (MILE) models , Text Encoding
[8] [7]
Initiative (TEI) guidelines , Ontolex , and the Language Base Exchange (LBX) serialization together
[5]
with the U.S. Government Wordscape On-Line Dictionary system .
[9]
LMF uses UML modeling processes . The LMF core package describes the basic hierarchy of information
of a lexical entry, including information on the word form. The core package is supplemented by various
resources that are part of the definition of LMF. These resources include:
— specific data categories used by the variety of resource types associated with LMF, both those data
categories relevant to the metamodel itself, and those associated with the extensions to the core
package in additional LMF parts (see Annex A for data category examples);
— the constraints governing the relationship of these data categories to the metamodel and to its
extensions;
— standard procedures for expressing these categories and thus for anchoring them on the structural
skeleton of LMF and relating them to the respective extension models;
— the vocabularies used by LMF to express related informational objects for describing how to extend
LMF through linkage to a variety of specific resources (extensions) and methods for analysing and
designing such linked systems.
LMF parts are expressed in a framework that describes the reuse of the LMF core components (such as
structures, data categories, and vocabularies) in conjunction with the additional components required
for a specific resource.
The parts currently in or planned for the new organization of ISO 24613 include Part 1: Core model, Part
2: Machine readable dictionary (MRD) model, Part 3: Diachrony-etymology, Part 4: TEI serialization, Part
5: LBX serialization, and Part 6: Syntax and semantics.
[2]
The ISO 24613 series is designed to coordinate closely with ISO 16642 .
INTERNATIONAL STANDARD ISO 24613-1:2019(E)
Language resource management — Lexical markup
framework (LMF) —
Part 1:
Core model
1 Scope
This document describes the core model of the lexical markup framework (LMF)l, a metamodel for
representing data in monolingual and multilingual lexical databases used with computer applications.
LMF provides mechanisms that allow the development and integration of a variety of electronic lexical
resource types.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.
ISO 639 (all parts), Codes for the representation of names of languages
ISO 15924, Information and documentation — Codes for the representation of names of scripts
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at http: //www .iso .org/obp
— IEC Electropedia: available at http: //www .electropedia .org/
3.1
data category
DC
elementary descriptor used in a linguistic description or annotation scheme
3.2
word form
instance of a word, multi-word expression, root, stem, or morpheme
3.3
grammatical feature
property associated with a word form (3.2) to describe one of its grammatical attributes
EXAMPLE /grammatical gender/
3.4
lemma
lemmatized form
canonical form
conventional word form (3.2) chosen to represent a lexeme (3.5)
Note 1 to entry: In many European languages, the lemma is usually the /singular/ for a noun if there is a variation
in /number/, the /masculine/ form if there is a variation in /gender/ and the /infinitive/ for all verbs. In some
languages, certain nouns are defective in the singular form, in which case the /plural/ is chosen. In Arabic, for
a verb, the lemma is sometimes considered as being the third person singular with the accomplished aspect, in
other approaches it is considered as being the root.
3.5
lexeme
abstract unit generally associated with a set of word forms (3.2) sharing a common meaning
[SOURCE: ISO 24613:2008, 3.25, modified – "forms" replaced with "word forms".]
3.6
lexical resource
lexical database
database consisting of one or several lexicons (3.7)
3.7
lexicon
resource comprising lexical entries for one or several languages
Note 1 to entry: A special language lexicon or a lexicon prepared for a specific NLP application can comprise a
specific subset of a language.
3.8
multiword expression
MWE
lexeme (3.5) made up of a sequence of two or more lexemes that has properties that may not be
predictable from the properties of the individual lexemes or their normal mode of combination
EXAMPLE “To kick the bucket”, an idiomatic expression which means to die rather than to hit a bucket with
one's foot. An idiomatic expression is a subtype of MWE whose properties are not predictable from the properties
of the individual lexemes.
Note 1 to entry: An MWE can be a compound, a fragment of a sentence, or a sentence. The group of lexemes
making up an MWE can be continuous or discontinuous. It is not always possible to mark an MWE with a part of
speech (3.13).
3.9
natural language processing
NLP
field covering knowledge and techniques involved in the processing of linguistic data by a computer
3.10
orthography
way of spelling or writing lexemes (3.5) that conforms to a conventionalized use
Note 1 to entry: Usually, the notion of orthography covers standardized spellings of alphabetic languages, such
as standard UK or US English, or reformed German spelling, as well as hieroglyphic or syllabic writing systems.
For the purpose of this standard, we also subsume variations such as transliterations of languages in non-native
scripts, stenographic renderings, or representations in the International Phonetic Alphabet under the notion of
orthography.
2 © ISO 2019 – All rights reserved
3.11
part of speech
lexical category
word class
category assigned to a lexeme (3.5) based on its grammatical properties
EXAMPLE Typical parts of speech for European languages include: noun, verb, adjective, adverb,
preposition, etc.
3.12
script
set of graphic characters used for the written form of one or more languages
EXAMPLE Hiragana, Katakana, Latin and Cyrillic.
Note 1 to entry: The description of scripts ranges from a high level classification such as hieroglyphic or syllabic
writing systems vs. alphabets to a more precise classification like Roman vs. Cyrillic. Scripts are defined by a list
of values taken from ISO 15924.
[SOURCE: ISO/IEC 10646:2017 3.50, modified – Example and Note 1 to entry added]
4 Key
...
SIST ISO 24613-1:2019 delineates the core model of the lexical markup framework (LMF), serving as a foundational metamodel for the representation of data in both monolingual and multilingual lexical databases. The standard's scope is critical in the context of language resource management, as it facilitates the creation and integration of diverse electronic lexical resource types essential for modern computer applications. One of the significant strengths of SIST ISO 24613-1:2019 is its comprehensive framework that offers well-defined structures for lexical entries, their relationships, and various linguistic properties. By standardizing data representation, it ensures consistency across different lexical resources, which is paramount for interoperability and data sharing among linguists and developers alike. This aspect enhances the usability of lexical databases in applications such as machine translation, natural language processing, and language learning tools. Additionally, the relevance of this standard extends beyond mere data representation. The LMF's flexibility allows for accommodating the evolving nature of language and technology, making it a valuable asset for developers aiming to integrate lexical data into innovative applications. The inclusion of features that cater to both monolingual and multilingual contexts underscores its significance in an increasingly globalized world where cross-linguistic operations are commonplace. Overall, SIST ISO 24613-1:2019 stands out as an essential guideline in the realm of language resource management, providing a robust and adaptable framework that fosters the integration and utilization of lexical databases across various domains and applications. Its focus on standardization not only promotes efficiency in data handling but also significantly contributes to the advancement of language technologies.
La norme SIST ISO 24613-1:2019, intitulée "Language resource management -- Lexical markup framework (LMF) -- Part 1: Core model", constitue un guide essentiel pour la gestion des ressources linguistiques, particulièrement dans le domaine de la représentation des données dans des bases de données lexicales monolingues et multilingues. Cette norme définit un modèle de base qui permet non seulement une représentation cohérente des données, mais aussi l'intégration et le développement de divers types de ressources lexicale électroniques. L'un des points forts de cette norme est sa capacité à fournir des mécanismes robustes qui facilitent l'harmonisation et la normalisation des outils linguistiques, rendant ainsi possible une interopérabilité accrue entre les systèmes. Grâce à cette approche, les développeurs peuvent créer des applications informatiques qui exploitent les ressources lexicale de manière efficace et intuitive. En outre, le modèle de LMF répond à un besoin croissant d'uniformité dans la gestion des ressources linguistiques, ce qui est crucial dans le contexte globalisé actuel où les données linguistiques sont largement utilisées dans plusieurs langues. La norme SIST ISO 24613-1:2019 est donc d'une pertinence indéniable, car elle soutient non seulement la recherche linguistique, mais également l'innovation technologique, notamment dans les domaines de l'intelligence artificielle et des applications linguistiques. Avec sa portée définie et ses mécanismes clairs, cette norme se positionne comme un outil indispensable pour les professionnels du secteur, favorisant la meilleure gestion des ressources linguistiques et la création de solutions novatrices. En somme, la norme SIST ISO 24613-1:2019 est un pilier dans le domaine de la gestion des ressources linguistiques, facilitant l'évolution des bases de données lexicales à travers le monde.
Die SIST ISO 24613-1:2019 ist ein wegweisendes Dokument, das das Kernmodell des Lexikalischen Markup-Rahmenwerks (LMF) beschreibt. Dieses Metamodell spielt eine entscheidende Rolle bei der Repräsentation von Daten in einsprachigen und mehrsprachigen lexikalischen Datenbanken, die für Computeranwendungen verwendet werden. Die Norm bietet eine umfassende Grundlage für die Entwicklung und Integration verschiedener elektronischer lexikalischer Ressourcentypen. Eine der größten Stärken der SIST ISO 24613-1:2019 ist ihre Flexibilität und Modularität, die es ermöglicht, das LMF an spezifische Anforderungen unterschiedlicher Anwendungsbereiche anzupassen. Dies ist besonders relevant in einer zunehmend globalisierten Welt, in der mehrsprachige Datenbanken eine zentrale Rolle in der Sprachressourcenverwaltung spielen. Durch die standardisierte Aufnahme von Datenstrukturen erleichtert das LMF die Interoperabilität zwischen verschiedenen Systemen und Ressourcen. Darüber hinaus fördert das Dokument die Standardisierung von Terminologie und Definitionen in der lexikalischen Datenverarbeitung. Dies ist nicht nur für Entwickler von Softwareanwendungen von Bedeutung, sondern auch für Linguisten und Forscher, die auf konsistent strukturierte Daten angewiesen sind. Die klare Struktur und die detaillierten Beschreibungen im Kernmodell bieten eine solide Grundlage für innovative Entwicklungen im Bereich der Sprachtechnologie und in der Sprachressourcenverwaltung. Insgesamt bietet die SIST ISO 24613-1:2019 einen unverzichtbaren Rahmen, der die Effektivität und Effizienz von lexikalischen Ressourcen erheblich verbessern kann. Die Relevanz dieser Norm wird besonders in der fortschreitenden Digitalisierung und der zunehmenden Nutzung von Künstlicher Intelligenz in der Sprachverarbeitung deutlich. Daher ist die Einhaltung dieser Standards für alle Akteure im Bereich der Sprachressourcenmanagement von hoher Bedeutung.
SIST ISO 24613-1:2019に関する標準化文書のレビューは以下の通りです。この文書は、レキシカルマークアップフレームワーク(LMF)のコアモデルを詳述しており、単言語および多言語のレキシカルデータベースにおけるデータを表現するためのメタモデルを提供します。 この標準のスコープは広範であり、コンピュータアプリケーションで使用される多様な電子レキシカルリソースタイプの開発と統合を可能にするメカニズムを提供しています。これにより、言語リソース管理における効率性と一貫性が大幅に向上します。 SIST ISO 24613-1:2019の強みは、その堅牢なコアモデルにあります。このモデルは、異なる言語や異なるデータベース形式における相互運用性を高めるための柔軟性を持ち、横断的な言語リソースの利用を推進する役割を果たすことができます。特に、LMFが提供する構造は、複雑な言語データを整理し、検索や管理を容易にするために役立ちます。 さらに、この標準は、現代のデジタル時代における言語資源の重要性を反映しており、今後の開発や研究において不可欠な基盤を提供します。特に、多様な言語環境や技術革新に対応するための重要なツールとして、言語リソースの管理と活用の観点から、非常に関連性が高いといえます。 SIST ISO 24613-1:2019は、言語リソース管理の分野において、その基準を定めるに際し、注目すべき文書であり、将来的な研究や実務においてもその影響が期待されます。
SIST ISO 24613-1:2019 문서는 언어 자원 관리의 표준화된 참조로서, 어휘 마크업 프레임워크(LMF)의 핵심 모델을 설명합니다. 이 표준의 범위는 단일 언어 및 다국어 어휘 데이터베이스에서 사용되는 정보를 표현하기 위한 메타모델을 정의하여, 컴퓨터 애플리케이션과의 상호작용을 지원하는 것입니다. 이 LMF 표준은 전자적 어휘 자원 유형의 개발 및 통합을 가능하게 하는 메커니즘을 제공함으로써, 다양한 언어 자원 관리를 위한 기본 토대를 마련하고 있습니다. 특히, monolingual 및 multilingual 환경에서 필수적인 데이터 구조와 관계를 명확하게 제시하여, 언어 데이터 처리의 일관성을 높입니다. SIST ISO 24613-1:2019은 어휘 마크업 프레임워크의 핵심 모델을 통해 언어 자원 관리의 표준화를 이루는 데 기여하여, 개발자와 연구자가 다양한 언어 자원을 보다 쉽게 통합하고 활용할 수 있도록 합니다. 이러한 표준은 특히 현대의 글로벌화된 환경에서 다국어 지원이 중요한 정보 시스템에 채택될 때 그 유용성이 극대화됩니다. 이 문서는 LMF의 구조를 잘 설명하고 있으며, 효율적인 언어 데이터 관리 및 처리에 있어 최신 트렌드와 요구사항을 반영하고 있습니다. 따라서, SIST ISO 24613-1:2019은 언어 자원 데이터베이스와 관련 분야의 전문가들에게 필수적인 자료로 자리잡고 있습니다.












Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...