SIST ISO 5078:2025
(Main)Management of terminology resources - Terminology extraction
Management of terminology resources - Terminology extraction
This document specifies methods for extracting candidate terms from text corpora and gives guidance on selecting relevant designations, definitions, concept relations and other terminology-related information.
Gestion des ressources terminologiques — Extraction de terminologie
Upravljanje terminoloških virov - Luščenje terminologije
Ta dokument določa metode za luščenje kandidatov iz besedilnih korpusov ter podaja smernice za izbiro ustreznih oznak, definicij, pojmovnih povezav in drugih informacij, povezanih s terminologijo.
General Information
Overview
SIST ISO 5078:2025 - Management of terminology resources - Terminology extraction specifies methods for extracting candidate terms and terminological data from text corpora and gives guidance for selecting relevant designations, definitions, concept relations and other terminology-related information. The standard defines key concepts (e.g., candidate term, termhood, text corpus, terminological data) and provides a reference framework to improve terminology extraction tools and workflows.
Key topics and technical requirements
- Scope and definitions
- Formal definitions for terms used in extraction workflows (candidate term, termhood, token, n‑gram, bitext, metadata, stop word).
- Text corpus compilation
- Guidance on types of text corpora, selection criteria, and considerations for corpus creation to support reliable terminology extraction.
- Extraction approaches and methods
- Classification of extraction approaches by language coverage, process, technique and technology. The document covers mainstream approaches (statistical, linguistic, hybrid and neural) without prescribing a single tool.
- Statistical and evaluative measures
- Use of metrics such as TF‑IDF, keyness, precision, recall, and concepts of noise and silence to assess extraction outputs.
- Output handling and validation
- Methods for filtering candidate term lists, assessing term eligibility, and reducing noise to maximize relevance.
- Workflow and implementation
- End‑to‑end workflow stages: corpus selection/building, preprocessing, identifying candidate terms, selecting relevant terms, allocating terms to concepts, identifying concept relations and completing terminological entries.
- Tool characteristics
- Guidance on tool features and expected behavior to optimize extraction performance and integration with terminology management systems.
Practical applications and users
SIST ISO 5078:2025 is valuable for organizations and professionals who create and maintain terminology resources and language assets, including:
- Terminologists and language professionals building glossaries and termbases
- Ontology engineers extracting concepts and concept relations for knowledge models
- Information and data scientists using terminology for information retrieval, search optimization, and text analytics
- Localization and translation teams leveraging bitexts and term extraction to improve translation memories and machine translation quality
- Software vendors developing terminology extraction, NLP and knowledge‑management tools
Practical uses include populating terminology databases, accelerating glossary creation, improving term consistency across documents, and supporting ontology construction.
Related standards
- ISO 704 - Terminology work - Principles and methods
- ISO 1087 - Terminology work and terminology science - Vocabulary
- ISO 16642 - Terminological markup framework
- ISO 26162-1 - Terminology databases - Design
Keywords: ISO 5078, terminology extraction, terminology management, text corpora, candidate terms, term extraction, TF‑IDF, termhood, precision and recall, terminological data.
Frequently Asked Questions
SIST ISO 5078:2025 is a standard published by the Slovenian Institute for Standardization (SIST). Its full title is "Management of terminology resources - Terminology extraction". This standard covers: This document specifies methods for extracting candidate terms from text corpora and gives guidance on selecting relevant designations, definitions, concept relations and other terminology-related information.
This document specifies methods for extracting candidate terms from text corpora and gives guidance on selecting relevant designations, definitions, concept relations and other terminology-related information.
SIST ISO 5078:2025 is classified under the following ICS (International Classification for Standards) categories: 01.020 - Terminology (principles and coordination); 35.240.30 - IT applications in information, documentation and publishing. The ICS classification helps identify the subject area and facilitates finding related standards.
You can purchase SIST ISO 5078:2025 directly from iTeh Standards. The document is available in PDF format and is delivered instantly after payment. Add the standard to your cart and complete the secure checkout process. iTeh Standards is an authorized distributor of SIST standards.
Standards Content (Sample)
SLOVENSKI STANDARD
01-junij-2025
Upravljanje terminoloških virov - Luščenje terminologije
Management of terminology resources — Terminology extraction
Gestion des ressources terminologiques — Extraction de terminologie
Ta slovenski standard je istoveten z: ISO 5078:2025
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
International
Standard
ISO 5078
First edition
Management of terminology
2025-02
resources — Terminology
extraction
Gestion des ressources terminologiques — Extraction de
terminologie
Reference number
© ISO 2025
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Principles and methods . 5
4.1 General .5
4.2 Text corpora and terminology extraction .5
4.3 Compilation of text corpora .6
4.3.1 Text corpora used for terminology extraction .6
4.3.2 Criteria for selecting texts for a text corpus .6
4.3.3 Considerations for text corpus creation .7
4.4 Terminology extraction approaches and methods.8
4.4.1 Classification of terminology extraction approaches.8
4.4.2 Extraction method according to the number of languages .10
4.4.3 Extraction method according to the process .11
4.4.4 Extraction method according to the underlying technique .11
4.4.5 Extraction method according to the underlying technology .14
4.4.6 Extraction method according to the extracted items .16
4.5 Term extraction output .17
4.5.1 Filtering candidate term lists .17
4.5.2 Assessing term eligibility .18
4.6 Uses for terminology extraction output .19
5 Implementation of terminology extraction . 19
5.1 General .19
5.2 Initial considerations for terminology extraction .19
5.3 Terminology extraction workflow . 20
5.3.1 Overview . 20
5.3.2 Starting the terminology extraction workflow . 20
5.3.3 Building or selecting a text corpus . . 20
5.3.4 Preprocessing the text corpus . 20
5.3.5 Identifying candidate terms .21
5.3.6 Selecting relevant terms .21
5.3.7 Allocating terms to concepts . 22
5.3.8 Identifying concept relations and building concept systems . 22
5.3.9 Completing terminological entries . 22
Bibliography .23
iii
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee
has been established has the right to be represented on that committee. International organizations,
governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely
with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of ISO document should be noted. This document was drafted in accordance with the editorial rules of the
ISO/IEC Directives, Part 2 (see www.iso.org/directives).
ISO draws attention to the possibility that the implementation of this document may involve the use of (a)
patent(s). ISO takes no position concerning the evidence, validity or applicability of any claimed patent
rights in respect thereof. As of the date of publication of this document, ISO had not received notice of (a)
patent(s) which may be required to implement this document. However, implementers are cautioned that
this may not represent the latest information, which may be obtained from the patent database available at
www.iso.org/patents. ISO shall not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO’s adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 37, Language and terminology, Subcommittee
SC 3, Management of terminology resources.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
iv
Introduction
Over the past decades, extracting relevant designations, mostly terms (i.e. linguistic designations), from
text corpora has become an increasingly important task carried out in a wide variety of different fields.
Terminology extraction, which goes beyond mere extraction of terms, is undertaken by a range of specialists
including language professionals in general, and terminologists in particular, as well as ontology engineers,
and both information and data scientists. Terminology extraction also serves several purposes that go
beyond the compilation of glossaries or the population of terminology databases, including the identification
of concepts and of concept relations for building ontologies.
The widespread use of terminology extraction tools in terminology management, as well as in other fields
such as information retrieval, stands in stark contrast to the rarity of individual documents that provide
definitions, requirements or best practices.
However, although terminology extraction tools save time, money and effort in terminology management,
their output becomes even more relevant when it is assessed and validated, using both qualitative and
quantitative approaches and criteria for selecting entities such as relevant terms, definitions and concept
relations. This extracted and then validated terminological data supports the building of high-quality
terminology resources and, thus, terminology management.
This document covers the following aspects that form the core of terminology extraction methods and
practices in general:
— compilation of text corpora (general principles and types of text corpora);
— methods and criteria employed by mainstream terminology extraction tools (statistical, linguistic,
hybrid and neural);
— criteria for selecting terms (filtering candidate term lists and assessment of term eligibility);
— tool characteristics.
By objectively specifying these aspects, this document provides a reference framework for improving the
performance of terminology extraction tools and optimizing the use of their output.
v
International Standard ISO 5078:2025(en)
Management of terminology resources — Terminology
extraction
1 Scope
This document specifies methods for extracting candidate terms from text corpora and gives guidance on
selecting relevant designations, definitions, concept relations and other terminology-related information.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
ISO 704, Terminology work — Principles and methods
ISO 1087, Terminology work and terminology science — Vocabulary
ISO 16642, Management of terminology resources — Terminological markup framework
ISO 26162-1, Management of terminology resources — Terminology databases — Part 1: Design
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1
annotation
process of adding metadata (3.10) to segments of language data
[SOURCE: ISO 24617-1:2012, 3.2, modified — “information” replaced by “metadata”; “or that information
itself” deleted.]
3.2
bitext
collection of texts (3.24) in two languages that can be considered translations of each other and that are
segmented and aligned
Note 1 to entry: Bitexts play a key role in training, evaluating and improving localization technologies, such as
translation memories, terminology management tools or machine translation engines.
3.3
candidate term
term candidate
provisional term
string of characters (3.5) that has been collected by means of term extraction (3.20) but has not yet been
selected as a relevant term (3.19) to be considered for inclusion in a terminological data (3.22) collection
[SOURCE: ISO 12616-1:2021, 3.18, modified — “text element to be documented in the” replaced by “term to
be considered for inclusion in a”.]
3.4
candidate terminological data
string of characters (3.5) that has been collected by means of terminology extraction (3.23) but has not yet
been selected as relevant terminological data (3.22)
3.5
character
unit of textual information represented by one or more bytes
EXAMPLE Single letter, numeral, punctuation mark, diacritic, symbol, ideograph, space.
[SOURCE: ISO/IEC 14840:1996, 4.10, modified — “textual” added to the definition; example added.]
3.6
collocation
lexically or pragmatically constrained recurrent cooccurrence of at least two lexical units (3.8) which are in
a direct syntactic relation with each other
EXAMPLE “Commit a crime” instead of “do a crime”.
3.7
keyness
quantity proportional to the frequency of a lexical unit (3.8) in a subject-field-specific text corpus (3.25),
relative to a reference corpus (3.15)
3.8
lexical unit
meaningful element in the lexicon (3.9) of a language
3.9
lexicon
complete set of meaningful elements in a language
3.10
metadata
data that defines and describes other data
[SOURCE: ISO 24531:2013, 4.32]
3.11
n-gram
sequence of n adjacent tokens (3.27)
Note 1 to entry: Frequently adjacent tokens can be an indicator for termhood (3.21).
Note 2 to entry: The number of adjacent tokens (n) is usually 2, 3 or 4.
3.12
noise
non-relevant search results
Note 1 to entry: In terminology extraction (3.23), “noise” means non-relevant data in the extraction output.
3.13
precision
ratio of relevant search results to all search results
Note 1 to entry: In terminology extraction (3.23), “precision” means the ratio of relevant candidate terms (3.3) retrieved
to the total of candidate terms retrieved.
Note 2 to entry: Precision and recall (3.14) generally have an inverse relationship; when one increases, the other tends
to decrease.
3.14
recall
ratio of relevant search results to all relevant items in a set that have been or should have been found from a
search query
Note 1 to entry: In terminology extraction (3.23), “recall” means the relevant candidate terms (3.3) in a text corpus (3.25).
Note 2 to entry: Recall and precision (3.13) generally have an inverse relationship; when one increases, the other tends
to decrease.
3.15
reference corpus
text corpus (3.25) to which a given text corpus for terminology extraction (3.23) is compared
3.16
relevance
quality of being a successful search result in relation to the search query
3.17
silence
set of relevant search results that have not been found from a search query
Note 1 to entry: In terminology extraction (3.23), “silence” means the set of valid candidate terms (3.3) that are missing
in the extraction results.
3.18
stop word
word that is not taken into account as a candidate term (3.3)
Note 1 to entry: Typical stop words are function words (e.g. prepositions, articles), brand names and non-special
language words to the specific subject field.
3.19
term
designation that represents a general concept by linguistic means
EXAMPLE “laser printer”, “planet”, “pacemaker”, “chemical compound”, “¾ time”, “Influenza A virus”, “oil
painting”.
Note 1 to entry: Terms can be partly or wholly verbal.
[SOURCE: ISO 1087:2019, 3.4.2]
3.20
term extraction
identification and excerption of candidate terms (3.3)
Note 1 to entry: Terms (3.19) can include all types of designations, including appellations, proper names and symbols.
3.21
termhood
degree to which a lexical unit (3.8) is recognized as a term (3.19)
EXAMPLE “Mouse” has stronger termhood in computer applications and weaker termhood in general language.
Note 1 to entry: Termhood applies to both simple terms (consisting of a single word) and complex terms (consisting
of more than one word or lexical unit), and to other designations, such as proper names and appellations, as well as
formulas and symbols.
[SOURCE: ISO 26162-3:2023, 3.13, modified — Example revised.]
3.22
terminological data
data related to concepts and their designations
Note 1 to entry: Common terminological data include designations, definitions, contexts, notes to entry, grammatical
labels, subject labels, language identifiers, country identifiers, and source identifiers.
[SOURCE: ISO 1087:2019, 3.6.1]
3.23
terminology extraction
identification and excerption of candidate terminological data (3.4)
3.24
text
content in written form
[SOURCE: ISO 20539:2023, 3.3.1]
3.25
text corpus
collection of natural language data
[SOURCE: ISO 1087:2019, 3.6.4, modified — Admitted term and Note 1 to entry deleted.]
3.26
TF-IDF
term frequency — inverse document frequency
statistical value intended to reflect how important a lexical unit (3.8) is to a document in a text corpus (3.25)
3.27
token
individual occurrence of a type (3.29) in a text corpus (3.25)
3.28
tokenization
conversion of text (3.24) into tokens (3.27)
3.29
type
unique sequence of characters (3.5) in a text corpus (3.25)
Note 1 to entry: The number of types is different from the number of occurrences (tokens (3.27)).
Note 2 to entry: While the number of tokens in a text corpus refers to the total number of occurrences, the number of
types refers to the total number of unique occurrences.
3.30
unithood
degree to which a given sequence of words has sufficient collocational strength to form a stable lexical unit (3.8)
EXAMPLE “Art deco table” has stronger unithood than “modern table”.
Note 1 to entry: Because unithood derives from the collocational relationship of words making up a given string, it
only applies to multi-word terms (3.19).
[SOURCE: ISO 26162-3:2023, 3.15]
3.31
validated term
candidate term (3.3) which meets specified criteria
3.32
validated terminological data
candidate terminological data (3.4) which meets specified criteria
3.33
vector
quantity having direction as well as magnitude
[SOURCE: ISO 19123-1:2023, 3.1.51, modified — Note 1 to entry deleted.]
3.34
vector space model
statistical model for representing text information as a vector (3.33) of identifiers
Note 1 to entry: Vector space models can be used for information retrieval (IR), natural language processing (NLP) or
text mining tasks in order to identify whether texts (3.24) are similar in meaning.
[SOURCE: Reference [15], modified — “for Information Retrieval, NLP, Text Mining” moved from the
definition to Note 1 to entry; “as a vector of identifiers” added to the definition; Note 1 to entry added.]
4 Principles and methods
4.1 General
Terminology extraction requires a deep understanding of terminology theory and terminology work. In this
sense, and to achieve high-quality results, the following shall be used:
— established terms and definitions as specified in ISO 1087;
— principles and methods as specified in ISO 704;
— data-modelling criteria as specified in ISO 16642;
— terminology database design principles as specified in ISO 26162-1.
There are various types of text corpora. Selection of corpus type and texts to be included is usually influenced
by factors such as project goal, scope and deadlines.
4.2 Text corpora and terminology extraction
Organizations usually produce textual material relating to their industry, activity and the field in which they
operate. These kinds of texts include, for example, marketing materials, product documentation, internal
memos and bilingual translation memories. Such textual material can contribute to an organization-wide
text corpus that forms the basis for terminology extraction.
The usefulness of candidate terminological data extracted from such a text corpus depends on the context
and aim of the terminology extraction project as well as on the depth or breadth of the subject-field coverage
provided by the text corpus.
4.3 Compilation of text corpora
4.3.1 Text corpora used for terminology extraction
Terminology extraction begins with the collection of a text corpus, according to the objectives of the project.
There are differing kinds of text corpora, specifically:
— a monolingual corpus, consisting of texts taken from the same language;
— a bilingual corpus, consisting of texts taken from two languages;
— a multilingual corpus, consisting of texts taken from more than two languages;
— a parallel corpus, consisting of texts taken from one language aligned with their translations into one or
more other languages;
EXAMPLE A set of annual reports in English aligned segment by segment with the same annual reports
translated into Tagalog.
— a comparable corpus, consisting of one or more sets of texts meeting certain criteria, matched against a
set of texts meeting the same criteria in one or more languages.
NOTE Thus, the texts are not translations of each other, but they are similar in respects other than language,
which can be, for example, subject field or text type. A comparable corpus can be created, for example, from
original articles on steel extrusion processes in French and another set of original articles on steel extrusion
processes in German.
When creating a text corpus, texts should be selected depending on the goal or purpose of the extraction
task by defining and/or selecting criteria the texts must meet to be included in the text corpus. For example,
if the goal is to extract rare words in Elizabethan English, a text that is a German computer manual should
not be included in the text corpus because it does not meet two criteria essential and appropriate to the goal:
the time frame criterion (texts written between 1558 and 1603) and the language criterion (in English).
Or if the goal is to translate the user interface of a software program into another language, creating a text
corpus out of the textual graphical user interface (GUI) elements can be useful for extracting terms. As a
next step, language professionals can find equivalents for these terms to be used in the translation.
4.3.2 Criteria for selecting texts for a text corpus
Terminology extraction can have a number of different objectives, all of which influence the development of
the text corpus. The following list includes frequent criteria that can be considered when selecting texts for
a text corpus. This list is based on scenarios that seek to extract and use terminology for the same purpose
associated with the texts in the corpus:
— Content source: Usually, original content is preferable to derived works or content generated by artificial
intelligence applications.
— Language originality: In monolingual extraction scenarios, texts for a given language which have
originally been written in that language are preferable to translated texts. Translated texts can be used,
particularly where original texts are not available (e.g. in cases of languages where very little has been
written in a particular subject field). In bilingual extraction scenarios, however, original texts are usually
aligned with their corresponding translations.
— Locale: When seeking to extract terminology particular to a locale (language plus region), including
texts from only that locale provides the best results.
EXAMPLE 1 Canadian French texts for Canadian French terminology, Swiss French texts for Swiss French
terminology.
— Scope: When seeking to extract terminology applicable to a particular part of a subject field, selecting
texts destined to be used in that subject field part yields the best results.
EXAMPLE 2 When seeking terms relating to oncology, selecting oncology-related texts from a medicine text
corpus over dermatology-related ones will generate more appropriate results.
— Intended audience: Selecting texts to extract terminology that fits the expectations and needs of the
intended users of the extracted terminology will yield better results.
EXAMPLE 3 If the specifications dictate plain language for a general, non-specialist audience, the selection of
texts for the corpus would differ from texts that would be chosen if the audience were skilled professionals.
— Time frame: When seeking to extract terminology of a given time period, selecting texts that have been
created in that period will provide the most relevant results.
— Representativeness: Selecting texts that are relevant to the community of experts of a specific subject
field leads to more relevant results. Texts that are representative of the subject field(s) for which
terminology is to be extracted are preferable to texts that mention the subject field(s) tangentially.
— Authority: Peer-reviewed documents published by a recognized authority are preferable to other
documents. Texts written by subject-field experts are usually preferable to texts created by non-subject-
field experts. Depending on the objectives, however, it can be useful to extend the scope to include other
authors.
— Language register: Selecting texts according to the purpose of the particular communication situation
(e.g. formal language used in laws versus informal language used in text messages) will result in more
relevant data being extracted.
— Document type: When seeking to extract terminology specific to a particular type of document, limiting
the selection to documents that belong to that document type (e.g. web pages, manuals, reference books)
will yield more relevant results.
— File size: Choosing documents that contain a reasonable amount of terminology with regard to the type
of terminology project is preferable. One short file often does not contain enough terminology to be
extracted if a tool only extracts candidate terms that appear at least twice in a text corpus. It is useful to
keep in mind how quickly the terminology extractor can process large files as well as how powerful the
computer running the extraction software is. If the files are too large and the computer or tool too slow,
terminology processing can take more time than has been allotted to a terminology extraction task.
— File format: Digitized documents are preferable to scanned documents. It saves time to select documents
in formats that can be processed by the terminology extraction tool being used (or that can quickly and
easily be converted).
As stated, the aforementioned criteria hypothesize extraction scenarios that aim at reusing the extracted
terminology in content production situations that are similar to the texts in the text corpus. However,
sometimes it can be necessary to adjust criteria (e.g. if no digitized documents are available, if the goal is to
describe how terminology evolved over time, if the extracted terminology is used to depict the inappropriate
use of terminology).
When building a text corpus, the formats in which documents are available and the formats the terminology
extraction tool can handle limit the amount of text that can be included in a text corpus. While some
tools have embedded conversion features, for others, file conversion tools help widen the set of possible
documents that can be used by converting files from formats the extraction tool cannot handle into a format
that can be handled by the tool.
In summary, the choice of texts and text types to include will depend on the goals of the terminology
extraction project and the criteria and parameters chosen.
4.3.3 Considerations for text corpus creation
To extract specific terms, creating a corpus of recent and authoritative texts can be useful. For example,
to extract company-specific terms, it can be useful to build a corpus of recent, authoritative texts from the
company's intranet and website.
If planning to create a bilingual termbase of those organization-specific terms, then creating a corpus of all
original texts for which there is a translation as well as their translations should be considered.
NOTE When creating a bilingual or multilingual corpus of texts and their translations, to aid the term extraction
process, an alignment tool can be useful for creating translation memory exchange (TMX) files or bitexts. Sometimes
corpus alignment tools align better than the term extraction tool’s built-in alignment algorithm. More effective
alignment improves the results of bilingual term extraction, because the extraction tool will consider the correct
segments for equivalent candidate terms.
In order to make the terms specific to a subject-field text corpus stand out in the candidate term list it
generates, some extractors will compare the results of the extraction from the selected text corpus with
those extracted from a reference corpus.
4.4 Terminology extraction approaches and methods
4.4.1 Classification of terminology extraction approaches
Figure 1 depicts a range of terminology extraction approaches, each corresponding to a leading criterion,
that are further subdivided into terminology extraction methods.
a
Degree of termhood, e.g. log likelihood (LL).
b
Degree of association, e.g. pointwise mutual interest (PMI) or chisquare.
c
Filtering by grammatical categories or patterns (POS), e.g. noun phrase extraction.
d
Classifiers (e.g. term/no term) learning from annotated data sets.
e
Neural networks, support vector machines (SVM), decision trees.
f
Inferring logics of ontologies (description logic, first order logic).
g
Combinations or sequences of the other methods.
Figure 1 — Classification of terminology extraction approaches
Approaches to terminology extraction can be structured according to:
a) number of languages;
b) process;
c) technique;
d) technology;
e) extracted item.
Within the chosen approach, several terminology extraction methods can be applied, for example:
— statistical;
— linguistic;
— machine-learning-based;
— logic-based;
— rule-based.
Terminology extraction approaches and methods are detailed in 4.4.2 to 4.4.6.5.
4.4.2 Extraction method according to the number of languages
4.4.2.1 General
Extractors can focus on one language, two or more.
4.4.2.2 Monolingual
Some extractors focus on only one language. Others can process a variety of languages, but still only one
at a time.
4.4.2.3 Bilingual
Bilingual terminology extraction tools can extract terms from a source text and a target text, one pair of
languages at a time. These texts are compiled in a text corpus. Bitexts are ideal for mining by bilingual
terminology extraction tools. Few tools can handle comparable corpora (in which original language texts
on the same topic are collected in both languages). A bilingual extractor can simply present equivalent text
segments for the user to consult to locate an equivalent term, or even use an algorithm to propose a possible
equivalent for validation.
4.4.2.4 Multilingual
Parallel texts in more than two languages are rarely available for terminology extraction. Therefore,
constructing multilingual terminology resources frequently depends on the compilation of multiple bilingual
terminology extraction outputs.
4.4.3 Extraction method according to the process
4.4.3.1 Manual terminology extraction
The most basic form of manual terminology extraction involves highlighting relevant terminological data in
a text and manually copying it into a list or termbase.
4.4.3.2 Automated terminology extraction
Automated terminology extraction generates a list of candidate terminological data for the user to validate
later, but does not allow validation during this phase.
4.4.3.3 Semi-automated terminology extraction
With semi-automated terminology extraction, the user can validate candidate terminological data within
the tool, before export. Validation at a pre-export stage reduces the noise in the resulting exported data.
4.4.4 Extraction method according to the underlying technique
4.4.4.1 Application of approaches
Most of the approaches described in this subclause apply primarily to term extraction unless otherwise stated.
4.4.4.2 Statistical terminology extraction
4.4.4.2.1 Overview
Statistical term extraction counts occurrences of candidate terms. The results are used to assess aspects,
including:
— the frequency of terms in a text corpus;
— their degree of relevance to a given subject field (termhood);
— the degree of association of words in an n-gram (unithood).
Statistical term extraction relies on the relative frequency of tokens in a text corpus or their distribution
within this text corpus.
Specific processing steps of statistical term extraction include:
— identification of relevant parameters such as token frequency, correlation and size of text corpus/
reference corpus;
— calculation of metrics based on selected formulae.
Depending on the type of targeted statistical data, standard tools or customized programs are used.
4.4.4.2.2 Frequency: counting occurrences
Frequency statistics are the basis for many terminology decisions and for more complex terminology
extraction methods. These statistics usually count types rather than tokens and include multiword terms.
To obtain useful results using this method of term extraction, the text corpus can be filtered to exclude
certain high-frequency words (the stop words containing little semantic information, such as articles,
conjunctions, prepositions and auxiliary verbs). Although it can be efficient to exclude very-low-frequency
words which do not reach a defined threshold for relevancy, there is a risk that some relevant terms will be
overlooked. Depending on the project goal, an approach other than merely statistical can be considered.
Lexical units (single-word or multiword lexical units) that are used as terms in subject-field-specific texts
commonly occur more frequently in subject-field-specific text corpora than in general-language usage.
Consequently, identifying candidate terms involves comparing the frequency of lexical units in a subject-
field-specific text corpus to reference corpora (a general-language text corpus or another subject-field-
specific text corpus). If a lexical unit’s frequency in the subject-field-specific text corpus is significantly
higher than that in the reference corpus, it is very likely to be a term.
4.4.4.2.3 Termhood: relevance of terms
Termhood comprises the intersection among the following (see Figure 2):
— unithood (see 4.4.4.2.4);
— usage in the text corpus on which terminology extraction is applied;
— purpose of the terminological data collection.
NOTE Source: ISO 26162-3:2023, Figure 1.
Figure 2 — Termhood
Apart from frequency, the dispersion of candidate terms across various types of textual materials available
within an organization’s text corpus plays a crucial role for termhood.
The convergence of high statistical frequency and dispersion throughout a text corpus is often expressed as
keyness and is viewed as an indicator that a lexical unit can indeed be considered a term in the subject field
in question.
Different types of metrics are used to calculate the termhood of candidate terms, e.g. the log likelihood ratio,
the Chi square ratio or the Jaccard coefficient (see Reference [17]).
4.4.4.2.4 Unithood: degree of association of words
Multiword lexical units that show relatively stable syntagmatic structures and recur with collocational
frequency in an organization’s text corpus feature a high degree of unithood. They are therefore used in
communicative structures demanding consistency and qualify as terms when extracted for a specified
purpose, see Figure 2.
A multiword lexical unit which meets the criteria for unithood functions as a term if it designates an
identifiable concept in the textual and operational context in question.
The metrics used to determine the termhood of a group of words measure the significance of association
between term elements. They compare the frequency of the occurrence of a phrase with the frequency of the
occurrence of its components.
A wide range of term extraction procedures rely on “mutual information”, which is a concept adopted from
information theory that involves measuring the statistical interdependency of two words. Since the words that
form a given word pair can occur at a certain distance from each other, a suitable window is defined (e.g.
four words each to the left and right of the semantic head of the compound), and mean and deviation are
calculated. The mean is calculated from all the distances between the two words in the window range, while
the variance measures how far the individual word distances deviate from the mean (see Reference [17],
pp. 158–159).
Collocations are characterized by a very low deviation of the word distances from the mean, while a high
deviation indicates the independence of the two words from each other.
4.4.4.3 Linguistic terminology extraction
Linguistic terminology extraction methods are specific to individual languages and more precise than
statistical ones.
Linguistic terminology extraction relies on grammatical and morphological features of text corpus content,
especially lexical units. In languages for special purposes (e.g. in Indo-European languages), terms often
occur as single nouns. In Germanic, Nordic and Slavic languages, they form compound nouns or multiword
units consisting of adjectives and nouns. This pattern varies in Romance languages, where prepositional
phrases often replace noun-adjective strings. Character-based (e.g. Chinese) or root-based (Arabic)
languages have their own patterns for combining characters and roots, with the result that the
relationship of terms to “words” per se varies from language to language. In a language like Turkish,
certain morphemes and affixes are used to specify the meaning of words within the context of language for
special purposes. Furthermore, specific prefixes and suffixes are prevalent in some subject fields, and
different languages frequently have their own patterns for adding functional and elisional elements to form
compounds. The absence of white space in some languages further complicates the task of automatic term
recognition. Solutions that work for one language do not always work for another, and linguists in some
languages (e.g. Chinese) have created extensive resources to facilitate text corpus management and term
recognition.
For details on term formation and designation patterns, see ISO 704:2022, Annexes B and C.
Linguistic terminology extraction relies on grammatical features of text corpus content, especially lexical
units. In languages for special purposes, terms often occur as single nouns, compound nouns or multiword
units consisting of adjectives and nouns. Furthermore, specific suffixes are prevalent in some subject fields
(e.g. the suffix -itis in medical contexts).
Before commencing linguistic terminology extraction, it is necessary to perform a linguistic analysis of
the text corpus content. Morphological analysis is required for identifying subject-field-specific suffixes,
whereas part-of-speech (POS) tagging annotates the respective word class to each token and thus represents
the prerequisite for finding syntactic patterns underlying term formation.
Processing steps include:
— morphosyntactic analysis (including POS tagging) of the text corpus;
— defining patterns for relevant terms;
— filtering candidate terms based on these patterns.
Since a linguistics-based approach can involve many aspects of the language, either a set of dedicated tools
or custom programs are used.
The programs used should have features applicable to the language(s) of the terminology extraction project.
For example, POS taggers use models learned from annotated data in a specific language, i.e. a POS tagger
will use a different model for English and for French.
...
International
Standard
ISO 5078
First edition
Management of terminology
2025-02
resources — Terminology
extraction
Gestion des ressources terminologiques — Extraction de
terminologie
Reference number
© ISO 2025
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Principles and methods . 5
4.1 General .5
4.2 Text corpora and terminology extraction .5
4.3 Compilation of text corpora .6
4.3.1 Text corpora used for terminology extraction .6
4.3.2 Criteria for selecting texts for a text corpus .6
4.3.3 Considerations for text corpus creation .7
4.4 Terminology extraction approaches and methods.8
4.4.1 Classification of terminology extraction approaches.8
4.4.2 Extraction method according to the number of languages .10
4.4.3 Extraction method according to the process .11
4.4.4 Extraction method according to the underlying technique .11
4.4.5 Extraction method according to the underlying technology .14
4.4.6 Extraction method according to the extracted items .16
4.5 Term extraction output .17
4.5.1 Filtering candidate term lists .17
4.5.2 Assessing term eligibility .18
4.6 Uses for terminology extraction output .19
5 Implementation of terminology extraction . 19
5.1 General .19
5.2 Initial considerations for terminology extraction .19
5.3 Terminology extraction workflow . 20
5.3.1 Overview . 20
5.3.2 Starting the terminology extraction workflow . 20
5.3.3 Building or selecting a text corpus . . 20
5.3.4 Preprocessing the text corpus . 20
5.3.5 Identifying candidate terms .21
5.3.6 Selecting relevant terms .21
5.3.7 Allocating terms to concepts . 22
5.3.8 Identifying concept relations and building concept systems . 22
5.3.9 Completing terminological entries . 22
Bibliography .23
iii
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee
has been established has the right to be represented on that committee. International organizations,
governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely
with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of ISO document should be noted. This document was drafted in accordance with the editorial rules of the
ISO/IEC Directives, Part 2 (see www.iso.org/directives).
ISO draws attention to the possibility that the implementation of this document may involve the use of (a)
patent(s). ISO takes no position concerning the evidence, validity or applicability of any claimed patent
rights in respect thereof. As of the date of publication of this document, ISO had not received notice of (a)
patent(s) which may be required to implement this document. However, implementers are cautioned that
this may not represent the latest information, which may be obtained from the patent database available at
www.iso.org/patents. ISO shall not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO’s adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www.iso.org/iso/foreword.html.
This document was prepared by Technical Committee ISO/TC 37, Language and terminology, Subcommittee
SC 3, Management of terminology resources.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html.
iv
Introduction
Over the past decades, extracting relevant designations, mostly terms (i.e. linguistic designations), from
text corpora has become an increasingly important task carried out in a wide variety of different fields.
Terminology extraction, which goes beyond mere extraction of terms, is undertaken by a range of specialists
including language professionals in general, and terminologists in particular, as well as ontology engineers,
and both information and data scientists. Terminology extraction also serves several purposes that go
beyond the compilation of glossaries or the population of terminology databases, including the identification
of concepts and of concept relations for building ontologies.
The widespread use of terminology extraction tools in terminology management, as well as in other fields
such as information retrieval, stands in stark contrast to the rarity of individual documents that provide
definitions, requirements or best practices.
However, although terminology extraction tools save time, money and effort in terminology management,
their output becomes even more relevant when it is assessed and validated, using both qualitative and
quantitative approaches and criteria for selecting entities such as relevant terms, definitions and concept
relations. This extracted and then validated terminological data supports the building of high-quality
terminology resources and, thus, terminology management.
This document covers the following aspects that form the core of terminology extraction methods and
practices in general:
— compilation of text corpora (general principles and types of text corpora);
— methods and criteria employed by mainstream terminology extraction tools (statistical, linguistic,
hybrid and neural);
— criteria for selecting terms (filtering candidate term lists and assessment of term eligibility);
— tool characteristics.
By objectively specifying these aspects, this document provides a reference framework for improving the
performance of terminology extraction tools and optimizing the use of their output.
v
International Standard ISO 5078:2025(en)
Management of terminology resources — Terminology
extraction
1 Scope
This document specifies methods for extracting candidate terms from text corpora and gives guidance on
selecting relevant designations, definitions, concept relations and other terminology-related information.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
ISO 704, Terminology work — Principles and methods
ISO 1087, Terminology work and terminology science — Vocabulary
ISO 16642, Management of terminology resources — Terminological markup framework
ISO 26162-1, Management of terminology resources — Terminology databases — Part 1: Design
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1
annotation
process of adding metadata (3.10) to segments of language data
[SOURCE: ISO 24617-1:2012, 3.2, modified — “information” replaced by “metadata”; “or that information
itself” deleted.]
3.2
bitext
collection of texts (3.24) in two languages that can be considered translations of each other and that are
segmented and aligned
Note 1 to entry: Bitexts play a key role in training, evaluating and improving localization technologies, such as
translation memories, terminology management tools or machine translation engines.
3.3
candidate term
term candidate
provisional term
string of characters (3.5) that has been collected by means of term extraction (3.20) but has not yet been
selected as a relevant term (3.19) to be considered for inclusion in a terminological data (3.22) collection
[SOURCE: ISO 12616-1:2021, 3.18, modified — “text element to be documented in the” replaced by “term to
be considered for inclusion in a”.]
3.4
candidate terminological data
string of characters (3.5) that has been collected by means of terminology extraction (3.23) but has not yet
been selected as relevant terminological data (3.22)
3.5
character
unit of textual information represented by one or more bytes
EXAMPLE Single letter, numeral, punctuation mark, diacritic, symbol, ideograph, space.
[SOURCE: ISO/IEC 14840:1996, 4.10, modified — “textual” added to the definition; example added.]
3.6
collocation
lexically or pragmatically constrained recurrent cooccurrence of at least two lexical units (3.8) which are in
a direct syntactic relation with each other
EXAMPLE “Commit a crime” instead of “do a crime”.
3.7
keyness
quantity proportional to the frequency of a lexical unit (3.8) in a subject-field-specific text corpus (3.25),
relative to a reference corpus (3.15)
3.8
lexical unit
meaningful element in the lexicon (3.9) of a language
3.9
lexicon
complete set of meaningful elements in a language
3.10
metadata
data that defines and describes other data
[SOURCE: ISO 24531:2013, 4.32]
3.11
n-gram
sequence of n adjacent tokens (3.27)
Note 1 to entry: Frequently adjacent tokens can be an indicator for termhood (3.21).
Note 2 to entry: The number of adjacent tokens (n) is usually 2, 3 or 4.
3.12
noise
non-relevant search results
Note 1 to entry: In terminology extraction (3.23), “noise” means non-relevant data in the extraction output.
3.13
precision
ratio of relevant search results to all search results
Note 1 to entry: In terminology extraction (3.23), “precision” means the ratio of relevant candidate terms (3.3) retrieved
to the total of candidate terms retrieved.
Note 2 to entry: Precision and recall (3.14) generally have an inverse relationship; when one increases, the other tends
to decrease.
3.14
recall
ratio of relevant search results to all relevant items in a set that have been or should have been found from a
search query
Note 1 to entry: In terminology extraction (3.23), “recall” means the relevant candidate terms (3.3) in a text corpus (3.25).
Note 2 to entry: Recall and precision (3.13) generally have an inverse relationship; when one increases, the other tends
to decrease.
3.15
reference corpus
text corpus (3.25) to which a given text corpus for terminology extraction (3.23) is compared
3.16
relevance
quality of being a successful search result in relation to the search query
3.17
silence
set of relevant search results that have not been found from a search query
Note 1 to entry: In terminology extraction (3.23), “silence” means the set of valid candidate terms (3.3) that are missing
in the extraction results.
3.18
stop word
word that is not taken into account as a candidate term (3.3)
Note 1 to entry: Typical stop words are function words (e.g. prepositions, articles), brand names and non-special
language words to the specific subject field.
3.19
term
designation that represents a general concept by linguistic means
EXAMPLE “laser printer”, “planet”, “pacemaker”, “chemical compound”, “¾ time”, “Influenza A virus”, “oil
painting”.
Note 1 to entry: Terms can be partly or wholly verbal.
[SOURCE: ISO 1087:2019, 3.4.2]
3.20
term extraction
identification and excerption of candidate terms (3.3)
Note 1 to entry: Terms (3.19) can include all types of designations, including appellations, proper names and symbols.
3.21
termhood
degree to which a lexical unit (3.8) is recognized as a term (3.19)
EXAMPLE “Mouse” has stronger termhood in computer applications and weaker termhood in general language.
Note 1 to entry: Termhood applies to both simple terms (consisting of a single word) and complex terms (consisting
of more than one word or lexical unit), and to other designations, such as proper names and appellations, as well as
formulas and symbols.
[SOURCE: ISO 26162-3:2023, 3.13, modified — Example revised.]
3.22
terminological data
data related to concepts and their designations
Note 1 to entry: Common terminological data include designations, definitions, contexts, notes to entry, grammatical
labels, subject labels, language identifiers, country identifiers, and source identifiers.
[SOURCE: ISO 1087:2019, 3.6.1]
3.23
terminology extraction
identification and excerption of candidate terminological data (3.4)
3.24
text
content in written form
[SOURCE: ISO 20539:2023, 3.3.1]
3.25
text corpus
collection of natural language data
[SOURCE: ISO 1087:2019, 3.6.4, modified — Admitted term and Note 1 to entry deleted.]
3.26
TF-IDF
term frequency — inverse document frequency
statistical value intended to reflect how important a lexical unit (3.8) is to a document in a text corpus (3.25)
3.27
token
individual occurrence of a type (3.29) in a text corpus (3.25)
3.28
tokenization
conversion of text (3.24) into tokens (3.27)
3.29
type
unique sequence of characters (3.5) in a text corpus (3.25)
Note 1 to entry: The number of types is different from the number of occurrences (tokens (3.27)).
Note 2 to entry: While the number of tokens in a text corpus refers to the total number of occurrences, the number of
types refers to the total number of unique occurrences.
3.30
unithood
degree to which a given sequence of words has sufficient collocational strength to form a stable lexical unit (3.8)
EXAMPLE “Art deco table” has stronger unithood than “modern table”.
Note 1 to entry: Because unithood derives from the collocational relationship of words making up a given string, it
only applies to multi-word terms (3.19).
[SOURCE: ISO 26162-3:2023, 3.15]
3.31
validated term
candidate term (3.3) which meets specified criteria
3.32
validated terminological data
candidate terminological data (3.4) which meets specified criteria
3.33
vector
quantity having direction as well as magnitude
[SOURCE: ISO 19123-1:2023, 3.1.51, modified — Note 1 to entry deleted.]
3.34
vector space model
statistical model for representing text information as a vector (3.33) of identifiers
Note 1 to entry: Vector space models can be used for information retrieval (IR), natural language processing (NLP) or
text mining tasks in order to identify whether texts (3.24) are similar in meaning.
[SOURCE: Reference [15], modified — “for Information Retrieval, NLP, Text Mining” moved from the
definition to Note 1 to entry; “as a vector of identifiers” added to the definition; Note 1 to entry added.]
4 Principles and methods
4.1 General
Terminology extraction requires a deep understanding of terminology theory and terminology work. In this
sense, and to achieve high-quality results, the following shall be used:
— established terms and definitions as specified in ISO 1087;
— principles and methods as specified in ISO 704;
— data-modelling criteria as specified in ISO 16642;
— terminology database design principles as specified in ISO 26162-1.
There are various types of text corpora. Selection of corpus type and texts to be included is usually influenced
by factors such as project goal, scope and deadlines.
4.2 Text corpora and terminology extraction
Organizations usually produce textual material relating to their industry, activity and the field in which they
operate. These kinds of texts include, for example, marketing materials, product documentation, internal
memos and bilingual translation memories. Such textual material can contribute to an organization-wide
text corpus that forms the basis for terminology extraction.
The usefulness of candidate terminological data extracted from such a text corpus depends on the context
and aim of the terminology extraction project as well as on the depth or breadth of the subject-field coverage
provided by the text corpus.
4.3 Compilation of text corpora
4.3.1 Text corpora used for terminology extraction
Terminology extraction begins with the collection of a text corpus, according to the objectives of the project.
There are differing kinds of text corpora, specifically:
— a monolingual corpus, consisting of texts taken from the same language;
— a bilingual corpus, consisting of texts taken from two languages;
— a multilingual corpus, consisting of texts taken from more than two languages;
— a parallel corpus, consisting of texts taken from one language aligned with their translations into one or
more other languages;
EXAMPLE A set of annual reports in English aligned segment by segment with the same annual reports
translated into Tagalog.
— a comparable corpus, consisting of one or more sets of texts meeting certain criteria, matched against a
set of texts meeting the same criteria in one or more languages.
NOTE Thus, the texts are not translations of each other, but they are similar in respects other than language,
which can be, for example, subject field or text type. A comparable corpus can be created, for example, from
original articles on steel extrusion processes in French and another set of original articles on steel extrusion
processes in German.
When creating a text corpus, texts should be selected depending on the goal or purpose of the extraction
task by defining and/or selecting criteria the texts must meet to be included in the text corpus. For example,
if the goal is to extract rare words in Elizabethan English, a text that is a German computer manual should
not be included in the text corpus because it does not meet two criteria essential and appropriate to the goal:
the time frame criterion (texts written between 1558 and 1603) and the language criterion (in English).
Or if the goal is to translate the user interface of a software program into another language, creating a text
corpus out of the textual graphical user interface (GUI) elements can be useful for extracting terms. As a
next step, language professionals can find equivalents for these terms to be used in the translation.
4.3.2 Criteria for selecting texts for a text corpus
Terminology extraction can have a number of different objectives, all of which influence the development of
the text corpus. The following list includes frequent criteria that can be considered when selecting texts for
a text corpus. This list is based on scenarios that seek to extract and use terminology for the same purpose
associated with the texts in the corpus:
— Content source: Usually, original content is preferable to derived works or content generated by artificial
intelligence applications.
— Language originality: In monolingual extraction scenarios, texts for a given language which have
originally been written in that language are preferable to translated texts. Translated texts can be used,
particularly where original texts are not available (e.g. in cases of languages where very little has been
written in a particular subject field). In bilingual extraction scenarios, however, original texts are usually
aligned with their corresponding translations.
— Locale: When seeking to extract terminology particular to a locale (language plus region), including
texts from only that locale provides the best results.
EXAMPLE 1 Canadian French texts for Canadian French terminology, Swiss French texts for Swiss French
terminology.
— Scope: When seeking to extract terminology applicable to a particular part of a subject field, selecting
texts destined to be used in that subject field part yields the best results.
EXAMPLE 2 When seeking terms relating to oncology, selecting oncology-related texts from a medicine text
corpus over dermatology-related ones will generate more appropriate results.
— Intended audience: Selecting texts to extract terminology that fits the expectations and needs of the
intended users of the extracted terminology will yield better results.
EXAMPLE 3 If the specifications dictate plain language for a general, non-specialist audience, the selection of
texts for the corpus would differ from texts that would be chosen if the audience were skilled professionals.
— Time frame: When seeking to extract terminology of a given time period, selecting texts that have been
created in that period will provide the most relevant results.
— Representativeness: Selecting texts that are relevant to the community of experts of a specific subject
field leads to more relevant results. Texts that are representative of the subject field(s) for which
terminology is to be extracted are preferable to texts that mention the subject field(s) tangentially.
— Authority: Peer-reviewed documents published by a recognized authority are preferable to other
documents. Texts written by subject-field experts are usually preferable to texts created by non-subject-
field experts. Depending on the objectives, however, it can be useful to extend the scope to include other
authors.
— Language register: Selecting texts according to the purpose of the particular communication situation
(e.g. formal language used in laws versus informal language used in text messages) will result in more
relevant data being extracted.
— Document type: When seeking to extract terminology specific to a particular type of document, limiting
the selection to documents that belong to that document type (e.g. web pages, manuals, reference books)
will yield more relevant results.
— File size: Choosing documents that contain a reasonable amount of terminology with regard to the type
of terminology project is preferable. One short file often does not contain enough terminology to be
extracted if a tool only extracts candidate terms that appear at least twice in a text corpus. It is useful to
keep in mind how quickly the terminology extractor can process large files as well as how powerful the
computer running the extraction software is. If the files are too large and the computer or tool too slow,
terminology processing can take more time than has been allotted to a terminology extraction task.
— File format: Digitized documents are preferable to scanned documents. It saves time to select documents
in formats that can be processed by the terminology extraction tool being used (or that can quickly and
easily be converted).
As stated, the aforementioned criteria hypothesize extraction scenarios that aim at reusing the extracted
terminology in content production situations that are similar to the texts in the text corpus. However,
sometimes it can be necessary to adjust criteria (e.g. if no digitized documents are available, if the goal is to
describe how terminology evolved over time, if the extracted terminology is used to depict the inappropriate
use of terminology).
When building a text corpus, the formats in which documents are available and the formats the terminology
extraction tool can handle limit the amount of text that can be included in a text corpus. While some
tools have embedded conversion features, for others, file conversion tools help widen the set of possible
documents that can be used by converting files from formats the extraction tool cannot handle into a format
that can be handled by the tool.
In summary, the choice of texts and text types to include will depend on the goals of the terminology
extraction project and the criteria and parameters chosen.
4.3.3 Considerations for text corpus creation
To extract specific terms, creating a corpus of recent and authoritative texts can be useful. For example,
to extract company-specific terms, it can be useful to build a corpus of recent, authoritative texts from the
company's intranet and website.
If planning to create a bilingual termbase of those organization-specific terms, then creating a corpus of all
original texts for which there is a translation as well as their translations should be considered.
NOTE When creating a bilingual or multilingual corpus of texts and their translations, to aid the term extraction
process, an alignment tool can be useful for creating translation memory exchange (TMX) files or bitexts. Sometimes
corpus alignment tools align better than the term extraction tool’s built-in alignment algorithm. More effective
alignment improves the results of bilingual term extraction, because the extraction tool will consider the correct
segments for equivalent candidate terms.
In order to make the terms specific to a subject-field text corpus stand out in the candidate term list it
generates, some extractors will compare the results of the extraction from the selected text corpus with
those extracted from a reference corpus.
4.4 Terminology extraction approaches and methods
4.4.1 Classification of terminology extraction approaches
Figure 1 depicts a range of terminology extraction approaches, each corresponding to a leading criterion,
that are further subdivided into terminology extraction methods.
a
Degree of termhood, e.g. log likelihood (LL).
b
Degree of association, e.g. pointwise mutual interest (PMI) or chisquare.
c
Filtering by grammatical categories or patterns (POS), e.g. noun phrase extraction.
d
Classifiers (e.g. term/no term) learning from annotated data sets.
e
Neural networks, support vector machines (SVM), decision trees.
f
Inferring logics of ontologies (description logic, first order logic).
g
Combinations or sequences of the other methods.
Figure 1 — Classification of terminology extraction approaches
Approaches to terminology extraction can be structured according to:
a) number of languages;
b) process;
c) technique;
d) technology;
e) extracted item.
Within the chosen approach, several terminology extraction methods can be applied, for example:
— statistical;
— linguistic;
— machine-learning-based;
— logic-based;
— rule-based.
Terminology extraction approaches and methods are detailed in 4.4.2 to 4.4.6.5.
4.4.2 Extraction method according to the number of languages
4.4.2.1 General
Extractors can focus on one language, two or more.
4.4.2.2 Monolingual
Some extractors focus on only one language. Others can process a variety of languages, but still only one
at a time.
4.4.2.3 Bilingual
Bilingual terminology extraction tools can extract terms from a source text and a target text, one pair of
languages at a time. These texts are compiled in a text corpus. Bitexts are ideal for mining by bilingual
terminology extraction tools. Few tools can handle comparable corpora (in which original language texts
on the same topic are collected in both languages). A bilingual extractor can simply present equivalent text
segments for the user to consult to locate an equivalent term, or even use an algorithm to propose a possible
equivalent for validation.
4.4.2.4 Multilingual
Parallel texts in more than two languages are rarely available for terminology extraction. Therefore,
constructing multilingual terminology resources frequently depends on the compilation of multiple bilingual
terminology extraction outputs.
4.4.3 Extraction method according to the process
4.4.3.1 Manual terminology extraction
The most basic form of manual terminology extraction involves highlighting relevant terminological data in
a text and manually copying it into a list or termbase.
4.4.3.2 Automated terminology extraction
Automated terminology extraction generates a list of candidate terminological data for the user to validate
later, but does not allow validation during this phase.
4.4.3.3 Semi-automated terminology extraction
With semi-automated terminology extraction, the user can validate candidate terminological data within
the tool, before export. Validation at a pre-export stage reduces the noise in the resulting exported data.
4.4.4 Extraction method according to the underlying technique
4.4.4.1 Application of approaches
Most of the approaches described in this subclause apply primarily to term extraction unless otherwise stated.
4.4.4.2 Statistical terminology extraction
4.4.4.2.1 Overview
Statistical term extraction counts occurrences of candidate terms. The results are used to assess aspects,
including:
— the frequency of terms in a text corpus;
— their degree of relevance to a given subject field (termhood);
— the degree of association of words in an n-gram (unithood).
Statistical term extraction relies on the relative frequency of tokens in a text corpus or their distribution
within this text corpus.
Specific processing steps of statistical term extraction include:
— identification of relevant parameters such as token frequency, correlation and size of text corpus/
reference corpus;
— calculation of metrics based on selected formulae.
Depending on the type of targeted statistical data, standard tools or customized programs are used.
4.4.4.2.2 Frequency: counting occurrences
Frequency statistics are the basis for many terminology decisions and for more complex terminology
extraction methods. These statistics usually count types rather than tokens and include multiword terms.
To obtain useful results using this method of term extraction, the text corpus can be filtered to exclude
certain high-frequency words (the stop words containing little semantic information, such as articles,
conjunctions, prepositions and auxiliary verbs). Although it can be efficient to exclude very-low-frequency
words which do not reach a defined threshold for relevancy, there is a risk that some relevant terms will be
overlooked. Depending on the project goal, an approach other than merely statistical can be considered.
Lexical units (single-word or multiword lexical units) that are used as terms in subject-field-specific texts
commonly occur more frequently in subject-field-specific text corpora than in general-language usage.
Consequently, identifying candidate terms involves comparing the frequency of lexical units in a subject-
field-specific text corpus to reference corpora (a general-language text corpus or another subject-field-
specific text corpus). If a lexical unit’s frequency in the subject-field-specific text corpus is significantly
higher than that in the reference corpus, it is very likely to be a term.
4.4.4.2.3 Termhood: relevance of terms
Termhood comprises the intersection among the following (see Figure 2):
— unithood (see 4.4.4.2.4);
— usage in the text corpus on which terminology extraction is applied;
— purpose of the terminological data collection.
NOTE Source: ISO 26162-3:2023, Figure 1.
Figure 2 — Termhood
Apart from frequency, the dispersion of candidate terms across various types of textual materials available
within an organization’s text corpus plays a crucial role for termhood.
The convergence of high statistical frequency and dispersion throughout a text corpus is often expressed as
keyness and is viewed as an indicator that a lexical unit can indeed be considered a term in the subject field
in question.
Different types of metrics are used to calculate the termhood of candidate terms, e.g. the log likelihood ratio,
the Chi square ratio or the Jaccard coefficient (see Reference [17]).
4.4.4.2.4 Unithood: degree of association of words
Multiword lexical units that show relatively stable syntagmatic structures and recur with collocational
frequency in an organization’s text corpus feature a high degree of unithood. They are therefore used in
communicative structures demanding consistency and qualify as terms when extracted for a specified
purpose, see Figure 2.
A multiword lexical unit which meets the criteria for unithood functions as a term if it designates an
identifiable concept in the textual and operational context in question.
The metrics used to determine the termhood of a group of words measure the significance of association
between term elements. They compare the frequency of the occurrence of a phrase with the frequency of the
occurrence of its components.
A wide range of term extraction procedures rely on “mutual information”, which is a concept adopted from
information theory that involves measuring the statistical interdependency of two words. Since the words that
form a given word pair can occur at a certain distance from each other, a suitable window is defined (e.g.
four words each to the left and right of the semantic head of the compound), and mean and deviation are
calculated. The mean is calculated from all the distances between the two words in the window range, while
the variance measures how far the individual word distances deviate from the mean (see Reference [17],
pp. 158–159).
Collocations are characterized by a very low deviation of the word distances from the mean, while a high
deviation indicates the independence of the two words from each other.
4.4.4.3 Linguistic terminology extraction
Linguistic terminology extraction methods are specific to individual languages and more precise than
statistical ones.
Linguistic terminology extraction relies on grammatical and morphological features of text corpus content,
especially lexical units. In languages for special purposes (e.g. in Indo-European languages), terms often
occur as single nouns. In Germanic, Nordic and Slavic languages, they form compound nouns or multiword
units consisting of adjectives and nouns. This pattern varies in Romance languages, where prepositional
phrases often replace noun-adjective strings. Character-based (e.g. Chinese) or root-based (Arabic)
languages have their own patterns for combining characters and roots, with the result that the
relationship of terms to “words” per se varies from language to language. In a language like Turkish,
certain morphemes and affixes are used to specify the meaning of words within the context of language for
special purposes. Furthermore, specific prefixes and suffixes are prevalent in some subject fields, and
different languages frequently have their own patterns for adding functional and elisional elements to form
compounds. The absence of white space in some languages further complicates the task of automatic term
recognition. Solutions that work for one language do not always work for another, and linguists in some
languages (e.g. Chinese) have created extensive resources to facilitate text corpus management and term
recognition.
For details on term formation and designation patterns, see ISO 704:2022, Annexes B and C.
Linguistic terminology extraction relies on grammatical features of text corpus content, especially lexical
units. In languages for special purposes, terms often occur as single nouns, compound nouns or multiword
units consisting of adjectives and nouns. Furthermore, specific suffixes are prevalent in some subject fields
(e.g. the suffix -itis in medical contexts).
Before commencing linguistic terminology extraction, it is necessary to perform a linguistic analysis of
the text corpus content. Morphological analysis is required for identifying subject-field-specific suffixes,
whereas part-of-speech (POS) tagging annotates the respective word class to each token and thus represents
the prerequisite for finding syntactic patterns underlying term formation.
Processing steps include:
— morphosyntactic analysis (including POS tagging) of the text corpus;
— defining patterns for relevant terms;
— filtering candidate terms based on these patterns.
Since a linguistics-based approach can involve many aspects of the language, either a set of dedicated tools
or custom programs are used.
The programs used should have features applicable to the language(s) of the terminology extraction project.
For example, POS taggers use models learned from annotated data in a specific language, i.e. a POS tagger
will use a different model for English and for French.
POS taggers use pre-defined probabilistic heuristics for determining which part of speech is most plausible
for a certain token. To optimize results, POS taggers are usually trained using a small, manually annotated
sample (approximately 10 %) of the text corpus to be analysed before they are applied to annotate the larger
part of this text corpus.
Definition extraction has evolved much more slowly than term extraction and is highly language-dependent.
One method of extracting definitions for concepts designated by terms is identifying relations between
[16]
two terms within a sentence. For example, the so-called Hearst pattern “A is a B” indicates a hierarchical
relation between the nouns A and B, where A represents a subordinate concept of B and B represents the
superordinate concept of A.
Extracted terms can also serve as a basis for bootstrapping approaches, that is, as a starting set for
identifying further terminology and to retrieve more texts to include in the corpus as well. An example
for resources on the web which feature synonyms to the identified term are lexical databases that link, by
[18] [14]
semantic relations, the concepts or contexts under
...
SIST ISO 5078:2025は、「用語リソースの管理 - 用語抽出」に関する標準であり、テキストコーパスから候補用語を抽出するための方法を明確にし、関連する名称、定義、概念関係、その他の用語関連情報を選択する際の指針を提供します。この標準の範囲は、用語の正確な管理と情報の整合性を確保することに寄与します。 この標準の強みは、用語抽出プロセスの体系的なアプローチを提供する点にあります。具体的には、文脈に基づいた検出方法や、分野ごとの特性を考慮した選別基準を示すことで、用語関連情報の質を高めることができます。また、異なる業界や分野における多様な用語のニーズに応じて柔軟な適用が可能であり、専門家や研究者が必要とする実用的なツールを提供しています。 さらに、SIST ISO 5078:2025は、用語リソースの管理における国際的な一貫性を促進し、コミュニケーションの効率を向上させるための基盤を築きます。情報の共有や協力が求められる現代において、この標準の関連性は非常に高く、言語の標準化や用語管理の必要性を強調しています。各国での適用を視野に入れたこの仕様は、国際的なビジネス環境や学術研究において重要な役割を果たすことでしょう。このように、SIST ISO 5078:2025は、用語抽出に関する実践的かつ重要な指導を提供する標準といえます。
The SIST ISO 5078:2025 standard provides a comprehensive framework for the management of terminology resources, specifically focusing on terminology extraction. Its scope is well-defined, as it specifies methods for extracting candidate terms from text corpora. This crucial aspect ensures that the terminology extraction process is systematic and reliable, allowing organizations to develop consistent terminology management practices. One of the key strengths of the SIST ISO 5078:2025 standard is its detailed guidance on not only extracting terms but also selecting relevant designations, definitions, and concept relations. This multi-faceted approach is essential for creating a robust terminology database that enhances communication and understanding across different languages and sectors. By establishing clear methodologies for terminology extraction, the standard supports organizations in ensuring that their terminology is both accurate and contextually relevant. Furthermore, the relevance of SIST ISO 5078:2025 in the contemporary landscape cannot be overstated. As globalization accelerates and industries expand, the need for precise and consistent terminology becomes more critical. This standard serves as a vital tool for linguists, terminologists, and organizations engaged in multilingual contexts, empowering them to manage terminology efficiently. The systematic extraction methods outlined in the document not only streamline the terminology management process but also enhance the overall quality of information being communicated. Overall, the SIST ISO 5078:2025 standard is a significant contribution to the field of terminology management, emphasizing the importance of rigorous terminology extraction methods, and providing invaluable resources to support organizations in their efforts to create cohesive and accurate terminology frameworks. Its strengths in offering a structured approach make it an essential reference for professionals seeking to improve their terminology management practices.
The SIST ISO 5078:2025 standard presents a robust framework for the management of terminology resources, specifically focusing on terminology extraction. This standard outlines systematic methods for identifying candidate terms from text corpora, which is crucial for ensuring accurate and consistent terminology usage across various domains. One of the significant strengths of ISO 5078:2025 is its comprehensive approach to selecting relevant designations, definitions, and concept relations. By providing clear guidelines, it facilitates the effective organization and retrieval of terminology-related information. This enhances the quality of communication and aids in achieving higher precision in specialized fields. Moreover, the standard's emphasis on terminology extraction is increasingly relevant in today's data-driven environment, where large volumes of text data need to be processed to create coherent terminology frameworks. Its methodologies not only support linguists and terminologists but also benefit sectors engaged in knowledge management, translation, and content creation. ISO 5078:2025 effectively addresses the challenges of managing terminology resources by introducing structured processes that can adapt to various text types and contexts. This adaptability ensures that the extraction of terminology aligns with the specific needs of different industries, making the standard a vital resource for professionals involved in terminology management. Overall, the SIST ISO 5078:2025 standard serves as an essential guide for the effective management of terminology resources, providing valuable insights and methodologies for terminology extraction that are pertinent to current and future linguistic challenges.
Die SIST ISO 5078:2025 stellt eine bedeutende Norm im Bereich des Terminologiemanagements dar, die sich mit der Terminologieextraktion beschäftigt. Ihr Fokus liegt auf der Entwicklung und Anwendung von Methoden zur Identifizierung von Kandidatentermen aus Textkorpora. Diese Norm addressiert einen essenziellen Aspekt der Sprach- und Wissensverarbeitung, indem sie klare Richtlinien zur Auswahl relevanter Bezeichnungen, Definitionen und Konzeptbeziehungen bietet, die für Fach- und Wissensdomänen entscheidend sind. Ein herausragendes Merkmal der SIST ISO 5078:2025 ist ihre umfangreiche Methodik zur präzisen Terminologieextraction. Durch die Strukturierung des Prozesses wird sichergestellt, dass die extrahierten Begriffe nicht nur korrekt, sondern auch kontextuell relevant sind. Die definierten Methoden sind darauf ausgelegt, die Qualität und Konsistenz der gesammelten Terminologie signifikant zu verbessern, was für Unternehmen und Organisationen von großem Vorteil ist. Ein weiterer Stärke der Norm ist ihre Relevanz in einer zunehmend digitalisierten Welt, in der große Datenmengen effizient verarbeitet werden müssen. Die Richtlinien zur Auswahl von Terminologie-relevanten Informationen bieten Fachleuten wertvolle Anleitungen, um in einem komplexen Terminologieumfeld Orientierung zu finden. Dies fördert nicht nur die Verständlichkeit innerhalb einer Disziplin, sondern auch die internationale Zusammenarbeit, da eine standardisierte Herangehensweise an Terminologiefragen die Kommunikation über Sprachgrenzen hinweg erleichtert. Zusammengefasst ist die SIST ISO 5078:2025 eine unverzichtbare Ressource für Fachleute im Bereich Terminologiemanagement. Sie fördert die Entwicklung systematischer Ansätze zur Terminologieextraktion und trägt somit zur Verbesserung der Qualität von Fachübersetzungen und der allgemeinen Verständigung in spezialisierten Bereichen bei.
SIST ISO 5078:2025は、用語資源の管理における用語抽出に関する基準を提供する重要な文書です。この標準の範囲は、テキストコーパスから候補用語を抽出する方法を明確に規定し、関連する名称、定義、概念の関係、およびその他の用語関連情報の選定に関する指針を与えるものです。 SIST ISO 5078:2025の強みは、実用的な方法論を提供することであり、特に用語抽出において一貫性を持たせる点にあります。この文書は、明確な手順を示すことで、専門家や研究者が用語資源の管理を効率的に行えるよう助けており、結果として、言語の正確性と用語の標準化が向上します。 さらに、この標準は技術的な用語や学術的なコンテキストにおいて非常に関連性があります。現代の情報社会において、正確な用語管理はコミュニケーションの明確化に寄与し、異なる領域間の理解を深めるために不可欠です。SIST ISO 5078:2025のガイダンスに従うことで、ユーザーは用語の一貫性を保ちつつ、文書やプロジェクトに必要な適切な情報を確保することができます。 総じて、SIST ISO 5078:2025は、用語資源を効果的に管理し、質の高い用語抽出を実現するための優れた基準であり、専門家にとって必需品となるでしょう。
La norme SIST ISO 5078:2025, intitulée "Gestion des ressources terminologiques - Extraction de terminologie", présente un ensemble de méthodes rigoristes dédiées à l'extraction de termes candidats à partir de corpus textuels. Son étendue couvre non seulement l'identification des termes, mais aussi la sélection de désignations pertinentes, de définitions, de relations conceptuelles, ainsi que d'autres informations liées à la terminologie. L'une des forces majeures de cette norme réside dans sa capacité à structurer le processus d'extraction de terminologie, ce qui est essentiel pour assurer une cohérence dans la gestion des ressources terminologiques. En fournissant des lignes directrices claires, la norme facilite la tâche des professionnels de la terminologie, leur permettant de mieux cerner le vocabulaire spécifique et d’optimiser la qualité des bases de données terminologiques. De plus, elle favorise la standardisation des pratiques dans divers secteurs, ce qui est crucial dans un contexte où la globalisation implique des échanges terminologiques entre différentes langues et disciplines. La pertinence de la norme SIST ISO 5078:2025 est indéniable dans le cadre de la transformation numérique des entreprises et des institutions. L'extraction de terminologie est une étape clé dans le développement de systèmes de gestion de la connaissance efficace. Cette norme répond à un besoin croissant de disposer de ressources terminologiques bien structurées, favorisant ainsi la compréhension interculturelle et interdisciplinaire. En somme, la norme constitue un outil indispensable pour les organismes cherchant à améliorer leur gestion terminologique et à accroître l'efficacité de leurs communications.
SIST ISO 5078:2025 표준은 용어 자원의 관리 및 용어 추출에 관한 방법을 상세히 규명하고 있습니다. 이 문서는 텍스트 코퍼스에서 후보 용어를 추출하는 방법을 정의하며, 관련된 지명, 정의, 개념 관계 및 기타 용어 관련 정보를 선택하는 데 대한 지침을 제공합니다. 이 표준의 주요 강점 중 하나는 용어 추출 과정에 대한 명확한 지침을 제공한다는 점입니다. 이를 통해 사용자들은 보다 효과적으로 용어를 식별하고 체계화할 수 있습니다. 특히, 여러 산업에서 필요한 전문 용어를 정리하고 관리하는 데 유용한 도구로 작용할 수 있습니다. 또한, SIST ISO 5078:2025는 다양한 텍스트 코퍼스를 수집하고 분석함으로써 용어의 일관성을 확보할 수 있도록 돕습니다. 이를 통해 관련 분야의 연구자와 전문가들은 소통의 정확성을 높이고, 중복된 용어 사용을 줄일 수 있는 이점을 누릴 수 있습니다. 이 표준은 현재의 정보화 사회에서 용어 관리의 중요성이 증가함에 따라 더욱 그 유용성이 부각되고 있습니다. 즉, 다양한 콘텐츠 및 문서에서의 용어 표준화는 지식의 공유와 전파를 원활하게 하며, 이 문서는 그 과정에서 필수적인 역할을 수행하고 있습니다. 결론적으로, SIST ISO 5078:2025는 용어 추출과 관리를 위한 필수적인 표준으로, 정보의 체계화 및 효율적인 용어 사용을 위한 강력한 도구로 자리 잡고 있습니다.
Das Standard-Dokument SIST ISO 5078:2025 befasst sich mit der Verwaltung von Terminologie-Ressourcen und bietet spezifische Methoden zur Terminologie-Extraktion. Die Relevanz dieser Norm zeichnet sich durch ihre umfassenden Richtlinien zur Auswahl von relevanten Bezeichnungen, Definitionen, Konzeptbeziehungen und anderen terminologiebasierten Informationen ab. Eine der Stärken der ISO 5078:2025 ist ihre Fähigkeit, aus Textkorpora aussagekräftige Kandidatentermini zu extrahieren, was für Fachleute aus den Bereichen Übersetzung, Linguistik und Informationsmanagement von großer Bedeutung ist. Die präzisen Methoden zur Terminologie-Extraktion sind nicht nur effizient, sondern auch anpassungsfähig, was bedeutet, dass sie auf unterschiedliche Fachgebiete und Textarten angewendet werden können. Darüber hinaus unterstützt die Norm die Konsistenz in der Terminologieverwaltung, was für Unternehmen und Organisationen, die auf eine präzise Kommunikation angewiesen sind, wesentliche Vorteile bietet. Die Anwendung dieser Richtlinien kann dazu beitragen, Missverständnisse zu vermeiden und die Klarheit in der Kommunikation zu fördern. Durch die Standardisierung der Verfahren zur Terminologie-Extraktion leistet die ISO 5078:2025 einen wertvollen Beitrag zur Schaffung gemeinsamer Verständnistemplates und Fachterminologien. Insgesamt bietet die Norm SIST ISO 5078:2025 eine strukturierte Herangehensweise, die für die effiziente Verwaltung von Terminologie-Ressourcen unerlässlich ist. Ihre umfassende Abdeckung der Themen Terminologie-Extraktion und deren Anwendungsmöglichkeiten demonstriert die Relevanz in einer zunehmend globalisierten und spezialisierten Welt.
Le document SIST ISO 5078:2025 présente une approche systématique pour la gestion des ressources terminologiques, en se concentrant spécifiquement sur l'extraction de terminologie. Cette norme est essentielle pour toute organisation ou individu impliqué dans la création, la gestion et l'utilisation de terminologie, car elle propose des méthodes ciblées pour extraire des termes candidats à partir de corpus textuels. L'un des points forts de cette norme est sa clarté dans la description des méthodologies d'extraction, ce qui permet aux professionnels de bénéficier de lignes directrices précises et applicables. Le fait de disposer de recommandations sur la sélection des désignations pertinentes, des définitions, des relations de concepts et d'autres informations connexes est particulièrement bénéfique pour assurer la cohérence et la qualité des ressources terminologiques. En outre, SIST ISO 5078:2025 souligne l'importance d'une terminologie bien définie dans le cadre de divers domaines d'expertise. La standardisation des processus d'extraction de la terminologie permet non seulement de rationaliser le travail des terminologues, mais aussi de faciliter la communication interdisciplinaire et internationale. Dans un contexte où la langue et les termes évoluent rapidement, cette norme reste d'une grande pertinence pour les entreprises devant naviguer dans un environnement multilingue et technique. En résumé, le SIST ISO 5078:2025 propose une fondation solide pour l'extraction de terminologie, apportant des outils concrets pour améliorer la qualité et la gestion des ressources terminologiques dans divers secteurs.
SIST ISO 5078:2025 표준은 용어 자원의 관리 및 용어 추출에 대한 방법을 규명하고 있습니다. 이 문서의 주요 목적은 텍스트 코퍼스에서 후보 용어를 추출하는 방법을 제공하는 것입니다. 표준은 관련 설계, 정의, 개념 관계 및 기타 용어 관련 정보를 선택하는 데 대한 안내도 포함하고 있어 용어 관리의 실전 적용에 실질적인 도움을 줍니다. 이 표준의 강점 중 하나는 체계적이고 명확한 절차를 통해 용어를 정리하는 데 중점을 두고 있다는 점입니다. 특히, 용어 추출의 과정에서 직면할 수 있는 다양한 문제를 해결할 수 있도록 지원하는 방법론을 제시하는 것은 특정 분야나 산업에서의 용어 사용을 통 Unified 하고 개선하는 데 기여합니다. SIST ISO 5078:2025는 특히 다국적 기업이나 여러 언어를 사용하는 조직에 매우 중요합니다. 이러한 표준은 용어의 일관성을 유지하고, 국제적인 커뮤니케이션을 원활하게 하며, 각 언어별로 정확한 정보 전달을 가능하게 합니다. 따라서 이 문서는 용어 자원 관리에서의 신뢰성과 일관성을 높이는 데 필수적인 역할을 합니다. 전반적으로, SIST ISO 5078:2025 표준은 용어 추출과 관련된 체계적 접근법을 통해 다양한 산업 분야에서의 용어 관리의 질을 향상시키는 데 중대한 기여를 하는 문서입니다. 이 표준을 통해 기업들은 용어 자원을 효율적으로 관리하고, 정보의 일관성을 유지하며, 국제적인 협업 과정에서 발생할 수 있는 혼란을 줄일 수 있습니다.










Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...