SIST ISO 24624:2018
Language resource management -- Transcription of spoken language
Language resource management -- Transcription of spoken language
ISO 24624:2016 specifies rules for representing transcriptions of audio- and video-recorded spoken interactions in XML documents based on the guidelines of the TEI. As a secondary objective, the document aims to relate transcribed data with standards for annotated corpora. It is applicable to transcription data for studies in sociolinguistics, conversation analysis, dialectology, corpus linguistics, corpus lexicography, language technology, qualitative social studies and other transcription data of recorded spoken language. It is not applicable to other forms of transcription, most importantly transcriptions of hand-written manuscripts.
Annex A gives a fully encoded example and Annex B provides an element index and an attribute index.
Gestion des ressources linguistiques -- Transcription du langage parlé
L'ISO 24624:2016 énonce des règles de représentation des transcriptions d'enregistrements audio et vidéo d'interactions parlées, dans des documents XML reposant sur les recommandations de la TEI. Le deuxième objectif de ce document vise à rattacher les données transcrites à des normes de corpus annotés. Il s'applique aux données de transcription pour des études sociolinguistiques, l'analyse de conversation, la dialectologie, la linguistique de corpus, la lexicographie de corpus, les technologies langagières, les études qualitatives en sciences sociales, et aux autres données de transcription d'enregistrements du langage parlé. Il ne s'applique pas aux autres formes de transcription et surtout pas aux transcriptions de manuscrits.
L'Annexe A présente un exemple d'encodage complet et l'Annexe B fournit un index des éléments et un index des attributs.
Upravljanje z jezikovnimi viri - Transkripcija govorjenega jezika
Ta dokument določa pravila za predstavitev transkripcij zvočnih in video posnetkov govorne komunikacije v dokumentih XML na podlagi smernic pobude za zapis besedil (TEI). Drugotni namen tega dokumenta je povezati prepisane podatke in standarde za označene korpuse. Uporablja se za prepisane podatke za študije na področju sociolingvistike, pogovorne analize, dialektologije, korpusnega jezikoslovja, korpusne leksikografije, jezikovne tehnologije, kvalitativne družboslovne študije in druge prepisane podatke zabeleženega govornega jezika. Ne uporablja se za druge oblike transkripcije, zlasti transkripcije ročno napisanih rokopisov.
V dodatku A je podan v celoti kodiran primer, v dodatku B pa sta podana kazalo elementov in kazalo atributov.
General Information
Standards Content (Sample)
Available free for research and teaching purposes. element) in a written document. It corresponds to a contiguous stretch of speech of
SLOVENSKI STANDARD
01-oktober-2018
Upravljanje z jezikovnimi viri - Transkripcija govorjenega jezika
Language resource management -- Transcription of spoken language
Gestion des ressources linguistiques -- Transcription du langage parlé
Ta slovenski standard je istoveten z: ISO 24624:2016
ICS:
01.140.10 3LVDQMHLQSUHþUNRYDQMH Writing and transliteration
35.060 Jeziki, ki se uporabljajo v Languages used in
informacijski tehniki in information technology
tehnologiji
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
INTERNATIONAL ISO
STANDARD 24624
First edition
2016-08-15
Language resource management —
Transcription of spoken language
Gestion des ressources linguistiques — Transcription du langage parlé
Reference number
©
ISO 2016
© ISO 2016, Published in Switzerland
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form
or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior
written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of
the requester.
ISO copyright office
Ch. de Blandonnet 8 • CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. +41 22 749 01 11
Fax +41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO 2016 – All rights reserved
Contents Page
Foreword .v
Introduction .vi
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Metadata . 2
4.1 Description of the electronic file () . 2
4.1.1 Distribution information () . 2
4.1.2 Recording information (). 2
4.2 Description of circumstances () . 4
4.2.1 Participant information () . 4
4.2.2 Setting information () . 4
4.3 Description of source () . 5
5 Macrostructure . 5
5.1 Timeline () . 5
5.2 Utterances () . 6
5.3 Free dependent annotations (, ) . 7
5.4 Grouping of utterances and dependent annotations () . 9
5.5 Independent elements outside utterances ( and ) .10
5.6 Inline paralinguistic annotation () .10
5.7 Global divisions of a transcription (
6 Microstructure .12
6.1 Tokens () .12
6.1.1 Characterization .12
6.1.2 Representation as .12
6.1.3 Further constraints .13
6.1.4 Examples .13
6.2 Pauses () .14
6.2.1 Characterization .14
6.2.2 Representation as .14
6.2.3 Further constraints .14
6.2.4 Examples .15
6.3 Audible and visible non-speech events (, and ) .15
6.3.1 Characterization .15
6.3.2 Representation as , or .16
6.3.3 Examples .16
6.4 Punctuation () .17
6.4.1 Characterization .17
6.4.2 Representation as .17
6.4.3 Further constraints .17
6.4.4 Examples .18
6.5 Uncertainty, alternatives, incomprehensible and omitted passages (,
, ) .18
6.5.1 Characterization .18
6.5.2 Representation as or .18
6.5.3 Further constraints .18
6.5.4 Examples .19
6.6 Units above the token and below the level () .20
6.6.1 Characterization .20
6.6.2 Representation as .20
6.6.3 Further constraints .20
6.6.4 Examples .20
Annex A (informative) Fully encoded example .22
Annex B (informative) Element and attribute index .28
Bibliography .31
iv © ISO 2016 – All rights reserved
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www.iso.org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the meaning of ISO specific terms and expressions related to conformity assessment,
as well as information about ISO’s adherence to the World Trade Organization (WTO) principles in the
Technical Barriers to Trade (TBT) see the following URL: www.iso.org/iso/foreword.html.
The committee responsible for this document is ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
Introduction
This document sets out to facilitate the interchange of transcriptions of spoken language between
different computational tools and environments for creating, editing, publishing and exploiting such
data. Transcription of spoken language in this context means an orthography-based transcription of
verbal activity as recorded in an audio or video recording of a natural interaction. The description of
activity in other modalities (e.g. body language, gestures and facial expression) may be part of a spoken
language transcription, but this document starts from the assumption that the verbal dimension is
the primary focus of a spoken language transcription. Likewise, although this document may also be
relevant for transcription based on phonetic alphabets like the IPA, the assumption for this document is
that orthography-based transcription is the default case.
This document is developed in the context of the joint agreement between ISO and the Text Encoding
Initiative (TEI) consortium, and accordingly, its content is also distributed as part of the TEI
[23]
guidelines.
This document takes into account data models and encoding practices supported by widely used
[12],[16],[17],[19]
transcription software. More specifically, it builds on several interoperability studies
involving the following tools:
[10]
— ANVIL
[11]
— CLAN
[22]
— ELAN
[20]
— EXMARaLDA
[18]
— FOLKER
[1]
— Transcriber
This document was developed to be compatible with the formats produced by these tools. The
[4]
compatibility may extend to the formats of further labelling tools (e.g. Praat or Wavesurfer, http://
www.speech.kth.se/wavesurfer/index2.html), but possibly on a lower level and/or with a requirement
to convert these formats to one of the above-mentioned before adding mandatory information (e.g.
speaker assignment) using the respective tools.
This document also aims to be usable with widely used transcription systems (“conventions”). However,
in a technical sense, compatibility is not easily definable in this area since, unlike the tool formats, most
of these systems lack an explicit formalization. The following selection of transcription systems was
considered for this document:
[11]
— Codes for the Human Analysis of Transcripts (CHAT)
[7]
— Discourse Transcription (DT)
[21]
— Gesprächsanalytisches Transkriptionssystem (GAT)
[13]
— Halbinterpretative Arbeitstranskriptionen (HIAT)
Since TEI is the reference framework for this document and metadata is not its main concern, no attempt
is made here to address metadata compatibility issues beyond the TEI header. However, it should be
noted that there are several TEI profiles for the CMDI framework which are related both to each other
and to CMDI profiles of other metadata formats (e.g. IMDI) via the ISOCAT registry (see also References
[5], [6] and [9]).
This document aims to define both a target format for legacy data conversion and a format suitable for
future data processing requirements. The pros and cons of these two demands were carefully weighed
up before decisions were taken. At some points, certain techniques are therefore marked as preferred
vi © ISO 2016 – All rights reserved
from a data processing point of view while an alternative technique is still allowed if the structure of
legacy data makes its use unavoidable.
With regard to the other standards developed within ISO committee TC 37/SC 4, this document is
intended to provide the primary layer on top of which further annotation layers may be implemented.
In particular, the use of the element for tokenizing a transcription is conformable to the TEI-based
representation of tokens ISO 24611 (MAF).
This document also aligns with the mechanism proposed in the TEI guidelines to embed stand-
off annotations within a TEI document. In particular, this mechanism contains a generic element
() that groups together annotations related to the same linguistic segment; this
grouping meets the needs of this document in the case of annotations of elements or its children.
Finally, this document is complementary and does not overlap with the speech and multimodal
interaction-related standards developed within the W3C. In particular, it does not deal with speech
[24]
synthesis as is the case for SSML, nor does it deal with the representation of the semantic
[25]
interpretation of multimodal utterances as does EMMA.
INTERNATIONAL STANDARD ISO 24624:2016(E)
Language resource management — Transcription of
spoken language
1 Scope
This document specifies rules for representing transcriptions of audio- and video-recorded spoken
interactions in XML documents based on the guidelines of the TEI. As a secondary objective, the
document aims to relate transcribed data with standards for annotated corpora. It is applicable to
transcription data for studies in sociolinguistics, conversation analysis, dialectology, corpus linguistics,
corpus lexicography, language technology, qualitative social studies and other transcription data
of recorded spoken language. It is not applicable to other forms of transcription, most importantly
transcriptions of hand-written manuscripts.
Annex A gives a fully encoded example and Annex B provides an element index and an attribute index.
2 Normative references
There are no normative references in this document.
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— IEC Electropedia: available at http://www.electropedia.org/
— ISO Online browsing platform: available at http://www.iso.org/obp
3.1
dependent annotation
annotation which does not refer directly to an audio or video recording, but to another annotation,
typically an orthographic or phonetic transcription
3.2
milestone element
empty XML element used to indicate a boundary point
3.3
orthographic transcription
representation or modelling of spoken language based on the orthography of the respective language
3.4
paralinguistic feature
feature of spoken language beyond the individual sound(s), such as voice quality, pitch, volume,
intonation
3.5
phonetic transcription
representation or modelling of spoken language based on the sound system of the respective language
3.6
spoken language
oral language produced by a person’s vocal system
3.7
transcriber
person who carries out the transcription
3.8
transcription
representation or modelling of spoken language by means of written symbols
3.9
transcription system
theoretically founded set of principles and rules detailing what spoken language phenomena are to be
transcribed, and how they are to be transcribed
4 Metadata
The TEI guidelines formulate extensive suggestions for encoding metadata inside different subsections
of the element. The following section addresses only those pieces of metadata which are
either (i) crucial for ensuring the interpretability and exchangeability of spoken language transcriptions
in general or (ii) likely to be relevant in a large majority of cases. This does not preclude the possibility
of, or necessity for, encoding further metadata inside the element.
4.1 Description of the electronic file ()
4.1.1 Distribution information ()
The element inside the section of the should be used to
record information about access rights and contact information for the transcription in question.
EXAMPLE 1 Use of
Hamburger Zentrum für Sprachkorpora
No redistributing allowed.
Hamburger Zentrum für Sprachkorpora
Max Brauer-Allee 60
22765
Hamburg
Germany
4.1.2 Recording information ()
The element inside the section of the should be used to
record information about the transcribed recording(s). Only the actual recording(s), usually digital
audio and/or video files, should be described here. General information about the respective interaction
which is independent of the recording(s) should be described in the element (see 4.2.2).
2 © ISO 2016 – All rights reserved
A element inside a element should be used to refer to the corresponding digital
file via a @url attribute (see Reference [2]). A @type attribute on should be used to
indicate the media type of the recording; audio and video are the permissible values for that attribute.
The actual digital file type should be encoded as a @mimeType attribute (see Reference [8]) on the
element. Where two or more files are derived from the same master recording (e.g. a video
file or an extracted audio track), these should be represented as different elements inside the
same element, rather than as different elements. TEI linking mechanisms,
such as or @corresp, can be used to describe relationships between different recordings or
between recordings and other elements, such as speakers.
EXAMPLE 2 Use of
Parkinson Talkshow on BBC, broadcast on 02 November 2007
Video excerpt downloaded from YouTube with aTube-Catcher, converted
into MPG format with Adobe Premiere
Audio extracted from video with Audacity 1.3 beta
Recorded with a ZOOM H4NSP, external lapel microphone
clipped to Victoria Beckham’s
dress
Synchronized with David Beckham’s record-
ing
Recorded with a ZOOM H4NSP, external lapel microphone
clipped to David Beckham’s
shirt collar
Synchronized with
Victoria Beckham’s recording
4.2 Description of circumstances ()
4.2.1 Participant information ()
The participants of the transcribed interaction should be described in elements inside
the section of a element. The use of an @n attribute on the
element to define an abbreviated code for the respective participant is mandatory since it can be crucial
for many processing purposes. elements inside the body of the transcription refer to the @xml:id
attribute of a element, which shall therefore always be provided.
In order to provide additional metadata about participants, the content model of can be fully
exploited, for example, to record a person’s age, birth date, language knowledge or role in the recorded
conversation.
EXAMPLE 3 Use of
Daniel
Steward
British English
French
Fiona
Baker
4.2.2 Setting information ()
The element should be used to provide general information about the setting and
circumstances of the interaction. This includes such matters as the place and time, spatial organization
4 © ISO 2016 – All rights reserved
and artefacts of the interaction. Information pertaining to a specific recording of that interaction should
not be recorded here, but in the (see 4.1.2).
EXAMPLE 4 Use of
BBC studio London
Talkshow host Michael Parkinson interviewing David and Victoria
Beckham about their relationship
4.3 Description of source ()
The element is used to record information about the way the TEI encoded text has
been derived from a recorded source. This includes information about both the tool which created the
transcription inside an element and the convention used in transcribing the data inside a
element. @ident and @version attributes should be used on these elements to
provide a machine-readable way of accessing this information.
EXAMPLE 5 Use of
Transcription Tool providing a TEI Export
Orthographic transcription according to HIAT
5 Macrostructure
5.1 Timeline ()
elements inside a element should be used to define points in the recording;
these points are then referred to by @start, @end and @synch attributes of other elements (most
importantly elements) of the transcription to represent its temporal structure. It is therefore
obligatory to provide an @xml:id attribute for each element. elements shall be in
the same order as the timepoints they refer to. Specifying an @interval attribute is optional, but it is
very useful for many processing purposes. Absolute time values in the @interval attribute should be
given in seconds from the start of the recording with the appropriate number of decimal points. The
first element in the timeline corresponds to the start time of the transcribed recording. If an
absolute value is known for this point in time, it can be encoded in an @absolute attribute of the first
element and the element can point to it via an @origin attribute. If no absolute value for
the start of the recording can be provided, the @origin and @absolute attributes should be omitted.
EXAMPLE 6 Use of
5.2 Utterances ()
The element is the fundamental unit of organization for a transcription, roughly comparable to a
paragraph (
a single speaker. A more exact definition and delimitation of a do not lie within the scope of this
document. The TEI definition characterizing a as “often preceded by a silence or a change of speaker”
should be viewed as a suggestion only. It is therefore permissible to use a more refined definition for
a . This more refined definition can be described in the header in a element
inside an element.
If it is not wrapped inside an element (see 5.4), a element shall be assigned to a
single speaker by providing a value for the @who attribute which points to the @xml:id of a
element defined in the header. If the speaker cannot be identified, the @who attribute may also be
omitted. An @xml:id attribute can optionally serve to make the element addressable for stand-off
annotation, for instance, via elements (see 5.3).
If it is not wrapped inside an element (see 5.4), a element shall be assigned
to the timeline by providing values for the @start and @end attributes pointing to the @xml:id of
a element defined in the timeline. Further temporal structure can be recorded by inserting
elements at appropriate places inside the content of a element.
In multilingual interactions, it may be necessary to record the language of an utterance. This can be
done in an @xml:lang attribute of the element. Alternatively, the language of an utterance can
be treated as an annotation and encoded in a element (see 5.3). In cases of interactions where
code-switching or similar phenomena occur, it can be preferable to record the language of individual
tokens (see 6.1) instead of entire utterances.
The preferred mechanism for representing overlap is to encode it implicitly through the appropriate
use of @start and @end attributes and elements. Other TEI mechanisms, such as a
6 © ISO 2016 – All rights reserved
@trans=”overlap” attribute for the element, are allowed but not recommended because they
cannot be processed in an appropriate manner by many of the widely used annotation tools.
EXAMPLE 7 Temporal information for elements
Good morning!
Okay. Très bien, très bien.
Good morning!
Do not interrupt me!
Sorry, mate!
In the simplest case, elements contain character data, possibly interspersed with
elements (see Example 7). Further structuring of the content of a element (e.g. markup of tokens
and pauses) may be carried out via the mechanisms described in Clause 6.
The assumed default case is that contains an orthographic transcription in a broad sense, including
orthography-based mechanisms for approaching the actual phonetic realizations, such as “eye dialect”,
“literary transcription” and “modified orthography”. If this is the case, no further specification in the
form of a @notation attribute on is necessary. If, however, contains a phonemic or phonetic
transcription or is based on some other systematics, this should be indicated via a @notation attribute
with an appropriate value.
EXAMPLE 8 Phonetic transcription inside a element
ɡʊd ˈmɔːnɪŋ
If several types of transcription exist side-by-side (e.g. an orthographic and a phonetic transcription),
one level should be singled out as the primary transcription layer. Only this layer should be represented
inside elements, the other one being represented in appropriate elements (see 5.3).
5.3 Free dependent annotations (, )
Whereas typically, but not necessarily, contains the basic orthographic transcription,
elements should be used to represent additional annotations (e.g. part-of-speech tagging, prosodic
annotation and translation) on that basic transcription. Annotations of the same type should be
grouped in a element with a @type attribute specifying the annotation level.
The reference of the annotation in question shall be specified using @to and @from attributes in one
of the following ways:
— the values of @to and @from can point to the @xml:id attributes of other elements (e.g. a , a
or a ) of the transcription;
— the values of @to and @from can point to the @xml:id attributes of elements from the
timeline.
If the latter mechanism is used, elements shall be grouped with the element they refer
to by using an element (see 5.4). This is necessary to avoid ambiguities of reference
in cases of overlapping speech.
On the level of tokens, annotation via elements pointing to elements is conformable to the
annotation mechanism described in ISO 24611 (MAF).
Alternatively, annotations of single tokens (e.g. lemmatization and part-of-speech tagging) may be
realized as appropriate attributes on elements if no structural conflicts between the two levels
exist (see 6.1.2).
For annotations with an internal structure, nesting elements can be used. In that way, 1:n
relations between tokens and annotations, as well as hierarchically organized annotations, can be
expressed.
The use of further annotation techniques (e.g. via feature structures) is not precluded, but does not lie
within the scope of this document.
EXAMPLE 9 Use of and for annotations
faster
Okay.
Very good, very good.
PersPron
Idunno
I
do
not
know
8 © ISO 2016 – All rights reserved
JohnlovesMary
S
NP
N
VP
V
NP
N
5.4 Grouping of utterances and dependent annotations ()
elements and the annotations referring to them can be grouped under an
element. This has the advantage of creating local annotated environments, each (succession) of which
can be treated as an independent transcription in its own right, that is to say, it provides a “tesselation”
of the transcription document. elements in which spans point to the timeline rather than
directly to other elements of the transcription shall be grouped with the element they refer to,
because, otherwise, ambiguities with respect to their scope may arise in cases of overlapping speech.
Although the use of is optional, it is not allowed to mix and
elements on the top level; in other words, as soon as one element is used, all
elements have to be wrapped inside an element.
elements shall not contain more than one element. However, there may be
cases where it makes sense to use an as a container only for the description of a
non-verbal action of a participant (using one of the elements described in 6.3), without a subordinate
element.
If is used, speaker assignment through the @who attribute should be made on this
level instead of on the embedded element. The same holds for @start and @end attributes pointing
to the timeline. An @xml:id attribute can be used to make the addressable for
stand-off annotations.
The element can also be used as a stand-off annotation component within the
element, as specified in the TEI guidelines. In such a case,
points to the corresponding element by means of a @corresp attribute.
EXAMPLE 10 Use of
laughter
5.5 Independent elements outside utterances ( and )
and elements should be used to represent pauses and non-verbal phenomena
which cannot be attributed to a speaker. In this document, these elements appear on the same
hierarchical level as (or, as the case may be, ) elements. In order to fit them
into the temporal structure, they shall have @start and @end attributes pointing to the timeline.
EXAMPLE 11 Use of and outside utterances
roar of thunder outside
5.6 Inline paralinguistic annotation ()
The TEI guidelines provide the element to “[mark] the point at which some paralinguistic
feature of a series of utterances by any one speaker changes”. If used for that purpose, the element shall
be further specified by the attributes @feature (legal values: tempo for speed of utterance, loud for
loudness, pitch for pitch range, tension for tension or stress pattern, rhythm for rhythmic qualities
and voice for voice quality) and @new to provide the new value taken by the feature at this point. In
addition, a @synch attribute shall be provided to assign the element a position in the timeline.
is a milestone element. As such, it brings with it certain problems with automatic checking
and processing of the document structure. Since the description of paralinguistic features can also be
10 © ISO 2016 – All rights reserved
viewed as annotations of transcribed material, expressing the same content in a element (see
5.3) is the preferable alternative.
EXAMPLE 12 Use of
And he was up and away
...
Available free for research and teaching purposes. element) in a written document. It corresponds to a contiguous stretch of speech of
SLOVENSKI STANDARD
01-oktober-2018
Upravljanje z jezikovnimi viri - Transkripcija govorjenega jezika
Language resource management -- Transcription of spoken language
Gestion des ressources linguistiques -- Transcription du langage parlé
Ta slovenski standard je istoveten z: ISO 24624:2016
ICS:
01.140.10 Pisanje in prečrkovanje Writing and transliteration
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
INTERNATIONAL ISO
STANDARD 24624
First edition
2016-08-15
Language resource management —
Transcription of spoken language
Gestion des ressources linguistiques — Transcription du langage parlé
Reference number
©
ISO 2016
© ISO 2016, Published in Switzerland
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form
or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior
written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of
the requester.
ISO copyright office
Ch. de Blandonnet 8 • CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. +41 22 749 01 11
Fax +41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO 2016 – All rights reserved
Contents Page
Foreword .v
Introduction .vi
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Metadata . 2
4.1 Description of the electronic file () . 2
4.1.1 Distribution information () . 2
4.1.2 Recording information (). 2
4.2 Description of circumstances () . 4
4.2.1 Participant information () . 4
4.2.2 Setting information () . 4
4.3 Description of source () . 5
5 Macrostructure . 5
5.1 Timeline () . 5
5.2 Utterances () . 6
5.3 Free dependent annotations (, ) . 7
5.4 Grouping of utterances and dependent annotations () . 9
5.5 Independent elements outside utterances ( and ) .10
5.6 Inline paralinguistic annotation () .10
5.7 Global divisions of a transcription (
6 Microstructure .12
6.1 Tokens () .12
6.1.1 Characterization .12
6.1.2 Representation as .12
6.1.3 Further constraints .13
6.1.4 Examples .13
6.2 Pauses () .14
6.2.1 Characterization .14
6.2.2 Representation as .14
6.2.3 Further constraints .14
6.2.4 Examples .15
6.3 Audible and visible non-speech events (, and ) .15
6.3.1 Characterization .15
6.3.2 Representation as , or .16
6.3.3 Examples .16
6.4 Punctuation () .17
6.4.1 Characterization .17
6.4.2 Representation as .17
6.4.3 Further constraints .17
6.4.4 Examples .18
6.5 Uncertainty, alternatives, incomprehensible and omitted passages (,
, ) .18
6.5.1 Characterization .18
6.5.2 Representation as or .18
6.5.3 Further constraints .18
6.5.4 Examples .19
6.6 Units above the token and below the level () .20
6.6.1 Characterization .20
6.6.2 Representation as .20
6.6.3 Further constraints .20
6.6.4 Examples .20
Annex A (informative) Fully encoded example .22
Annex B (informative) Element and attribute index .28
Bibliography .31
iv © ISO 2016 – All rights reserved
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www.iso.org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the meaning of ISO specific terms and expressions related to conformity assessment,
as well as information about ISO’s adherence to the World Trade Organization (WTO) principles in the
Technical Barriers to Trade (TBT) see the following URL: www.iso.org/iso/foreword.html.
The committee responsible for this document is ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
Introduction
This document sets out to facilitate the interchange of transcriptions of spoken language between
different computational tools and environments for creating, editing, publishing and exploiting such
data. Transcription of spoken language in this context means an orthography-based transcription of
verbal activity as recorded in an audio or video recording of a natural interaction. The description of
activity in other modalities (e.g. body language, gestures and facial expression) may be part of a spoken
language transcription, but this document starts from the assumption that the verbal dimension is
the primary focus of a spoken language transcription. Likewise, although this document may also be
relevant for transcription based on phonetic alphabets like the IPA, the assumption for this document is
that orthography-based transcription is the default case.
This document is developed in the context of the joint agreement between ISO and the Text Encoding
Initiative (TEI) consortium, and accordingly, its content is also distributed as part of the TEI
[23]
guidelines.
This document takes into account data models and encoding practices supported by widely used
[12],[16],[17],[19]
transcription software. More specifically, it builds on several interoperability studies
involving the following tools:
[10]
— ANVIL
[11]
— CLAN
[22]
— ELAN
[20]
— EXMARaLDA
[18]
— FOLKER
[1]
— Transcriber
This document was developed to be compatible with the formats produced by these tools. The
[4]
compatibility may extend to the formats of further labelling tools (e.g. Praat or Wavesurfer, http://
www.speech.kth.se/wavesurfer/index2.html), but possibly on a lower level and/or with a requirement
to convert these formats to one of the above-mentioned before adding mandatory information (e.g.
speaker assignment) using the respective tools.
This document also aims to be usable with widely used transcription systems (“conventions”). However,
in a technical sense, compatibility is not easily definable in this area since, unlike the tool formats, most
of these systems lack an explicit formalization. The following selection of transcription systems was
considered for this document:
[11]
— Codes for the Human Analysis of Transcripts (CHAT)
[7]
— Discourse Transcription (DT)
[21]
— Gesprächsanalytisches Transkriptionssystem (GAT)
[13]
— Halbinterpretative Arbeitstranskriptionen (HIAT)
Since TEI is the reference framework for this document and metadata is not its main concern, no attempt
is made here to address metadata compatibility issues beyond the TEI header. However, it should be
noted that there are several TEI profiles for the CMDI framework which are related both to each other
and to CMDI profiles of other metadata formats (e.g. IMDI) via the ISOCAT registry (see also References
[5], [6] and [9]).
This document aims to define both a target format for legacy data conversion and a format suitable for
future data processing requirements. The pros and cons of these two demands were carefully weighed
up before decisions were taken. At some points, certain techniques are therefore marked as preferred
vi © ISO 2016 – All rights reserved
from a data processing point of view while an alternative technique is still allowed if the structure of
legacy data makes its use unavoidable.
With regard to the other standards developed within ISO committee TC 37/SC 4, this document is
intended to provide the primary layer on top of which further annotation layers may be implemented.
In particular, the use of the element for tokenizing a transcription is conformable to the TEI-based
representation of tokens ISO 24611 (MAF).
This document also aligns with the mechanism proposed in the TEI guidelines to embed stand-
off annotations within a TEI document. In particular, this mechanism contains a generic element
() that groups together annotations related to the same linguistic segment; this
grouping meets the needs of this document in the case of annotations of elements or its children.
Finally, this document is complementary and does not overlap with the speech and multimodal
interaction-related standards developed within the W3C. In particular, it does not deal with speech
[24]
synthesis as is the case for SSML, nor does it deal with the representation of the semantic
[25]
interpretation of multimodal utterances as does EMMA.
INTERNATIONAL STANDARD ISO 24624:2016(E)
Language resource management — Transcription of
spoken language
1 Scope
This document specifies rules for representing transcriptions of audio- and video-recorded spoken
interactions in XML documents based on the guidelines of the TEI. As a secondary objective, the
document aims to relate transcribed data with standards for annotated corpora. It is applicable to
transcription data for studies in sociolinguistics, conversation analysis, dialectology, corpus linguistics,
corpus lexicography, language technology, qualitative social studies and other transcription data
of recorded spoken language. It is not applicable to other forms of transcription, most importantly
transcriptions of hand-written manuscripts.
Annex A gives a fully encoded example and Annex B provides an element index and an attribute index.
2 Normative references
There are no normative references in this document.
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— IEC Electropedia: available at http://www.electropedia.org/
— ISO Online browsing platform: available at http://www.iso.org/obp
3.1
dependent annotation
annotation which does not refer directly to an audio or video recording, but to another annotation,
typically an orthographic or phonetic transcription
3.2
milestone element
empty XML element used to indicate a boundary point
3.3
orthographic transcription
representation or modelling of spoken language based on the orthography of the respective language
3.4
paralinguistic feature
feature of spoken language beyond the individual sound(s), such as voice quality, pitch, volume,
intonation
3.5
phonetic transcription
representation or modelling of spoken language based on the sound system of the respective language
3.6
spoken language
oral language produced by a person’s vocal system
3.7
transcriber
person who carries out the transcription
3.8
transcription
representation or modelling of spoken language by means of written symbols
3.9
transcription system
theoretically founded set of principles and rules detailing what spoken language phenomena are to be
transcribed, and how they are to be transcribed
4 Metadata
The TEI guidelines formulate extensive suggestions for encoding metadata inside different subsections
of the element. The following section addresses only those pieces of metadata which are
either (i) crucial for ensuring the interpretability and exchangeability of spoken language transcriptions
in general or (ii) likely to be relevant in a large majority of cases. This does not preclude the possibility
of, or necessity for, encoding further metadata inside the element.
4.1 Description of the electronic file ()
4.1.1 Distribution information ()
The element inside the section of the should be used to
record information about access rights and contact information for the transcription in question.
EXAMPLE 1 Use of
Hamburger Zentrum für Sprachkorpora
No redistributing allowed.
Hamburger Zentrum für Sprachkorpora
Max Brauer-Allee 60
22765
Hamburg
Germany
4.1.2 Recording information ()
The element inside the section of the should be used to
record information about the transcribed recording(s). Only the actual recording(s), usually digital
audio and/or video files, should be described here. General information about the respective interaction
which is independent of the recording(s) should be described in the element (see 4.2.2).
2 © ISO 2016 – All rights reserved
A element inside a element should be used to refer to the corresponding digital
file via a @url attribute (see Reference [2]). A @type attribute on should be used to
indicate the media type of the recording; audio and video are the permissible values for that attribute.
The actual digital file type should be encoded as a @mimeType attribute (see Reference [8]) on the
element. Where two or more files are derived from the same master recording (e.g. a video
file or an extracted audio track), these should be represented as different elements inside the
same element, rather than as different elements. TEI linking mechanisms,
such as or @corresp, can be used to describe relationships between different recordings or
between recordings and other elements, such as speakers.
EXAMPLE 2 Use of
Parkinson Talkshow on BBC, broadcast on 02 November 2007
Video excerpt downloaded from YouTube with aTube-Catcher, converted
into MPG format with Adobe Premiere
Audio extracted from video with Audacity 1.3 beta
Recorded with a ZOOM H4NSP, external lapel microphone
clipped to Victoria Beckham’s
dress
Synchronized with David Beckham’s record-
ing
Recorded with a ZOOM H4NSP, external lapel microphone
clipped to David Beckham’s
shirt collar
Synchronized with
Victoria Beckham’s recording
4.2 Description of circumstances ()
4.2.1 Participant information ()
The participants of the transcribed interaction should be described in elements inside
the section of a element. The use of an @n attribute on the
element to define an abbreviated code for the respective participant is mandatory since it can be crucial
for many processing purposes. elements inside the body of the transcription refer to the @xml:id
attribute of a element, which shall therefore always be provided.
In order to provide additional metadata about participants, the content model of can be fully
exploited, for example, to record a person’s age, birth date, language knowledge or role in the recorded
conversation.
EXAMPLE 3 Use of
Daniel
Steward
British English
French
Fiona
Baker
4.2.2 Setting information ()
The element should be used to provide general information about the setting and
circumstances of the interaction. This includes such matters as the place and time, spatial organization
4 © ISO 2016 – All rights reserved
and artefacts of the interaction. Information pertaining to a specific recording of that interaction should
not be recorded here, but in the (see 4.1.2).
EXAMPLE 4 Use of
BBC studio London
Talkshow host Michael Parkinson interviewing David and Victoria
Beckham about their relationship
4.3 Description of source ()
The element is used to record information about the way the TEI encoded text has
been derived from a recorded source. This includes information about both the tool which created the
transcription inside an element and the convention used in transcribing the data inside a
element. @ident and @version attributes should be used on these elements to
provide a machine-readable way of accessing this information.
EXAMPLE 5 Use of
Transcription Tool providing a TEI Export
Orthographic transcription according to HIAT
5 Macrostructure
5.1 Timeline ()
elements inside a element should be used to define points in the recording;
these points are then referred to by @start, @end and @synch attributes of other elements (most
importantly elements) of the transcription to represent its temporal structure. It is therefore
obligatory to provide an @xml:id attribute for each element. elements shall be in
the same order as the timepoints they refer to. Specifying an @interval attribute is optional, but it is
very useful for many processing purposes. Absolute time values in the @interval attribute should be
given in seconds from the start of the recording with the appropriate number of decimal points. The
first element in the timeline corresponds to the start time of the transcribed recording. If an
absolute value is known for this point in time, it can be encoded in an @absolute attribute of the first
element and the element can point to it via an @origin attribute. If no absolute value for
the start of the recording can be provided, the @origin and @absolute attributes should be omitted.
EXAMPLE 6 Use of
5.2 Utterances ()
The element is the fundamental unit of organization for a transcription, roughly comparable to a
paragraph (
a single speaker. A more exact definition and delimitation of a do not lie within the scope of this
document. The TEI definition characterizing a as “often preceded by a silence or a change of speaker”
should be viewed as a suggestion only. It is therefore permissible to use a more refined definition for
a . This more refined definition can be described in the header in a element
inside an element.
If it is not wrapped inside an element (see 5.4), a element shall be assigned to a
single speaker by providing a value for the @who attribute which points to the @xml:id of a
element defined in the header. If the speaker cannot be identified, the @who attribute may also be
omitted. An @xml:id attribute can optionally serve to make the element addressable for stand-off
annotation, for instance, via elements (see 5.3).
If it is not wrapped inside an element (see 5.4), a element shall be assigned
to the timeline by providing values for the @start and @end attributes pointing to the @xml:id of
a element defined in the timeline. Further temporal structure can be recorded by inserting
elements at appropriate places inside the content of a element.
In multilingual interactions, it may be necessary to record the language of an utterance. This can be
done in an @xml:lang attribute of the element. Alternatively, the language of an utterance can
be treated as an annotation and encoded in a element (see 5.3). In cases of interactions where
code-switching or similar phenomena occur, it can be preferable to record the language of individual
tokens (see 6.1) instead of entire utterances.
The preferred mechanism for representing overlap is to encode it implicitly through the appropriate
use of @start and @end attributes and elements. Other TEI mechanisms, such as a
6 © ISO 2016 – All rights reserved
@trans=”overlap” attribute for the element, are allowed but not recommended because they
cannot be processed in an appropriate manner by many of the widely used annotation tools.
EXAMPLE 7 Temporal information for elements
Good morning!
Okay. Très bien, très bien.
Good morning!
Do not interrupt me!
Sorry, mate!
In the simplest case, elements contain character data, possibly interspersed with
elements (see Example 7). Further structuring of the content of a element (e.g. markup of tokens
and pauses) may be carried out via the mechanisms described in Clause 6.
The assumed default case is that contains an orthographic transcription in a broad sense, including
orthography-based mechanisms for approaching the actual phonetic realizations, such as “eye dialect”,
“literary transcription” and “modified orthography”. If this is the case, no further specification in the
form of a @notation attribute on is necessary. If, however, contains a phonemic or phonetic
transcription or is based on some other systematics, this should be indicated via a @notation attribute
with an appropriate value.
EXAMPLE 8 Phonetic transcription inside a element
ɡʊd ˈmɔːnɪŋ
If several types of transcription exist side-by-side (e.g. an orthographic and a phonetic transcription),
one level should be singled out as the primary transcription layer. Only this layer should be represented
inside elements, the other one being represented in appropriate elements (see 5.3).
5.3 Free dependent annotations (, )
Whereas typically, but not necessarily, contains the basic orthographic transcription,
elements should be used to represent additional annotations (e.g. part-of-speech tagging, prosodic
annotation and translation) on that basic transcription. Annotations of the same type should be
grouped in a element with a @type attribute specifying the annotation level.
The reference of the annotation in question shall be specified using @to and @from attributes in one
of the following ways:
— the values of @to and @from can point to the @xml:id attributes of other elements (e.g. a , a
or a ) of the transcription;
— the values of @to and @from can point to the @xml:id attributes of elements from the
timeline.
If the latter mechanism is used, elements shall be grouped with the element they refer
to by using an element (see 5.4). This is necessary to avoid ambiguities of reference
in cases of overlapping speech.
On the level of tokens, annotation via elements pointing to elements is conformable to the
annotation mechanism described in ISO 24611 (MAF).
Alternatively, annotations of single tokens (e.g. lemmatization and part-of-speech tagging) may be
realized as appropriate attributes on elements if no structural conflicts between the two levels
exist (see 6.1.2).
For annotations with an internal structure, nesting elements can be used. In that way, 1:n
relations between tokens and annotations, as well as hierarchically organized annotations, can be
expressed.
The use of further annotation techniques (e.g. via feature structures) is not precluded, but does not lie
within the scope of this document.
EXAMPLE 9 Use of and for annotations
faster
Okay.
Very good, very good.
PersPron
Idunno
I
do
not
know
8 © ISO 2016 – All rights reserved
JohnlovesMary
S
NP
N
VP
V
NP
N
5.4 Grouping of utterances and dependent annotations ()
elements and the annotations referring to them can be grouped under an
element. This has the advantage of creating local annotated environments, each (succession) of which
can be treated as an independent transcription in its own right, that is to say, it provides a “tesselation”
of the transcription document. elements in which spans point to the timeline rather than
directly to other elements of the transcription shall be grouped with the element they refer to,
because, otherwise, ambiguities with respect to their scope may arise in cases of overlapping speech.
Although the use of is optional, it is not allowed to mix and
elements on the top level; in other words, as soon as one element is used, all
elements have to be wrapped inside an element.
elements shall not contain more than one element. However, there may be
cases where it makes sense to use an as a container only for the description of a
non-verbal action of a participant (using one of the elements described in 6.3), without a subordinate
element.
If is used, speaker assignment through the @who attribute should be made on this
level instead of on the embedded element. The same holds for @start and @end attributes pointing
to the timeline. An @xml:id attribute can be used to make the addressable for
stand-off annotations.
The element can also be used as a stand-off annotation component within the
element, as specified in the TEI guidelines. In such a case,
points to the corresponding element by means of a @corresp attribute.
EXAMPLE 10 Use of
laughter
5.5 Independent elements outside utterances ( and )
and elements should be used to represent pauses and non-verbal phenomena
which cannot be attributed to a speaker. In this document, these elements appear on the same
hierarchical level as (or, as the case may be, ) elements. In order to fit them
into the temporal structure, they shall have @start and @end attributes pointing to the timeline.
EXAMPLE 11 Use of and outside utterances
roar of thunder outside
5.6 Inline paralinguistic annotation ()
The TEI guidelines provide the element to “[mark] the point at which some paralinguistic
feature of a series of utterances by any one speaker changes”. If used for that purpose, the element shall
be further specified by the attributes @feature (legal values: tempo for speed of utterance, loud for
loudness, pitch for pitch range, tension for tension or stress pattern, rhythm for rhythmic qualities
and voice for voice quality) and @new to provide the new value taken by the feature at this point. In
addition, a @synch attribute shall be provided to assign the element a position in the timeline.
is a milestone element. As such, it brings with it certain problems with automatic checking
and processing of the document structure. Since the description of paralinguistic features can also be
10 © ISO 2016 – All rights reserved
viewed as annotations of transcribed material, expressing the same content in a element (see
5.3) is the preferable alternative.
EXAMPLE 12 Use of
And he was up and away
...
Available free for research and teaching purposes. element) in a written document. It corresponds to a contiguous stretch of speech of
INTERNATIONAL ISO
STANDARD 24624
First edition
2016-08-15
Language resource management —
Transcription of spoken language
Gestion des ressources linguistiques — Transcription du langage parlé
Reference number
©
ISO 2016
© ISO 2016, Published in Switzerland
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form
or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior
written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of
the requester.
ISO copyright office
Ch. de Blandonnet 8 • CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. +41 22 749 01 11
Fax +41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO 2016 – All rights reserved
Contents Page
Foreword .v
Introduction .vi
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Metadata . 2
4.1 Description of the electronic file () . 2
4.1.1 Distribution information () . 2
4.1.2 Recording information (). 2
4.2 Description of circumstances () . 4
4.2.1 Participant information () . 4
4.2.2 Setting information () . 4
4.3 Description of source () . 5
5 Macrostructure . 5
5.1 Timeline () . 5
5.2 Utterances () . 6
5.3 Free dependent annotations (, ) . 7
5.4 Grouping of utterances and dependent annotations () . 9
5.5 Independent elements outside utterances ( and ) .10
5.6 Inline paralinguistic annotation () .10
5.7 Global divisions of a transcription (
6 Microstructure .12
6.1 Tokens () .12
6.1.1 Characterization .12
6.1.2 Representation as .12
6.1.3 Further constraints .13
6.1.4 Examples .13
6.2 Pauses () .14
6.2.1 Characterization .14
6.2.2 Representation as .14
6.2.3 Further constraints .14
6.2.4 Examples .15
6.3 Audible and visible non-speech events (, and ) .15
6.3.1 Characterization .15
6.3.2 Representation as , or .16
6.3.3 Examples .16
6.4 Punctuation () .17
6.4.1 Characterization .17
6.4.2 Representation as .17
6.4.3 Further constraints .17
6.4.4 Examples .18
6.5 Uncertainty, alternatives, incomprehensible and omitted passages (,
, ) .18
6.5.1 Characterization .18
6.5.2 Representation as or .18
6.5.3 Further constraints .18
6.5.4 Examples .19
6.6 Units above the token and below the level () .20
6.6.1 Characterization .20
6.6.2 Representation as .20
6.6.3 Further constraints .20
6.6.4 Examples .20
Annex A (informative) Fully encoded example .22
Annex B (informative) Element and attribute index .28
Bibliography .31
iv © ISO 2016 – All rights reserved
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www.iso.org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the meaning of ISO specific terms and expressions related to conformity assessment,
as well as information about ISO’s adherence to the World Trade Organization (WTO) principles in the
Technical Barriers to Trade (TBT) see the following URL: www.iso.org/iso/foreword.html.
The committee responsible for this document is ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
Introduction
This document sets out to facilitate the interchange of transcriptions of spoken language between
different computational tools and environments for creating, editing, publishing and exploiting such
data. Transcription of spoken language in this context means an orthography-based transcription of
verbal activity as recorded in an audio or video recording of a natural interaction. The description of
activity in other modalities (e.g. body language, gestures and facial expression) may be part of a spoken
language transcription, but this document starts from the assumption that the verbal dimension is
the primary focus of a spoken language transcription. Likewise, although this document may also be
relevant for transcription based on phonetic alphabets like the IPA, the assumption for this document is
that orthography-based transcription is the default case.
This document is developed in the context of the joint agreement between ISO and the Text Encoding
Initiative (TEI) consortium, and accordingly, its content is also distributed as part of the TEI
[23]
guidelines.
This document takes into account data models and encoding practices supported by widely used
[12],[16],[17],[19]
transcription software. More specifically, it builds on several interoperability studies
involving the following tools:
[10]
— ANVIL
[11]
— CLAN
[22]
— ELAN
[20]
— EXMARaLDA
[18]
— FOLKER
[1]
— Transcriber
This document was developed to be compatible with the formats produced by these tools. The
[4]
compatibility may extend to the formats of further labelling tools (e.g. Praat or Wavesurfer, http://
www.speech.kth.se/wavesurfer/index2.html), but possibly on a lower level and/or with a requirement
to convert these formats to one of the above-mentioned before adding mandatory information (e.g.
speaker assignment) using the respective tools.
This document also aims to be usable with widely used transcription systems (“conventions”). However,
in a technical sense, compatibility is not easily definable in this area since, unlike the tool formats, most
of these systems lack an explicit formalization. The following selection of transcription systems was
considered for this document:
[11]
— Codes for the Human Analysis of Transcripts (CHAT)
[7]
— Discourse Transcription (DT)
[21]
— Gesprächsanalytisches Transkriptionssystem (GAT)
[13]
— Halbinterpretative Arbeitstranskriptionen (HIAT)
Since TEI is the reference framework for this document and metadata is not its main concern, no attempt
is made here to address metadata compatibility issues beyond the TEI header. However, it should be
noted that there are several TEI profiles for the CMDI framework which are related both to each other
and to CMDI profiles of other metadata formats (e.g. IMDI) via the ISOCAT registry (see also References
[5], [6] and [9]).
This document aims to define both a target format for legacy data conversion and a format suitable for
future data processing requirements. The pros and cons of these two demands were carefully weighed
up before decisions were taken. At some points, certain techniques are therefore marked as preferred
vi © ISO 2016 – All rights reserved
from a data processing point of view while an alternative technique is still allowed if the structure of
legacy data makes its use unavoidable.
With regard to the other standards developed within ISO committee TC 37/SC 4, this document is
intended to provide the primary layer on top of which further annotation layers may be implemented.
In particular, the use of the element for tokenizing a transcription is conformable to the TEI-based
representation of tokens ISO 24611 (MAF).
This document also aligns with the mechanism proposed in the TEI guidelines to embed stand-
off annotations within a TEI document. In particular, this mechanism contains a generic element
() that groups together annotations related to the same linguistic segment; this
grouping meets the needs of this document in the case of annotations of elements or its children.
Finally, this document is complementary and does not overlap with the speech and multimodal
interaction-related standards developed within the W3C. In particular, it does not deal with speech
[24]
synthesis as is the case for SSML, nor does it deal with the representation of the semantic
[25]
interpretation of multimodal utterances as does EMMA.
INTERNATIONAL STANDARD ISO 24624:2016(E)
Language resource management — Transcription of
spoken language
1 Scope
This document specifies rules for representing transcriptions of audio- and video-recorded spoken
interactions in XML documents based on the guidelines of the TEI. As a secondary objective, the
document aims to relate transcribed data with standards for annotated corpora. It is applicable to
transcription data for studies in sociolinguistics, conversation analysis, dialectology, corpus linguistics,
corpus lexicography, language technology, qualitative social studies and other transcription data
of recorded spoken language. It is not applicable to other forms of transcription, most importantly
transcriptions of hand-written manuscripts.
Annex A gives a fully encoded example and Annex B provides an element index and an attribute index.
2 Normative references
There are no normative references in this document.
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— IEC Electropedia: available at http://www.electropedia.org/
— ISO Online browsing platform: available at http://www.iso.org/obp
3.1
dependent annotation
annotation which does not refer directly to an audio or video recording, but to another annotation,
typically an orthographic or phonetic transcription
3.2
milestone element
empty XML element used to indicate a boundary point
3.3
orthographic transcription
representation or modelling of spoken language based on the orthography of the respective language
3.4
paralinguistic feature
feature of spoken language beyond the individual sound(s), such as voice quality, pitch, volume,
intonation
3.5
phonetic transcription
representation or modelling of spoken language based on the sound system of the respective language
3.6
spoken language
oral language produced by a person’s vocal system
3.7
transcriber
person who carries out the transcription
3.8
transcription
representation or modelling of spoken language by means of written symbols
3.9
transcription system
theoretically founded set of principles and rules detailing what spoken language phenomena are to be
transcribed, and how they are to be transcribed
4 Metadata
The TEI guidelines formulate extensive suggestions for encoding metadata inside different subsections
of the element. The following section addresses only those pieces of metadata which are
either (i) crucial for ensuring the interpretability and exchangeability of spoken language transcriptions
in general or (ii) likely to be relevant in a large majority of cases. This does not preclude the possibility
of, or necessity for, encoding further metadata inside the element.
4.1 Description of the electronic file ()
4.1.1 Distribution information ()
The element inside the section of the should be used to
record information about access rights and contact information for the transcription in question.
EXAMPLE 1 Use of
Hamburger Zentrum für Sprachkorpora
No redistributing allowed.
Hamburger Zentrum für Sprachkorpora
Max Brauer-Allee 60
22765
Hamburg
Germany
4.1.2 Recording information ()
The element inside the section of the should be used to
record information about the transcribed recording(s). Only the actual recording(s), usually digital
audio and/or video files, should be described here. General information about the respective interaction
which is independent of the recording(s) should be described in the element (see 4.2.2).
2 © ISO 2016 – All rights reserved
A element inside a element should be used to refer to the corresponding digital
file via a @url attribute (see Reference [2]). A @type attribute on should be used to
indicate the media type of the recording; audio and video are the permissible values for that attribute.
The actual digital file type should be encoded as a @mimeType attribute (see Reference [8]) on the
element. Where two or more files are derived from the same master recording (e.g. a video
file or an extracted audio track), these should be represented as different elements inside the
same element, rather than as different elements. TEI linking mechanisms,
such as or @corresp, can be used to describe relationships between different recordings or
between recordings and other elements, such as speakers.
EXAMPLE 2 Use of
Parkinson Talkshow on BBC, broadcast on 02 November 2007
Video excerpt downloaded from YouTube with aTube-Catcher, converted
into MPG format with Adobe Premiere
Audio extracted from video with Audacity 1.3 beta
Recorded with a ZOOM H4NSP, external lapel microphone
clipped to Victoria Beckham’s
dress
Synchronized with David Beckham’s record-
ing
Recorded with a ZOOM H4NSP, external lapel microphone
clipped to David Beckham’s
shirt collar
Synchronized with
Victoria Beckham’s recording
4.2 Description of circumstances ()
4.2.1 Participant information ()
The participants of the transcribed interaction should be described in elements inside
the section of a element. The use of an @n attribute on the
element to define an abbreviated code for the respective participant is mandatory since it can be crucial
for many processing purposes. elements inside the body of the transcription refer to the @xml:id
attribute of a element, which shall therefore always be provided.
In order to provide additional metadata about participants, the content model of can be fully
exploited, for example, to record a person’s age, birth date, language knowledge or role in the recorded
conversation.
EXAMPLE 3 Use of
Daniel
Steward
British English
French
Fiona
Baker
4.2.2 Setting information ()
The element should be used to provide general information about the setting and
circumstances of the interaction. This includes such matters as the place and time, spatial organization
4 © ISO 2016 – All rights reserved
and artefacts of the interaction. Information pertaining to a specific recording of that interaction should
not be recorded here, but in the (see 4.1.2).
EXAMPLE 4 Use of
BBC studio London
Talkshow host Michael Parkinson interviewing David and Victoria
Beckham about their relationship
4.3 Description of source ()
The element is used to record information about the way the TEI encoded text has
been derived from a recorded source. This includes information about both the tool which created the
transcription inside an element and the convention used in transcribing the data inside a
element. @ident and @version attributes should be used on these elements to
provide a machine-readable way of accessing this information.
EXAMPLE 5 Use of
Transcription Tool providing a TEI Export
Orthographic transcription according to HIAT
5 Macrostructure
5.1 Timeline ()
elements inside a element should be used to define points in the recording;
these points are then referred to by @start, @end and @synch attributes of other elements (most
importantly elements) of the transcription to represent its temporal structure. It is therefore
obligatory to provide an @xml:id attribute for each element. elements shall be in
the same order as the timepoints they refer to. Specifying an @interval attribute is optional, but it is
very useful for many processing purposes. Absolute time values in the @interval attribute should be
given in seconds from the start of the recording with the appropriate number of decimal points. The
first element in the timeline corresponds to the start time of the transcribed recording. If an
absolute value is known for this point in time, it can be encoded in an @absolute attribute of the first
element and the element can point to it via an @origin attribute. If no absolute value for
the start of the recording can be provided, the @origin and @absolute attributes should be omitted.
EXAMPLE 6 Use of
5.2 Utterances ()
The element is the fundamental unit of organization for a transcription, roughly comparable to a
paragraph (
a single speaker. A more exact definition and delimitation of a do not lie within the scope of this
document. The TEI definition characterizing a as “often preceded by a silence or a change of speaker”
should be viewed as a suggestion only. It is therefore permissible to use a more refined definition for
a . This more refined definition can be described in the header in a element
inside an element.
If it is not wrapped inside an element (see 5.4), a element shall be assigned to a
single speaker by providing a value for the @who attribute which points to the @xml:id of a
element defined in the header. If the speaker cannot be identified, the @who attribute may also be
omitted. An @xml:id attribute can optionally serve to make the element addressable for stand-off
annotation, for instance, via elements (see 5.3).
If it is not wrapped inside an element (see 5.4), a element shall be assigned
to the timeline by providing values for the @start and @end attributes pointing to the @xml:id of
a element defined in the timeline. Further temporal structure can be recorded by inserting
elements at appropriate places inside the content of a element.
In multilingual interactions, it may be necessary to record the language of an utterance. This can be
done in an @xml:lang attribute of the element. Alternatively, the language of an utterance can
be treated as an annotation and encoded in a element (see 5.3). In cases of interactions where
code-switching or similar phenomena occur, it can be preferable to record the language of individual
tokens (see 6.1) instead of entire utterances.
The preferred mechanism for representing overlap is to encode it implicitly through the appropriate
use of @start and @end attributes and elements. Other TEI mechanisms, such as a
6 © ISO 2016 – All rights reserved
@trans=”overlap” attribute for the element, are allowed but not recommended because they
cannot be processed in an appropriate manner by many of the widely used annotation tools.
EXAMPLE 7 Temporal information for elements
Good morning!
Okay. Très bien, très bien.
Good morning!
Do not interrupt me!
Sorry, mate!
In the simplest case, elements contain character data, possibly interspersed with
elements (see Example 7). Further structuring of the content of a element (e.g. markup of tokens
and pauses) may be carried out via the mechanisms described in Clause 6.
The assumed default case is that contains an orthographic transcription in a broad sense, including
orthography-based mechanisms for approaching the actual phonetic realizations, such as “eye dialect”,
“literary transcription” and “modified orthography”. If this is the case, no further specification in the
form of a @notation attribute on is necessary. If, however, contains a phonemic or phonetic
transcription or is based on some other systematics, this should be indicated via a @notation attribute
with an appropriate value.
EXAMPLE 8 Phonetic transcription inside a element
ɡʊd ˈmɔːnɪŋ
If several types of transcription exist side-by-side (e.g. an orthographic and a phonetic transcription),
one level should be singled out as the primary transcription layer. Only this layer should be represented
inside elements, the other one being represented in appropriate elements (see 5.3).
5.3 Free dependent annotations (, )
Whereas typically, but not necessarily, contains the basic orthographic transcription,
elements should be used to represent additional annotations (e.g. part-of-speech tagging, prosodic
annotation and translation) on that basic transcription. Annotations of the same type should be
grouped in a element with a @type attribute specifying the annotation level.
The reference of the annotation in question shall be specified using @to and @from attributes in one
of the following ways:
— the values of @to and @from can point to the @xml:id attributes of other elements (e.g. a , a
or a ) of the transcription;
— the values of @to and @from can point to the @xml:id attributes of elements from the
timeline.
If the latter mechanism is used, elements shall be grouped with the element they refer
to by using an element (see 5.4). This is necessary to avoid ambiguities of reference
in cases of overlapping speech.
On the level of tokens, annotation via elements pointing to elements is conformable to the
annotation mechanism described in ISO 24611 (MAF).
Alternatively, annotations of single tokens (e.g. lemmatization and part-of-speech tagging) may be
realized as appropriate attributes on elements if no structural conflicts between the two levels
exist (see 6.1.2).
For annotations with an internal structure, nesting elements can be used. In that way, 1:n
relations between tokens and annotations, as well as hierarchically organized annotations, can be
expressed.
The use of further annotation techniques (e.g. via feature structures) is not precluded, but does not lie
within the scope of this document.
EXAMPLE 9 Use of and for annotations
faster
Okay.
Very good, very good.
PersPron
Idunno
I
do
not
know
8 © ISO 2016 – All rights reserved
JohnlovesMary
S
NP
N
VP
V
NP
N
5.4 Grouping of utterances and dependent annotations ()
elements and the annotations referring to them can be grouped under an
element. This has the advantage of creating local annotated environments, each (succession) of which
can be treated as an independent transcription in its own right, that is to say, it provides a “tesselation”
of the transcription document. elements in which spans point to the timeline rather than
directly to other elements of the transcription shall be grouped with the element they refer to,
because, otherwise, ambiguities with respect to their scope may arise in cases of overlapping speech.
Although the use of is optional, it is not allowed to mix and
elements on the top level; in other words, as soon as one element is used, all
elements have to be wrapped inside an element.
elements shall not contain more than one element. However, there may be
cases where it makes sense to use an as a container only for the description of a
non-verbal action of a participant (using one of the elements described in 6.3), without a subordinate
element.
If is used, speaker assignment through the @who attribute should be made on this
level instead of on the embedded element. The same holds for @start and @end attributes pointing
to the timeline. An @xml:id attribute can be used to make the addressable for
stand-off annotations.
The element can also be used as a stand-off annotation component within the
element, as specified in the TEI guidelines. In such a case,
points to the corresponding element by means of a @corresp attribute.
EXAMPLE 10 Use of
laughter
5.5 Independent elements outside utterances ( and )
and elements should be used to represent pauses and non-verbal phenomena
which cannot be attributed to a speaker. In this document, these elements appear on the same
hierarchical level as (or, as the case may be, ) elements. In order to fit them
into the temporal structure, they shall have @start and @end attributes pointing to the timeline.
EXAMPLE 11 Use of and outside utterances
roar of thunder outside
5.6 Inline paralinguistic annotation ()
The TEI guidelines provide the element to “[mark] the point at which some paralinguistic
feature of a series of utterances by any one speaker changes”. If used for that purpose, the element shall
be further specified by the attributes @feature (legal values: tempo for speed of utterance, loud for
loudness, pitch for pitch range, tension for tension or stress pattern, rhythm for rhythmic qualities
and voice for voice quality) and @new to provide the new value taken by the feature at this point. In
addition, a @synch attribute shall be provided to assign the element a position in the timeline.
is a milestone element. As such, it brings with it certain problems with automatic checking
and processing of the document structure. Since the description of paralinguistic features can also be
10 © ISO 2016 – All rights reserved
viewed as annotations of transcribed material, expressing the same content in a element (see
5.3) is the preferable alternative.
EXAMPLE 12 Use of
And he was up and away
And he was up and away
faster
5.7 Global divisions of a transcription (
For a division of a transcription into larger sections (above the level of or
elements), for example, for different phases of an interaction, the
@subtype attributes may be used to categorize the larger units as required. This element is entirely
optional, but if it is used, a division shall be indicated for the whole of the transcription, that is to say,
every or shall be contained by some
EXAMPLE 13 Use of
...
Accès libre à des fins de recherche et d’enseignement. ) d’un document écrit. Il correspond à une séquence parlée d’un seul tenant
NORME ISO
INTERNATIONALE 24624
Première édition
2016-08-15
Gestion des ressources
linguistiques — Transcription du
langage parlé
Language resource management — Transcription of spoken language
Numéro de référence
©
ISO 2016
DOCUMENT PROTÉGÉ PAR COPYRIGHT
© ISO 2016, Publié en Suisse
Droits de reproduction réservés. Sauf indication contraire, aucune partie de cette publication ne peut être reproduite ni utilisée
sous quelque forme que ce soit et par aucun procédé, électronique ou mécanique, y compris la photocopie, l’affichage sur
l’internet ou sur un Intranet, sans autorisation écrite préalable. Les demandes d’autorisation peuvent être adressées à l’ISO à
l’adresse ci-après ou au comité membre de l’ISO dans le pays du demandeur.
ISO copyright office
Ch. de Blandonnet 8 • CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. +41 22 749 01 11
Fax +41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO 2016 – Tous droits réservés
Sommaire Page
Avant-propos .iv
Introduction .v
1 Domaine d’application . 1
2 Références normatives . 1
3 Termes et définitions . 1
4 Métadonnées . 2
4.1 Description du fichier électronique () . 2
4.1.1 Informations de diffusion ( ) . 2
4.1.2 Informations sur l’enregistrement () . 3
4.2 Description des circonstances () . 4
4.2.1 Informations sur les participants () . 4
4.2.2 Informations sur le contexte () . 5
4.3 Description de la source () . 6
5 Macrostructure . 6
5.1 Frise chronologique () . 6
5.2 Énoncés () . 7
5.3 Annotations libres et dépendantes (,) . 8
5.4 Regroupement des énoncés et des annotations dépendantes () .10
5.5 Éléments indépendants hors énoncé ( et ) .11
5.6 Annotations paralinguistiques en ligne () .11
5.7 Divisions globales d’une transcription (
6 Microstructure .13
6.1 Token () .13
6.1.1 Caractérisation .13
6.1.2 Représentation comme .13
6.1.3 Autres contraintes .14
6.1.4 Exemples .14
6.2 Pauses () .15
6.2.1 Caractérisation .15
6.2.2 Représentation comme .16
6.2.3 Autres contraintes .16
6.2.4 Exemples .16
6.3 Événements audibles et visibles ne relevant pas du discours (, et
) .17
6.3.1 Caractérisation .17
6.3.2 Représentation comme , ou .17
6.3.3 Exemples .18
6.4 Ponctuation ().19
6.4.1 Caractérisation .19
6.4.2 Représentation comme .19
6.4.3 Autres contraintes .19
6.4.4 Exemples .19
6.5 Incertitude, alternatives, passages incompréhensibles et omis (,
, ) .20
6.5.1 Caractérisation .20
6.5.2 Représentation en tant que ou .20
6.5.3 Autres contraintes .20
6.5.4 Exemples .20
6.6 Unités au-dessus du token et en dessous du niveau ().22
6.6.1 Caractérisation .22
6.6.2 Représentation comme .22
6.6.3 Autres contraintes .22
Avant-propos
L’ISO (Organisation internationale de normalisation) est une fédération mondiale d’organismes
nationaux de normalisation (comités membres de l’ISO). L’élaboration des Normes internationales est
en général confiée aux comités techniques de l’ISO. Chaque comité membre intéressé par une étude
a le droit de faire partie du comité technique créé à cet effet. Les organisations internationales,
gouvernementales et non gouvernementales, en liaison avec l’ISO participent également aux travaux.
L’ISO collabore étroitement avec la Commission électrotechnique internationale (IEC) en ce qui
concerne la normalisation électrotechnique.
Les procédures utilisées pour élaborer le présent document et celles destinées à sa mise à jour sont
décrites dans les Directives ISO/IEC, Partie 1 Il convient, en particulier, de prendre note des différents
critères d’approbation requis pour les différents types de documents ISO. Le présent document a été
rédigé conformément aux règles de rédaction données dans les Directives ISO/IEC, Partie 2 (voir www.
iso.org/directives).
L’attention est appelée sur le fait que certains des éléments du présent document peuvent faire l’objet de
droits de propriété intellectuelle ou de droits analogues. L’ISO ne saurait être tenue pour responsable
de ne pas avoir identifié de tels droits de propriété et averti de leur existence. Les détails concernant
les références aux droits de propriété intellectuelle ou autres droits analogues identifiés lors de
l’élaboration du document sont indiqués dans l’Introduction et/ou dans la liste des déclarations de
brevets reçues par l’ISO (voir www.iso.org/brevets).
Les appellations commerciales éventuellement mentionnées dans le présent document sont données
pour information, par souci de commodité, à l’intention des utilisateurs et ne sauraient constituer un
engagement.
Pour une explication de la signification des termes et expressions spécifiques de l’ISO liés à l’évaluation
de la conformité, ou pour toute information au sujet de l’adhésion de l’ISO aux principes de l’Organisation
mondiale du commerce (OMC) concernant les obstacles techniques au commerce (OTC), voir le lien
suivant: www.iso.org/iso/fr/avant-propos.html.
Le présent document a été élaboré par le comité technique ISO/TC 37, Terminologie et autres ressources
langagières et ressources de contenu, sous-comité SC 4, Gestion des ressources linguistiques.
iviv © ISO 2016 – T© ISO 2016 – Tous drous droits roits réservéservésés
Introduction
Le présent document vise à faciliter l’échange de transcriptions du langage parlé entre différents outils
et environnements informatiques de création, de révision, de publication et d’exploitation de telles
données. La transcription du langage parlé dans ce contexte implique une transcription orthographique
de l’activité verbale telle qu’elle figure dans un enregistrement audio ou vidéo d’une interaction
naturelle. La description de l’activité selon d’autres modalités (par exemple, langage corporel, gestes et
expressions faciales) peut faire partie intégrante d’une transcription du langage parlé, mais ce document
part du principe que la composante verbale est l’objet premier d’une transcription du langage parlé. De
la même façon, bien que ce document puisse s’avérer pertinent pour une transcription en alphabets
phonétiques comme l’API, ce document repose sur l’hypothèse que la transcription orthographique est
le cas par défaut.
Le présent document est élaboré dans le cadre de l’accord commun entre l’ISO et le Text Encoding
Initiative (TEI) Consortium et, par conséquent, son contenu figure également dans les recommandations
[23]
de la TEI .
Le présent document tient compte des modèles de données et des pratiques d’encodage pris en charge
par des logiciels de transcription d’utilisation courante. Plus précisément, il s’appuie sur plusieurs
[12][16][17][19]
études d’interopérabilité portant sur les outils suivants:
[10]
— ANVIL
[11]
— CLAN
[22]
— ELAN
[20]
— EXMARaLDA
[18]
— FOLKER
[1]
— Transcriber
Le présent document a été élaboré pour être compatible avec les formats créés par ces outils. La
[4]
compatibilité peut s’étendre aux formats d’autres outils d’étiquetage (par exemple, Praat ou
Wavesurfer, http://www.speech.kth.se/wavesurfer/index2.html), mais peut-être à un niveau moindre
et/ou avec la nécessité de convertir ces formats dans l’un des formats ci-dessus mentionnés avant
d’ajouter des informations obligatoires (par exemple, assignation des locuteurs) à l’aide des outils
respectifs.
Le présent document a aussi pour objet d’être utilisé avec des systèmes de transcription d’utilisation
courante («conventions»). Cependant, sur un plan technique, la compatibilité n’est pas facile à définir
dans ce domaine puisque, à la différence des formats logiciels, la plupart de ces systèmes manquent de
formalisation explicite. Pour l’élaboration du présent document, les systèmes de transcription suivants
ont été pris en compte:
[11]
— Codes for the Human Analysis of Transcripts (CHAT)
[7]
— Discourse Transcription (DT)
[21]
— Gesprächsanalytisches Transkriptionssystem (GAT)
[13]
— Halbinterpretative Arbeitstranskriptionen (HIAT)
Puisque la TEI est le cadre de référence du présent document et que les métadonnées ne constituent
pas sa priorité, il n’est nullement question ici de traiter des questions de compatibilité des métadonnées
allant au-delà de l’en-tête TEI. Cependant, il convient de noter qu’il existe plusieurs profils TEI pour le
cadre CMDI qui sont reliés les uns aux autres et aux profils CMDI d’autres formats de métadonnées (par
exemple, IMDI) par l’intermédiaire du registre ISOCAT (voir aussi Références [5], [6] et [9]).
Le présent document vise à définir tant un format cible pour la conversion des données héritées qu’un
format adapté aux exigences futures de traitement des données. Les décisions n’ont été prises qu’après
avoir soigneusement pesé les avantages et les inconvénients de ces deux exigences. Par conséquent,
en quelques endroits, certaines techniques sont indiquées comme étant recommandées d’un point de
vue de traitement des données, cependant qu’une technique alternative est toujours autorisée si la
structure des données héritées rend son utilisation incontournable.
En ce qui concerne les autres normes élaborées au sein du Comité ISO TC 37/SC 4, le présent document a
pour objet la mise en place d’une première couche sur laquelle pourront se superposer d’autres couches
d’annotations. L’utilisation de l’élément pour la tokénisation d’une transcription, notamment, est
conforme à la représentation TEI des token de l’ISO 24611 (MAF).
Le présent document s’aligne également sur les mécanismes proposés dans les recommandations de la
TEI pour intégrer les annotations déportées à un document TEI. Ce mécanisme comporte notamment
un élément générique () qui regroupe les annotations relatives au même segment
linguistique: ce regroupement répond aux besoins du présent document dans le cas d’annotations de
l’élément ou de ses enfants.
Enfin, le présent document constitue un document complémentaire: il n’empiète pas sur les normes
relatives aux interactions orales et multimodales élaborées au sein du W3C. Il ne traite pas, notamment,
[24]
de la synthèse de la parole, comme dans le cas de la SSML, ni de la représentation de l’interprétation
[25]
sémantique des énoncés multimodaux comme l’EMMA.
vi © ISO 2016 – Tous droits réservés
NORME INTERNATIONALE ISO 24624:2016(F)
Gestion des ressources linguistiques — Transcription du
langage parlé
1 Domaine d’application
Le présent document énonce des règles de représentation des transcriptions d’enregistrements audio
et vidéo d’interactions parlées, dans des documents XML reposant sur les recommandations de la TEI.
Le deuxième objectif de ce document vise à rattacher les données transcrites à des normes de corpus
annotés. Il s’applique aux données de transcription pour des études sociolinguistiques, l’analyse de
conversation, la dialectologie, la linguistique de corpus, la lexicographie de corpus, les technologies
langagières, les études qualitatives en sciences sociales, et aux autres données de transcription
d’enregistrements du langage parlé. Il ne s’applique pas aux autres formes de transcription et surtout
pas aux transcriptions de manuscrits.
L’Annexe A présente un exemple d’encodage complet et l’Annexe B fournit un index des éléments et un
index des attributs.
2 Références normatives
Le présent document ne contient aucune référence normative.
3 Termes et définitions
Pour les besoins du présent document, les termes et définitions suivants s’appliquent.
L’ISO et l’IEC tiennent à jour des bases de données terminologiques destinées à être utilisées en
normalisation, consultables aux adresses suivantes:
— IEC Electropedia: disponible à l’adresse http://www.electropedia.org/
— ISO Online browsing platform: disponible à l’adresse http://www.iso.org/obp
3.1
annotation dépendante
annotation qui ne renvoie pas directement à un enregistrement audio ou vidéo, mais à une autre
annotation, généralement une transcription orthographique ou phonétique
3.2
élément de bornage
élément XML vide servant à indiquer un point de délimitation
3.3
transcription orthographique
représentation ou modélisation du langage parlé reposant sur l’orthographe dudit langage
3.4
caractéristique paralinguistique
caractéristique du langage parlé, au-delà du ou des sons proprement dits, comme la qualité de la voix, sa
tonalité, son volume ou son intonation
3.5
transcription phonétique
représentation ou modélisation du langage parlé reposant sur le système phonologique dudit langage
3.6
langage parlé
langage oral produit par la voix humaine
3.7
transcripteur
personne qui réalise la transcription
3.8
transcription
représentation ou modélisation d’un langage parlé au moyen de symboles scripturaux
3.9
système de transcription
ensemble de principes et de règles fondés sur une base théorique, détaillant les phénomènes du langage
parlé qui doivent être transcrits, ainsi que la façon de procéder à la transcription
4 Métadonnées
Les recommandations de la TEI donnent des indications détaillées d’encodage des métadonnées dans
différentes sous-sections de l’élément . La section suivante ne traite que des métadonnées
qui sont soit (i) essentielles pour assurer le caractère interprétable et échangeable de transcriptions
de langage parlé en général, soit (ii) susceptibles de s’avérer pertinentes dans une grande majorité
de cas. Cela n’exclut pas la possibilité ou la nécessité d’encoder d’autres métadonnées dans l’élément
.
4.1 Description du fichier électronique ()
4.1.1 Informations de diffusion ( )
Il convient d’utiliser l’élément dans la section de
pour enregistrer les informations relatives aux droits d’accès et aux coordonnées de contact pour la
transcription en question.
EXEMPLE 1 Utilisation de
Hamburger Zentrum für Sprachkorpora
Aucune rediffusion autorisée.
Hamburger Zentrum für Sprachkorpora
Max Brauer-Allee 60
22765
Hamburg
Germany
2 © ISO 2016 – Tous droits réservés
4.1.2 Informations sur l’enregistrement ()
Il convient d’utiliser l’élément dans la section de pour
enregistrer les informations relatives aux enregistrements transcrits. Il convient de décrire dans cet
élément uniquement le ou les enregistrements proprement dits, généralement des fichiers numériques
audio et/ou vidéo. Il convient de décrire les informations d’ordre général portant sur l’interaction
considérée, qui sont indépendantes de (des) enregistrement(s), dans l’élément
(voir 4.2.2).
Il convient d’utiliser un élément dans un élément pour renvoyer au fichier
numérique correspondant par l’intermédiaire d’un attribut @url (voir Référence [2]). Il convient
d’assigner un attribut @type à pour indiquer le type de média de l’enregistrement: les
valeurs autorisées pour cet attribut sont «audio» et «video». Il convient d’encoder le type véritable
du fichier numérique comme attribut @mimeType (voir Référence [8]) assigné à l’élément .
Lorsqu’au moins deux fichiers sont obtenus à partir du même enregistrement maître (par exemple, un
fichier vidéo ou un extrait de piste audio), il convient que lesdits fichiers soient représentés sous forme
d’éléments différents dans le même élément , plutôt que comme des éléments
différents. Des mécanismes de liaison TEI, tels que ou @corresp, peuvent être
utilisés pour décrire des relations entre différents enregistrements ou entre des enregistrements et
d’autres éléments, comme les locuteurs.
EXEMPLE 2 Utilisation de
Parkinson Talkshow sur la BBC, émission du 02 novembre 2007
gistrement -–>
sera -–>
ex. Camcorder) –->
Extrait vidéo téléchargé sur YouTube avec aTube-Catcher, converti
au format MPG avec Adobe Premiere
Piste audio extraite de la vidéo avec Audacity 1.3 beta
Enregistré avec un micro enregistreur portatif ZOOM H4NSP
fixé à la robe de Victoria Beckham
persName>
Synchronisé avec l’enregistrement de
David Beckham
Enregistré avec un micro enregistreur portatif ZOOM H4NSP
Fixé au col de chemise
de David Beckham
Synchronisé avec
l’enregistrement de Victoria Beckham
4.2 Description des circonstances ()
4.2.1 Informations sur les participants ()
Il convient de décrire les participants à l’interaction transcrite dans des éléments de la
section d’un élément . L’utilisation d’un attribut @n assigné à l’élément
pour définir un code abrégé représentant le participant concerné est obligatoire, car il
peut être indispensable pour répondre à de nombreux objectifs de traitement. Des éléments
dans le corps de la transcription renvoient à l’attribut @xml:id d’un élément qui doit, par
conséquent, être toujours prévu.
Afin de fournir des métadonnées supplémentaires sur les participants, il est possible d’exploiter la
totalité du modèle de contenu de , par exemple pour enregistrer l’âge, la date de naissance, le
niveau linguistique ou le rôle d’une personne dans la conversation enregistrée.
EXEMPLE 3 Utilisation de
4 © ISO 2016 – Tous droits réservés
Daniel
Steward
anglais britannique
français
Fiona
Baker
4.2.2 Informations sur le contexte ()
Il convient d’utiliser l’élément pour fournir des informations d’ordre général sur
le contexte et les circonstances de l’interaction. Cela inclut des aspects tels que l’endroit et l’heure,
l’organisation spatiale et les artéfacts de l’interaction. Il convient que les informations concernant un
enregistrement spécifique de cette interaction ne soient pas enregistrées dans cet élément, mais dans
l’élément (voir 4.1.2).
EXEMPLE 4 Utilisation de
studio de la BBC Londres
Animateur du talkshow Michael Parkinson interviewant David et
Victoria
Beckham au sujet de leur relation
4.3 Description de la source ()
On utilise l’élément pour enregistrer des informations sur la façon dont on obtient,
à partir d’une source enregistrée, le texte encodé selon la TEI. Cela comprend les informations tant
sur l’outil qui a produit la transcription, dans un élément , que la convention utilisée pour
transcrire les données, dans un élément . Il convient d’assigner les attributs @
ident et @version à ces éléments pour permettre l’accès à ces informations via un procédé lisible par
machine.
EXEMPLE 5 Utilisation de
Outil de transcription avec exportation TEI
Transcription orthographique conforme à HIAT
5 Macrostructure
5.1 Frise chronologique ()
Il convient d’utiliser des éléments dans un élément pour définir des repères dans
l’enregistrement: ainsi, les attributs @start, @end et @synch d’autres éléments de la transcription
(surtout les éléments ) renvoient à ces repères pour représenter la structure temporelle de
la transcription. Il est par conséquent obligatoire de prévoir un attribut @xml:id pour chaque élément
. Les éléments doivent figurer dans le même ordre que les repères temporels auxquels
ils renvoient. La spécification d’un attribut @interval est facultative, mais très utile pour répondre
à de nombreux objectifs de traitement. Dans l’attribut @interval, il convient d’indiquer les valeurs
de temps absolues en secondes, à partir du démarrage de l’enregistrement, en prévoyant le nombre
approprié de décimales. Le premier élément de la chronologie correspond à l’heure de départ
de l’enregistrement retranscrit. Si une valeur absolue est connue pour ce repère temporel, elle peut
être encodée dans un attribut @absolute assigné au premier élément et l’élément peut
pointer vers elle par l’intermédiaire d’un attribut @origin. Si aucune valeur absolue n’est fournie pour
le démarrage de l’enregistrement, il convient d’omettre les attributs @origin et @absolute.
EXEMPLE 6 Utilisation de
6 © ISO 2016 – Tous droits réservés
5.2 Énoncés ()
L’élément est l’unité fondamentale de l’organisation d’une transcription: il est assez comparable à
un paragraphe (élément
produite par un locuteur unique. Il ne relève pas du domaine d’application du présent document de
fournir une définition et une délimitation plus précises de . Il convient de considérer seulement
comme une suggestion la définition de la TEI caractérisant comme «souvent précédée d’un
silence ou d’un changement de locuteur». Il est par conséquent autorisé d’utiliser une définition de
plus détaillée. Cette définition plus détaillée peut être décrite dans l’en-tête dans un élément
figurant dans un élément .
S’il n’est pas encapsulé dans un élément (voir 5.4), un élément doit être
affecté à un locuteur unique en fournissant une valeur à l’attribut @who qui pointe vers l’élément @
xml:id d’un élément défini dans l’en-tête. Si le locuteur ne peut pas être identifié, l’attribut
@who peut également être omis. Un attribut @xml:id peut, en option, permettre de rendre l’élément
adressable pour une annotation déportée, par exemple par l’intermédiaire des éléments
(voir 5.3).
S’il n’est pas encapsulé dans un élément (voir 5.4), un élément doit être
assigné à la frise chronologique en fournissant des valeurs aux attributs @start et @end pointant
vers @xml:id d’un élément défini dans la frise chronologique. Une structure temporelle plus
complète peut être enregistrée en insérant des éléments < anchor > aux emplacements appropriés du
contenu d’un élément .
Dans le cadre d’interactions multilingues, il peut s’avérer nécessaire d’enregistrer la langue d’un énoncé.
Cela peut être réalisé à l’aide d’un attribut @xml:lang de l’élément . Une autre solution consiste à
traiter la langue de l’énoncé comme annotation et l’encoder dans un élément (voir 5.3). Dans le
cas d’interactions dans lesquelles se produit un changement de langue, donc de code, ou un phénomène
du même genre, il peut être préférable d’enregistrer la langue de chacun des token (voir 6.1) plutôt que
celle d’énoncés entiers.
Le mécanisme privilégié de représentation d’un chevauchement consiste à l’encoder implicitement
en utilisant de manière appropriée les attributs @start et @end et les éléments . D’autres
mécanismes TEI sont autorisés, tels que l’attribut @trans = ”overlap” assigné à l’élément , mais ne
sont pas recommandés parce qu’une grande partie des outils d’annotation les plus répandus ne peuvent
les traiter de façon appropriée.
EXEMPLE 7 Informations temporelles des éléments
male) ––>
Bonjour!
Okay. Très bien, très bien.
Bonjour!
s avec chevauchement partiel ––>
Ne m’interrompez
synch="#T1"/>pas!
Excusez moi!
Dans le cas le plus simple, les éléments contiennent des données-caractères, éventuellement
entrecoupées d’éléments (voir exemple 7). La structuration du contenu d’un élément
(par exemple, le balisage des token et des pauses) peut être approfondie par l’intermédiaire des
mécanismes décrits à l’Article 6.
Dans le cas par défaut supposé, contient une transcription orthographique au sens large,
intégrant également des mécanismes basés sur l’orthographe destinés à concevoir de véritables
réalisations phonétiques, comme le «eye dialect» (orthographe adaptée), la «transcription littéraire»
et l’«orthographe modifiée»). Si tel est le cas, aucune spécification supplémentaire sous la forme d’un
attribut @notation assigné à n’est nécessaire. Cependant, si contient une transcription
phonémique ou phonétique, ou s’il repose sur d’autres systématiques, il convient de l’indiquer par
l’intermédiaire d’un attribut @notation portant une valeur appropriée.
EXEMPLE 8 Transcription phonétique dans un élément
ɡʊd ˈmɔːnɪŋ
Si plusieurs types de transcription coexistent (par exemple, une transcription orthographique et une
transcription phonétique), il convient de choisir un niveau au titre de première couche de transcription.
Il convient de ne représenter que cette couche dans les éléments , l’autre étant représentée dans les
éléments appropriés (voir 5.3).
5.3 Annotations libres et dépendantes (,)
Attendu que généralement, mais pas nécessairement, contient la transcription orthographique de
base, il convient d’utiliser des éléments pour représenter des annotations complémentaires
(par exemple, balisage des parties du discours, annotation prosodique et traduction) concernant cette
transcription de base. Il convient de grouper les annotations de même type dans un élément
avec un attribut @type spécifiant le niveau d’annotation.
La référence de l’annotation en question doit être spécifiée à l’aide des attributs @to et @from selon
l’une ou l’autre des méthodes suivantes:
— les valeurs de @to et de @from peuvent pointer vers les attributs @xml:id d’autres éléments de la
transcription (par exemple, un élément , un élément ou un élément );
8 © ISO 2016 – Tous droits réservés
— les valeurs de @to et de @from peuvent pointer vers les attributs @xml:id d’éléments de
la frise chronologique.
Lorsque ce deuxième procédé est utilisé, les éléments doivent être groupés dans l’élément
auquel ils renvoient en utilisant un élément (voir 5.4). Cela est nécessaire
pour éviter toute ambiguïté de renvoi dans les cas de discours se chevauchant.
Au niveau des token, l’annotation par l’intermédiaire d’éléments pointant vers des éléments
est conforme au mécanisme d’annotation décrit dans l’ISO 24611 (MAF).
En variante, des annotations de token individuels (par exemple, lemmatisation et balisage de parties du
discours) peuvent être réalisées sous la forme d’attributs appropriés assignés aux éléments s’il ne
se pose aucun conflit d’ordre structurel entre les deux niveaux (voir 6.1.2).
En ce qui concerne les annotations avec structure interne, il est possible de recourir à l’imbrication
d’éléments . De cette façon, il est possible de représenter des relations 1:n entre les token et les
annotations, ainsi que des annotations organisées hiérarchiquement.
L’utilisation de techniques d’annotation plus approfondies (par exemple, par l’intermédiaire de
structures de traits) n’est pas exclue, mais elle ne relève pas du domaine d’application du présent
document.
EXEMPLE 9 Utilisation de et de pour des annotations
faster
Okay
Very good, very good.
––>
PersPron
Idunno
I
do
not
know
JohnlovesMary
S
NP
N
VP
V
NP
N
5.4 Regroupement des énoncés et des annotations dépendantes ()
Les éléments et les annotations qui y renvoient peuvent être regroupés sous un élément
. Cela présente l’avantage de créer des contextes annotés localement, dont chacun
(succession) peut être traité, en lui-même, comme une transcription indépendante, c’est-à-dire qu’il
offre une « présentation en mosaïque » du document de transcription. Les éléments dans
lesquels les intervalles (span) pointent vers la frise chronologique plutôt que directement vers d’autres
éléments de la transcription doivent être regroupés sous l’élément auquel ils renvoient, car, sinon,
des ambiguïtés concernant leur champ d’application peuvent apparaître dans le cas de discours se
chevauchant.
Bien que l’utilisation de soit facultative, il n’est pas autorisé de combiner des
éléments et des éléments au niveau supérieur: autrement dit, dès qu’un
élément est utilisé, tous les éléments doivent être encapsulés dans un élément
.
Les éléments ne doivent pas contenir plus d’un élément . Cependant, il peut y
avoir des cas dans lesquels il est logique d’utiliser uniquement comme conteneur
de la description d’une action non verbale d’un participant (en utilisant un des éléments décrits en 6.3),
mais sans élément subordonné .
Si est utilisé, il convient que l’assignation des locuteurs par l’attribut @who soit
réalisée à ce niveau et non pas au niveau de l’élément encapsulé . Il en va de même pour les attributs
@start et @end pointant vers la frise chronologique. Il est possible d’utiliser un attribut @xml:id pour
rendre adressable en ce qui concerne les annotations déportées.
L’élément peut aussi être utilisé comme composante d’une annotation déportée
dans l’élément , comme spécifié dans les recommandations TEI. Dans un tel cas,
pointe vers l’élément correspondant au moyen d’un attribut @corresp.
EXEMPLE 10 Utilisation de
10 © ISO 2016 – Tous droits réservés
subordonné ––>
laughter
5.5 Éléments indépendants hors énoncé ( et )
Il convient d’utiliser les éléments et pour représenter des pauses et des
phénomènes non verbaux qui ne peuvent pas être attribués à un locuteur. Dans ce document, ces
éléments apparaissent au même niveau hiérarchique que les éléments (ou, comme
cela peut se produire, ). Afin de les intégrer dans la structure temporelle, ils doivent avoir des
attributs @start et @end pointant vers la frise chronologique.
EXEMPLE 11 Utilisation d’énoncés externes et
...














Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...