Information technology - Biometric data interchange formats - Part 13: Voice data

ISO/IEC 19794-13:2018 specifies a data interchange format that can be used for storing, recording, and transmitting digitized acoustic human voice data (speech) assumed to be from a single speaker recorded in a single session. This format is designed specifically to support a wide variety of Speaker Identification and Verification (SIV) applications, both text-dependent and text-independent, with minimal assumptions made regarding the voice data capture conditions or the collection environment. Other uses for the data encapsulated in this format, such as automated speech recognition (ASR), may be possible, but are not addressed in this documnet. This document also does not address handling of data that has been processed to the feature or voice model levels. No application-specific requirements, equipment, or features are addressed in this document. This document supports the optional inclusion of non-standardized extended data. This document allows both the original data captured and digitally-processed (enhanced) voice data to be exchanged. A description of any processing of the original source input is intended to be included in the metadata associated with the voice representations (VRs). This document does not address data streaming. Provisions that stored and transmitted biometric data be time-stamped and that cryptographic techniques be used to protect their authenticity, integrity and confidentiality are out of the scope of this document. Information formatted in accordance with this document can be recorded on machine-readable media or can be transmitted by data communication between systems. A general content-oriented subclause describing the voice data interchange format is followed by a subclause addressing an XML schema definition. ISO/IEC 19794-13:2018 includes vocabulary in common use by the speech and speaker recognition community, as well as terminology from other ISO standards.

Technologies de l'information — Formats d'échanges de données biométriques — Partie 13: Données relatives à la voix

General Information

Status
Published
Publication Date
22-Feb-2018
Current Stage
9093 - International Standard confirmed
Start Date
06-Sep-2024
Completion Date
30-Oct-2025
Ref Project

Overview

ISO/IEC 19794-13:2018 defines a standardized biometric voice data interchange format for storing, recording and transmitting digitized human voice (speech) assumed to come from a single speaker in a single session. The format is purpose-built to support a wide range of Speaker Identification and Verification (SIV) applications (both text-dependent and text-independent). It permits exchange of original and digitally processed (enhanced) voice recordings and includes an XML schema for metadata and structural description. Streaming, feature/model-level representations, and cryptographic/time-stamping requirements are intentionally out of scope.

Key topics and technical requirements

  • Single-speaker, single-session assumption: Records are attributed to one individual and one capture session.
  • Support for SIV use cases: Designed to work with text-dependent and text-independent speaker recognition workflows.
  • Voice representation (VR) structure: Standard defines voice record headers and voice representation headers covering date/time, audio content descriptors, quality metrics and signal enhancement metadata.
  • Audio content and encoding: The standard accommodates a variety of audio encodings and allows descriptive metadata for capture device, transducer and channel conditions.
  • Metadata and processing traceability: Any processing applied to the original audio (e.g., enhancement) should be documented in associated metadata.
  • Optional extended vendor data: Vendors can include non‑standardized extended data to support proprietary needs while preserving interoperability.
  • XML schema and content-orientation: The document supplies an XML schema to represent records and associated fields for interchange.
  • Conformance testing guidance: Annexes describe testing methodology for conformance to the format.
  • Explicit exclusions: Does not address streaming, handling of feature or model-level biometric data, application-specific equipment, or security (cryptographic protection, time-stamping).

Applications and who uses it

ISO/IEC 19794-13:2018 is used by:

  • Biometrics vendors and system integrators implementing speaker enrollment, verification and identification systems.
  • Governments and border/security agencies managing voice biometric databases and identity programs.
  • Telecom and call-center vendors implementing voice-based authentication (IVR-integrated SIV).
  • Forensic labs and research organizations exchanging high-quality voice recordings with consistent metadata for analysis.
  • Developers building interoperable voice data exchange pipelines between capture devices, storage systems and recognition engines.

Practical benefits include consistent record structure, improved interoperability between capture and recognition systems, and clear metadata for quality assessment and auditability.

Related standards

  • ISO/IEC 19794-1 - Framework for biometric data interchange formats
  • ISO/IEC 19785-1 - Common Biometric Exchange Formats Framework (data element specification)
  • ISO/IEC 2382-37 - Biometrics vocabulary

Keywords: ISO/IEC 19794-13:2018, voice data, biometric data interchange, speaker identification, speaker verification, voice metadata, XML schema, SIV, voice recordings.

Standard
ISO/IEC 19794-13:2018 - Information technology — Biometric data interchange formats — Part 13: Voice data Released:2/23/2018
English language
26 pages
sale 15% off
Preview
sale 15% off
Preview

Standards Content (Sample)


INTERNATIONAL ISO/IEC
STANDARD 19794-13
First edition
2018-03
Information technology — Biometric
data interchange formats —
Part 13:
Voice data
Technologies de l'information — Formats d'échanges de données
biométriques —
Partie 13: Données relatives à la voix
Reference number
©
ISO/IEC 2018
© ISO/IEC 2018
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting
on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address
below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Fax: +41 22 749 09 47
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO/IEC 2018 – All rights reserved

Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 5
5 Conformance . 6
6 Processes and identifiers . 7
6.1 Capture processes and utterances . 7
6.1.1 Introduction . 7
6.1.2 Voice utterance . 7
6.1.3 Structure of a capture process . 7
6.2 Registered format type identifiers . 9
7 General voice data interchange format (BDB) . 9
7.1 Overview . 9
7.2 Conventions .10
7.3 Voice record general header .10
7.3.1 Overview .10
7.3.2 Version .11
7.3.3 Session ID .11
7.3.4 Channel .11
7.3.5 Capture device .12
7.3.6 Transducer .12
7.3.7 Audio meta information .13
7.3.8 Capture process protocol .14
7.3.9 Extended vendor data .14
7.4 Voice representation header .14
7.4.1 Overview .14
7.4.2 Date and time .14
7.4.3 Audio content.15
7.4.4 Quality information .17
7.4.5 Signal enhancement .18
7.4.6 Extended vendor data .19
7.5 Voice representation data .19
7.6 Schema .19
7.7 Example .23
Annex A (normative) Conformance testing methodology .25
Bibliography .26
© ISO/IEC 2018 – All rights reserved iii

Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are
members of ISO or IEC participate in the development of International Standards through technical
committees established by the respective organization to deal with particular fields of technical
activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international
organizations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the
work. In the field of information technology, ISO and IEC have established a joint technical committee,
ISO/IEC JTC 1.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for
the different types of document should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject
of patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent
rights. Details of any patent rights identified during the development of the document will be in the
Introduction and/or on the ISO list of patent declarations received (see www.iso.org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT) see the following
URL: www.iso.org/iso/foreword.html.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 37, Biometrics.
A list of all the parts in the ISO/IEC 19794 series, can be found on the ISO website.
iv © ISO/IEC 2018 – All rights reserved

Introduction
This document assumes that the voice data interchange record is to be attributed to a single individual
and recorded in a single session. Voice data is a time record of audible, acoustic vibrations produced by
a human in the course of a verbal interaction and will generally contain both speech and non-speech
vocal sounds, as well as non-vocal sounds to be considered “noise” in this context. In addition to serving
the linguistic function of semantic information transfer, voice data contains both acoustic and semantic
information that can be used to recognize speakers. It is the collection, storage and transmission of voice
data containing speech for the purpose of recognizing individuals that is the focus of this document.
This format is designed specifically to support a wide variety of automatic speaker recognition
applications, including both text-dependent and text-independent Speaker Identification and
Verification (SIV) and enrolment, with minimal assumptions made regarding the voice data capture
conditions or the collection environment. This document is intended to be sufficiently general that
speaker recognition applications beyond traditional SIV could also be supported, such as linking
utterances to the same unknown speaker, and determining that a known speaker is not the source of
an utterance. The differentiation between speech used to create the reference for future comparisons
(which in some applications is called “enrolment”), and that used to create voice representations (VRs)
queried against the references, might occur only at the point of application, thus requiring each stored
speech record to potentially support either reference or query creation. Further, automated speaker
recognition might incorporate related technologies, such as speech and language recognition, not only
in current algorithms and applications, but in future ways that cannot be anticipated. Therefore, this
document is written from a very broad perspective with the intent of supporting the broadest possible
range of speaker recognition applications and technical approaches.
© ISO/IEC 2018 – All rights reserved v

INTERNATIONAL STANDARD ISO/IEC 19794-13:2018(E)
Information technology — Biometric data interchange
formats —
Part 13:
Voice data
1 Scope
This document specifies a data interchange format that can be used for storing, recording, and
transmitting digitized acoustic human voice data (speech) assumed to be from a single speaker
recorded in a single session. This format is designed specifically to support a wide variety of Speaker
Identification and Verification (SIV) applications, both text-dependent and text-independent, with
minimal assumptions made regarding the voice data capture conditions or the collection environment.
Other uses for the data encapsulated in this format, such as automated speech recognition (ASR), may
be possible, but are not addressed in this documnet. This document also does not address handling of
data that has been processed to the feature or voice model levels. No application-specific requirements,
equipment, or features are addressed in this document. This document supports the optional inclusion
of non-standardized extended data. This document allows both the original data captured and digitally-
processed (enhanced) voice data to be exchanged. A description of any processing of the original source
input is intended to be included in the metadata associated with the voice representations (VRs). This
document does not address data streaming.
Provisions that stored and transmitted biometric data be time-stamped and that cryptographic
techniques be used to protect their authenticity, integrity and confidentiality are out of the scope of this
document.
Information formatted in accordance with this document can be recorded on machine-readable media
or can be transmitted by data communication between systems.
A general content-oriented subclause describing the voice data interchange format is followed by a
subclause addressing an XML schema definition.
This document includes vocabulary in common use by the speech and speaker recognition community,
as well as terminology from other ISO standards.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.
ISO/IEC 19794-1, Information technology — Biometric data interchange formats — Part 1: Framework
ISO/IEC 19785-1, Information technology — Common Biometric Exchange Formats Framework — Part 1:
Data element specification
ISO/IEC 2382-37, Information technology — Vocabulary — Part 37: Biometrics
3 Terms and definitions
For the purposes of this document, the terms and definitions in ISO/IEC 19794-1 and the following
apply.
© ISO/IEC 2018 – All rights reserved 1

ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— IEC Electropedia: available at http://www.electropedia.org/
— ISO Online browsing platform: available at http://www.iso.org/obp
3.1
analog-to-digital converter (ADC) resolution
exponent of the base 2 representation (the number of bits) of the number of discrete amplitudes that
the analog-to-digital converter is capable of producing
Note 1 to entry: Common values for ADC resolution for sound-cards are: 8, 16, 20 and 24.
3.2
audio duration
duration of the complete audio containing all voice representation utterances, e.g. whole call recordings
3.3
audio encoding
encoding used by the data capture subsystem, e.g. cellphone
Note 1 to entry: The voice signal is encoded before being transmitted over a channel. There are many formats
in use today and the number is likely to continue to change as telephones and transmission channels evolve.
Formats include PCM(ITU-T G.711) and ADPCM(ITU-T G.726) for wave encoding and ACELP(ITU-T G.723.1) and
CS-ACELP(ITU-T G.729 Annex A) for AbS encoding. A-law PCM and mu-law PCM are included in ITU-T G.711.
Note 2 to entry: A comprehensive overview list is provided in 7.4.3.2.
3.4
compression
process that reduces the size of a digital file and, accordingly, the data rate required for transmission
Note 1 to entry: Some audio encodings include compression and some do not. Compression is almost always
“lossy” and, therefore, has an impact on the speech signal.
3.5
cut-off frequency (lower/upper)
frequency (below/above) which the acoustic energy drops 3dB below the average energy in the pass
band
3.6
far-field
region far enough from the source where the angular field distribution is independent of the distance
from the source
3.7
interactive voice response
IVR
predicate title for a telephony based computer that is used to control the flow of telephone calls and to
provide voice based self-service
Note 1 to entry: Technology that allows a computer to detect voice and keypad inputs.
Note 2 to entry: IVR systems deal with several real-world and constrained-content effects, such as emotional
voices, varying environmental noises, recording of free speech, but also hotwords (e.g., yes, no, digits, keywords).
Note 3 to entry: IVRs apply ASR for user navigation, where on secure applications SIV becomes relevant e.g.,
financial transactions via telefone. IVR systems may combine ASR and SIV to detect audio sample replays and
detect user liveness by introducing on-time generated knowledge to the user that should be spoken.
2 © ISO/IEC 2018 – All rights reserved

3.8
microphone
data capture subsystem that converts the acoustic pressure wave emanating from the voice into an
electrical signal
3.9
mid-field
region between the near-field and the far-field which has a combination of the characteristics found in
both the near-field and the far-field
3.10
near-field
region in an enclosure in which the direct energy at the microphone from the primary source is greater
than the reflected energy from that source
Note 1 to entry: In a free field, the near-field is the region close enough to the source that the angular energy
distribution is dependent upon the distance from the source.
3.11
public switched telephone network
channel based technology used to switch analogue signal, typically telephone calls, through a network
from a source such as a telephone to a destination such as another telephone
Note 1 to entry: Knowledge about the channel where a telephone call originates is useful because, historically,
noise and other channel characteristics vary from country to country. The advent and growth of VoIP and other
digital telephone networks has attenuated the impact of national telecommunications networks becausethey
are not constrained by national boundaries. For example, a call originating in the United States might traverse
Canada before arriving at its destination, which could be within the United States (also see Voice over IP).
3.12
representation duration
duration of a single voice representation utterance
3.13
sampling rate
number of samples per second (or per other unit) taken from a continuous signal to make a discrete
signal
Note 1 to entry: When the rate is per second, the unit is Hertz (Hz).
Note 2 to entry: Equal to the sampling frequency.
Note 3 to entry: The rate of sampling needs to satisfy the Nyquist criterion.
3.14
session
single capture process that takes place over a single, continuous time period
Note 1 to entry: In database collection, two sessions should have at least 3 weeks to 6 weeks in between, such
that non-contemporary speech can be captured. However, in biometric systems a session can be interpreted
as the time of recording one or more samples without the subject leaving the scene of the biometric capturing
device, i.e. passing through a control stage/barrier infers the end of a session, while multiple rejects can occur
during one session.
3.15
signal-to-encoding noise ratio
SNR
ratio of the pure signal of interest to the noise component that results from possible electronic noise
sources
Note 1 to entry: SNR(dB) = 10 lg (Ps/Pn), where Ps is average signal power and Pn is average noise power,
expressed as follows for digitized signals,
© ISO/IEC 2018 – All rights reserved 3

N N
2 2
Ps= si() Pn= ni()
∑∑
N N
i==1 i 1
Note 2 to entry: where N is the total number of digital samples.
Note 3 to entry: Usually measured in decibels (dB).
Note 4 to entry: For example, in PCM, the noise is caused by quantization and roughly calculated in Furui, Digital
Speech Processing, Synthesis, and Recognition, (Dekker, 1989) as:
SNRdB = 6 B7− ,2
()
Note 5 to entry: where B is quantization bits.
3.16
speaker identification
form of speaker recognition which compares a voice sample with a set of voice references corresponding
to different persons to determine the one who has spoken
3.17
speaker recognition
process of determining whether two speech segments were produced by the vocal mechanism of the
same data subject
3.18
speaker verification
speaker authentication
form of speaker recognition for deciding whether a speech sample was spoken by the person whose
identity was claimed
Note 1 to entry: Speaker verification is used mainly to restrict access to information, facilities or premises.
3.19
speaker identification and verification
SIV
process of automatically recognizing individuals through voice characteristics
Note 1 to entry: The data format itself does not depend on the application purpose (active/passive SIV).
3.20
voice
speech
sound produced by the vocal apparatus whilst speaking
Note 1 to entry: Normally defined by phoneticians as the sound that emanates from the lips and nostrils, which
comprises "voiced" and "unvoiced" sound produced by the vibration of the vocal folds and from constrictions
within the vocal track and modified by the time varying acoustic transfer characteristic of the vocal tract.
Note 2 to entry: For the purposes of this document, speech and voice are used interchangeably.
3.21
speech signal bandwidth
range of speech frequencies between the upper and lower cutoff frequencies that are transmitted or
recorded by a system
3.22
speech recognition
automatic speech recognition
conversion, by a functional unit, of a speech signal to a representation of the content of the speech
Note 1 to entry: The content to be recognized can be expressed as a proper sequence of words or phonemes.
4 © ISO/IEC 2018 – All rights reserved

3.23
streaming data
sequence of digitally encoded coherent signals (packets of data) used to transmit or receive information
3.24
text-independent recognizer
text-independent recognition system
speech recognizer that works reliably whether or not the received speech sample corresponds to a
predefined message
3.25
text-dependent recognizer
text-dependent recognition system
speech recognizer that works reliably only when it receives a speech sample corresponding to a
predefined message
3.26
text prompted
SIV technology that requires the data subject to repeat a sequence presented by the SIV system or to
answer a question
Note 1 to entry: A synonym is “challenge-response”.
Note 2 to entry: “Text prompted” is often seen as a kind of text-independent interaction.
3.27
utterance
sequence of continuous speech units (e.g., phonemes, syllables, words) that is bounded by silence
3.28
voice over IP
digitized streaming speech carried over data channels as Internet Protocol packets
3.29
voice prompt
voice-response prompt
spoken message used to guide the user through a dialog with a voice response system
3.30
voice representation
VR
one or more voice utterances
3.31
volume
calculation of the “loudness” of the input signal (including speech)
Note 1 to entry: When it is known, volume is expressed in terms of the International Telecommunications Union’s
[2]
P.56 algorithm .
Note 2 to entry: Volume level is a factor in the quality of the input utterances.
4 Abbreviated terms
ADC Analog-to-Digital Converter
ADPCM Adaptive Differential Pulse Code Modulation
ASR Automatic Speech Recognition
© ISO/IEC 2018 – All rights reserved 5

bps bits per second
BDIR biometric data interchange record
CS-ACELP Conjugate Structure Algebraic Code Excited Linear Prediction
dB decibels, measured as a ratio between two energy levels (E1 and E2) as 10 lg(E1/E2)
Hz Hertz (units of cycles per second)
ILBC Internet Low Bitrate Codec
IP Internet Protocol
IVR Interactive Voice Response
PCM Pulse Code Modulation
PSTN Public Switched Telephone Network
SIV Speaker Identification and Verification
SNR Signal-to-encoding Noise Ratio (units of dB)
TTS Text-To-Speech
URL Uniform Resource Locator
VAD, SAD Voice Activity Detection, Speech Activity Detection
VR Voice representation
VoIP Voice over IP
W3C World Wide Web Consortium
XML eXtensible Markup Language
5 Conformance
A biometric data record conforms to this document if it satisfies all of the normative requirements
related to:
a) its data structure, data values and the relationships between its XML elements, as specified in
ISO/IEC 19794-1 and throughout Clause 7 of this document; and
b) the relationship between its data values and the input biometric data from which the biometric
data record is generated, as specified throughout Clause 6.
A system that produces biometric data records is conformant to this document if all biometric data
records that it outputs conform to this document (as defined above) as claimed in the Implementation
Conformance Statement associated with that system. A system does not need to be capable of producing
biometric data records that cover all possible aspects of this document, but only those that are claimed
to be supported by the system in the Implementation Conformance Statement.
A system that uses biometric data records is conformant to this document if it can read, and use for
the purpose intended by that system, all biometric data records that conform to this document (as
defined above) as claimed in the Implementation Conformance Statement associated with that system.
A system does not need to be capable of using biometric data records that cover all possible aspects
6 © ISO/IEC 2018 – All rights reserved

of this document but only those that are claimed to be supported by the system in an Implementation
Conformance Statement.
NOTE For details on the conformance testing methodology, see Annex A.
6 Processes and identifiers
6.1 Capture processes and utterances
6.1.1 Introduction
This clause defines the fundamental elements of SIV interactions called “capture process”, as defined in
ISO/IEC 2382-37, and the VRs of data subject speech captured during those interactions or “sessions”.
During a capture process voice sounds stemming not from the targeted speaker may be unintentionally
recorded overlapping or not overlapping targeted speech sequences; this speech should be considered
as noise. Compatible capture process structuring and acoustic signal descriptions are required for
interoperability between and among SIV engines.
6.1.2 Voice utterance
A voice utterance is assumed to come from a single speaker for the purpose of recognizing individuals,
(or to be used to create a reference for future comparisons). In the case that other voices from different
individuals are included within the utterance, this information should be considered as noise, which
might affect the SIV system. It is not the purpose of this document to specify how voice utterances will
be demarcated, but they will generally be separated by: 1) a change in or repeat of a prompt; or 2) a
pause of far longer duration than the inter-syllabic rate. There is no minimum or maximum length to a
voice utterance.
6.1.3 Structure of a capture process
An SIV capture process is a verbal interaction which may be used for biometric enrolment, verification
or identification that is conducted with a data subject by an automated system or another human. In
general, a capture process may include background noise possibly from human sources.
SIV interactions as capture processes can be active or passive (user is aware of capture process or
not), with or without behavioural adaptation of users (friendly/frequent users intend to adapt for
performance purposes), and further with cooperative (friendly) and non-cooperative users.
An SIV capture process is known as a session. Example in Figure 3: the recording sample may cover
the whole-call utterance of the enrolment call as well as single prompt utterances. An utterance is a
continuous flow of vocalization stemming from one speaker; it may contain inter-syllabic or inter-word
silence, and is bounded by pauses. Pauses are suspension of vocalization of perceptible duration, which
are longer than inter-syllabic or inter-word silence, i.e. human-perceptible silence.
NOTE 1 Speech and non-speech sounds are uttered by biometric subjects and can be used for SIV purposes.
Usually, an utterance is demarcated as an uninterrupted chain of speech, however applications can also intend
the use of sub-utterances for VRs.
NOTE 2 Non-speech sounds do not indicate a suspension of vocalization.
NOTE 3 Utterances can cover temporary stops in action of speech, such as temporary interrupts, since the
human perception may arguably still be “listening” rather than percepting a suspension of vocalization.
A single capture process generally takes place over a single, continuous time period (or “session”) and
contains one or more utterances of voice data, known as voice representations (VR). A VR contains
primarily the voice of one speaker and may be initiated by a prompt to the data subject requesting a
response. Figure 1 illustrates a simple verification capture process with the voice utterance initiated by
a prompt from an interactive voice response (IVR) system.
© ISO/IEC 2018 – All rights reserved 7

Figure 1 — Capture process 1: Basic speaker verification capture process in a text prompt
technology
The capture process in Figure 1 represents a single session, which may contain one or two utterances
of the speech of speaker A. These possibilities are shown in Figure 2. Figure 2 might show one
representation or two representations.
a) As one representation b) As two representations
Figure 2 — Voice representations from voice utterances of capture process 1
This is an example from an access control application. In this example, the first voice utterance is the
claimed reference pointer (“claim of identity”) by the data subject “speaker A”. A speaker independent
automated speech recognition (ASR) system might be used to extract the content from the first
utterance to determine the reference pointer. The second utterance is the “text-dependent” passphrase
required to verify the claim using the stored voice model of the reference pointer. The capture process
in Figure 1 would not need to change for data subjects interacting with humans (e.g., a call centre
agent). Variants of capture process 1 include asking or allowing the data subject to input the reference
pointer (account number) manually (e.g., using the touchtone keypad of the telephone). Prompts can
be presented as audio by playing one or more sound files or by generating a TTS output for an internal
string. Prompts may be presented as text displays (e.g. on PDAs, mobile, or smart devices).
From the data subjects’ perspective, the simplest active SIV capture process would contain only one
utterance. In capture process 1, this can be accomplished in two ways. Some applications use caller ID
and/or other methods to implicitly establish the claim of identity. The result is a one-utterance capture
process (utterance 2 only). The capture process may also be reduced to a single utterance (utterance 1
only) when ASR is used. In that utterance the IVR asks speaker A to say the account number. ASR
decodes the digits and uses them to retrieve the biometric reference. Then it sends the same input to
the SIV engine for biometric verification.
NOTE As Figure 3 reveals, the same capture process and utterance structure can also be used for enrolment.
8 © ISO/IEC 2018 – All rights reserved

Figure 3 — Capture process 2: Enrolment
This capture process contains five utterances of speaker A. It first establishes the pointer to the claimed
reference, which is followed by four repetitions of the passphrase prompted by a tone. The voice data
acquired in these utterances compose the VRs, which are primary XML elements in the voice data BDB.
6.2 Registered format type identifiers
The registration listed in Table 1 has been made with the CBEFF registration process to identify the
voice data record format. The CBEFF definition shall be in accordance with ISO/IEC 19785-1. The format
owner is ISO/IEC JTC 1/SC 37 with the registered format owner identifier 31 (001F ).
Hex
Table 1 — Format Type Identifiers
CBEFF BDB format Short name Full object identifier
Type identifier
257 (0101 ) voice-data {iso(1) registration-authority(1)
Hex
cbeff(19785) biometricorgani-
zation(0) jtc1-sc37(257) bdbs(0)
voice-data(31)}
7 General voice data interchange format (BDB)
7.1 Overview
This document will be implemented only in XML. In this clause, the voice-data specific header in the
biometric data interchange record (BDIR), containing information about the VR collection conditions
and any post-collection processing are discussed. It is not the purpose of this document to specify
which of the data capture environments, the methods of data capture or any pre-processing (e.g.
detection/segmentation, pre-emphasis filtering) are done on the utterances of voice data comprising
the capture process.
The structure of XML elements is depicted in Figure 4. The record format is as follows:
— a Voice Record General Header containing information about the overall record (7.3),
— a representation element for each VR (7.4).
Each VR shall consist of:
— a VR header containing information about the data for a single representation,
© ISO/IEC 2018 – All rights reserved 9

— a VR data field,
where each header contains an element for extended vendor data (see Tables 2 and 6).
Figure 4 — Structure of XML elements
7.2 Conventions
Elements may be simple or complex. Complex elements contain other elements.
Elements may be mandatory or optional. Optional complex elements may contain both optional and
mandatory elements and characteristics.
The naming convention for XML elements and characteristics used in this format shall consist of capital
and small letters, such as NumberofVRs, with no hyphens or spaces. The printing convention for valid
string values is to enclose each valid value in quotes.
7.3 Voice record general header
7.3.1 Overview
The header for the voice record elements is given below as Table 2. The first three fields of the voice
records element schema are taken directly from ISO/IEC 19794-1. The remaining six fields are each
complex elements, serving as the default for the capture process. Details within each representation,
however, may vary. Therefore, the field values in the headers of the various VRs may be different from
those in the general header.
Table 2 — Voice record general header
Field Clause Item type Valid values Optional/
Mandatory
Version 7.3.2 VersionType see ISO/IEC 19794- M
1/Amd 2
Session ID 7.3.3 string no limit string O
Channel 7.3.4 ChannelType see Table 3 M
Capture device 7.3.5 CaptureDeviceModelID see ISO/IEC 19794- O
1/Amd 2
Transducer 7.3.6 TransducerType see Table 4 O
10 © ISO/IEC 2018 – All rights reserved

Table 2 (continued)
Field Clause Item type Valid values Optional/
Mandatory
Audio meta infor- 7.3.7 AudioMetaInformationType see Table 5 M
mation
Capture process 7.3.8 CaptureProcessProtocolType no limit string O
protocol
Extended vendor 7.3.9 VendorSpecificDataType see ISO/IEC 19794- O
data 1/Amd 2, max. 256
7.3.2 Version
Version of the associated entity (e.g., CBEFF version, patron/ data format specification).
[SOURCE: ISO/IEC 19794-1]
7.3.3 Session ID
Application-specific session identifier.
7.3.4 Channel
7.3.4.1 Overview
The Channel element shall describe the fields of the default channel from which the data were captured.
Table 3 — Description of the functionality of “ChannelType”
Field Clause Item type Valid values Optional/
Mandatory
Type 7.3.4.2 string “Unknown” M
“Analog”
“Digital”
“NonVoIP”
“DigitalVoIP”
“Mixed”
Cutoff upper frequency numeric 0 – 65535 O
7.3.4.3
Cutoff lower frequency numeric 0 – 65535
Country of origin 7.3.4.4 string 3 character string O
7.3.4.2 Type
Type shall specify the kind of channel over which the data were captured. Types are Analog, Digital
Non-VoIP, Digital VoIP, Mixed and Unknown. The default value is “Unknown”.
7.3.4.3 Cutoff upper frequency and cutoff lower frequency
The voice elements record schema shall have an indicator of the upper and lower cutoff frequencies of
the audio data. Both upper and lower cutoff frequencies shall be the integer that best represents the
frequencies on the upper and lower ends of the audio band at which energy has fallen 3 dB below the
average band energy. There is no default value. The value shall be 0 if unknown.
© ISO/IEC 2018 – All rights reserved 11

7.3.4.4 Country of origin
The Country element shall identify the country of origin of the channel, if known.
Country code of origin should be represented by an alpha code that complies with the two-letter country
code of ISO 3166-1, which supports three kinds of country codes: two-letter, three-letter, and numeric.
7.3.5 Capture device
Registered identifier of the type of device used to capture the biometric data (BDIR):
— identifies the device vendor;
— identifies the specific device type (e.g. maps to model).
[SOURCE: ISO/IEC 19794-1]
7.3.6 Transducer
The Capture Technology ID is a simple type for voice data that defines the characteristics of the signal
collection transducer.
7.3.6.1 Overview
The transducer field shall specify the input device employed by the data subject. It is recognized that
complex collection systems may consist of multiple transducers, to which the elements of this clause
may not apply. In such cases, “unknown” is the default value.
NOTE This element is intended primarily to support R&D and engines that require device registration.
Table 4 — Description of the functionality of “TransducerType”
Field Clause Item type Valid values Optional/
Mandatory
Capture technology ID 7.3.6.2 string “Telephone” O
“Microphone”
“Handheld”
“Mobile phone”
“Stethoscope”
“Other”
“Unknown”
Microphone type 7.3.6.3 string “Carbon” O
“Electret”
“Other”
“Unknown”
Manufacturer 7.3.6.4 string no limit string O
Model 7.3.6.5 string no limit string O
Mic cutoff upper 7.3.6.6 numeric 0 – 65535 O
Mic cutoff lower numeric 0 – 65535
Device info 7.3.6.7 string no limit string O
12 © ISO/IEC 2018 – All rights reserved

7.3.6.2 Capture technology ID
The voice record elements schema shall have a Capture Technology ID to specify the kind of input device
used, if known. The default value is “telephone”.
7.3.6.3 Microphone type
The voice record elements schema shall indicate the type of microphone used in the input device, if
known. Permitted values are carbon, electret, other, and unknown.
7.3.6.4 Manufacturer
The manufacturer field shall be a string identifying the manufacturer of the data subject’s input device.
7.3.6.5 Model
The model field shall be a string identifying the manufacturer of the data subject’s input device.
7.3.6.6 Mic cutoff upper and mic cutoff lower
The optional upper and lower cutoff frequencies shall both be an integer that best represents the
frequencies on the upper and lower ends at which the capacity for energy conversion of the microphone
has fallen 3 dB below the average band energy. There is no default value, but 0 shall indicate that the
information is unknown.
7.3.6.7 Device info
Device info shall be reserved for additional information about the device, but not about the capture
process or the data subject. It shall be limited to data that a recipient SIV engine or application is able to
discern and use.
7.3.7 Audio meta information
7.3.7.1 Overview
This clause gives the technical specifications of the signal process used to capture all VRs in the record.
Table 5 — Description of the functionality of “AudioMetaInformationType”
Field Clause Item type Valid values Optional/
Mandatory
Channel count 7.3.7.2 numeric 1 – 15 M
Sampling rate 7.3.7.3 numeric 0 – 128000 M
Bits per sample 7.3.7.4 numeric 0 – 255 M
Audio duration 7.3.7.5 numeric built-in type M
7.3.7.2 Channel count
The voice elements record schema shall have a Channel Count field. This integer element gives the
number of channels in the input stream. The default value shall be 1.
7.3.7.3 Sampling rate
The voice elements record schema shall have an integer characteristic giving the number of samples
per second at which the original audio input stream was sampled.
© ISO/IEC 2018 – All rights reserved 13

7.3.7.4 Bits per sample
The voice elements record schema shall have an integer Bits per Sample characteristic. This integer
gives the bit depth of a single sample of the audio signal. For formats that use variable bit depth, like
Ogg Vorbis, this element is set to 0.
7.3.7.5 Audio duration
Audio Duration is an integer value that indicates duration of the utterance in milliseconds.
7.3.8 Capture process protocol
Capture Process Protocol shall be reserved for additional information about the capture process, but
not about the data subject or data capture device. It shall be limited to data that a recipient SIV engine
or application is able to discern and use.
7.3.9 Extended vendor data
This is used when non-standardized data, proprietary to a vendor/product, needs to be included.
[SOURCE: ISO/IEC 19794-1]
7.4 Voice representation header
7.4.1 Overview
The VR shall be the child of the capture process element that contains the elements and fields that may
change in the course of a capture process. There shall be a minimum of one representation for each
capture process. The VR elements are shown in Table 6.
NOTE Information regarding the spoken text, language, dialects, or a subject’s gender are not considered for
VR elements at all. If these or other information can aid the recognition process, analysts can use ASR, Automatic
Language Recognition (ALR), or Automatig Gender Detection (AGD) software.
Table 6 — Voice representation header
Field Clause Item type Valid values Optional/
Mandatory
Date and time 7.4.2 DateAnd- see Table 7 O
TimeType
Audio content 7.4.3 AudioCon- see Table 8 M
tentType
Quality 7.4.4 VRQuality- see Table 10 O
Type
Signal enhanceme
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...

Frequently Asked Questions

ISO/IEC 19794-13:2018 is a standard published by the International Organization for Standardization (ISO). Its full title is "Information technology - Biometric data interchange formats - Part 13: Voice data". This standard covers: ISO/IEC 19794-13:2018 specifies a data interchange format that can be used for storing, recording, and transmitting digitized acoustic human voice data (speech) assumed to be from a single speaker recorded in a single session. This format is designed specifically to support a wide variety of Speaker Identification and Verification (SIV) applications, both text-dependent and text-independent, with minimal assumptions made regarding the voice data capture conditions or the collection environment. Other uses for the data encapsulated in this format, such as automated speech recognition (ASR), may be possible, but are not addressed in this documnet. This document also does not address handling of data that has been processed to the feature or voice model levels. No application-specific requirements, equipment, or features are addressed in this document. This document supports the optional inclusion of non-standardized extended data. This document allows both the original data captured and digitally-processed (enhanced) voice data to be exchanged. A description of any processing of the original source input is intended to be included in the metadata associated with the voice representations (VRs). This document does not address data streaming. Provisions that stored and transmitted biometric data be time-stamped and that cryptographic techniques be used to protect their authenticity, integrity and confidentiality are out of the scope of this document. Information formatted in accordance with this document can be recorded on machine-readable media or can be transmitted by data communication between systems. A general content-oriented subclause describing the voice data interchange format is followed by a subclause addressing an XML schema definition. ISO/IEC 19794-13:2018 includes vocabulary in common use by the speech and speaker recognition community, as well as terminology from other ISO standards.

ISO/IEC 19794-13:2018 specifies a data interchange format that can be used for storing, recording, and transmitting digitized acoustic human voice data (speech) assumed to be from a single speaker recorded in a single session. This format is designed specifically to support a wide variety of Speaker Identification and Verification (SIV) applications, both text-dependent and text-independent, with minimal assumptions made regarding the voice data capture conditions or the collection environment. Other uses for the data encapsulated in this format, such as automated speech recognition (ASR), may be possible, but are not addressed in this documnet. This document also does not address handling of data that has been processed to the feature or voice model levels. No application-specific requirements, equipment, or features are addressed in this document. This document supports the optional inclusion of non-standardized extended data. This document allows both the original data captured and digitally-processed (enhanced) voice data to be exchanged. A description of any processing of the original source input is intended to be included in the metadata associated with the voice representations (VRs). This document does not address data streaming. Provisions that stored and transmitted biometric data be time-stamped and that cryptographic techniques be used to protect their authenticity, integrity and confidentiality are out of the scope of this document. Information formatted in accordance with this document can be recorded on machine-readable media or can be transmitted by data communication between systems. A general content-oriented subclause describing the voice data interchange format is followed by a subclause addressing an XML schema definition. ISO/IEC 19794-13:2018 includes vocabulary in common use by the speech and speaker recognition community, as well as terminology from other ISO standards.

ISO/IEC 19794-13:2018 is classified under the following ICS (International Classification for Standards) categories: 35.040 - Information coding; 35.240.15 - Identification cards. Chip cards. Biometrics. The ICS classification helps identify the subject area and facilitates finding related standards.

You can purchase ISO/IEC 19794-13:2018 directly from iTeh Standards. The document is available in PDF format and is delivered instantly after payment. Add the standard to your cart and complete the secure checkout process. iTeh Standards is an authorized distributor of ISO standards.

記事のタイトル:ISO/IEC 19794-13:2018 - 情報技術 - 生体認証データの交換形式 - 第13部:音声データ 記事の内容:ISO/IEC 19794-13:2018は、単一の話者が単一のセッションで録音された音声データ(音声)を保存、記録、および送信するために使用できるデータの交換形式を指定しています。この形式は、音声データのキャプチャ条件や収集環境に関して最小限の仮定がなされるテキスト依存およびテキスト非依存のスピーカー識別および検証(SIV)アプリケーションを特にサポートするように設計されています。自動音声認識(ASR)など、他の用途のためのデータ処理やアプリケーション固有の要件には言及していません。この文書では、特定の要件、装置、または機能については取り扱っていません。この文書では、非標準化された拡張データの選択的な含有もサポートしています。この文書では、元のキャプチャデータとデジタル処理(強化)された音声データの両方の交換を許可しています。音声表現(VR)に関連するメタデータには、元のソース入力の処理に関する説明が含まれるように意図されています。この文書では、データストリーミングには言及していません。生体認証データの保存および送信にはタイムスタンプが付けられ、その真正性、完全性、および機密性を保護するために暗号技術が使用されるべきである規定は、この文書の範囲外です。この文書に準拠した形式で提供される情報は、機械可読なメディアに記録するか、システム間のデータ通信によって送信することができます。音声データ交換形式に関する一般的なコンテンツ指向のサブクラウスの後に、XMLスキーマ定義を取り上げたサブクラウスが続きます。ISO/IEC 19794-13:2018には、音声とスピーカー認識コミュニティで一般的に使用される語彙だけでなく、他のISO標準からの用語も含まれています。

The article discusses ISO/IEC 19794-13:2018, which is a standard that specifies a data interchange format for storing, recording, and transmitting digitized human voice data. This format is specifically designed for Speaker Identification and Verification (SIV) applications, allowing for both text-dependent and text-independent identification. The standard does not address other uses like automated speech recognition, data processing, or application-specific requirements. It supports the inclusion of extended data and allows for the exchange of both original and digitally-processed voice data. Time-stamping and cryptographic protection of biometric data are not covered in this standard. The format can be recorded on media or transmitted via data communication. The standard also includes common vocabulary and terminology used in the speech and speaker recognition community.

문서 제목: ISO/IEC 19794-13:2018 - 정보 기술 - 생체 인식 데이터 교환 형식 - 제 13부: 음성 데이터 기사 내용: ISO/IEC 19794-13:2018은 단일 화자가 단일 세션에서 녹음된 음성 데이터(음성)를 저장, 기록 및 전송하기 위해 사용할 수 있는 데이터 교환 형식을 지정합니다. 이 형식은 음성 데이터 캡처 조건이나 수집 환경에 대해 최소한의 가정이 이루어지는 텍스트 종속 및 텍스트 비종속 스피커 식별 및 검증(SIV) 애플리케이션을 지원하는데 특별히 설계되었습니다. 자동 음성 인식(ASR)과 같은 데이터 포함 형태의 다른 용도는 가능하지만, 본 문서에서 다루지 않습니다. 이 문서는 특정한 요구 사항, 장비 또는 기능을 다루지 않습니다. 이 문서는 비표준 확장 데이터의 선택적 포함을 지원합니다. 이 문서는 원본 데이터와 디지털 처리된(향상된) 음성 데이터의 양방향 교환을 허용합니다. 음성 표현(VR)과 연관된 메타데이터에 원본 소스 입력의 처리에 대한 설명이 포함되도록 의도되었습니다. 이 문서는 데이터 스트리밍을 다루지 않습니다. 저장 및 전송된 생체 인식 데이터에는 타임 스탬프가 부여되며, 그들의 진위성, 무결성 및 기밀 보호를 위해 암호화 기술을 사용해야 하는 조항은 이 문서의 범위를 벗어납니다. 이 문서에 따라 형식화된 정보는 기계 판독 가능한 미디어에 기록될 수 있으며, 시스템 간의 데이터 통신을 통해 전송될 수 있습니다. 음성 데이터 교환 형식에 대한 일반적인 내용 중심 하위 절에서는 XML 스키마 정의를 다루는 하위 절이 따릅니다. ISO/IEC 19794-13:2018에는 음성 및 화자 인식 커뮤니티에서 공통으로 사용되는 용어뿐만 아니라 다른 ISO 표준에서 사용되는 용어도 포함되어 있습니다.