ISO/IEC 23092-1:2020
(Main)Information technology — Genomic information representation — Part 1: Transport and storage of genomic information
Information technology — Genomic information representation — Part 1: Transport and storage of genomic information
This document specifies data formats for both transport and storage of genomic information, including the conversion process.
Technologie de l'information — Représentation des informations génomiques — Partie 1: Transport et stockage des informations génomiques
General Information
Relations
Standards Content (Sample)
INTERNATIONAL ISO/IEC
STANDARD 23092-1
Second edition
2020-10
Information technology — Genomic
information representation —
Part 1:
Transport and storage of genomic
information
Technologie de l'information — Représentation des informations
génomiques —
Partie 1: Transport et stockage des informations génomiques
Reference number
©
ISO/IEC 2020
© ISO/IEC 2020
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting
on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address
below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO/IEC 2020 – All rights reserved
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Mathematical operators . 4
4.1 Arithmetic operators . 4
4.2 Logical operators . 4
4.3 Relational operators . 4
4.4 Bitwise operators. 4
4.5 Assignment . 5
4.6 Unary operators . 5
5 Structure of coded genomic data . 5
5.1 Genomic records . 5
5.2 Data classes . 6
5.3 Access units . 6
5.4 Datasets . 7
5.5 Selective access . 7
6 Data format . 7
6.1 Format structure . 7
6.1.1 General. 7
6.1.2 Box order . 9
6.2 Syntax and semantics .10
6.2.1 Method of specifying syntax in tabular form .10
6.2.2 Bit ordering .11
6.2.3 Specification of syntax functions .11
6.3 Syntax for representation .11
6.4 Output data unit .12
6.5 Data structures common to file format and transport format .13
6.5.1 File header .13
6.5.2 Dataset group.13
6.5.3 Dataset .22
6.5.4 Access unit .30
6.5.5 Block .36
6.6 Data structures specific to file format .37
6.6.1 General.37
6.6.2 Indexing .37
6.6.3 Descriptor stream .41
6.6.4 Offset .42
6.7 Data structures specific to transport format .43
6.7.1 General.43
6.7.2 Data streams .43
6.7.3 Dataset mapping table list .44
6.7.4 Dataset mapping table .44
6.7.5 Packet .46
6.8 Reference procedure to convert transport format to file format .47
Annex A (informative) IETF RFC 3986 specification summary .50
Annex B (informative) Selective access strategies .51
Annex C (informative) Depacketization process .54
Bibliography .56
© ISO/IEC 2020 – All rights reserved iii
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that
are members of ISO or IEC participate in the development of International Standards through
technical committees established by the respective organization to deal with particular fields of
technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other
international organizations, governmental and non-governmental, in liaison with ISO and IEC, also
take part in the work.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for
the different types of document should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject
of patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent
rights. Details of any patent rights identified during the development of the document will be in the
Introduction and/or on the ISO list of patent declarations received (see www .iso .org/ patents) or the IEC
list of patent declarations received (see http:// patents .iec .ch).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www .iso .org/
iso/ foreword .html.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information.
This second edition cancels and replaces the first edition (ISO/IEC 23092-1:2019), which has been
technically revised.
The main changes compared to the previous edition are as follows:
— reference box syntax and semantics have been updated;
— syntax, semantics and decoding process for cluster signatures has been fixed;
— the scope of some parameters has been changed from dataset_header to dataset_parameter_set;
— new dataset_group_ID and dataset_ID fields have been added to the metadata and protection boxes;
— minor fixes in transport format;
— editorial changes.
A list of all parts in the ISO/IEC 23092 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www .iso .org/ members .html.
iv © ISO/IEC 2020 – All rights reserved
Introduction
The advent of high-throughput sequencing (HTS) technologies has the potential to boost the adoption
of genomic information in everyday practice, ranging from biological research to personalized genomic
medicine in clinics. As a consequence, the volume of generated data has increased dramatically during
the last few years, and an even more pronounced growth is expected in the near future.
At the moment, genomic information is mostly exchanged through a variety of data formats, such as
FASTA/FASTQ for unaligned sequencing reads and SAM/BAM/CRAM for aligned reads. With respect to
such formats, the ISO/IEC 23092 series provides a new solution for the representation and compression
of genome sequencing information by:
— Specifying an abstract representation of the sequencing data rather than a specific format with its
direct im
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.