Information technology - Coding of audio-visual objects - Part 15: Carriage of network abstraction layer (NAL) unit structured video in ISO base media file format

ISO/IEC 14496-15:2014 specifies the storage format for streams of video that is structured as Network Abstraction Layer (NAL) Units, such as Advanced Video Coding, AVC (ISO/IEC 14496-10) and High Efficiency Video Coding, HEVC (ISO/IEC 23008-2) video streams.

Technologies de l'information — Codage des objets audiovisuels — Partie 15: Transport de vidéo structuré en unités NAL au format ISO de base pour les fichiers médias

General Information

Status
Withdrawn
Publication Date
23-Jun-2014
Withdrawal Date
23-Jun-2014
Current Stage
9599 - Withdrawal of International Standard
Start Date
23-Feb-2017
Completion Date
30-Oct-2025
Ref Project

Relations

Standard
ISO/IEC 14496-15:2014 - Information technology -- Coding of audio-visual objects
English language
114 pages
sale 15% off
Preview
sale 15% off
Preview

Frequently Asked Questions

ISO/IEC 14496-15:2014 is a standard published by the International Organization for Standardization (ISO). Its full title is "Information technology - Coding of audio-visual objects - Part 15: Carriage of network abstraction layer (NAL) unit structured video in ISO base media file format". This standard covers: ISO/IEC 14496-15:2014 specifies the storage format for streams of video that is structured as Network Abstraction Layer (NAL) Units, such as Advanced Video Coding, AVC (ISO/IEC 14496-10) and High Efficiency Video Coding, HEVC (ISO/IEC 23008-2) video streams.

ISO/IEC 14496-15:2014 specifies the storage format for streams of video that is structured as Network Abstraction Layer (NAL) Units, such as Advanced Video Coding, AVC (ISO/IEC 14496-10) and High Efficiency Video Coding, HEVC (ISO/IEC 23008-2) video streams.

ISO/IEC 14496-15:2014 is classified under the following ICS (International Classification for Standards) categories: 35.040 - Information coding; 35.040.40 - Coding of audio, video, multimedia and hypermedia information. The ICS classification helps identify the subject area and facilitates finding related standards.

ISO/IEC 14496-15:2014 has the following relationships with other standards: It is inter standard links to ISO/IEC 14496-15:2017, ISO/IEC 14496-15:2010/Amd 1:2011, ISO/IEC 14496-15:2010/Cor 2:2012, ISO/IEC 14496-15:2010. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.

You can purchase ISO/IEC 14496-15:2014 directly from iTeh Standards. The document is available in PDF format and is delivered instantly after payment. Add the standard to your cart and complete the secure checkout process. iTeh Standards is an authorized distributor of ISO standards.

Standards Content (Sample)


INTERNATIONAL ISO/IEC
STANDARD 14496-15
Third edition
2014‐07‐01
Information technology — Coding of audio-
visual objects —
Part 15:
Carriage of network abstraction layer
(NAL) unit structured video in ISO base
media file format
Technologies de l'information — Codage des objets audiovisuels —
Partie 15: Transport de vidéo structuré en unités NAL au format ISO de base
pour les fichiers médias
Reference number
ISO/IEC 14496‐15:2014(E)
©
ISO/IEC 2014
© ISO/IEC 2014
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form or by any
means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior written permission.
Permission can be requested from either ISO at the address below or ISO’s member body in the country of the requester.
ISO copyright office
Case postale 56  CH‐1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E‐mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO/IEC 2014 – All rights reserved

Contents Page
Foreword . v
Introduction . vii
1 Scope . 1
2 Normative references . 1
3 Terms, definitions and abbreviated terms . 1
3.1 Terms and definitions . 1
3.2 Abbreviated terms . 5
4 General Definitions . 6
4.1 Introduction. 6
4.2 Elementary stream structure . 6
4.3 Sample and Configuration definition . 6
4.4 Video Track Structure . 8
4.5 Template fields used . 8
4.6 Visual width and height . 9
4.7 Decoding time (DTS) and composition time (CTS) . 9
4.8 Sync sample (IDR) . 9
4.9 Shadow sync . 10
4.10 Sample groups on random access recovery points and random access points . 10
4.11 Hinting . 10
5 AVC elementary streams and sample definitions . 11
5.1 Introduction. 11
5.2 Elementary stream structure . 11
5.3 Sample and Configuration definition . 14
5.4 Derivation from ISO Base Media File Format . 18
6 SVC elementary stream and sample definitions . 29
6.1 Introduction. 29
6.2 Elementary stream structure . 30
6.3 Use of the plain AVC file format . 31
6.4 Sample and configuration definition . 31
6.5 Derivation from the ISO base media file format . 33
7 MVC elementary stream and sample definitions . 39
7.1 Introduction. 39
© ISO/IEC 2014 – All rights reserved iii

7.2 Overview of MVC Storage . 40
7.3 MVC Track Structure . 41
7.4 Use of the plain AVC File Fo rmat . 42
7.5 Sample and configuration definition . 42
7.6 Derivation from the ISO base media file format . 45
7.7 MVC specific information boxes . 54
8 HEVC elementary streams and sample definitions . 63
8.1 Introduction . 63
8.2 Elementary Stream Structure . 64
8.3 Sample and configuration definition . 64
8.4 Derivation from ISO base media file format . 69
Annex A (normative) In-stream structures specific to SVC and MVC . 76
Annex B (normative) SVC and MVC sample group and sub-track definitions . 81
Annex C (normative) Temporal metadata support . 102
Annex D (normative) File format toolsets . 110
Annex E (normative) Sub-parameters for the MIME type ‘Codecs’ parameter . 112
Annex F (Informative) Patent Statements . 114

iv © ISO/IEC 2014 – All rights reserved

Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are
members of ISO or IEC participate in the development of International Standards through technical
committees established by the respective organization to deal with particular fields of technical activity.
ISO and IEC technical committees collaborate in fields of mutual interest. Other international
organizations, governmental and non‐governmental, in liaison with ISO and IEC, also take part in the
work. In the field of information technology, ISO and IEC have established a joint technical committee,
ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International
Standards adopted by the joint technical committee are circulated to national bodies for voting.
Publication as an International Standard requires approval by at least 75 % of the national bodies
casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights.
ISO/IEC 14496‐15 was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information.
This third edition cancels and replaces the second edition (ISO/IEC 14496‐15:2010), which has been
technically revised. It also incorporates the Amendment ISO/IEC 14496‐15:2010/Amd.1:2011 and the
Technical Corrigenda ISO/IEC 14496‐15:2010/Cor.1:2011 and ISO/IEC 14496‐15:2010/Cor.2:2012.
ISO/IEC 14496 consists of the following parts, under the general title Information technology — Coding
of audio-visual objects:
 Part 1: Systems
 Part 2: Visual
 Part 3: Audio
 Part 4: Conformance testing
 Part 5: Reference software
 Part 6: Delivery Multimedia Integration Framework (DMIF)
 Part 7: Optimized reference software for coding of audio-visual objects [Technical Report]
 Part 8: Carriage of ISO/IEC 14496 contents over IP networks
 Part 9: Reference hardware description [Technical Report]
 Part 10: Advanced Video Coding
© ISO/IEC 2014 – All rights reserved v

 Part 11: Scene description and application engine
 Part 12: ISO base media file format
 Part 13: Intellectual Property Management and Protection (IPMP) extensions
 Part 14: MP4 file format
 Part 15: Carriage of network abstraction layer (NAL) unit structured video in the ISO base media file
format
 Part 16: Animation Framework eXtension (AFX)
 Part 17: Streaming text format
 Part 18: Font compression and streaming
 Part 19: Synthesized texture stream
 Part 20: Lightweight Application Scene Representation (LASeR) and Simple Aggregation Format
(SAF)
 Part 21: MPEG-J Graphics Framework eXtension (GFX)
 Part 22: Open Font Format
 Part 23: Symbolic Music Representation
 Part 24: Audio and systems interaction
 Part 25: 3D Graphics Compression Model
 Part 26: Audio conformance
 Part 27: 3D Graphics conformance
 Part 28: Composite font representation
vi © ISO/IEC 2014 – All rights reserved

Introduction
This part of ISO/IEC 14496 defines a storage format based on, and compatible with, the ISO Base Media
File Format (ISO/IEC 14496‐12 and ISO/IEC 15444‐12), which is used by the MP4 file format
(ISO/IEC 14496‐14) and the Motion JPEG 2000 file format (ISO/IEC 15444‐3) among others. This part
of ISO/IEC 14496 enables video streams formatted as Network Adaptation Layer Units (NAL Units) to
 be used in conjunction with other media streams, such as audio,
 be used in an MPEG‐4 systems environment, if desired,
 be formatted for delivery by a streaming server, using hint tracks, and
 inherit all the use cases and features of the ISO Base Media File Format on which MP4 and MJ2 are
based.
This part of ISO/IEC 14496 may be used as a standalone specification; it specifies how NAL unit
structured video content shall be stored in an ISO Base Media File Format compliant format. However, it
is normally used in the context of a specification, such as the MP4 file format, derived from the ISO Base
Media File Format, that permits the use of NAL unit structured video such as AVC (ISO/IEC 14496‐10)
and video and High Efficiency Video Coding (HEVC, ISO/IEC 23008‐2) video.
The ISO Base Media File Format is becoming increasingly common as a general‐purpose media
container format for the exchange of digital media, and its use in this context should accelerate both
adoption and interoperability.
The International Organization for Standardization (ISO) and International Electrotechnical
Commission (IEC) draw attention to the fact that it is claimed that compliance with this document may
involve the use of a patent.
The ISO and IEC take no position concerning the evidence, validity and scope of this patent right.
The holder of this patent right has assured the ISO and IEC that he is willing to negotiate licences under
reasonable and non‐discriminatory terms and conditions with applicants throughout the world. In this
respect, the statement of the holder of this patent right is registered with the ISO and IEC. Information
may be obtained from the companies listed in Annex F.
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights other than those identified in Annex F. ISO and IEC shall not be held responsible for
identifying any or all such patent rights.
© ISO/IEC 2014 – All rights reserved vii

INTERNATIONAL STANDARD ISO/IEC 14496-15:2014(E)

Information technology — Coding of audio-visual objects —
Part 15:
Carriage of network abstraction layer (NAL) unit structured video
in the ISO base media file format
1 Scope
This part of ISO/IEC 14496 specifies the storage format for streams of video that is structured as NAL
Units, such as AVC (ISO/IEC 14496‐10) and HEVC (ISO/IEC 23008‐2) video streams.
2 Normative references
The following documents, in whole or in part, are normatively referenced in this document and are
indispensable for its application. For dated references, only the edition cited applies. For undated
references, the latest edition of the referenced document (including any amendments) applies.
ISO/IEC 14496‐10, Information technology — Coding of audio-visual objects — Part 10: Advanced Video
Coding
ISO/IEC 14496‐12, Information technology — Coding of audio-visual objects — Part 12: ISO base media
1)
file format
ISO/IEC 23008‐2, Information technology — High efficiency coding and media delivery in heterogeneous
environments — Part 2: High efficiency video coding
3 Terms, definitions and abbreviated terms
3.1 Terms and definitions
For the purposes of this document, the terms and definitions given in ISO/IEC 14496‐10 or
ISO/IEC 23008‐2, and the following apply.
3.1.1
aggregator
in‐stream structure using a NAL unit header
NOTE Aggregators are used to group NAL units belonging to the same sample.

) ISO/IEC 14496‐12 is technically identical to ISO/IEC 15444‐12.
© ISO/IEC 2014 – All rights reserved 1

3.1.2
AVC base layer
maximum subset of a bitstream that is AVC compatible (i.e. a bitstream not using any of the
functionality of ISO/IEC 14496‐10 Annex G or Annex H)
NOTE 1 The AVC base layer is represented by AVC VCL NAL units and associated non‐VCL NAL units.
NOTE 2 The AVC base layer itself can be a temporal scalable bitstream.
3.1.3
AVC NAL unit
AVC VCL NAL unit and its associated non‐VCL NAL units in a bitstream
3.1.4
AVC VCL NAL unit
NAL unit with type 1 to 5 (inclusive) as specified in ISO/IEC 14496‐10
3.1.5
extraction path
set of operations on the original bitstream, each yielding a subset bitstream, ordered such that the
complete bitstream is first in the set, and the base layer is last, and all the bitstreams are in decreasing
complexity (along one of the scalability axes, such as resolution), and where every bitstream is a valid
operating point
NOTE An extraction path may be represented by the values of priority_id in the NAL unit headers. Alternatively an
extraction path can be represented by the run of tiers or by a set of hierarchically dependent tracks.
3.1.6
extractor
in‐stream structure using a NAL unit header including a NAL unit header extension
NOTE Extractors contain instructions on how to extract data from other tracks. Logically an Extractor can be seen as a
‘link’. While accessing a track containing Extractors, the Extractor is replaced by the data it is referencing.
3.1.7
in-stream structure
structure residing within sample data
3.1.8
MVC VCL NAL unit
NAL unit with type 20, and NAL units with type 14, as specified in ISO/IEC 14496‐10, when the
immediately following NAL units are AVC VCL NAL units.
NOTE MVC VCL NAL units do not affect the decoding process of a legacy AVC decoder.
3.1.9
operating point
subset of a scalable bitstream, representing in SVC a particular spatial resolution, temporal resolution,
and quality, or in MVC a set of target output views
NOTE 1 Each operating point consists of all the data needed to decode this particular bitstream subset.
NOTE 2 In an SVC stream an operating point can be represented either by (i) specific values of DTQ (dependency_id,
temporal_id and quality_id) or (ii) specific values of P (priority_id) or (iii) combinations of them (e.g. PDTQ). Note that the
usage of priority_id is defined by the application. In an SVC file a track represents one or more operating points. Within a track
tiers may be used to define multiple operating points.
2 © ISO/IEC 2014 – All rights reserved

NOTE 3 The bitstream subset of an MVC operating point represents a particular set of target output views at a particular
temporal resolution, and consists of all the data needed to decode this particular bitstream subset.
NOTE 4 An operating point is referred to as an operation point in Annex H of ISO/IEC 14496‐10 or in ISO/IEC 23008‐2.
3.1.10
parameter set
video parameter set, sequence parameter set, or picture parameter set, as defined in the applicable
video standard (e.g. ISO/IEC 14496‐10 or ISO/IEC 23008‐2)
NOTE This term is used to refer to all types of parameter sets.
3.1.11
parameter set elementary stream
elementary stream containing samples made up of only sequence and picture parameter set NAL units
synchronized with the video elementary stream
3.1.12
prefix NAL unit
NAL units with type 14 as specified in ISO/IEC 14496‐10
NOTE Prefix NAL units provide scalability information about AVC VCL NAL units and filler data NAL units. Prefix NAL
units do not affect the decoding process of a legacy AVC decoder. The behaviour of a legacy AVC file reader as a response to
prefix NAL units is undefined.
3.1.13
scalable layer; layer
set of VCL NAL units with the same values of dependency_id, quality_id, and temporal_id, and the
associated non‐VCL NAL units as specified in ISO/IEC 14496‐10.
NOTE 1 A scalable layer with any of dependency_id, quality_id, and temporal_id not equal to 0 enhances the video by one
or more scalability levels in at least one direction (temporal, quality or spatial resolution)
NOTE 2 SVC uses a “layered” encoder design which results in a bitstream representing “coding layers”. In some
publications the ‘base layer’ is the first quality layer of a specific coding layer. In some publications the base layer is the
scalable layer with the lowest priority. The SVC file format uses “scalable layer” or “layer” in a general way for describing
nested bitstreams (using terms like AVC base layer or SVC enhancement layer).
3.1.14
scalable layer representation
bitstream subset that is required for decoding the scalable layer, consisting of the scalable layer itself
and all the scalable layers on which the scalable layer depends
NOTE A scalable layer representation is also referred to as the representation of the scalable layer.
3.1.15
sub-picture
proper subset of coded slices of a layer representation
3.1.16
sub-picture tier
tier that consists of sub‐pictures
NOTE Any coded slice that is not included in the tier representation of a sub‐picture tier is not to be referred to in inter
prediction or inter‐layer prediction for decoding of the sub‐picture tier.
© ISO/IEC 2014 – All rights reserved 3

3.1.17
SVC enhancement layer
layer that specifies a part of a scalable bitstream that enhances the video
NOTE 1 An SVC enhancement layer is represented by SVC VCL NAL units and the associated non‐VCL NAL units and SEI
messages.
NOTE 2 Usually an SVC enhancement layer represents a spatial or coarse‐grain scalability (CGS) coding layer (identified by
a specific value of dependency_id).
3.1.18
SVC NAL unit
SVC VCL NAL unit and its associated non‐VCL NAL units in an SVC stream
3.1.19
SVC stream
bitstream represented by the operating point for which dependency_id is equal to mDid, temporal_id is
the greatest temporal_id value among mOpSet, and quality_id is the greatest quality_id value among
mOpSet, where the greatest value of dependency_id of all the operating points represented by DTQ
(dependency_id, temporal_id and quality_id) combinations is equal to mDid, and the set of all the
operating points with dependency_id equal to mDid is mOpSet.
NOTE The term “SVC stream” is referenced by ‘decoding/accessing the entire stream’ in this document. There may be
NAL units which are not required for decoding this operating point.
3.1.20
SVC VCL NAL unit
NAL unit with type 20, and NAL units with type 14 when the immediately following NAL units are AVC
VCL NAL units
NOTE SVC VCL NAL units do not affect the decoding process of a legacy AVC decoder.
3.1.21
temporal layer representation
representation of a temporal layer
temporal layer and all lower temporal layers
3.1.22
tier
set of operating points within a track, providing information about the operating points and
instructions on how to access the corresponding bitstream portions (using maps and groups)
NOTE 1 A tier represents one or more scalable layers of an SVC bitstream.
NOTE 2 The term “tier” is used to avoid confusion with the frequently used term layer. A tier represents a subset of a track
and represents an operating point of an SVC bitstream. Tiers in a track subset the entire track, no matter whether the track
references another track by extractors.
NOTE 3 An MVC tier represents a particular set of temporal subsets of a particular set of views.
3.1.23
tier representation; representation of the tier
bitstream subset that is required for decoding the tier, consisting of the tier itself and all the tiers on
which the tier depends
4 © ISO/IEC 2014 – All rights reserved

3.1.24
video elementary stream
elementary stream containing access units made up of NAL units for coded picture data
3.1.25
virtual base view
AVC compatible representation of an independently coded non‐base view
NOTE The virtual base view of an independently coded non‐base view is created according to the process specified in
H.8.5.5 of ISO/IEC 14496‐10. Samples containing data units of an independently coded non‐base view and samples of the
virtual base view are aligned by decoding times.
3.2 Abbreviated terms
AVC Advanced Video Coding. Where contrasted with SVC or MVC in this International Standard,
this term refers to the main part of ISO/IEC 14496‐10, including neither Annex G (Scalable
Video Coding) nor Annex H (Multiview Video Coding)
BLA Broken Link Access
CRA Clean Random Access
CTU Coding Tree Unit
HEVC High Efficiency Video Coding
FF File Format
HRD Hypothetical Reference Decoder
IDR Instantaneous Decoding Refresh
MVC MultiviewVideo Coding [refers to ISO/IEC 14496‐10 when the techniques in Annex H
(Multiview Video Coding) are in use]
NAL Network Abstraction Layer
PPS Picture Parameter Set
ROI Region‐Of‐Interest
SEI Supplementary Enhancement Information
SPS Sequence Parameter Set
STSA Step‐wise Temporal Sub‐layer Access
SVC Scalable Video Coding [refers to ISO/IEC 14496‐10 when the techniques in Annex G
(Scalable Video Coding) are in use]
TSA Temporal Sub‐layer Access
VCL Video Coding Layer
VPS Video Parameter Set
© ISO/IEC 2014 – All rights reserved 5

4 General Definitions
4.1 Introduction
The specifications in this clause apply to all coding systems identified by chapters in this specification,
unless specifically over‐ridden by definitions in the clause for a specific coding system.
The following table summarizes the correspondences between the sets of terminology used in video
specifications and the ISO Base Media File Format.
Table 1 – Correspondence of terms in video and ISO Base Media File Format
Video ISO Base Media File
Format
‐ Movie
Bitstream Track
Access Unit Sample
4.2 Elementary stream structure
This specification concerns video coding systems that specify a set of Network Abstraction Layer (NAL)
units, which contain different types of data. This subclause specifies the format of the elementary
streams for storing such content.
4.3 Sample and Configuration definition
4.3.1 Introduction
Sample: A sample is an access unit as defined in the appropriate specification.
Parameter set sample: A parameter set sample is a sample in a parameter set stream which shall consist
of those parameter set NAL units that are to be considered as if present in the video elementary stream
at the same instant in time.
4.3.2 Canonical order and restrictions
The elementary stream is stored in the ISO Base Media File Format in a canonical format. The canonical
format is as neutral as possible so that systems that need to customize the stream for delivery over
different transport protocols — MPEG‐2 Systems, RTP, and so on — should not have to remove
information from the stream while being free to add to the stream. Furthermore, a canonical format
allows such operations to be performed against a known initial state.
The canonical stream format is an elementary stream that satisfies the following conditions:
 Video data NAL units: All video data NAL units for a single picture shall be contained with the
sample whose decoding time and composition time are those of the picture. Each sample shall
contain at least one video data NAL unit of the primary picture.
6 © ISO/IEC 2014 – All rights reserved

 SEI NAL units: All SEI NAL units shall be contained in the parameter set arrays, or in the sample
whose decoding time is at the time, or immediately precedes the time (with no intervening
samples), when the SEI messages come into effect instantaneously. In general, SEI messages for a
picture shall be included in the sample containing that picture and that SEI messages pertaining
to a sequence of pictures shall be included in the sample containing the first picture of the
sequence to which the SEI message pertains. The order of SEI messages within a sample is as
defined in the applicable video coding standard.
 The sequence of NAL units in an elementary stream and within a single sample must be in a valid
decoding order for those NAL units as specified in the applicable video coding standard.
 All timing information is external to stream. Picture Timing SEI messages that define
presentation or composition timestamps may be included in the video elementary stream, as
these messages contain other information than timing, and may be required for conformance
checking. However, all timing information is provided by the information stored in the various
sample metadata tables, and this information over‐rides any timing provided in the video layer.
Timing provided within the video stream in this file format should be ignored as it may
contradict the timing provided by the file format and may not be correct or consistent within
itself.
NOTE This constraint is imposed due to the fact that post‐compression editing, combination, or re‐timing of a
stream at the file format level may invalidate or make inconsistent any embedded timing information present within
the video stream.
 No start codes. The elementary streams shall not include start codes. As stored, each NAL unit is
preceded by a length field as specified in 4.3.3; this enables easy scanning of the sample’s NAL
units. Systems that wish to deliver, from this file format, a stream using start codes will need to
reformat the stream to insert those start codes.
4.3.3 Sample format
4.3.3.1 Definition
This subclause defines the structure of the samples. Samples are externally framed and have a size
supplied by that external framing. The syntax of a sample is configured via the decoder specific
configuration for the elementary stream. An example of the structure of a video sample is depicted in
the following figure.
Access
Slice Slice
SEI
Unit NAL Unit NAL Unit
NAL Unit
Delimiter (Primary (Redundant
(if present)
NAL Unit Coded Coded Picture)
(if present) Picture) (if present)

Figure 1 — Example structure of a sample
An access unit is made up of a set of NAL units. Each NAL unit is represented with a:
 Length: Indicates the length in bytes of the following NAL unit. The length field can be configured
to be of 1, 2, or 4 bytes.
 NAL Unit: Contains the NAL unit data as specified in the applicable video coding standard.
© ISO/IEC 2014 – All rights reserved 7

Length
Length
Length
Length
4.3.3.2 Syntax
aligned(8) class NALUSample
{
unsigned int PictureLength = sample_size; //Size of Sample from SampleSizeBox
for (i=0; i {
unsigned int((DecoderConfigurationRecord.LengthSizeMinusOne+1)*8)
NALUnitLength;
bit(NALUnitLength * 8) NALUnit;
i += (DecoderConfigurationRecord.LengthSizeMinusOne+1) + NALUnitLength;
}
}
4.3.3.3 Semantics
DecoderConfigurationRecord indicates the record in the matching sample entry (e.g.
AVCDecoderConfigurationRecord in the case of AVC)
NALUnitLength indicates the size of a NAL unit measured in bytes. The length field includes the
size of both the one byte NAL header and the RBSP payload but does not include the length field
itself.
NALUnit contains a single NAL unit. The syntax of a NAL unit is defined in the appropriate
specification (e.g. ISO/IEC 14496‐10) and includes both the one byte NAL header and the
variable length encapsulated byte stream payload.
4.4 Video Track Structure
In the terminology of ISO/IEC 14496‐12, both video and parameter set tracks are video or visual tracks.
They therefore use:
a) a handler_type of ‘vide’ in the HandlerBox;
b) a video media header ‘vmhd’;
c) and, as defined below, a derivative of the VisualSampleEntry.
A video stream is represented by one or more video tracks in a file.
If there is more than one track representing scalable aspects of a single stream, then they form
alternatives to each other, and the field ‘alternate_group’ should be used, or the composition system
used should select one of them, as appropriate. See 8.10.3 “Track Selection Box” of ISO/IEC 14496‐12
for informative labelling of why tracks are members of alternate groups.
4.5 Template fields used
The ISO Base Media File Format defines a number of fields which have default values but which may be
defined for use by specific sub‐systems. Tracks containing video data may use the following template
fields:
a) alternate_group in the TrackHeaderBox (see 5.4.6 on stream switching).
b) template field ‘depth’ in the VisualSampleEntry to document the presence of alpha.
depth takes one of the following values
0x18 – the video sequence is in colour with no alpha
8 © ISO/IEC 2014 – All rights reserved

0x28 – the video sequence is in grayscale with no alpha
0x20 – the video sequence has alpha (gray or colour)
4.6 Visual width and height
The width and height fields in a VisualSampleEntry must correctly document the cropped frame
dimensions (visual presentation size) of the video stream that is described by that entry. The width and
height fields do not reflect any changes in size caused by SEI messages such as pan‐scan. The visual
handling of SEI messages such as pan‐scan is both optional and terminal‐dependent. If the width and
height of the sequence changes, then a new sample entry is needed.
Note that the visual size in the SPS may be either frame or field size; in the sample entry, it is always the
frame size.
The width and height fields in the track header may not be the same as the width and height fields in the
one or more VisualSampleEntry in the video track. As specified in the ISO Base Media File Format, if
normalized visual presentation is needed, all the sequences are normalized to the track width and
height for presentation.
4.7 Decoding time (DTS) and composition time (CTS)
Samples are stored in the file format in decoding order. If picture reordering is not used and decoding
and composition times are the same, then presentation is the same as decoding order and only the time‐
to‐sample ‘stts’ table is used. Note that any kind of picture may be reordered, not only B‐pictures.
If decoding time and composition time differ, the composition time‐to‐sample ‘ctts’ table is also used in
conjunction with the 'stts' table.
4.8 Sync sample (IDR)
A sample is considered as a sync sample if ALL of the following conditions are met:
 The video data NAL units in the sample indicate that the primary picture contained in the sample
is an instantaneous decoding refresh (IDR) picture.
 When the sample entry name is 'avc1' or 'avc2', all SPSs and PPSs needed to decode the video
data NAL units in the sample of the IDR picture and the following samples in decode order are
contained in the decoder configuration of the video elementary stream or in a separate
parameter set elementary stream sample.
 When the sample entry name is 'avc3' or 'avc4', the following applies:
1. If the sample is an IDR access unit, all parameter sets needed for decoding that sample shall
be included either in the sample entry or in the sample itself.
2. Otherwise (the sample is not an IDR access unit), all parameter sets needed for decoding the
sample shall be included either in the sample entry or in any of the samples since the
previous random access point to the sample itself, inclusive.
A parameter set elementary stream sample is a sync sample if and only if all parameter sets required by
the associated video elementary stream from the time of the parameter set sample forward are
supplied, in the parameter set stream, before they are required by the associated video elementary
stream.
© ISO/IEC 2014 – All rights reserved 9

4.9 Shadow sync
The use of the shadow sync table to indicate alternate encodings of a sample for random access are
supported as defined in the ISO Base Media File Format. A shadow sync shall indicate a sample that is a
random access point as specified in the general requirements and for the specific coding format in the
track.
While the use of shadow sync is supported for backward compatibility reasons, this use is deprecated
and use of the mechanisms defined in 5.4.6 is recommended.
4.10 Sample groups on random access recovery points and random access points
The video coding system can include the concept of a ‘gradual decoding refresh’ or random access
recovery point. This may be signalled in the bit‐stream using a mechanism such as the recovery point
SEI message. This message is found at the beginning of the random access, and indicates how much data
must be decoded subsequent to the access unit at the position of the SEI message before the recovery is
complete.
When all access units in output order starting from the access unit at the position of the SEI message
can be successfully decoded after random access, i.e. when the recovery_frame_cnt syntax element of
the recovery point SEI message is 0, the Random Access Point (‘rap ‘) sample grouping should be used.
This concept of gradual recovery is supported in the file format also by using RollRecoveryEntry Groups
[4.5]. In order that the group membership marks the sample containing the SEI message the ‘roll-
distance’ is constrained to being only positive (i.e. a post‐roll). In other words, RollRecoveryEntry
Groups can be used when the value of the recovery_frame_cnt syntax element of the recovery point SEI
message is greater than 0.
Note – The roll‐group counts samples in the file format; this may not match the way that the distances are
represented in the SEI message.
Within a stream, it is necessary to mark the beginning of the pre‐roll, so that a stream decoder may start
decoding there. However, in a file, when performing random access, a deterministic search is desired
for the closest preceding frame which can be decoded perfectly (either a sync sample, or the end of a
pre‐roll).
4.11 Hinting
Note that what the hint tracks call “B frames” are actually ‘disposable’ pictures or non‐reference
pictures, for example as defined in ISO/IEC 14496‐10.
Care should be taken when the structures in Annex A (aggregators or extractors) are in use and the
track is hinted. These structures are defined only for use in the file format and should not be
transmitted. In particular, a hint track that points at an extractor in a video track would cause the
extractor itself to be transmitted (which is probably both incorrect and not the desired behaviour), not
the data the extractor references. Hint tracks should normally directly reference NAL units specified in
the applicable video coding standard.
10 © ISO/IEC 2014 – All rights reserved

5 AVC elementary streams and sample definitions
5.1 Introduction
The Advanced Video Coding (AVC) standard, jointly developed by the ITU‐T and
ISO/IEC JTC 1/SC 29/WG 11 (MPEG), offers not only increased coding efficiency and enhanced
robustness, but also many features for the systems that use it. To enable the best visibility of, and access
to, those features, and to enhance the opportunities for the interchange and interoperability of media,
this part of ISO/IEC 14496 defines a storage format for video streams compressed using AVC.
This clause defines the storage for plain AVC streams, where ‘plain AVC’ refers to the main part of
ISO/IEC 14496‐10, excluding Annex G (Scalable Video Coding) and Annex H (Multiview Video Coding).
This clause specifies the elementary stream and sample structure used to store AVC visual content.
The storage of AVC content uses the existing capabilities of the ISO base media file format but also
defines extensions to support the following features of the AVC codec.
 Switching pictures:
to enable switching between different coded streams and substitution of pictures within the same
stream.
 Sub‐sequences and layers:
provides a structuring of the dependencies of a group of pictures to provide for a flexible stream
structure (e.g. in terms of temporal scalability and layering).
 Parameter sets:
the sequence and picture parameter set mechanism decouples the transmission of infrequently
changing information from the transmission of coded macroblock data. Each slice containing the
coded macroblock data references the picture parameter set containing its decoding parameters. In
turn, the picture parameter set references a sequence parameter set that contains sequence level
decoding parameter information.
5.2 Elementary stream structure
Two types of elementary streams are defined for storing AVC content (see also Figure 2):
 Video Elementary Streams shall contain all video coding related NAL units (i.e. those NAL units
containing video data or signaling video structure) and may contain non‐video coding related
NAL units such as SEI messages and access unit delimiter NAL units. Other NAL units that are not
expressly prohibited may be present, and if they are unrecognized should be ignored (e.g. not
placed in the output buffer while accessing the file).
 Parameter set elementary streams shall not contain video coding related NAL units (i.e. those
NAL units containing video data or signalling video structure), and would normally contain only
sequence parameter sets, picture parameter sets and sequence parameter set extension NAL
units.
Using these stream types, AVC content shall be stored in one of these configurations:
 Video elementary stream with no parameter sets: In this case, sequence and picture
parameter set NAL units shall be stored in the sample entries of this track. Sequence and picture
parameter set NAL units shall not be part of AVC samples within the stream itself.
© ISO/IEC 2014 – All rights reserved 11

 Video elementary stream possibly including parameter sets: In this case, the sample entry
indicates whether the stream may contain parameter sets of given types, in addition to other
parameters provided in the sample entry. Sequence and picture parameter set NAL units may
therefore be part of AVC samples within the stream itself.
 Video elementary stream and parameter set elementary stream: In this case, sequence and
picture parameter set NAL units shall be transmitted only in the parameter set elementary
stream and shall neither be present in the sample entries nor the AVC samples of the video
elementary stream.
The types of NAL units that are allowed in each of the video and parameter set elementary streams are
specified in the following table.
Table 2 – NAL Unit types in elementary Streams
Value of Description Video elementary Video elementary Parameter set
nal_unit_type stream (sample entry stream (sample entry elementary
'avc1' or 'avc2') 'avc3' or 'avc4') stream
Unspecified Not specified by this part Not specified by this Not specified by
part of ISO/IEC 14496 this part of
of ISO/IEC 14496
ISO/IEC 14496
1 Coded slice of a non‐IDR Yes Yes No
picture
slice_layer_without_partitionin
g_rbsp( )
2 Coded slice data partition A Yes No
Yes
slice_data_partition_a_layer_rb
sp( )
Coded slice data partition B
3 Yes Yes No
slice_data_partition_b_layer_rb
sp( )
Coded slice data partition C
4 Yes Yes No
slice_data_partition_c_layer_rbs
p( )
5 Coded slice of an IDR picture Yes No
Yes
slice_layer_without_partitionin
g_rbsp( )
6 Supplemental enhancement Yes Only
Yes.
information(SEI) Except for the Sub‐ Except for the Sub‐ ‘declarative’
sei_rbsp( ) sequence, or layering SEIs should be
sequence, layering or
Filler
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...