Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content

Technologies de l'information — Intelligence artificielle pour le multimédia — Partie 3: Optimisation des codeurs et des systèmes de réception pour l'analyse automatique de contenus vidéo codés

General Information

Status
Not Published
Current Stage
6000 - International Standard under publication
Start Date
05-Nov-2025
Completion Date
15-Nov-2025
Ref Project
Draft
ISO/IEC DTR 23888-3 - Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content Released:26. 08. 2025
English language
20 pages
sale 15% off
sale 15% off
Draft
REDLINE ISO/IEC DTR 23888-3 - Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content Released:26. 08. 2025
English language
20 pages
sale 15% off
sale 15% off

Standards Content (Sample)


FINAL DRAFT
Technical
Report
ISO/IEC DTR
23888-3
ISO/IEC JTC 1/SC 29
Information technology — Artificial
Secretariat: JISC
intelligence for multimedia —
Voting begins on:
2025-09-09
Part 3:
Optimization of encoders and
Voting terminates on:
2025-11-04
receiving systems for machine
analysis of coded video content
RECIPIENTS OF THIS DRAFT ARE INVITED TO SUBMIT,
WITH THEIR COMMENTS, NOTIFICATION OF ANY
RELEVANT PATENT RIGHTS OF WHICH THEY ARE AWARE
AND TO PROVIDE SUPPOR TING DOCUMENTATION.
IN ADDITION TO THEIR EVALUATION AS
BEING ACCEPTABLE FOR INDUSTRIAL, TECHNO-
LOGICAL, COMMERCIAL AND USER PURPOSES, DRAFT
INTERNATIONAL STANDARDS MAY ON OCCASION HAVE
TO BE CONSIDERED IN THE LIGHT OF THEIR POTENTIAL
TO BECOME STAN DARDS TO WHICH REFERENCE MAY BE
MADE IN NATIONAL REGULATIONS.
Reference number
ISO/IEC DTR 23888-3:2025(en) © ISO/IEC 2025

FINAL DRAFT
ISO/IEC DTR 23888-3:2025(en)
Technical
Report
ISO/IEC DTR
23888-3
ISO/IEC JTC 1/SC 29
Information technology — Artificial
Secretariat: JISC
intelligence for multimedia —
Voting begins on:
Part 3:
Optimization of encoders and
Voting terminates on:
receiving systems for machine
analysis of coded video content
RECIPIENTS OF THIS DRAFT ARE INVITED TO SUBMIT,
WITH THEIR COMMENTS, NOTIFICATION OF ANY
RELEVANT PATENT RIGHTS OF WHICH THEY ARE AWARE
AND TO PROVIDE SUPPOR TING DOCUMENTATION.
© ISO/IEC 2025
IN ADDITION TO THEIR EVALUATION AS
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
BEING ACCEPTABLE FOR INDUSTRIAL, TECHNO-
LOGICAL, COMMERCIAL AND USER PURPOSES, DRAFT
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
INTERNATIONAL STANDARDS MAY ON OCCASION HAVE
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
TO BE CONSIDERED IN THE LIGHT OF THEIR POTENTIAL
or ISO’s member body in the country of the requester.
TO BECOME STAN DARDS TO WHICH REFERENCE MAY BE
MADE IN NATIONAL REGULATIONS.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland Reference number
ISO/IEC DTR 23888-3:2025(en) © ISO/IEC 2025

© ISO/IEC 2025 – All rights reserved
ii
ISO/IEC DTR 23888-3:2025(en)
Contents Page
Foreword .iv
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 2
5 Overview . 2
5.1 General overview .2
5.2 Use cases and applications .3
6 Evaluation methodology . 3
6.1 General .3
6.2 Bit rate .4
6.3 PSNR .4
6.4 mAP .4
6.5 MOTA .5
6.6 BD-rate .5
7 Pre-processing technologies . 6
7.1 Region of interest-based methods.6
7.2 Foreground and background processing .7
7.3 Temporal subsampling .7
7.4 Spatial subsampling .7
7.5 Noise filtering .8
8 Encoding technologies . 8
8.1 RoI-based quantization parameter adaption .8
8.2 Quantization step adjustment for temporal layers .8
8.3 Chroma QP offset setting.9
9 Post-processing technologies . 9
9.1 Temporal resampling .9
9.2 Spatial resampling .9
9.3 Enhancement post-filtering .9
10 Metadata . 10
10.1 Neural-network post-filter SEI message .10
10.2 Annotated regions SEI message .10
10.3 Object mask information SEI message .11
10.4 Encoder optimization information SEI message .11
10.5 Packed regions information SEI message .11
Annex A (informative) Software implementation examples .12
Annex B (informative) Combined software implementation examples . 19
Bibliography .20

© ISO/IEC 2025 – All rights reserved
iii
ISO/IEC DTR 23888-3:2025(en)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are
members of ISO or IEC participate in the development of International Standards through technical
committees established by the respective organization to deal with particular fields of technical activity.
ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations,
governmental and non-governmental, in liaison with ISO and IEC, also take part in the work.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of document should be noted. This document was drafted in accordance with the editorial rules of the ISO/
IEC Directives, Part 2 (see www.iso.org/directives or www.iec.ch/members_experts/refdocs).
ISO and IEC draw attention to the possibility that the implementation of this document may involve the
use of (a) patent(s). ISO and IEC take no position concerning the evidence, validity or applicability of any
claimed patent rights in respect thereof. As of the date of publication of this document, ISO and IEC had not
received notice of (a) patent(s) which may be required to implement this document. However, implementers
are cautioned that this may not represent the latest information, which may be obtained from the patent
database available at www.iso.org/patents and https://patents.iec.ch. ISO and IEC shall not be held
responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www.iso.org/iso/foreword.html.
In the IEC, see www.iec.ch/understanding-standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information, in collaboration with
ITU-T SG21, Technologies for multimedia, content delivery and cable television. The corresponding ITU-T SG21
provisional work item name is H.Sup.MACVC.
A list of all parts in the ISO/IEC 23888 series can be found on the ISO and IEC websites.
Any feedback or questions on this document should be directed to the user’s national standards
body. A complete listing of these bodies can be found at www.iso.org/members.html and
www.iec.ch/national-committees.

© ISO/IEC 2025 – All rights reserved
iv
FINAL DRAFT Technical Report ISO/IEC DTR 23888-3:2025(en)
Information technology — Artificial intelligence for
multimedia —
Part 3:
Optimization of encoders and receiving systems for machine
analysis of coded video content
1 Scope
This document specifies a summary of optimizations for encoders and receiving systems for conducting
machine analysis tasks on coded video content. It provides a concept-level overview of recent practices
and provides comments on technical aspects and cautions to be taken when interpreting the results. This
document describes technologies that have recently been studied and demonstrated benefits to coding
efficiency for some machine analysis tasks.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitute
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
Rec. ITU-T H.266 | ISO/IEC 23090-3, Versatile video coding
Rec. ITU-T H.265 | ISO/IEC 23008-2, High efficiency video coding
Rec. ITU-T H.264 | ISO/IEC 14496-10, Advanced video coding
Rec. ITU-T H.274 | ISO/IEC 23002-7, Versatile supplemental enhancement information messages for coded video
bitstreams
3 Terms and definitions
For the purposes of this document, the terms and definitions given in Rec. ITU-T H.266 | ISO/IEC 23090-3,
Rec. ITU-T H.265 | ISO/IEC 23008-2, Rec. ITU-T H.264 | ISO/IEC 14496-10, Rec. ITU-T H.274 | ISO/IEC 23002-7
and the following apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1
machine consumption
applying a machine analysis task such as object detection, segmentation or object tracking

© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:2025(en)
4 Abbreviated terms
AVC Advanced Video Coding (Rec. ITU-T H.264 | ISO/IEC 14496-10)
BD-rate Bjøntegaard delta bit rate
CTU coding tree unit
HEVC High Efficiency Video Coding (Rec. ITU-T H.265 | ISO/IEC 23008-2)
mAP mean average precision
MOTA multiple object tracking accuracy
NNPF neural-network post-filter
NNPFA neural-network post-filter activation
NNPFC neural-network post-filter characteristics
OMI object mask information
PSNR peak signal-to-noise ratio
QP quantization parameter
RoI region of interest
RPR reference picture resampling
SEI supplemental enhancement information
TID temporal identifier
URI uniform resource identifier
VSEI Versatile Supplemental Enhancement Information Messages for Coded Video Bitstreams (Rec.
ITU-T H.274 | ISO/IEC 23002-7)
VVC Versatile Video Coding (Rec. ITU-T H.266 | ISO/IEC 23090-3)
Y′C C colour space representation commonly used for video/image distribution, also written as YUV
B R
YUV colour space representation commonly used for video/image distribution, also written as Y′C C
B R
5 Overview
5.1 General overview
Most video processing systems consist of four main processing steps, as shown in Figure 1. This document
describes technologies for optimization of encoders and receiving systems, such as pre-processing, encoding
and post-processing for machine consumption. The decoding process, on the other hand, is fully specified
in the respective Rec. ITU-T H.266 | ISO/IEC 23090-3 Versatile Video Coding (VVC), Rec. ITU-T H.265 |
ISO/IEC 23008-2 High Efficiency Video Coding (HEVC) and Rec. ITU-T H.264 | ISO/IEC 14496-10 Advanced
Video Coding (AVC) video coding standards, amongst others. Hence, the samples of the decoded video are
fully specified by the given input bitstream.

© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:2025(en)
Figure 1 — General video coding and processing pipeline
An overview of the commonly used practices for evaluating encoder optimization technologies for machine
consumption can be found in Clause 6. Descriptions of pre-processing technologies can be found in Clause 7.
Encoder optimization technologies are described in Clause 8 and post-processing technologies are described
in Clause 9. Metadata that is useful for machine consumption is described in Clause 10.
It is noted that depending on specific use cases, the technologies outlined in this document can be
implemented individually or in combination to optimize the machine consumption performance within the
constraints of the system capabilities. When employing multiple technologies simultaneously, it is important
to consider that certain combinations can be impractical or infeasible due to inherent methodological
constraints. Tested combinations of two or more technologies are listed in Annex B.
5.2 Use cases and applications
There are various use cases and applications using encoded video that benefit from optimizing both encoders
and receiving systems for machine consumption. Some of them are highlighted below:
— Surveillance: A considerable amount of bandwidth is needed to transmit a high volume of data generated
by a large number of sensors. The number of sensors also has an impact on the computational load on
the server side, as having to analyse the input from many sensors can become a huge burden. This can be
eased by distributing the computation to the front-end devices.
— Intelligent transportation: A key aspect for vehicular applications is interoperability between not only
vehicles from different vendors, but also the infrastructures of various locations. Connected vehicles
are expected to play a significant role in future transport systems and the tremendous number of
vehicles emphasizes the need of reducing the amount of data being transmitted between them to avoid
overloading the network.
— Intelligent industry: One example in this area is visual content analysis, checking and screening. Machine
automation is desirable for increasing efficiency.
[1]
A more detailed description of use cases can be found in ISO/IEC TR 23888-1 .
6 Evaluation methodology
6.1 General
A set of assessment metrics are used for the evaluation of encoder and receiving systems optimization
technologies for machine consumption. An overview evaluation framework is shown in Figure 2. Here the
input video is encoded to generate a bitstream. This bitstream is then decoded, and the decoded video is
used for machine consumption. In this diagram, the “encoder” includes both pre-processing and encoding
steps, and the “decoder” includes both decoding and post-processing steps, as shown in Figure 1.

© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:2025(en)
Figure 2 — Evaluation framework and points of measurement
6.2 Bit rate
The bit rate is determined based on the encoded bitstream and parameters of the input video such as frame
rate and the number of total frames. The following formula is applied to calculate the bit rate:
8 **fileSizeInBytesf ps
bitrate =
numFrames *1000
6.3 PSNR
Encoding for video distribution is ordinarily performed in the Y′C C domain (nicknamed YUV herein for
B R
brevity and ease of typing). For standard-dynamic range video, the distortion metric primarily used in the
video coding standardization community has been the Peak Signal to Noise Ratio (PSNR). The following two
formulae are used to calculate PSNR:
n−1m−1
MSE = xi,,jy− ij
() ()
∑∑
mn*
i=0 j=0
 ()bitdepth−8 
255*2
()
 
PSNR =10*log
10  
MSE
 
 
where x(i,j) is the decoded sample value of a certain color component, y(i,j) is the corresponding original
sample value, and bitdepth is the bit depth of the input video. It is a common practice to calculate PSNR
values for each of the color component Y, U and V.
6.4 mAP
The performance of object detection and segmentation tasks are measured by mean average precision
(mAP). This metric indicates what percentage of objects are correctly identified by having sufficient overlap
between the detected object and the ground truth as well as being assigned to the correct object class. Then

© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:2025(en)
the share of correctly identified objects for each class is determined, and finally the score for each class is
averaged. The calculation of mAP is as follows:
numOverlaps numClasses
 
coorrectObjects
11  
ii
mAP =  
∑∑  
 
numOverlaps numClasses totalObjects
 ii 
i==11 ii 
Some commonly used variants of this metric are:
— mAP@0 .5: An object is counted as correctly identified if the Intersection over Union (IoU) between the
detected bounding box and the ground truth bounding box is at least 0.5. Sometimes this variant of the
mAP metric is also referred to as mAP50.
— mAP@[0.5:0.05:0.95]: In this variant a total of ten mAP scores with increasing IoU thresholds are
calculated. The IoU threshold starts at 0.5 and increases by 0.05 after each iteration, until it reaches to
the upper bound value 0.95. Once all ten scores are determined, the average of these scores is calculated
to produce the final mAP.
6.5 MOTA
Object tracking performance is measured by Multiple Object Tracking Accuracy (MOTA). This metric
accounts for all object configuration errors made by the tracker, false positives, misses (true negative),
mismatches, overall frames. The calculation of MOTA is as follows:
()FN ++FP mme
∑ tt t
t
MOTA=−1,
g
t

t
where FN , FP , mme and g are the number of false negatives, the number of false positives, the number
t t t t
of mismatch error (ID Switching between 2 successive frames), and the number of objects in the ground
truth respectively at time t .
6.6 BD-rate
To compare the performance of a technology against the reference, the well-known Bjøntegaard delta rate
[2]
(BD-rate) metric is used. Instead of using PSNR as the distortion metric as is typical for human vision
performance evaluation, machine consumption distortion metrics, e.g., mAP and MOTA, are used in machine
BD-rate calculation.
The distortion measurement of machine consumption (e.g., mAP and MOTA) can sometimes be non-
monotonic to the bit rate due to the characteristics of the machine analysis task and possible limitations of
machine networks. Polynomial curve fitting is applied to ensure rate-distortion monotonicity and thus valid
BD-rate calculation.
3 2
fx() =+ bx** bx ++bx* b
0 1 23
For a given polynomial function in the above formula, b , b , b , and b are coefficients of the function, x is
0 1 2 3
the input (bit rate) and fx() is the output (quality). The following two constraints are invoked to ensure its
monotonicity and convexity:
— the first order derivative of the polynomial shown below is positive in the given x range
fx′ =+32 **bx   *bx* + b
()
0 12
— the second order derivative of the polynomial shown below is negative in the given x range
fx′′ =+ 62** bx  * b
()
© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:2025(en)
Parameters ()bb,,bb, in the polynomial function are solved by sequential least squares programming
01 23
(SLSQP) and applied to curve fitting.
NOTE It is a common practice to have the minimal quality value of the fitted curve no smaller than the minimal
quality value of the original curve and the maximum quality value of the fitted curve no greater than the maximum
quality value of the original curve.
7 Pre-processing technologies
7.1 Region of interest-based methods
One often-used optimization method is region of interest (RoI)-based coding. Here the input video is
analysed in some way and then the encoder can optimize the encoding towards machine consumption based
on the analysis results. The analysis can be done using various methods, e.g., neural networks. An example
of a pipeline that can be used for RoI-based approaches is shown in Figure 3.
Figure 3 — Pipeline for RoI-based systems
In one implementation example, an object detection network is used to analyse the input data. This network
produces a list of objects that can be found in the current picture. The information used to describe each
object includes the index of the picture in which the object can be found and the position of the object in
the picture. Some networks can provide more information than this and the encoder can choose to select a
subset of all objects by filtering based on, for example, the class of an object or the estimated likelihood of an
object of the described class being at the described position. In a similar approach, a segmentation network
can be used where the object is not described by a bounding box but by a segmentation mask indicating
exactly which samples the segmentation network estimates belonging to the object. The list produced
during the analysis can then be used by the encoder, for example, to separate foreground and background
with the purpose of encoding the foreground at a better quality and the background at a lower quality. One
such encoding method is described in 8.1. In this example, the analysis does not change the input video, but
directly forwards it to the encoder.
In other RoI-based methods, the pre-processing changes the input video, for example, by applying different
pre-processing methods on the foreground and background, or specific parts of the video, such as
subsampling the background area of the input video.
In one implementation example, an object segmentation network is first used to analyse the input data.
The network produces a list of objects segmented with the object shapes in the current picture. The object
shapes and positions could be represented, for example, by segmentation masks. More information such as
the object class or the estimated likelihood of the object segment could also be provided by the network to
identify the objects. Based on the object information, it is possible to derive spatial complexity and temporal
complexity for the different segments, and then RoI-based pre-processing of the input video can be adapted
based on the spatial and temporal complexity. The spatial complexity here indicates the averaged object
size which can be calculated by dividing the percentage of the area covered by the objects by the total
number of the objects. Temporal complexity indicates the content changes between two pictures which can
be calculated by various methods, for example, by taking the mean absolute difference of the collocated
samples in two pictures.
© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:2025(en)
7.2 Foreground and background processing
After pre-analysis that determines the foreground and background areas, one straightforward way to handle
the background that is less critical to machine consumption is to “eliminate” it by setting the corresponding
sample values to a constant value. However, some portions of the background samples, for example those
immediately surrounding the foreground area, could still be useful for machine consumption. Therefore, the
background regions relevant to machine consumption can be preserved to a certain extent with low-pass
filtering, such as a Gaussian filter with a sliding window, where the window size can be set based on the
input video resolution.
Moreover, extracted features can reveal importance information of the input video. In other words,
compared with binary classification of foreground and background, these extracted features can provide
importance information at a finer granularity. Therefore, such extracted features can be used to determine
how to process foreground and background differently. In one implementation example, a feature map is
extracted by a feature extraction network, and based on the feature map, the parameters of a Gaussian
smoothing filter are adapted and then the adaptive filtering is applied to the picture. As the background area
and foreground area have different features and even within the background or foreground area, different
regions can have different features, the Gaussian smooth filter can be controlled at a finer granularity, which
finally results in a more efficient pre-processing.
An implementation example with more detailed description can be found in A.2.
7.3 Temporal subsampling
In some use cases, for example when the frame rate is high, a way to reduce the bit rate without a strong
negative impact on the machine consumption performance can be to skip certain frames and encode the
video at a lower frame rate. One example is to remove every other frame from the input video and encode
the video at half frame rate. This can be done in a dynamic manner, for example by evaluating the motion
between two or more frames and if there is only little motion, a frame can be removed. In some cases, if
the receiving system requires a specific frame rate, a corresponding post-processing technology that up-
samples the video to the full frame rate can be applied.
An implementation example with more detailed description can be found in A.4.
7.4 Spatial subsampling
If the analysis of a video shows that it contains primarily large objects, one way to improve the BD-rate
performance is to perform spatial subsampling on the input video. This will result in fewer samples in the
subsampled frames to be encoded, and thus likely lead to faster encoding and bit rate savings. The optimum
downscaling factor is content dependent. It is possible that the machine consumption performance drops
when subsampling is too aggressive. Therefore, it is advisable to apply spatial subsampling adaptively, for
example, based on the characteristics of the video content (such as the averaged object spatial area and the
number of objects) and the target bit rate. Moreover, the spatial subsampling can also be dependent on the
picture types. For example, depending on whether the input video is captured by regular camera as natural
scenes, or is captured by infrared sensor as thermal images, different spatial subsampling methods can be
applied.
One tool specified in the Rec. ITU-T H.266 | ISO/IEC 23090-3 (VVC) standard that can be used for the purpose
of spatial resampling is called reference picture resampling (RPR). This tool allows the encoder to choose
to encode some pictures of the input video at different resolutions. For example, based on the analysis of
keyframes, without encoding an intra picture at the specified changed resolution, inter predictions can be
made from all allowed pictures regardless of their resolution.
In one implementation example, the RPR tool can be used at the frame level, where the input video can be
analysed as described in 7.1. In this case, the unmodified input video is forwarded to the encoder with a scale
factor list generated by the analyser. Specifically, an object detection network is used to analyse the input
video in both full resolution and at least one spatially subsampled resolution. This network produces a list of
objects with object inform
...


Draft ISO/IEC TR DTR 23888-3:202#(1)
ISO/IEC JTC 1/SC 29/WG 5
Secretariat: JISC
Date: 2025-06-08
Information technology — Artificial intelligence for multimedia —
Part 3:
Optimization of encoders and receiving systems for machine analysis
of coded video content
DTRFDIS stage
Warning for WDs and CDs
This document is not an ISO/IEC International Standard. It is distributed for review and comment. It is subject to change
without notice and may not be referred to as an International Standard.
Recipients of this draft are invited to submit, with their comments, notification of any relevant patent rights of which
they are aware and to provide supporting documentation.

© ISO/IEC 2025
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication
may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying,
or posting on the internet or an intranet, without prior written permission. Permission can be requested from either ISO
at the address below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: + 41 22 749 01 11
EmailE-mail: copyright@iso.org
Website: www.iso.orgwww.iso.org
Published in Switzerland
© ISO/IEC 2025 – All rights reserved

Draft ISO/IEC TR DTR 23888-3:202#(1:(en)
Contents
Foreword . viii
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 1
5 Overview . 2
6 Evaluation methodology . 3
7 Pre-processing technologies . 6
8 Encoding technologies . 9
9 Post-processing technologies . 10
10 Metadata . 12
Annex A (informative) Software implementation examples . 14
Annex B (informative) Combined software implementation examples . 23
Bibliography . 24

Foreword . v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 1
5 Overview . 2
5.1 General overview . 2
5.2 Use cases and applications . 2
6 Evaluation methodology . 3
6.1 Bit rate . 3
6.2 PSNR . 3
6.3 mAP . 4
6.4 MOTA . 4
6.5 BD-rate . 4
7 Pre-processing technologies . 5
7.1 Region of interest-based methods . 5
7.2 Foreground and background processing . 6
7.3 Temporal subsampling . 6
7.4 Spatial subsampling . 6
7.5 Noise filtering . 7
8 Encoding technologies . 7
8.1 RoI-based quantization parameter adaption . 7
8.2 Quantization step adjustment for temporal layers . 8
8.3 Chroma QP offset setting . 8
© ISO/IEC 2025 – All rights reserved

v
ISO
#####-#:####(X/IEC DTR 23888-3:(en)
9 Post-processing technologies . 9
9.1 Temporal resampling . 9
9.2 Spatial resampling . 9
9.3 Enhancement post-filtering . 9
10 Metadata . 9
10.1 Neural-network post-filter SEI message . 9
10.2 Annotated regions SEI message . 10
10.3 Object mask information SEI message . 10
10.4 Encoder optimization information SEI message . 10
10.5 Packed regions information SEI message . 10
Annex A (informative) Software implementation examples . 12
A.1 Region of interest-based adaptive QP . 12
A.1.1 General . 12
A.1.2 Analyser . 12
A.1.3 Encoder . 12
A.2 Pre-processing of foreground and background . 12
A.2.1 General . 12
A.2.2 Pre-analysis. 13
A.2.3 Pre-processing . 13
A.3 Enhancement post-filtering . 13
A.3.1 General . 13
A.3.2 Network structure . 14
A.4 Temporal resampling . 15
A.4.1 General . 15
A.4.2 Pre-analysis. 15
A.4.2.1 Frame-level MIOU 𝑴𝒇 . 15
A.4.2.2 Sequence-level MIOU 𝑴𝒔 . 16
A.4.2.3 Adaptive temporal resampling ratio decision. 17
A.4.3 Down-sampling . 17
A.4.4 Up-sampling . 17
vi © ISO #### /IEC 2025 – All rights reserved
vi
Draft ISO/IEC TR DTR 23888-3:202#(1:(en)
Annex B (informative) Combined software implementation examples . 18
Bibliography . 19
© ISO/IEC 2025 – All rights reserved

vii
ISO
#####-#:####(X/IEC DTR 23888-3:(en)
Foreword
ISO (the International Organization for Standardization) is a and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide federation of national standardsstandardization.
National bodies (that are members of ISO member bodies). The work of preparingor IEC participate in the
development of International Standards is normally carried out through ISO technical committees. Each
member body interested in a subject for which a technical committee has been established has the right to be
represented on that committee. Internationalby the respective organization to deal with particular fields of
technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international
organizations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types of
ISO documentsdocument should be noted. This document was drafted in accordance with the editorial rules
of the ISO/IEC Directives, Part 2 (see www.iso.org/directives 2 (see www.iso.org/directives or
www.iec.ch/members_experts/refdocs).
ISO drawsand IEC draw attention to the possibility that the implementation of this document may involve the
use of (a) patent(s). ISO takesand IEC take no position concerning the evidence, validity or applicability of any
claimed patent rights in respect thereof. As of the date of publication of this document, ISO [had/and IEC had
not] received notice of (a) patent(s) which may be required to implement this document. However,
implementers are cautioned that this may not represent the latest information, which may be obtained from
the patent database available at www.iso.org/patents. ISOwww.iso.org/patents and https://patents.iec.ch.
ISO and IEC shall not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www.iso.org/iso/foreword.html)
see www.iso.org/iso/foreword.html. In the IEC, see www.iec.ch/understanding-standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information, in collaboration with
ITU-T SG21, Technologies for multimedia, content delivery and cable television. The corresponding ITU-T SG21
provisional work item name is H.Sup.MACVC.
viii © ISO #### /IEC 2025 – All rights reserved
viii
Draft ISO/IEC TR DTR 23888-3:202#(1:(en)
A list of all parts in the ISO/IEC 23888 series can be found on the ISO websiteand IEC websites.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.htmlwww.iso.org/members.html and
www.iec.ch/national-committees.
© ISO/IEC 2025 – All rights reserved

ix
ISO/IEC DTR 23888-3:(en)
Information technology — Artificial intelligence for multimedia —
Part 3:
Optimization of encoders and receiving systems for machine analysis
of coded video content
1 Scope
This document specifies a summary of optimizations for encoders and receiving systems for conducting
machine analysis tasks on coded video content. It provides a concept-level overview of recent practices and
provides comments on technical aspects and cautions to be taken when interpreting the results. This
document describes technologies that have recently been studied and demonstrated benefits to coding
efficiency for some machine analysis tasks.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitute
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
Rec. ITU-T H.266 | ISO/IEC 23090-3, Versatile video coding
Rec. ITU-T H.265 | ISO/IEC 23008-2, High efficiency video coding
Rec. ITU-T H.264 | ISO/IEC 14496-10, Advanced video coding
Rec. ITU-T H.274 | ISO/IEC 23002-7, Versatile supplemental enhancement information messages for coded video
bitstreams
3 Terms and definitions
For the purposes of this document, the terms and definitions given in Rec. ITU-T H.266 | ISO/IEC 23090-3,
Rec. ITU-T H.265 | ISO/IEC 23008-2, Rec. ITU-T H.264 | ISO/IEC 14496-10, Rec. ITU-T H.274 | ISO/IEC
23002-7 and the following apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https://www.iso.org/obp
— IEC Electropedia: available at https://www.electropedia.org/
3.1 3.1
machine consumption
applying a machine analysis task such as object detection, segmentation or object tracking
4 Abbreviated terms
AVC Advanced Video Coding (Rec. ITU-T H.264 | ISO/IEC 14496-10)
© ISO/IEC 2025 – All rights reserved

ISO/IEC DTR 23888-3:(en)
BD-rate Bjøntegaard delta bit rate
CTU coding tree unit
HEVC High Efficiency Video Coding (Rec. ITU-T H.265 | ISO/IEC 23008-2)
mAP mean average precision
MOTA multiple object tracking accuracy
NNPF neural-network post-filter
NNPFA neural-network post-filter activation
NNPFC neural-network post-filter characteristics
OMI object mask information
PSNR peak signal-to-noise ratio
QP quantization parameter
RoI region of interest
RPR reference picture resampling
SEI supplemental enhancement information
TID temporal identifier
URI uniform resource identifier
VSEI Versatile Supplemental Enhancement Information Messages for Coded Video Bitstreams
(Rec. ITU-T H.274 | ISO/IEC 23002-7)
VVC Versatile Video Coding (Rec. ITU-T H.266 | ISO/IEC 23090-3)
Y′C C colour space representation commonly used for video/image distribution, also written as
B R
YUV
YUV colour space representation commonly used for video/image distribution, also written as
Y′C C
B R
5 Overview
5.1 General overview
Most video processing systems consist of four main processing steps, as shown in Figure 1.Figure 1. This
document describes technologies for optimization of encoders and receiving systems, such as pre-processing,
encoding and post-processing for machine consumption. The decoding process, on the other hand, is fully
specified in the respective Rec. ITU-T H.266 | ISO/IEC 23090-3 Versatile Video Coding (VVC), Rec. ITU-T
H.265 | ISO/IEC 23008-2 High Efficiency Video Coding (HEVC) and Rec. ITU-T H.264 | ISO/IEC 14496-10
Advanced Video Coding (AVC) video coding standards, amongst others. Hence, the samples of the decoded
video are fully specified by the given input bitstream.
input pre- post- machine
encoding decoding
video processing processing consumption

Figure 1 — Figure 1 — General video coding and processing pipeline
2 © ISO #### /IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:(en)
An overview of the commonly used practices for evaluating encoder optimization technologies for machine
consumption can be found in clause 6.6. Descriptions of pre-processing technologies can be found in
clause 7.7. Encoder optimization technologies are described in clause 88 and post-processing technologies are
described in clause 9.9. Metadata that is useful for machine consumption is described in clause 10. 10.
It is noted that depending on specific use cases, the technologies outlined in this document can be
implemented individually or in combination to optimize the machine consumption performance within the
constraints of the system capabilities. When employing multiple technologies simultaneously, it is important
to consider that certain combinations can be impractical or infeasible due to inherent methodological
constraints. Tested combinations of two or more technologies are listed in Annex BAnnex B.
5.2 Use cases and applications
There are various use cases and applications using encoded video that benefit from optimizing both encoders
and receiving systems for machine consumption. Some of them are highlighted below:
— Surveillance: A considerable amount of bandwidth is needed to transmit a high volume of data generated
by a large number of sensors. The number of sensors also has an impact on the computational load on the
server side, as having to analyse the input from many sensors can become a huge burden. This can be
eased by distributing the computation to the front-end devices.
— Intelligent transportation: A key aspect for vehicular applications is interoperability between not only
vehicles from different vendors, but also the infrastructures of various locations. Connected vehicles are
expected to play a significant role in future transport systems and the tremendous number of vehicles
emphasizes the need of reducing the amount of data being transmitted between them to avoid overloading
the network.
— Intelligent industry: One example in this area is visual content analysis, checking and screening. Machine
automation is desirable for increasing efficiency.
[ ]
A more detailed description of use cases can be found in ISO/IEC TR 23888-1 [1]. 0 .
6 Evaluation methodology
6.1 General
A set of assessment metrics are used for the evaluation of encoder and receiving systems optimization
technologies for machine consumption. An overview evaluation framework is shown in Figure 2.Figure 2.
Here the input video is encoded to generate a bitstream. This bitstream is then decoded, and the decoded video
is used for machine consumption. In this diagram, the “encoder” includes both pre-processing and encoding
steps, and the “decoder” includes both decoding and post-processing steps, as shown in Figure 1.Figure 1.
© ISO/IEC 2025 – All rights reserved

ISO/IEC DTR 23888-3:(en)
YUV PSNR
calculation
input
encoder decoder output video
video
bit rate
task
ground machine
performance
truth consumption
calculation
Figure 2 — Figure 2 — Evaluation framework and points of measurement
6.16.2 Bit rate
The bit rate is determined based on the encoded bitstream and parameters of the input video such as frame
rate and the number of total frames. The following equationformula is applied to calculate the bit rate:
8 ∗ 𝑓𝑖𝑙𝑒𝑆𝑖𝑧𝑒𝐼𝑛𝐵𝑦𝑡𝑒𝑠 ∗ 𝑓𝑝𝑠 8 ∗ 𝑓𝑖𝑙𝑒𝑆𝑖𝑧𝑒𝐼𝑛𝐵𝑦𝑡𝑒𝑠 ∗ 𝑓𝑝𝑠
𝑏𝑖𝑡𝑟𝑎𝑡𝑒 =
𝑛𝑢𝑚𝐹𝑟𝑎𝑚𝑒𝑠 ∗ 1000 𝑛𝑢𝑚𝐹𝑟𝑎𝑚𝑒𝑠 ∗ 1000
6.26.3 PSNR
Encoding for video distribution is ordinarily performed in the Y′C C domain (nicknamed YUV herein for
B R
brevity and ease of typing). For standard-dynamic range video, the distortion metric primarily used in the
video coding standardization community has been the Peak Signal to Noise Ratio (PSNR). The following two
equationsformulae are used to calculate PSNR:
𝑛−1 𝑚−1 𝑛−1 𝑚−1
2 2
𝑀𝑆𝐸 = ∑ ∑ ‖𝑥(𝑖, 𝑗) − 𝑦(𝑖, 𝑗)‖ ∑ ∑ ∥ 𝑥(𝑖, 𝑗) − 𝑦(𝑖, 𝑗) ∥
𝑚 ∗ 𝑛
𝑖=0 𝑗=0 𝑖=0 𝑗=0
(𝑏𝑖𝑡𝑑𝑒𝑝𝑡ℎ−8) 2 (𝑏𝑖𝑡𝑑𝑒𝑝𝑡ℎ−8) 2
(255 ∗ 2 ) (255 ∗ 2 )
𝑃𝑆𝑁𝑅 = 10 ∗ 𝑙𝑜𝑔 ( ) 𝑙𝑜𝑔 ( )
10 10
𝑀𝑆𝐸 𝑀𝑆𝐸
4 © ISO #### /IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:(en)
where x(i,j) is the decoded sample value of a certain color component, y(i,j) is the corresponding original
sample value, and bitdepth is the bit depth of the input video. It is a common practice to calculate PSNR values
for each of the color component Y, U and V.
6.36.4 mAP
The performance of object detection and segmentation tasks are measured by mean average precision (mAP).
This metric indicates what percentage of objects are correctly identified by having sufficient overlap between
the detected object and the ground truth as well as being assigned to the correct object class. Then the share
of correctly identified objects for each class is determined, and finally the score for each class is averaged. The
calculation of mAP is as follows:
𝑚𝐴𝑃
𝑛𝑢𝑚𝑂𝑣𝑒𝑟𝑙𝑎𝑝𝑠
𝑛𝑢𝑚𝐶𝑙𝑎𝑠𝑠𝑒𝑠
𝑛𝑢𝑚𝑂𝑣𝑒𝑟𝑙𝑎𝑝𝑠
𝑛𝑢𝑚𝐶𝑙𝑎𝑠𝑠𝑒𝑠
1 1 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑂𝑏𝑗𝑒𝑐𝑡𝑠 1 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑂𝑏𝑗𝑒𝑐𝑡𝑠
𝑖𝑖 𝑖𝑖
= ∑ ( ∑ ( )) ∑ ( ∑ ( ))
𝑛𝑢𝑚𝑂𝑣𝑒𝑟𝑙𝑎𝑝𝑠 𝑛𝑢𝑚𝐶𝑙𝑎𝑠𝑠𝑒𝑠 𝑡𝑜𝑡𝑎𝑙𝑂𝑏𝑗𝑒𝑐𝑡𝑠 𝑛𝑢𝑚𝐶𝑙𝑎𝑠𝑠𝑒𝑠 𝑡𝑜𝑡𝑎𝑙𝑂𝑏𝑗𝑒𝑐𝑡𝑠
𝑖𝑖 𝑖𝑖
𝑖=1 𝑖𝑖=1
𝑖𝑖=1
𝑖=1
Some commonly used variants of this metric are:
— mAP@0.5: An object is counted as correctly identified if the Intersection over Union (IoU) between the
detected bounding box and the ground truth bounding box is at least 0.5. Sometimes this variant of the
mAP metric is also referred to as mAP50.
— mAP@[0.5:0.05:0.95]: In this variant a total of ten mAP scores with increasing IoU thresholds are
calculated. The IoU threshold starts at 0.5 and increases by 0.05 after each iteration, until it reaches to the
upper bound value 0.95. Once all ten scores are determined, the average of these scores is calculated to
produce the final mAP.
6.46.5 MOTA
Object tracking performance is measured by Multiple Object Tracking Accuracy (MOTA). This metric accounts
for all object configuration errors made by the tracker, false positives, misses (true negative), mismatches,
overall frames. The calculation of MOTA is as follows:
∑ (𝐹𝑁 + 𝐹𝑃 + 𝑚𝑚𝑒 )
∑ ( )
𝐹𝑁 + 𝐹𝑃 + 𝑚𝑚𝑒
𝑡 𝑡 𝑡
𝑡 𝑡 𝑡 𝑡
𝑡
𝑀𝑂𝑇𝐴 = 1 − , ,
∑ 𝑔 ∑ 𝑔
𝑡 𝑡 𝑡 𝑡
where 𝐹𝑁 𝐹𝑁 , 𝐹𝑃 𝐹𝑃 , 𝑚𝑚𝑒 𝑚𝑚𝑒 and 𝑔 are the number of false negatives, the number of false positives,
𝑡 𝑡 𝑡 𝑡 𝑡 𝑡 𝑡
the number of mismatch error (ID Switching between 2 successive frames), and the number of objects in the
ground truth respectively at time 𝑡.
6.56.6 BD-rate
To compare the performance of a technology against the reference, the well-known Bjøntegaard delta rate
[ ]
(BD-rate) metric [2] 2 is used. Instead of using PSNR as the distortion metric as is typical for human vision
performance evaluation, machine consumption distortion metrics, e.g., mAP and MOTA, are used in machine
BD-rate calculation.
The distortion measurement of machine consumption (e.g., mAP and MOTA) can sometimes be non-
monotonic to the bit rate due to the characteristics of the machine analysis task and possible limitations of
© ISO/IEC 2025 – All rights reserved

ISO/IEC DTR 23888-3:(en)
machine networks. Polynomial curve fitting is applied to ensure rate-distortion monotonicity and thus valid
BD-rate calculation.
3 2 2
( )
𝑓 𝑥 = 𝑏 (𝑥) = 𝑏 ∗ 𝑥 + 𝑏 ∗ 𝑥 𝑏 ∗ 𝑥 + 𝑏 𝑏 ∗ 𝑥 + 𝑏 + 𝑏
0 0 1 1 2 2 3 3
For a given polynomial function in the above formula, 𝑏 , 𝑏 , 𝑏 , and 𝑏 are coefficients of the function, 𝑥 is the
0 1 2 3
input (bit rate) and 𝑓(𝑥) is the output (quality). The following two constraints are invoked to ensure its
monotonicity and convexity:
-— the first order derivative of the polynomial shown below is positive in the given 𝑥 range
′ 2 2
( )
𝑓 𝑥 = 3 ∗ 𝑏 ∗ 𝑥 + 2 ∗ 𝑏 ∗ 𝑥 + 𝑏 (𝑥) = 3 ∗ 𝑏 ∗ 𝑥 + 2* 𝑏 ∗ 𝑥 + 𝑏
0 1 2 0 1 2
-— the second order derivative of the polynomial shown below is negative in the given 𝑥 range
′′ ″
𝑓 (𝑥) = 𝑓 (𝑥) = 6 ∗ 𝑏 ∗ 𝑏 ∗ 𝑥 + +2 ∗ 𝑏 * 𝑏
0 0 1 1
Parameters (𝑏 , 𝑏 , 𝑏 , 𝑏 ) in the polynomial function are solved by sequential least squares programming
0 1 2 3
(SLSQP) and applied to curve fitting.
NOTE: It is a common practice to have the minimal quality value of the fitted curve no smaller than the minimal
quality value of the original curve and the maximum quality value of the fitted curve no greater than the maximum quality
value of the original curve.
7 Pre-processing technologies
7.1 Region of interest-based methods
One often-used optimization method is region of interest (RoI)-based coding. Here the input video is analysed
in some way and then the encoder can optimize the encoding towards machine consumption based on the
analysis results. The analysis can be done using various methods, e.g., neural networks. An example of a
pipeline that can be used for RoI-based approaches is shown in Figure 3.Figure 3.
analysis
input machine
encoder decoder
video consumption
Figure 3 — Figure 3 — Pipeline for RoI-based systems
In one implementation example, an object detection network is used to analyse the input data. This network
produces a list of objects that can be found in the current picture. The information used to describe each object
includes the index of the picture in which the object can be found and the position of the object in the picture.
6 © ISO #### /IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:(en)
Some networks can provide more information than this and the encoder can choose to select a subset of all
objects by filtering based on, for example, the class of an object or the estimated likelihood of an object of the
described class being at the described position. In a similar approach, a segmentation network can be used
where the object is not described by a bounding box but by a segmentation mask indicating exactly which
samples the segmentation network estimates belonging to the object. The list produced during the analysis
can then be used by the encoder, for example, to separate foreground and background with the purpose of
encoding the foreground at a better quality and the background at a lower quality. One such encoding method
is described in 8.1.8.1. In this example, the analysis does not change the input video, but directly forwards it
to the encoder.
In other RoI-based methods, the pre-processing changes the input video, for example, by applying different
pre-processing methods on the foreground and background, or specific parts of the video, such as subsampling
the background area of the input video.
In one implementation example, an object segmentation network is first used to analyse the input data. The
network produces a list of objects segmented with the object shapes in the current picture. The object shapes
and positions could be represented, for example, by segmentation masks. More information such as the object
class or the estimated likelihood of the object segment could also be provided by the network to identify the
objects. Based on the object information, it is possible to derive spatial complexity and temporal complexity
for the different segments, and then RoI-based pre-processing of the input video can be adapted based on the
spatial and temporal complexity. The spatial complexity here indicates the averaged object size which can be
calculated by dividing the percentage of the area covered by the objects by the total number of the objects.
Temporal complexity indicates the content changes between two pictures which can be calculated by various
methods, for example, by taking the mean absolute difference of the collocated samples in two pictures.
7.2 Foreground and background processing
After pre-analysis that determines the foreground and background areas, one straightforward way to handle
the background that is less critical to machine consumption is to “eliminate” it by setting the corresponding
sample values to a constant value. However, some portions of the background samples, for example those
immediately surrounding the foreground area, could still be useful for machine consumption. Therefore, the
background regions relevant to machine consumption can be preserved to a certain extent with low-pass
filtering, such as a Gaussian filter with a sliding window, where the window size can be set based on the input
video resolution.
Moreover, extracted features can reveal importance information of the input video. In other words, compared
with binary classification of foreground and background, these extracted features can provide importance
information at a finer granularity. Therefore, such extracted features can be used to determine how to process
foreground and background differently. In one implementation example, a feature map is extracted by a
feature extraction network, and based on the feature map, the parameters of a Gaussian smoothing filter are
adapted and then the adaptive filtering is applied to the picture. As the background area and foreground area
have different features and even within the background or foreground area, different regions can have
different features, the Gaussian smooth filter can be controlled at a finer granularity, which finally results in a
more efficient pre-processing.
An implementation example with more detailed description can be found in A.2A.2.
7.3 Temporal subsampling
In some use cases, for example when the frame rate is high, a way to reduce the bit rate without a strong
negative impact on the machine consumption performance can be to skip certain frames and encode the video
at a lower frame rate. One example is to remove every other frame from the input video and encode the video
at half frame rate. This can be done in a dynamic manner, for example by evaluating the motion between two
© ISO/IEC 2025 – All rights reserved

ISO/IEC DTR 23888-3:(en)
or more frames and if there is only little motion, a frame can be removed. In some cases, if the receiving system
requires a specific frame rate, a corresponding post-processing technology that up-samples the video to the
full frame rate can be applied.
An implementation example with more detailed description can be found in A.4A.4.
7.4 Spatial subsampling
If the analysis of a video shows that it contains primarily large objects, one way to improve the BD-rate
performance is to perform spatial subsampling on the input video. This will result in fewer samples in the
subsampled frames to be encoded, and thus likely lead to faster encoding and bit rate savings. The optimum
downscaling factor is content dependent. It is possible that the machine consumption performance drops
when subsampling is too aggressive. Therefore, it is advisable to apply spatial subsampling adaptively, for
example, based on the characteristics of the video content (such as the averaged object spatial area and the
number of objects) and the target bit rate. Moreover, the spatial subsampling can also be dependent on the
picture types. For example, depending on whether the input video is captured by regular camera as natural
scenes, or is captured by infrared sensor as thermal images, different spatial subsampling methods can be
applied.
One tool specified in the Rec. ITU-T H.266 | ISO/IEC 23090-3 (VVC) standard that can be used for the purpose
of spatial resampling is called reference picture resampling (RPR). This tool allows the encoder to choose to
encode some pictures of the input video at different resolutions. For example, based on the analysis of
keyframes, without encoding an intra picture at the specified changed resolution, inter predictions can be
made from all allowed pictures regardless of their resolution.
In one implementation example, the RPR tool can be used at the frame level, where the input video can be
analysed as described in 7.1.7.1. In this case, the unmodified input video is forwarded to the encoder with a
scale factor list generated by the analyser. Specifically, an object detection network is used to analyse the input
video in both full resolution and at least one spatially subsampled resolution. This network produces a list of
objects with object information that can be found in the current picture and the spatially resampled picture.
The object information describes the position and size of the detected objects for both the current picture and
the spatially resampled picture. Based on the object information, an object occupancy distribution (i.e., the
distribution of the ratio of object size to the corresponding picture resolution) can be generated for both the
current picture and the spatially resampled picture. The scale factor can be derived for the current picture by
comparing the correlation of object occupancy distributions, and then the list of all scale factors is passed on
to the encoder for utilization of the RPR tool.
7.5 Noise filtering
Under some circumstances, various types of noise can be present in the video content. Denoising filters can be
applied on such content to reduce unnecessary bit rate increases and avoid machine consumption
performance degradation by filtering out the undesirable noise from the video content while preserving
information important to machine consumption. Various types of denoising filters can be applied according to
the characteristics of the noise. The strength of the filter can be adjusted based on the noise, and the filter can
adaptively be enabled and disabled for either an entire picture or sequence, or only a part thereof.
It is noted that some existing denoising filters such as bilateral or anisotropic diffusion filters can preserve
local details during denoising. For applications that benefit from such local details being preserved, filtering
the entire picture using the same strength can be detrimental.
8 © ISO #### /IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:(en)
8 Encoding technologies
8.1 RoI-based quantization parameter adaption
One method that is available in many video coding standards is adaptive quantization parameter (QP). Here
the encoder can change the QP value at a sub-picture level, for example the coding tree unit (CTU) level, to
optimize the encoding for the application. Due to this versatility, adaptive QP can be used in many different
use cases to improve performance.
The decision on where to change the QP value and by how much can be made by the encoder based on an
analysis of the input video. Another option is to utilize the output of an external analyser such as described in
7.1.7.1. In this case, the encoder receives information about the positions and sizes of objects in each frame to
make a differentiation between foreground and background.
One option is to use a base QP value of the picture for areas that contain objects, i.e., foreground areas, and an
increased QP value for background areas, resulting in fewer bits being used to encode the background. As the
background is usually not critical to machine consumption, this is a straight-forward way to reduce the bit
rate without affecting the machine consumption performance. As an extension, it can also be beneficial to
encode large objects with slightly higher QP values. As it is generally easier to detect larger objects, reducing
the bit rate for large objects usually does not reduce the performance of machine consumption.
However, it is noted that when utilizing the analysis, complexity can be traded against bit rate. For example, if
a light-weight neural network is used to perform the analysis, it is possible that not all relevant objects have
been found and thus it couldcan be detrimental to reduce the quality for the background too much as there
are possibly objects that the initial analysis missed. These objects are possibly still important for machine
consumption and if the background is encoded in sufficient quality, the machine consumption network has a
chance of detecting objects in the background even if they are coded in lower quality than the foreground area.
On the other hand, if the encoding system has a lot of resources, it can employ a neural network of higher
complexity for the analysis. With a better and more certain analysis, the bit rate for the background can be
reduced more as there are likely fewer objects that have been missed in the initial analysis.
A more detailed description with a link to an implementation can be found in A.1A.1.
8.2 Quantization step adjustment for temporal layers
It is a common practice that the encoder places different pictures on different temporal layers, i.e., assigning
different temporal identifiers (TID). This has the purpose of creating hierarchical structures that indicate from
which previously coded pictures the encoder can create predictions for the current picture. One aspect of these
hierarchical structures is that pictures cannot be referenced by other pictures on a higher temporal layer. This
way, it is not necessary to store every decoded picture in the decoded picture buffer. Another aspect of the
hierarchical structure is that pictures on higher temporal layers can be encoded with higher QP values, i.e.,
lower quality, as they will not be used, or less often used, as references by other pictures. An example of a
hierarchical structure is shown in Figure 4.Figure 4. The display order is left-to-right, and the numbers specify
the order of the coded pictures in the bitstream.
© ISO/IEC 2025 – All rights reserved

ISO/IEC DTR 23888-3:(en)
Figure 4 — Figure 4 — An example hierarchical referencing structure in a Rec. ITU-T H.266 |
ISO/IEC 23090-3 (VVC) random access configuration bitstream
This hierarchy of pictures can be exploited by encoding pictures on higher temporal layers using higher QP
values, i.e., reducing the number of bits spent on these pictures in total. This is also done in the case of the
common test conditions for standard definition range for Rec. ITU-T H.266 | ISO/IEC 23090-3 (VVC). In the
use case of coding video content for machine consumption, this characteristic can be exploited further by
increasing the QP value for pictures on higher temporal layers more. Taking advantage of motion
compensation, many bits can be saved when compressing pictures in high temporal layers while these pictures
are still able to be reconstructed with high quality. Compressing the highest temporal layer with at a high QP
can be seen as being similar in spirit to reducing the frame rate as discussed in 7.3.7.3. As an example, lowering
the bit rate substantially on every odd-numbered frame can be seen as a step towards completely removing
them.
8.3 Chroma QP offset setting
Many machine analysis methods are performed using 4:4:4 colour format input data. Therefore, encoding in
4:2:0 colour format, which has lower chroma resolution that 4:4:4 colour format, can sometimes have a
negative impact on the machine analysis performance. This can sometimes be compensated for by using a
negative chroma QP offset, which increases the quality of the low-resolution chroma components.
9 Post-processing technologies
9.1 Temporal resampling
When the receiving system requires a specific frame rate that is different from the frame rate of the decoded
sequences, temporal resampling can be applied on the decoded video by utilizing conventional temporal filters
(e.g., motion compensated interpolation filters) or neural network-based filters, or just frame repetition. An
implementation example with more detailed description can be found in A.4A.4.
10 ©
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...