Information technology — Data centre facilities and infrastructures — Part 31: Key performance indicators for resilience

This document: a) defines metrics as key performance indicators (KPIs) for resilience, dependability, fault tolerance and availability tolerance for data centres; b) covers the data centre infrastructure (DCI) of power distribution and supply, and environmental control; c) can be referred to for covering further infrastructures, e.g. telecommunications cabling; d) defines the measurement and calculation of the KPIs and resilience levels (RLs); e) targets maintainability, recoverability and vulnerability; f) provides examples for calculating these KPIs for the purpose of analytical comparison of different DCIs. This document does not apply to IT equipment, cloud services, software or business applications.

Technologie de l’information — Installation et infrastructures de centres de traitement de données — Partie 31: Indicateurs clés de performance pour la résilience

General Information

Status
Published
Publication Date
12-Dec-2023
Current Stage
9092 - International Standard to be revised
Start Date
13-Dec-2023
Completion Date
30-Oct-2025
Ref Project

Relations

Overview

ISO/IEC TS 22237-31:2023 provides a practical framework of key performance indicators (KPIs) for data centre resilience. It defines metrics for resilience, dependability, fault tolerance and availability tolerance specifically for data centre infrastructure (DCI) components such as power distribution/supply and environmental control. The technical specification also defines how to measure and calculate KPIs and associated resilience levels (RLs), and gives worked examples to enable analytical comparison of different DCI designs. It does not apply to IT equipment, cloud services, software or business applications.

Key Topics

  • KPI definitions and structure - standardized metrics for resilience, dependability, failure rate, availability and related attributes.
  • Resilience levels (RLs) - quantitative framing for normal and reduced operation modes.
  • Measurement & calculation methods - guidance for deriving KPIs and resilience levels using reliability analyses.
  • Reliability and analysis techniques - includes methods such as Reliability Block Diagrams (RBD) and Failure Mode Effects and Criticality Analysis (FMECA) to model and compare DCI performance.
  • Fault tolerance analyses - identification and treatment of Single Point of Failure (SPoF), Double Point of Failure (DPoF), and points of reduced availability.
  • Life-cycle integration - application of resilience KPIs across design phases (strategy, objectives, system specification, design, construction and operation).
  • Documentation requirements - recommended documentation for maintainability, recoverability and vulnerability of DCIs.
  • Informative annexes provide example analyses (SPoF analysis, resilience level analysis, FMECA examples and confidence intervals).

Applications

  • Compare alternative data centre infrastructure designs using consistent resilience KPIs.
  • Validate and quantify resilience requirements in procurement, design reviews and technical specifications.
  • Support SLA formulation and verification by translating qualitative availability classes into measurable KPIs.
  • Optimize DCI investment decisions by balancing resilience, maintainability and cost.
  • Assist operators, planners and auditors in documenting dependability, fault tolerance and availability tolerance.

Who Should Use It

  • Data centre designers, planners and engineers
  • Facility managers and operators responsible for DCI reliability
  • Procurement teams specifying resilience requirements
  • Consultants, auditors and risk analysts evaluating DCI designs or SLAs

Related Standards

  • ISO/IEC 22237 series (data centre facilities and infrastructures) - provides broader classification and structural definitions.
  • ISO/IEC 30134 series - efficiency and sustainability KPIs recommended to be used alongside resilience KPIs for holistic DCI assessment.

Keywords: ISO/IEC TS 22237-31:2023, data centre resilience, KPIs, data centre infrastructure (DCI), fault tolerance, dependability, availability tolerance, resilience levels.

Technical specification
ISO/IEC TS 22237-31:2023 - Information technology — Data centre facilities and infrastructures — Part 31: Key performance indicators for resilience Released:13. 12. 2023
English language
43 pages
sale 15% off
Preview
sale 15% off
Preview

Standards Content (Sample)


TECHNICAL ISO/IEC TS
SPECIFICATION 22237-31
First edition
2023-12
Information technology — Data centre
facilities and infrastructures —
Part 31:
Key performance indicators for
resilience
Technologie de l’information — Installation et infrastructures de
centres de traitement de données —
Partie 31: Indicateurs clés de performance pour la résilience
Reference number
© ISO/IEC 2023
© ISO/IEC 2023
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii
© ISO/IEC 2023 – All rights reserved

Contents Page
Foreword .v
Introduction . vi
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
3.1 Terms and definitions . 1
3.2 Symbols and abbreviated terms . 6
3.2.1 Symbols. 6
3.2.2 Abbreviated terms . 7
4 Area of application .8
4.1 General . 8
4.2 DCI service definition . 8
5 Resilience considerations as part of the life cycle . 9
5.1 Implementation in the design process . 9
5.1.1 General . 9
5.1.2 Phase 1 — Strategy . 9
5.1.3 Phase 2 — Objectives . 10
5.1.4 Phase 3 — System specifications . 10
5.1.5 Phase 4 — Design proposal . 10
5.1.6 Phase 6 — Functional design . 10
5.1.7 Phase 8 — Final design and project plan. 10
5.1.8 Phase 10 — Construction . 11
5.1.9 Phase 11 — Operation . 11
5.2 Documentation during operation . 11
5.2.1 General . 11
5.3 Documentation of resilience level . 11
5.3.1 General . 11
5.3.2 Requirements .12
5.4 Documentation of dependability .12
5.4.1 Requirements . 12
5.4.2 Recommendations.12
5.5 Documentation of fault tolerance .12
5.5.1 Requirements .12
5.6 Documentation of availability tolerance .12
5.6.1 Requirements .12
5.6.2 Recommendations.13
6 Determination of KPIs for resilience .13
6.1 General .13
6.2 Structuring of the KPIs for resilience . 13
6.2.1 General .13
6.2.2 KPIs . 14
6.2.3 Failure rate .15
6.2.4 Metrics . 15
6.3 Dependability . 16
6.3.1 Provided KPIs . 16
6.3.2 Reliability . 17
6.3.3 Availability . 18
6.3.4 Failure rate . 19
6.4 Fault tolerance .20
6.4.1 General .20
6.4.2 Single point of failure (SPoF) . 20
6.4.3 Double point of failure (DPoF) . 20
iii
© ISO/IEC 2023 – All rights reserved

6.5 Availability tolerance . 21
6.5.1 General . 21
6.5.2 Single point of reduced availability (SPoRA) . 21
6.5.3 Double point of reduced availability (DPoRA) . 21
6.6 Resilience level (RL) . .22
6.6.1 General .22
6.6.2 Operation at normal resilience level . 22
6.6.3 Operation at reduced resilience level . 23
6.7 Application to data centre infrastructures . 24
6.7.1 Methodology and analysis considerations . 24
6.7.2 Analysis process . 25
6.7.3 Method of reliability block diagrams (RBD) . 26
6.7.4 Method of Failure Mode Effects and Criticality Analysis .26
Annex A (informative) Resilience analysis for DCIs .28
Annex B (informative) SPoF Analysis for DCIs .33
Annex C (informative) Resilience level analysis for DCIs .36
Annex D (informative) Example of Failure Mode Effects and Criticality Analysis .38
Annex E (informative) Interval of confidence .40
Bibliography .43
iv
© ISO/IEC 2023 – All rights reserved

Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are
members of ISO or IEC participate in the development of International Standards through technical
committees established by the respective organization to deal with particular fields of technical
activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international
organizations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the
work.
The procedures used to develop this document and those intended for its further maintenance
are described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria
needed for the different types of document should be noted. This document was drafted in
accordance with the editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives or
www.iec.ch/members_experts/refdocs).
ISO and IEC draw attention to the possibility that the implementation of this document may involve the
use of (a) patent(s). ISO and IEC take no position concerning the evidence, validity or applicability of
any claimed patent rights in respect thereof. As of the date of publication of this document, ISO and IEC
had not received notice of (a) patent(s) which may be required to implement this document. However,
implementers are cautioned that this may not represent the latest information, which may be obtained
from the patent database available at www.iso.org/patents and https://patents.iec.ch. ISO and IEC shall
not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to
the World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT) see
www.iso.org/iso/foreword.html. In the IEC, see www.iec.ch/understanding-standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 39, Sustainability, IT and data centres.
A list of all parts in the ISO/IEC 22237 series can be found on the ISO and IEC websites.
Any feedback or questions on this document should be directed to the user’s national standards
body. A complete listing of these bodies can be found at www.iso.org/members.html and
www.iec.ch/national-committees.
v
© ISO/IEC 2023 – All rights reserved

Introduction
The various parts of the ISO/IEC 22237 series reference four qualitative Availability Classes as well as
structural definitions to categorize different designs. The documents also refer to resilience criteria in
order to improve structural requirements for a qualitative approach.
In order to meet the requirements necessary for evaluating or comparing different designs or for
validating service level agreements (SLAs) for data centres, this document introduces quantitative
metrics as key performance indicators (KPIs). The proposed KPIs cover resilience attributes, including
dependability and fault tolerance metrics. The characteristics of aging of infrastructures are covered
by reliability criteria.
Through the use of KPIs, the comparison of designs, functional elements and components of
infrastructure designs becomes possible. In addition, it is possible to optimize data centre
infrastructures (DCI) with holistic targets. It is recommended to use the KPIs of this document in
combination with the efficiency and sustainability KPIs of the ISO/IEC 30134 series.
ISO/IEC 22237-1:2021, Annex A, demonstrates that a single KPI, such as Availability, is not sufficient to
describe the complexity of a DCI. In recognition, this document has been developed in order to compare
and value different designs with different Availability Classes of DCIs based on a set of selected KPIs.
Furthermore, the document has been created to establish KPIs for resilience of DCIs with defined
resilience levels. The resilience objectives can vary depending on the outcome of the ISO/IEC 22237-1
risk analysis, the end user information technology equipment (ITE) process criticality, and the data
centre type of business.
Using the different stages of a data centre design process, this document describes in which phases the
application of KPIs for resilience is appropriate. With its assistance, data centre designers, planners
and operators will be supported in defining resilience levels, performing theoretical assessments and
designing and operating DCIs which are able to meet SLAs.
vi
© ISO/IEC 2023 – All rights reserved

TECHNICAL SPECIFICATION ISO/IEC TS 22237-31:2023(E)
Information technology — Data centre facilities and
infrastructures —
Part 31:
Key performance indicators for resilience
1 Scope
This document:
a) defines metrics as key performance indicators (KPIs) for resilience, dependability, fault tolerance
and availability tolerance for data centres;
b) covers the data centre infrastructure (DCI) of power distribution and supply, and environmental
control;
c) can be referred to for covering further infrastructures, e.g. telecommunications cabling;
d) defines the measurement and calculation of the KPIs and resilience levels (RLs);
e) targets maintainability, recoverability and vulnerability;
f) provides examples for calculating these KPIs for the purpose of analytical comparison of different
DCIs.
This document does not apply to IT equipment, cloud services, software or business applications.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.
ISO/IEC 22237-1, Information technology — Data centre facilities and infrastructures — Part 1: General
concepts
ISO/IEC 22237-3, Information technology — Data centre facilities and infrastructures — Part 3: Power
distribution
ISO/IEC 22237-4, Information technology — Data centre facilities and infrastructures — Part 4:
Environmental control
ISO/IEC 30134-1, Information technology — Data centres — Key performance indicators — Part 1:
Overview and general requirements
IEC 61078, Reliability block diagrams
3 Terms and definitions
3.1 Terms and definitions
For the purposes of this document, the terms and definitions given in ISO/IEC 22237-1, ISO/IEC 22237-3,
ISO/IEC 22237-4 and the following apply.
© ISO/IEC 2023 – All rights reserved

ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1.1
availability
ability to be in a state to perform as required
[SOURCE: IEC 60050-192:2015, 192-01-23, modified — Notes 1 and 2 to entry have been deleted.]
3.1.2
availability tolerance
ability to be in a state to perform as required with certain failures (3.1.8) present
3.1.3
dependability
ability to perform as and when required
Note 1 to entry: In this document, the term is used for the determination of data centre reliability (3.1.28),
availability (3.1.1) and failure rate (3.1.9).
[SOURCE: IEC 60050-192:2015, 192-01-22, modified — Notes 1 and 2 to entry have been replaced by a
new Note 1 to entry.]
3.1.4
double point of failure
DPoF
combination of two functional elements whose simultaneous failures (3.1.8) cause overall system fault
(3.1.10)
[1]
[SOURCE: IET, Journal of Engineering, Vol. 2019 Iss. 12, 99. 8419-8427 ]
3.1.5
double point of reduced availability
DPoRA
combination of two functional elements whose simultaneous failures (3.1.8) result in the violation of
the service level agreement (SLA) (3.1.30)
[1]
[SOURCE: IET, Journal of Engineering, Vol. 2019 Iss. 12, 99. 8419-8427 ]
3.1.6
down state
state of being unable to perform as required, due to failures (3.1.8) or faults (3.1.10)
Note 1 to entry: The state can be related to failures of items or faults at a specified operation point (OP) (3.1.21).
[SOURCE: IEC 60050-192:2015, 192-02-20]
3.1.7
event
something that happens and leads to one or more failures (3.1.8) or faults (3.1.10)
3.1.8
failure
loss of ability to perform as required
Note 1 to entry: In this context it is irrelevant if the cause was planned or unplanned.
[SOURCE: IEC 60050-192:2015, 192-03-01, modified — Notes 1 to 3 to entry have been replaced by Note
1 to entry.]
© ISO/IEC 2023 – All rights reserved

3.1.9
failure rate
limit of the ratio of the conditional probability that the instant of time, T, of a failure (3.1.8) of a product
falls within a given time interval (3.1.35) (t, t + Δt) and the duration of this interval, Δt, when Δt tends
towards zero, given that the item is in an up state (3.1.36) at the start of the time interval
[SOURCE: IEC 60050-192:2015, 821-12-21]
3.1.10
fault
inability to perform as required, due to an internal state
Note 1 to entry: Opposite of success. In the context of the expected resilience level (RL) (3.1.26), at a specified
operation point (OP) (3.1.21).
[SOURCE: IEC 60050-192:2015, 192-04-01]
3.1.11
fault tolerance
ability to continue functioning with certain faults (3.1.10) present
[SOURCE: IEC 60050-192:2015, 192-10-09]
3.1.12
information technology equipment
ITE
equipment providing data storage, processing and transport services together with equipment
dedicated to providing direct connection to core and/or access networks
3.1.13
infrastructure
technical systems providing the functional capability of the data centre
Note 1 to entry: Examples are power distribution, environmental control, telecommunications cabling, physical
security
[SOURCE: ISO/IEC 22237-1:2021, 3.1.21, modified — "telecommunications cabling" has been added to
the list in Note 1 to entry.]
3.1.14
inherent availability
availability (3.1.1) provided by the design under ideal conditions of operation and maintenance
[SOURCE: IEC 60050-192:2015, 192-08-02]
3.1.15
mean down time
MDT
average downtime caused by scheduled and unscheduled maintenance, including any logistics time
(expectations including detection time, diagnostic time, spare part delivery time, repair time)
[SOURCE: IEEE Std. 493-2007]
3.1.16
mean operating time between failures
MTBF
expectation of the duration of the operating time between failures (3.1.8)
Note 1 to entry: Mean operating time between failures should only be applied to repairable items. For non-
repairable items, see mean operating time to failure (3.1.17).
Note 2 to entry: The term “mean time between failures” (MTBF) is used synonymously in this document.
© ISO/IEC 2023 – All rights reserved

[SOURCE: IEC 60050-192:2015, 192-05-13]
3.1.17
mean operating time to failure
expectation of the operating time to failure (3.1.8)
Note 1 to entry: In the case of non-repairable items with an exponential distribution of operating times to failure,
i.e. a constant failure rate (3.1.9), the mean operating time to failure is numerically equal to the reciprocal of the
failure rate. This is also true for repairable items if after restoration they can be considered to be "as-good-as-
new".
Note 2 to entry: The term “mean time to failures” (MTTF) is used synonymously in this document.
[SOURCE: IEC 60050-192:2015, 192-05-11]
3.1.18
mean time between maintenance
MTBM
average time between all maintenance events (3.1.7), scheduled and unscheduled, and also includes any
associated logistics time
[SOURCE: IEEE Std. 493-2007]
3.1.19
mean time to restoration
mean time to replace or repair a failed component
Note 1 to entry: Logistics time associated with the repair, such as parts acquisitions or crew mobilization, are not
included.
[SOURCE: IEEE Std. 493-2007]
3.1.20
normal resilience level
NRL
resilience level (3.1.26) mandatory during nominal operation
3.1.21
operation point
OP
point of reference for which calculation of resilience level (3.1.26) is performed
Note 1 to entry: This can be an individual socket (3.1.33) taking into account the entire data centre infrastructure
(DCI) or certain defined parts of the infrastructure (3.1.13). The documentation of the referenced operation point
(OP) is required for any key performance indicator (KPI).
3.1.22
operational availability
availability (3.1.1) experienced under actual conditions of operation and maintenance
[SOURCE: IEC 60050-192:2015, 192-08-03, modified — Note 1 to entry has been deleted.]
3.1.23
past availability
availability (3.1.1) measured during a period of 1 year
Note 1 to entry: For the purposes of this document, 1 year equals 8 760 hours.
3.1.24
reduced resilience level
RRL
resilience level (3.1.26) mandatory during reduced operation in case of one or more failures (3.1.8)
© ISO/IEC 2023 – All rights reserved

3.1.25
resilience
ability to withstand and reduce the magnitude and/or duration of disruptive events (3.1.7), including
the capability to anticipate, absorb, adapt to, and/or rapidly recover from such an event
[2]
[SOURCE: IEEE Task Force on Definition and Quantification of Resilience, PES -TR65: 2018 -04 ]
3.1.26
resilience level
enumeration of attributes for the determination of resilience (3.1.25) aspects of a defined service at a
defined operation point (OP) (3.1.21)
3.1.27
redundancy
provision of more than one means for performing a function
Note 1 to entry: In a data centre, redundancy can be achieved by duplication of devices, functional elements, and/
or supply paths.
[SOURCE: IEC 60050-192:2015, 192-10-02, modified — Original Note 1 to entry has been replaced by a
new Note 1 to entry.]
3.1.28
reliability
ability to perform as required, without failure (3.1.8), for a mean time interval (3.1.35), under given
conditions
[SOURCE: IEC 60050-192:2015, 192-01-24, modified — Notes 1 to 3 to entry have been deleted.]
3.1.29
resilience model
representation x of the data centre infrastructure (DCI) that shows all required subsystems,
components and items as well as their systemic interdependencies
3.1.30
service level agreement
SLA
agreement defining the content and quality of the service to be delivered and the timescale in which it
is to be delivered
[SOURCE: ISO/IEC TS 22237-7:2018, 3.1.20]
3.1.31
single point of failure
SPoF
functional element whose failure (3.1.8) causes overall system fault (3.1.10)
[1]
[SOURCE: IET, Journal of Engineering, Vol. 2019 Iss. 12, 99. 8419-8427 ]
3.1.32
single point of reduced availability
SPoRA
functional element whose failure (3.1.8) results in the violation of the service level agreement (SLA)
(3.1.30)
[1]
[SOURCE: IET, Journal of Engineering, Vol. 2019 Iss. 12, 99. 8419-8427 ]
3.1.33
socket
connection enabling supply of power to attached equipment
Note 1 to entry: This can be a de-mateable or a hardwired connection.
© ISO/IEC 2023 – All rights reserved

[SOURCE: ISO/IEC 22237-3:2021, 3.1.26]
3.1.34
system success path
infrastructural path, consisting of a minimum of functional elements, to express the success of the
infrastructure (3.1.13) system at the operation point (OP) (3.1.21) to be in the up state (3.1.36)
Note 1 to entry: Each functional element can consist of one or more devices.
3.1.35
time interval
part of the time axis limited by two instants
[SOURCE: IEC 60050-192:2015, 113-01-10]
3.1.36
up state
state of being able to perform as required
Note 1 to entry: The state can be related to items or to a specified operation point (OP) (3.1.21).
[SOURCE: IEC 60050-192:2015, 192-02-01]
3.2 Symbols and abbreviated terms
3.2.1 Symbols
For the purposes of this document, the symbols given in ISO/IEC 22237-1, ISO/IEC 30134-1 and the
following apply.
A
inherent availability
i
A
operational availability
o
A normal resilience level operational availability
o,NRL
A required operational availability
o,req
A reduced resilience level operational availability
o,RRL
A
past availability
p
disjoint sum of system success paths of x
D()x
e exponential PDF
ft()
probability density function (PDF)
number of failures during time interval t
N
f
number of x
N
x
reliability in time interval t
Rt
()
R
inherent reliability
i
R
operational reliability
o
R
past reliability
p
success, x is in the up state
S()x
© ISO/IEC 2023 – All rights reserved

S x environmental control success function
()
E
S x
() overall success function
OP
S ()x power and distribution success function
P
t mean down time
MDT
t mean time between failures
MTBF
t mean time between maintenance
MTBM
t mean time to restoration
MTTR
time interval of x
t
x
T instant of time
x vector of elements of x of the m th DCI
m mi()
functional element x of the m th DCI with the index i
x
mi()
α
confidence rate;
Δt
duration of time interval
λ
inherent failure rate
i
λ
mean failure rate
mean
λ
operational failure rate
o
λ
past failure rate
p
chi-square distribution function law with two degrees of freedom;
χ
3.2.2 Abbreviated terms
For the purposes of this document, the abbreviated terms given in ISO/IEC 22237-1, ISO/IEC 30134-1
and the following apply.
CBEMA Computer Business Equipment Manufacturers Association
DCI data centre infrastructure (infrastructure residing within a data centre)
DPoF double point of failure
DPoRA double point of reduced availability
FAT factory acceptance test
FMECA Failure Mode Effects and Criticality Analysis
ITE information technology equipment
KPI key performance indicator
MDT mean down time
MTBF mean operating time between failures
© ISO/IEC 2023 – All rights reserved

MTBM mean time between maintenance
MTTF mean time to failure
MTTR mean time to restoration
NRL normal resilience level
OP operation point
PDF probability density function
RBD reliability block diagram
RL resilience level
RRL reduced resilience level
SLA service level agreement
SPoF single point of failure
SPoRA single point of reduced availability
SSP system success path
4 Area of application
4.1 General
The KPIs for resilience, including the dependability, fault tolerance and availability tolerance KPIs, as
specified in this document are associated with the following DCIs of the ISO/IEC 22237 series:
a) ISO/IEC 22237-3: Power supply and distribution;
b) ISO/IEC 22237-4: Environmental control.
The application can be extended to additional infrastructures, e.g. ISO/IEC TS 22237-5
(telecommunications cabling infrastructure).
4.2 DCI service definition
To determine system success at the operation point (OP), it is required to define the relevant DCI. In
general, the overall success function S x is represented by a certain number, N, of successes of
()
OP
infrastructures inside the DCI as shown in the Formula (1):
N
SS()xx= () (1)
OP  m
m=1
The success S x of the enumerated infrastructures x is connected by the ∩ operator. In general,
()
m m
these infrastructures are not mutually exclusive, because the functions depend on each other.
Functional dependencies shall be taken into account in the calculations.
To operate the information technology equipment (ITE) within the permitted parameters, the service
success requires:
— adequate service quality of the power supply and distribution, fed by the sockets;
— adequate service quality of the cooling by the environmental control.
© ISO/IEC 2023 – All rights reserved

The DCI is represented by the vector x , which refers to Formula (1). The operation of the DCI is
considered to be successful if power supply and distribution S x and environmental control S x
() ()
P E
are by themselves operating successfully at the specified OP. Formula (2) defines the system success
function as follows:
SS()xx= () ∩ S ()x (2)
OP PE
The operation of the power supply and distribution system is deemed successful, S x =1 , if the
()
P
infrastructure provides the required power quality to the specific socket defined as OP. A violation of
the power quality, as required by the ITE at a specific socket, is defined as a failure: S()x =0. The
P
cause of the failure can be planned or unplanned.
The operation of the environmental control system is deemed successful, S()x =1 , if the environmental
E
requirements of the ITE at the specified socket defined as OP are satisfied. A violation of the
environmental conditions of a specific functional element or device is defined as a failure: S x =0 .
()
E
The cause of the failure can be planned or unplanned.
A failure or the combination of failures which lead to S()x =0 is deemed as fault. For calculation
OP
purposes using Formula (2), the following criteria shall be taken into account.
a) The power and cooling capacity of the entire DCI shall be specified.
b) The OP shall be selected in relation to the outcome of the risk analysis.
c) The specified power and cooling capacity shall be given for the selected OP.
d) The service quality of power supply and distribution and environmental control at the selected OP
shall be represented by the DCI model.
The selection of the OP depends on the specific task. In general, the OPs with the highest requirements
of service quality are of relevance.
5 Resilience considerations as part of the life cycle
5.1 Implementation in the design process
5.1.1 General
According to ISO/IEC 22237-1, the data centre design process is split into 11 project phases. The
resilience of the DCI can be managed all along the life cycle, from the strategy phase (1) until the
operation phase (11). In particular, the usage of the KPIs for resilience covers the following of these
phases.
5.1.2 Phase 1 — Strategy
Phase 1 is for information collection in order to define the project objectives. This phase requires the
following.
a) Gather the requirements, for example, SLAs.
b) Decide about application of resilience KPIs for design.
c) Decide about application of resilience KPIs for operation.
d) Define the DCI services for application of KPIs for resilience.
© ISO/IEC 2023 – All rights reserved

5.1.3 Phase 2 — Objectives
Phase 2 is handled by the owner to convert the strategy into objectives. This phase requires the
definition of the resilience objectives according to the risk analysis respective to SLAs.
a) Define the OP, for example: protected/non-protected sockets, server racks, rack rows, etc.
b) Define the maximum accepted downtime at the OP, for example:
— the maximum time interval of loss of the power supply (see ISO/IEC 22237-3);
— the maximum time interval of loss of the power distribution (see ISO/IEC 22237-3);
— supply boundary that ITE can tolerate without experiencing unexpected shutdowns or
malfunctions (see Reference [3]);
— the maximum time interval of loss of the environmental control (see ISO/IEC 22237-4);
— the maximum time of fault of the entire DCI.
c) Define the maximum accepted failure rate at the OP deemed as faults during the time interval of
reporting.
d) Define the set of KPIs depending on the resilience objective, for example:
— dependability requirements (reliability, availability, failure rate);
— fault tolerance requirements (number of SPoF, number of DPoF);
— availability tolerance requirements (number of SPoRA, number of DPoRA).
The definitions of resilience objectives can be made by making the provisions of 6.6 mandatory during
nominal operation (NRL) and during reduced operation (RRL).
5.1.4 Phase 3 — System specifications
Phase 3 defines the target specifications for all infrastructures. The output of the specifications shall be
validated in accordance with the objectives of Phase 2.
5.1.5 Phase 4 — Design proposal
Phase 4 offers several options for a design proposal. This phase requires the following.
a) Compare/optimize different designs through the application of KPIs for resilience.
b) Approve compliance of the designs for the defined requirements.
5.1.6 Phase 6 — Functional design
Phase 6 offers the functional design. This phase requires the following.
a) Approve the functional design through the application of KPIs for resilience.
5.1.7 Phase 8 — Final design and project plan
During Phase 8 the designer defines volume and/or pieces for all items of the DCI. To meet the resilience
objectives, the definitions made in previous phases shall be taken into account, by the help of the
applied KPIs of resilience.
© ISO/IEC 2023 – All rights reserved

5.1.8 Phase 10 — Construction
Phase 10 includes supervision and acceptance verification of the DCI, until it is put into service. The
Resilience objectives shall be taken into account during the following.
a) Factory acceptance tests (FATs).
b) Equipment transportation and installation on site.
c) Commissioning tests, such as functional performance tests (FPT) and integrated system tests
(IST);
d) Failure simulations on functional elements;
e) Failure simulations on the entire DCI.
The outcome of this phase is deeper knowledge of the resilience properties of the DCI.
5.1.9 Phase 11 — Operation
Phase 11 describes the handover to the owner for operation. This phase requires the following.
a) Approve compliance of the DCI for the assumptions of the KPIs used.
b) Monitor the defined KPIs of resilience during operation.
c) Approve compliance of the DCI for the defined requirements in case of planned interruptions, times
for logistics, response times.
d) Review and, if required, recalculate the KPIs for Resilience of the DCI.
5.2 Documentation during operation
5.2.1 General
Documentation of metrics and causes are the basis for optimization of resilience during operation. In
order to be able to monitor aspects of resilience, the organization shall document the following metrics.
a) MTBF and MTTR of the utility supply.
b) MTBF, MTTR, MTBM and MDT data of the functional elements or components.
c) Causes for failures and/or faults.
d) Causes and scope of restoration.
For evaluation and documentation of failures, the Failure Mode Effects and Criticality Analysis (FMECA)
is applicable. See Annex D.
5.3 Documentation of resilience level
5.3.1 General
In order to evaluate KPIs for resilience, the following information shall be provided.
a) The resilience model of the DCI.
b) The OPs studied and their load assumptions.
c) The MTBF, MTTR, MTBM and MDT data of the functional elements or components.
d) The number of SPoF and DPoF.
© ISO/IEC 2023 – All rights reserved

e) If applicable, the number of SPoRA and DPoRA.
f) The calculation method.
Periods of runtime shall be documented on an annual basis, where 1 a = 8 760 h.
The recalculation of the resilience KPIs is required after an incident that involves structural
modifications as well as modifications on functional elements. Structural change requires the review
and, if necessary, the revision of the resilience model.
5.3.2 Requirements
Cause and duration of violations of the resilience level shall be documented to calculate the past
reliability, past availability, and past failure rate.
5.4 Documentation of dependability
5.4.1 Requirements
In general, reliability, availability and failure rate shall be reported at a minimum of four and a
maximum of six decimal places. The chosen OP and the load assumption of the DCI shall always be
quoted alongside documented values.
To gauge the availability KPI, a corresponding NRL shall be defined.
5.4.2 Recommendations
To distinguish between calculated availabilities, i.e. the inherent availability, the operational availability,
and the measured past availability of a data centre in operation, the measurement of A (past
p
availability) should be documented in percentage terms. This is also applicable to the measurement of
the past reliability, R , and the past failure rate, λ .
p p
A reduced resilience level (RRL) during periods of planned reconstruction, adaptation or renewal
should be defined.
To avoid rounding errors, the data of the system's items should be used at least one order of magnitude
higher than the KPIs to be calculated.
5.5 Documentation of fault tolerance
5.5.1 Requirements
The number of SPoF and DPoF shall be documented as integers; see Formulae (14) and (15). Based on
the resilience model of the DCI, the KPIs of SPoF and DPoF shall be calculated.
5.6 Documentation of availability tolerance
5.6.1 Requirements
The number of SPoRA and DPoRA shall be documented as integers; see Formulae (16) and (17).
The RRL, as a condition of planned maintenance, shall be defined. Based on the resilience model of the
DCI, the operational availability for all cases of SPoF and DPoF shall be calculated. The number of
violations of A in cases of SPoF gives the number of SPoRA, and in case of DPoF gives the number
o,RRL
of DPoRA.
© ISO/IEC 2023 – All rights reserved

5.6.2 Recommendations
Comparing DCI models in terms of the number of SPoRA and DPoRA allows deeper insights into the
resilience characteristics than are acheivable using the number of SPoF and DPoF. Particularly for the
optimization and/or comparison of different DCIs, these KPIs are cruc
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...

Frequently Asked Questions

ISO/IEC TS 22237-31:2023 is a technical specification published by the International Organization for Standardization (ISO). Its full title is "Information technology — Data centre facilities and infrastructures — Part 31: Key performance indicators for resilience". This standard covers: This document: a) defines metrics as key performance indicators (KPIs) for resilience, dependability, fault tolerance and availability tolerance for data centres; b) covers the data centre infrastructure (DCI) of power distribution and supply, and environmental control; c) can be referred to for covering further infrastructures, e.g. telecommunications cabling; d) defines the measurement and calculation of the KPIs and resilience levels (RLs); e) targets maintainability, recoverability and vulnerability; f) provides examples for calculating these KPIs for the purpose of analytical comparison of different DCIs. This document does not apply to IT equipment, cloud services, software or business applications.

This document: a) defines metrics as key performance indicators (KPIs) for resilience, dependability, fault tolerance and availability tolerance for data centres; b) covers the data centre infrastructure (DCI) of power distribution and supply, and environmental control; c) can be referred to for covering further infrastructures, e.g. telecommunications cabling; d) defines the measurement and calculation of the KPIs and resilience levels (RLs); e) targets maintainability, recoverability and vulnerability; f) provides examples for calculating these KPIs for the purpose of analytical comparison of different DCIs. This document does not apply to IT equipment, cloud services, software or business applications.

ISO/IEC TS 22237-31:2023 is classified under the following ICS (International Classification for Standards) categories: 35.020 - Information technology (IT) in general. The ICS classification helps identify the subject area and facilitates finding related standards.

ISO/IEC TS 22237-31:2023 has the following relationships with other standards: It is inter standard links to ISO/IEC TS 22237-31. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.

You can purchase ISO/IEC TS 22237-31:2023 directly from iTeh Standards. The document is available in PDF format and is delivered instantly after payment. Add the standard to your cart and complete the secure checkout process. iTeh Standards is an authorized distributor of ISO standards.