Information technology — Data centre facilities and infrastructures — Part 31: Key performance indicators for resilience

This document: defines metrics as key performance indicators (KPIs) for resilience, dependability, fault tolerance and availability tolerance for data centres; covers the data centre infrastructure (DCI) of power distribution and supply, and environmental control; can be referred to for covering further infrastructures, e.g. telecommunications cabling; defines the measurement and calculation of the KPIs and resilience levels (RLs); targets maintainability, recoverability and vulnerability; provides examples for calculating these KPIs for the purpose of analytical comparison of different DCIs. This document does not apply to IT equipment, cloud services, software or business applications.

Technologie de l’information — Installation et infrastructures de centres de traitement de données — Partie 31: Indicateurs clés de performance pour la résilience

General Information

Status
Published
Publication Date
02-Feb-2026
Current Stage
6060 - International Standard published
Start Date
03-Feb-2026
Due Date
13-Dec-2026
Completion Date
03-Feb-2026

Relations

Effective Date
23-Dec-2023

Overview

ISO/IEC TS 22237-31:2025 - "Information technology - Data centre facilities and infrastructures - Part 31: Key performance indicators for resilience" defines a framework of key performance indicators (KPIs) for resilience applicable to data centre facilities and infrastructures (DCI). The technical specification covers metrics and measurement methods for dependability, fault tolerance and availability tolerance specifically for DCI elements such as power distribution and supply and environmental control. It defines how to calculate resilience levels (RLs) and provides worked examples for analytical comparison of different DCIs. The document explicitly excludes IT equipment, cloud services, software and business applications.

Key topics and requirements

  • KPI definitions and metrics: Standardises KPIs for measuring resilience, including reliability, availability and failure rates for physical infrastructure.
  • Resilience levels (RLs): Defines how to measure and report operational resilience and reduced-resilience states.
  • Dependability, fault tolerance, availability tolerance: Provides structured KPIs and measurement approaches to assess maintainability, recoverability and vulnerability of infrastructure.
  • Scope of application: Focuses on DCIs such as power distribution/supply and environmental control; can be extended to other infrastructures (e.g., telecom cabling).
  • Measurement and calculation methods: Describes analysis techniques and examples for KPI computation; includes Reliability Block Diagrams (RBD) and Failure Mode, Effects and Criticality Analysis (FMECA) as normative approaches.
  • Lifecycle integration: Recommends how resilience KPIs are implemented across design, construction and operation phases and documented for ongoing management.
  • Supporting annexes: Informative annexes cover FMECA templates, dependability data, SPoF analysis, resilience analysis and confidence intervals for failure-rate estimation.

Applications and users

Who benefits:

  • Data centre owners and operators seeking to benchmark and certify resilience
  • Design engineers and consulting firms specifying resilient DCI architectures
  • Facility managers responsible for power and cooling reliability
  • Risk assessors, auditors and insurers evaluating infrastructure dependability
  • Procurement teams comparing DCI options on a like-for-like analytical basis

Practical uses:

  • Selecting and comparing power and environmental system configurations using standard KPIs
  • Setting resilience targets and SLA-aligned metrics for operations and maintenance
  • Documenting maintainability, recoverability and vulnerability for compliance and certification
  • Applying RBD and FMECA to quantify single points of failure (SPoF) and resilience impacts

Related standards

  • Part of the ISO/IEC 22237 series for data centre facilities and infrastructures.
  • Developed under ISO/IEC JTC 1/SC 39 (Secretariat: ANSI).

Keywords: ISO/IEC TS 22237-31, data centre resilience, KPIs for resilience, data center infrastructure (DCI), dependability, fault tolerance, availability, resilience levels, RBD, FMECA.

Technical specification

ISO/IEC TS 22237-31:2026 - Information technology — Data centre facilities and infrastructures — Part 31: Key performance indicators for resilience Released:3. 02. 2026

English language
60 pages
sale 15% off
Preview
sale 15% off
Preview

Get Certified

Connect with accredited certification bodies for this standard

BSI Group

BSI (British Standards Institution) is the business standards company that helps organizations make excellence a habit.

UKAS United Kingdom Verified

NYCE

Mexican standards and certification body.

EMA Mexico Verified

Sponsored listings

Frequently Asked Questions

ISO/IEC TS 22237-31:2026 is a technical specification published by the International Organization for Standardization (ISO). Its full title is "Information technology — Data centre facilities and infrastructures — Part 31: Key performance indicators for resilience". This standard covers: This document: defines metrics as key performance indicators (KPIs) for resilience, dependability, fault tolerance and availability tolerance for data centres; covers the data centre infrastructure (DCI) of power distribution and supply, and environmental control; can be referred to for covering further infrastructures, e.g. telecommunications cabling; defines the measurement and calculation of the KPIs and resilience levels (RLs); targets maintainability, recoverability and vulnerability; provides examples for calculating these KPIs for the purpose of analytical comparison of different DCIs. This document does not apply to IT equipment, cloud services, software or business applications.

This document: defines metrics as key performance indicators (KPIs) for resilience, dependability, fault tolerance and availability tolerance for data centres; covers the data centre infrastructure (DCI) of power distribution and supply, and environmental control; can be referred to for covering further infrastructures, e.g. telecommunications cabling; defines the measurement and calculation of the KPIs and resilience levels (RLs); targets maintainability, recoverability and vulnerability; provides examples for calculating these KPIs for the purpose of analytical comparison of different DCIs. This document does not apply to IT equipment, cloud services, software or business applications.

ISO/IEC TS 22237-31:2026 is classified under the following ICS (International Classification for Standards) categories: 35.020 - Information technology (IT) in general. The ICS classification helps identify the subject area and facilitates finding related standards.

ISO/IEC TS 22237-31:2026 has the following relationships with other standards: It is inter standard links to ISO/IEC TS 22237-31:2023. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.

ISO/IEC TS 22237-31:2026 is available in PDF format for immediate download after purchase. The document can be added to your cart and obtained through the secure checkout process. Digital delivery ensures instant access to the complete standard document.

Standards Content (Sample)


Technical
Specification
ISO/IEC TS
22237-31
Second edition
Information technology —
2026-02
Data centre facilities and
infrastructures —
Part 31:
Key performance indicators for
resilience
Technologie de l’information — Installation et infrastructures de
centres de traitement de données —
Partie 31: Indicateurs clés de performance pour la résilience
Reference number
© ISO/IEC 2026
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
© ISO/IEC 2026 – All rights reserved
ii
Contents Page
Foreword .v
Introduction .vi
1 Scope . 1
2 Normative references . 1
3 Terms, definitions, symbols and abbreviated terms . 1
3.1 Terms and definitions .1
3.2 Symbols and abbreviated terms .6
3.2.1 Symbols.6
3.2.2 Abbreviated terms .7
4 Area of application . 8
4.1 General .8
4.2 DCI service definition .8
5 Resilience considerations as part of the life cycle . 9
5.1 Implementation in the design process .9
5.1.1 General .9
5.1.2 Phase 1 — Strategy .9
5.1.3 Phase 2 — Objectives .10
5.1.4 Phase 3 — System specifications .10
5.1.5 Phase 4 — Design proposal .10
5.1.6 Phase 6 — Functional design .10
5.1.7 Phase 8 — Final design and project plan.10
5.1.8 Phase 10 — Construction .11
5.1.9 Phase 11 — Operation .11
5.2 Documentation during operation .11
5.3 Documentation of resilience level .11
5.3.1 General .11
5.3.2 Requirements . 12
5.4 Documentation of dependability . 12
5.4.1 Requirements . 12
5.4.2 Recommendations. 12
5.5 Documentation of fault tolerance . 12
5.6 Documentation of availability tolerance . 12
5.6.1 Requirements . 12
5.6.2 Recommendations. 12
6 Determination of KPIs for resilience .13
6.1 General . 13
6.2 Structuring of the KPIs for resilience . 13
6.2.1 General . 13
6.2.2 KPIs .14
6.2.3 Metrics . 15
6.3 Dependability .16
6.3.1 Provided KPIs .16
6.3.2 Reliability .17
6.3.3 Availability .18
6.3.4 Failure rate .19
6.4 Fault tolerance . 20
6.4.1 General . 20
6.4.2 Single point of failure (SPoF) . 20
6.4.3 Double point of failure (DPoF) . 20
6.5 Availability tolerance . 20
6.5.1 General . 20
6.5.2 Single point of reduced availability (SPoRA) .21
6.5.3 Double point of reduced availability (DPoRA) .21

© ISO/IEC 2026 – All rights reserved
iii
6.6 Resilience level (RL) . .21
6.6.1 General .21
6.6.2 Operation at normal resilience level . 22
6.6.3 Operation at reduced resilience level (RRL) . 23
6.7 Application to data centre infrastructures .24
6.7.1 Methodology and analysis considerations .24
6.7.2 Analysis process . 25
6.7.3 Method of reliability block diagrams (RBD) . 25
6.7.4 Method of failure mode effects and criticality analysis (FMECA) . 26
Annex A (informative) Failure mode effects and criticality analysis.27
Annex B (informative) Dependability data .29
Annex C (informative) Resilience analysis for DCIs . 47
Annex D (informative) SPoF Analysis for DCIs.52
Annex E (informative) Resilience level analysis for DCIs .55
Annex F (informative) Interval of confidence .57
Bibliography .60

© ISO/IEC 2026 – All rights reserved
iv
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are
members of ISO or IEC participate in the development of International Standards through technical
committees established by the respective organization to deal with particular fields of technical activity.
ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations,
governmental and non-governmental, in liaison with ISO and IEC, also take part in the work.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of document should be noted. This document was drafted in accordance with the editorial rules of the ISO/
IEC Directives, Part 2 (see www.iso.org/directives or www.iec.ch/members_experts/refdocs).
ISO and IEC draw attention to the possibility that the implementation of this document may involve the
use of (a) patent(s). ISO and IEC take no position concerning the evidence, validity or applicability of any
claimed patent rights in respect thereof. As of the date of publication of this document, ISO and IEC had not
received notice of (a) patent(s) which may be required to implement this document. However, implementers
are cautioned that this may not represent the latest information, which may be obtained from the patent
database available at www.iso.org/patents and https://patents.iec.ch. ISO and IEC shall not be held
responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www.iso.org/iso/foreword.html.
In the IEC, see www.iec.ch/understanding-standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 39, Sustainability, IT and data centres.
This second edition cancels and replaces the first edition (ISO/IEC TS 22237-31:2023), which has been
technically revised.
The main changes are as follows:
— Annex B was added to provide dependability data for items of data centre infrastructures;
— all subsequent annexes were reviewed and reordered;
— terms and definitions were clarified.
A list of all parts in the ISO/IEC 22237 series can be found on the ISO and IEC websites.
Any feedback or questions on this document should be directed to the user’s national standards
body. A complete listing of these bodies can be found at www.iso.org/members.html and
www.iec.ch/national-committees.

© ISO/IEC 2026 – All rights reserved
v
Introduction
The various parts of the ISO/IEC 22237 series reference four qualitative Availability Classes as well as
structural definitions to categorize different designs. The documents also refer to resilience criteria in order
to improve structural requirements for a qualitative approach.
In order to meet the requirements for evaluating or comparing different designs or for validating service
level agreements (SLAs) for data centres, this document introduces quantitative metrics as key performance
indicators (KPIs). The proposed KPIs cover resilience attributes, including dependability and fault tolerance
metrics. The characteristics of aging of infrastructures are covered by reliability criteria.
Through the use of KPIs, the comparison of designs, functional elements and components of infrastructure
designs becomes possible. In addition, it is possible to optimize data centre infrastructures (DCIs) with
holistic targets. It is recommended to use the KPIs of this document in combination with the efficiency and
sustainability KPIs of the ISO/IEC 30134 series.
ISO/IEC 22237-1:2021, Annex A, demonstrates that a single KPI, such as Availability, is not sufficient to
describe the complexity of a DCI. In recognition, this document has been developed in order to compare and
value different designs with different Availability Classes of DCIs based on a set of selected KPIs.
Furthermore, this document has been created to establish KPIs for resilience of DCIs with defined resilience
levels. The resilience objectives can vary depending on the outcome of the ISO/IEC 22237-1 risk analysis,
the process criticality of the end user's information technology equipment (ITE), and the data centre type of
business.
Using the different stages of a data centre design process, this document describes in which phases the
application of KPIs for resilience is appropriate. With its assistance, data centre designers, planners and
operators will be supported in defining resilience levels, performing theoretical assessments and designing
and operating DCIs which are able to meet SLAs.

© ISO/IEC 2026 – All rights reserved
vi
Technical Specification ISO/IEC TS 22237-31:2026(en)
Information technology — Data centre facilities and
infrastructures —
Part 31:
Key performance indicators for resilience
1 Scope
This document:
a) defines metrics as key performance indicators (KPIs) for resilience, dependability, fault tolerance and
availability tolerance for data centres;
b) covers the data centre infrastructure (DCI) of power distribution and supply, and environmental
control;
c) can be referred to for covering further infrastructures, e.g. telecommunications cabling;
d) defines the measurement and calculation of the KPIs and resilience levels (RLs);
e) targets maintainability, recoverability and vulnerability;
f) provides examples for calculating these KPIs for the purpose of analytical comparison of different DCIs.
This document does not apply to IT equipment, cloud services, software or business applications.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
ISO/IEC 22237-1, Information technology — Data centre facilities and infrastructures — Part 1: General
concepts
ISO/IEC 22237-3, Information technology — Data centre facilities and infrastructures — Part 3: Power
distribution
ISO/IEC 22237-4, Information technology — Data centre facilities and infrastructures — Part 4: Environmental
control
ISO/IEC 30134-1, Information technology — Data centres — Key performance indicators — Part 1: Overview
and general requirements
3 Terms, definitions, symbols and abbreviated terms
3.1 Terms and definitions
For the purposes of this document, the terms and definitions given in ISO/IEC 22237-1, ISO/IEC 22237-3,
ISO/IEC 22237-4 and the following apply.

© ISO/IEC 2026 – All rights reserved
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1.1
availability
ability to be in a state to perform as required
[SOURCE: IEC 60050-192:2015, 192-01-23, modified — Notes 1 and 2 to entry have been deleted.]
3.1.2
availability tolerance
ability to be in a state to perform as required with certain failures (3.1.8) present
3.1.3
dependability
ability to perform as and when required
Note 1 to entry: In this document, the term is used for the determination of data centre reliability (3.1.28), availability
(3.1.1) and failure rate (3.1.9).
[SOURCE: IEC 60050-192:2015, 192-01-22, modified — Notes 1 and 2 to entry have been replaced by a new
Note 1 to entry.]
3.1.4
double point of failure
DPoF
combination of two functional elements whose simultaneous failures (3.1.8) cause overall system fault
(3.1.10)
[1]
[SOURCE: IET, Journal of Engineering, Vol. 2019 Iss. 12, 99. 8419-8427]
3.1.5
double point of reduced availability
DPoRA
combination of two functional elements whose simultaneous failures (3.1.8) result in the violation of the
service level agreement (SLA) (3.1.30)
[1]
[SOURCE: IET, Journal of Engineering, Vol. 2019 Iss. 12, 99. 8419-8427]
3.1.6
down state
state of being unable to perform as required, due to failures (3.1.8) or faults (3.1.10)
Note 1 to entry: The state can be related to failures of items or faults at a specified operation point (OP) (3.1.21).
[SOURCE: IEC 60050-192:2015, 192-02-20]
3.1.7
event
something that happens and leads to one or more failures (3.1.8) or faults (3.1.10)
3.1.8
failure
loss of ability to perform as required
Note 1 to entry: In this context it is irrelevant if the cause was planned or unplanned.
[SOURCE: IEC 60050-192:2015, 192-03-01, modified — Notes 1 to 3 to entry have been replaced by Note 1 to
entry.]
© ISO/IEC 2026 – All rights reserved
3.1.9
failure rate
limit of the ratio of the conditional probability that the instant of time, T, of a failure (3.1.8) of a product falls
within a given time interval (3.1.34) (t, t + Δt) and the duration of this interval, Δt, when Δt tends towards
zero, given that the item is in an up state (3.1.35) at the start of the time interval
[SOURCE: IEC 60050-192:2015, 821-12-21]
3.1.10
fault
inability to perform as required, due to an internal state
Note 1 to entry: Opposite of success. In the context of the expected resilience level (RL) (3.1.26), at a specified operation
point (OP) (3.1.21).
[SOURCE: IEC 60050-192:2015, 192-04-01]
3.1.11
fault tolerance
ability to continue functioning with certain faults (3.1.10) present
[SOURCE: IEC 60050-192:2015, 192-10-09]
3.1.12
information technology equipment
ITE
equipment providing data storage, processing and transport services together with equipment dedicated to
providing direct connection to either core or access networks or both
3.1.13
infrastructure
technical systems providing the functional capability of the data centre
Note 1 to entry: Examples are power distribution, environmental control, telecommunications cabling and physical
security.
[SOURCE: ISO/IEC 22237-1:2021, 3.1.21, modified — The term "telecommunications cabling" has been added
to the list in Note 1 to entry.]
3.1.14
inherent availability
availability (3.1.1) provided by the design under ideal conditions of operation and maintenance
[SOURCE: IEC 60050-192:2015, 192-08-02]
3.1.15
mean down time
MDT
average downtime caused by scheduled and unscheduled maintenance, including any logistics time
(expectations including detection time, diagnostic time, spare part delivery time, repair time)
[SOURCE: IEEE Std. 493-2007]
3.1.16
mean operating time between failures
MTBF
expectation of the duration of the operating time between failures (3.1.8)
Note 1 to entry: Mean operating time between failures should only be applied to repairable items. For non-repairable
items, see mean operating time to failure (3.1.17).
Note 2 to entry: The term “mean time between failures” (MTBF) is used synonymously in this document.

© ISO/IEC 2026 – All rights reserved
[SOURCE: IEC 60050-192:2015, 192-05-13]
3.1.17
mean time to failure
MTTF
expectation of the operating time to failure (3.1.8)
Note 1 to entry: In the case of non-repairable items with an exponential distribution of operating times to failure, i.e.
a constant failure rate (3.1.9), the mean operating time to failure is numerically equal to the reciprocal of the failure
rate. This is also true for repairable items if after restoration they can be considered to be "as-good-as-new".
[SOURCE: IEC 60050-192:2015, 192-05-11, modified — The term "operating" has been removed from the
preferred term and "MTTF" has been added as a preferred term.]
3.1.18
mean time between maintenance
MTBM
average time between all maintenance events (3.1.7), scheduled and unscheduled, and also includes any
associated logistics time
[SOURCE: IEEE Std. 493-2007]
3.1.19
mean time to restoration
mean time to replace or repair a failed component
Note 1 to entry: Logistics time associated with the repair, such as parts acquisitions or crew mobilization, are not
included.
[SOURCE: IEEE Std. 493-2007]
3.1.20
normal resilience level
NRL
resilience level (3.1.26) mandatory during nominal operation
3.1.21
operation point
OP
point of reference for which calculation of resilience level (3.1.26) is performed
Note 1 to entry: This can be an individual socket taking into account the entire data centre infrastructure or certain
defined parts of the infrastructure (3.1.13), described by at least one system success path (3.1.33). The documentation
of the referenced operation point is required for any key performance indicator.
3.1.22
operational availability
availability (3.1.1) experienced under actual conditions of operation and maintenance
[SOURCE: IEC 60050-192:2015, 192-08-03, modified — Note 1 to entry has been deleted.]
3.1.23
past availability
availability (3.1.1) measured during a period of 1 year
Note 1 to entry: For the purposes of this document, 1 year equals 8 760 h.
3.1.24
reduced resilience level
RRL
resilience level (3.1.26) mandatory during reduced operation in case of one or more failures (3.1.8)

© ISO/IEC 2026 – All rights reserved
3.1.25
resilience
ability to withstand and reduce the magnitude and/or duration of disruptive events (3.1.7), including the
capability to anticipate, absorb, adapt to, and/or rapidly recover from such an event
[2]
[SOURCE: IEEE Task Force on Definition and Quantification of Resilience, PES-TR65:2018-04]
3.1.26
resilience level
enumeration of attributes for the determination of resilience (3.1.25) aspects of a defined service at a defined
operation point (OP) (3.1.21)
3.1.27
redundancy
provision of more than one means for performing a function
Note 1 to entry: In a data centre, redundancy can be achieved by duplication of devices, functional elements, and/or
supply paths.
[SOURCE: IEC 60050-192:2015, 192-10-02, modified — Original Note 1 to entry has been replaced by a new
Note 1 to entry.]
3.1.28
reliability
ability to perform as required, without failure (3.1.8), for a given time interval (3.1.34), under given conditions
[SOURCE: IEC 60050-192:2015, 192-01-24, modified — Notes 1 to 3 to entry have been deleted.]
3.1.29
resilience model
representation x of the data centre infrastructure that shows all required subsystems, components and
items as well as their systemic interdependencies
3.1.30
service level agreement
SLA
agreement defining the content and quality of the service to be delivered and the timescale in which it is to
be delivered
[SOURCE: ISO/IEC TS 22237-7:2018, 3.1.20, modified — The term "SLA" has been added as a preferred term.]
3.1.31
single point of failure
SPoF
functional element whose failure (3.1.8) causes overall system fault (3.1.10)
[1]
[SOURCE: IET, Journal of Engineering, Vol. 2019 Iss. 12, 99. 8419-8427]
3.1.32
single point of reduced availability
SPoRA
functional element whose failure (3.1.8) results in the violation of the service level agreement (SLA) (3.1.30)
[1]
[SOURCE: IET, Journal of Engineering, Vol. 2019 Iss. 12, 99. 8419-8427]
3.1.33
system success path
infrastructural path, consisting of a minimum of functional elements, to express the success of the
infrastructure (3.1.13) system at the operation point (OP) (3.1.21) to be in the up state (3.1.35)
Note 1 to entry: Each functional element can consist of one or more devices.

© ISO/IEC 2026 – All rights reserved
3.1.34
time interval
part of the time axis limited by two instants
[SOURCE: IEC 60050-192:2015, 113-01-10]
3.1.35
up state
state of being able to perform as required
Note 1 to entry: The state can be related to items or to a specified operation point (OP) (3.1.21).
[SOURCE: IEC 60050-192:2015, 192-02-01]
3.2 Symbols and abbreviated terms
3.2.1 Symbols
For the purposes of this document, the symbols given in ISO/IEC 22237-1, ISO/IEC 30134-1 and the following
apply.
A
inherent availability
i
A
operational availability
o
A
normal resilience level operational availability
o,NRL
A
required operational availability
o,req
A
reduced resilience level operational availability
o,RRL
A
past availability
p
D
nominal diameter
N
disjoint sum of system success paths of x
Dx
e exponential probability density function
f
frequency
ft
 probability density function
I
nominal current
N
number of failures during time interval t
N
f
number of x
N
x
P
nominal power
N
Q
nominal cooling capacity
N
reliability in time interval t
Rt
R
inherent reliability
i
R
operational reliability
o
R
past reliability
p
success, data centre infrastructure represented by the vector x is in the up state
S x

© ISO/IEC 2026 – All rights reserved
S x
 environmental control success function
E
Sx
overall success function
OP
Sx
power and distribution success function
P
t mean down time
MDT
t mean time between failures
MTBF
t mean time between maintenance
MTBM
t mean time to restoration
MTTR
time interval of x
t
x
T instant of time
U
nominal voltage
N
x vector of elements of x of the m th data centre infrastructure
m mi
functional element x of the m th data centre infrastructure with the index i
x
mi

X
set of all functional elements x of the mth data centre infrastructure
m
α
confidence rate
∆t
duration of time interval
λ
inherent failure rate
i
λ
mean failure rate
mean
λ
operational failure rate
o
λ
past failure rate
p
chi-square distribution function law with two degrees of freedom
χ
3.2.2 Abbreviated terms
For the purposes of this document, the abbreviated terms given in ISO/IEC 22237-1, ISO/IEC 30134-1 and
the following apply.
DCI data centre infrastructure (infrastructure residing within a data centre)
DPoF double point of failure
DPoRA double point of reduced availability
FMECA failure mode effects and criticality analysis
ITE information technology equipment
KPI key performance indicator
MDT mean down time
MTBF mean operating time between failures

© ISO/IEC 2026 – All rights reserved
MTBM mean time between maintenance
MTTF mean time to failure
MTTR mean time to restoration
NRL normal resilience level
OP operation point
PDF probability density function
PREP power reliability enhancement program
RBD reliability block diagram
RL resilience level
RRL reduced resilience level
SLA service level agreement
SPoF single point of failure
SPoRA single point of reduced availability
SSP system success path
UPS uninterrupted power system
4 Area of application
4.1 General
The KPIs for resilience, including the dependability, fault tolerance and availability tolerance KPIs, as
specified in this document are associated with the following DCIs of the ISO/IEC 22237 series:
a) ISO/IEC 22237-3: Power supply and distribution;
b) ISO/IEC 22237-4: Environmental control.
The application can be extended to additional infrastructures, e.g. ISO/IEC TS 22237-5 (telecommunications
cabling infrastructure).
4.2 DCI service definition
To determine system success at the operation point (OP), it is required to define the relevant DCI. In general,
the overall success function S x is represented by a certain number, N, of successes of infrastructures

OP
inside the DCI as shown in Formula (1):
N
SSxx  (1)
OP m1 m
The success S x of the enumerated infrastructures x is connected by the ∩ operator. In general, these

m m
infrastructures are not mutually exclusive, because the functions depend on each other. Functional
dependencies shall be taken into account in the calculations.

© ISO/IEC 2026 – All rights reserved
To operate the information technology equipment (ITE) within the permitted parameters, the service
success requires:
— adequate service quality of the power supply and distribution, fed by the sockets;
— adequate service quality of the cooling by the environmental control.
The DCI is represented by the vector x , which refers to Formula (1). The operation of the DCI is considered
to be successful if power supply and distribution S x and environmental control S x are by themselves
P E
operating successfully at the specified OP. Formula (2) defines the system success function as follows:
SSxx  Sx (2)
OP PE
The operation of the power supply and distribution system is deemed successful, S x 1 , if the

P
infrastructure provides the required power quality to the specific socket defined as OP. A violation of the
power quality, as required by the ITE at a specific socket, is defined as a failure: Sx 0 . The cause of the
P
failure can be planned or unplanned.
The operation of the environmental control system is deemed successful, Sx 1 , if the environmental
E
requirements of the ITE at the specified socket defined as OP are satisfied. A violation of the environmental
conditions of a specific functional element or device is defined as a failure: S x 0 . The cause of the failure

E
can be planned or unplanned.
A failure or the combination of failures which lead to Sx 0 is deemed as fault. For calculation purposes
OP
using Formula (2), the following criteria shall be taken into account:
a) The power and cooling capacity of the entire DCI shall be specified.
b) The OP shall be selected in relation to the outcome of the risk analysis.
c) The specified power and cooling capacity shall be given for the selected OP.
d) The service quality of power supply and distribution and environmental control at the selected OP shall
be represented by the DCI model.
The selection of the OP depends on the specific task. In general, the OPs with the highest requirements of
service quality are of relevance.
5 Resilience considerations as part of the life cycle
5.1 Implementation in the design process
5.1.1 General
According to ISO/IEC 22237-1, the data centre design process is split into 11 project phases. The resilience of
the DCI can be managed all along the life cycle, from the strategy phase (1) until the operation phase (11). In
particular, the usage of the KPIs for resilience covers the phases outlined in this clause.
5.1.2 Phase 1 — Strategy
Phase 1 is for information collection in order to define the project objectives. This phase requires the
following:
a) Gather the requirements, e.g. SLAs.
b) Decide about application of resilience KPIs for design.
c) Decide about application of resilience KPIs for operation.

© ISO/IEC 2026 – All rights reserved
d) Define the DCI services for application of KPIs for resilience.
5.1.3 Phase 2 — Objectives
Phase 2 is handled by the owner to convert the strategy into objectives. This phase requires the definition of
the resilience objectives according to the risk analysis respective to SLAs.
a) Define the OP, e.g. protected/non-protected sockets, server racks, rack rows.
b) Define the maximum accepted downtime at the OP, for example:
1) the maximum time interval of loss of the power supply (see ISO/IEC 22237-3);
2) the maximum time interval of loss of the power distribution (see ISO/IEC 22237-3);
3) supply boundary that ITE can tolerate without experiencing unexpected shutdowns or malfunctions
(see Reference [3]);
4) the maximum time interval of loss of the environmental control (see ISO/IEC 22237-4);
5) the maximum time of fault of the entire DCI.
c) Define the maximum accepted failure rate at the OP deemed as faults during the time interval of
reporting.
d) Define the set of KPIs depending on the resilience objective, for example:
1) dependability requirements (reliability, availability, failure rate);
2) fault tolerance requirements (number of SPoF, number of DPoF);
3) availability tolerance requirements (number of SPoRA, number of DPoRA).
The definitions of resilience objectives can be made by making the provisions of 6.6 mandatory during
nominal operation (NRL) and during reduced operation (RRL).
5.1.4 Phase 3 — System specifications
Phase 3 defines the target specifications for all infrastructures. The output of the specifications shall be
validated in accordance with the objectives of Phase 2.
5.1.5 Phase 4 — Design proposal
Phase 4 offers several options for a design proposal. This phase requires the following:
a) Compare/optimize different designs through the application of KPIs for resilience.
b) Approve compliance of the designs for the defined requirements.
5.1.6 Phase 6 — Functional design
Phase 6 offers the functional design. This phase requires the following:
a) Approve the functional design through the application of KPIs for resilience.
5.1.7 Phase 8 — Final design and project plan
During Phase 8, the designer defines volume and/or pieces for all items of the DCI. To meet the resilience
objectives, the definitions made in previous phases shall be taken into account, by the help of the applied
KPIs of resilience.
© ISO/IEC 2026 – All rights reserved
5.1.8 Phase 10 — Construction
Phase 10 includes supervision and acceptance verification of the DCI, until it is put into service. The resilience
objectives shall be taken into account during the following:
a) factory acceptance tests (FATs);
b) equipment transportation and installation on site;
c) commissioning tests, such as functional performance tests (FPT) and integrated system tests (IST);
d) failure simulations on functional elements;
e) failure simulations on the entire DCI.
The outcome of this phase is deeper knowledge of the resilience properties of the DCI.
5.1.9 Phase 11 — Operation
Phase 11 describes the handover to the owner for operation. This phase requires the following:
a) Approve compliance of the DCI for the assumptions of the KPIs used.
b) Monitor the defined KPIs of resilience during operation.
c) Approve compliance of the DCI for the defined requirements in case of planned interruptions, times for
logistics, response times.
d) Review and, if required, recalculate the KPIs for resilience of the DCI.
5.2 Documentation during operation
Documentation of metrics and causes are the basis for optimization of resilience during operation. In order
to be able to monitor aspects of resilience, the organization shall document the following metrics:
a) MTBF and MTTR of the utility supply;
b) MTBF, MTTR, MTBM and MDT data of the functional elements or components;
c) causes for failures and/or faults;
d) causes and scope of restoration.
For evaluation and documentation of failures, the failure mode effects and criticality analysis (FMECA) is
applicable. See Annex A.
5.3 Documentation of resilience level
5.3.1 General
In order to evaluate KPIs for resilience, the following information shall be provided:
a) the resilience model of the DCI;
b) the OPs studied and their load assumptions;
c) the MTBF, MTTR, MTBM and MDT data of the functional elements or components;
d) the number of SPoF and DPoF;
e) if applicable, the number of SPoRA and DPoRA;
f) the calculation method.
© ISO/IEC 2026 – All rights reserved
Periods of runtime shall be documented on an annual basis, where 1 a = 8 760 h.
The recalculation of the resilience KPIs is required after an incident that involves structural modifications
as well as modifications on functional elements. Structural change requires the review and, if necessary, the
revision of the resilience model.
5.3.2 Requirements
Cause and duration of violations of the resilience level shall be documented to calculate the past reliability,
past availability and past failure rate.
5.4 Documentation of dependability
5.4.1 Requirements
In general, reliability, availability and failure rate shall be reported at a minimum of four and a maximum
of six decimal places. The chosen OP and the load assumption of the DCI shall always be quoted alongside
documented values.
To gauge the availability KPI, a corresponding NRL shall be defined.
5.4.2 Recommendations
To distinguish between calculated availabilities, i.e. the inherent availability, the operational availability
and the measured past availability of a data centre in operation, the measurement of A (past availability)
p
should be documented in percentage terms. This is also applicable to the measurement of the past reliability,
R , and the past failure rate, λ .
p p
An RRL during periods of planned reconstruction, adaptation or renewal should be defined.
To avoid rounding errors, the data of the system's items should be used at least one order of magnitude
higher than the KPIs to be calculated.
5.5 Documentation of fault tolerance
The number of SPoF and DPoF shall be documented as integers; see Formulae (14) and (15). Based on the
resilience model of the DCI, the KPIs of SPoF and DPoF sha
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...