ASTM E2935-21
(Practice)Standard Practice for Evaluating Equivalence of Two Testing Processes
Standard Practice for Evaluating Equivalence of Two Testing Processes
ABSTRACT
This practice provides statistical methodology for conducting equivalence testing on numerical data from two sources to determine if their true means or variances differ by no more than predetermined limits. This standard provides guidance on experiments and statistical methods needed to demonstrate that the test results from a modified testing process are equivalent to those from the current testing process, where equivalence is defined as agreement within a prescribed limit, termed an equivalence limit.
SIGNIFICANCE AND USE
4.1 Laboratories conducting routine testing have a continuing need to make improvements in their testing processes. In these situations it must be demonstrated that any changes will neither cause an undesirable shift in the test results from the current testing process nor substantially affect a performance characteristic of the test method. This standard provides guidance on experiments and statistical methods needed to demonstrate that the test results from a modified testing process are equivalent to those from the current testing process, where equivalence is defined as agreement within a prescribed limit, termed an equivalence limit.
4.1.1 The equivalence limit, which represents a worst-case difference or ratio, is determined prior to the equivalence test and its value is usually set by consensus among subject-matter experts.
4.1.2 Examples of modifications to the testing process include, but are not limited, to the following:
(1) Changes to operating levels in the steps of the test method procedure,
(2) Installation of new instruments, apparatus, or sources of reagents and test materials,
(3) Evaluation of new personnel performing the testing, and
(4) Transfer of testing to a new location.
4.1.3 Examples of performance characteristics directly applicable to the test method include bias, precision, sensitivity, specificity, linearity, and range. Additional characteristics are test cost and elapsed time needed to conduct the test procedure.
4.2 Equivalence studies are performed by a designed experiment that generates test results from the modified and current testing procedures on the same types of materials that are routinely tested. The design of the experiment depends on the type of equivalence needed as discussed below. Experiment design and execution for various objectives is discussed in Section 5.
4.2.1 Means equivalence is concerned with a potential shift in the mean test result in either direction due to a modification in the te...
SCOPE
1.1 This practice provides statistical methodology for conducting equivalence studies on numerical data from two sources of test results to determine if their true means, variances, or other parameters differ by no more than predetermined limits.
1.2 Applications include (1) equivalence studies for bias against an accepted reference value, (2) determining means equivalence of two test methods, test apparatus, instruments, reagent sources, or operators within a laboratory or equivalence of two laboratories in a method transfer, and (3) determining non-inferiority of a modified test procedure versus a current test procedure with respect to a performance characteristic.
1.3 The guidance in this standard applies to experiments conducted either on a single material at a given level of the test result or on multiple materials covering a selected range of test results.
1.4 Guidance is given for determining the amount of data required for an equivalence study. The control of risks associated with the equivalence decision is discussed.
1.5 The values stated in SI units are to be regarded as standard. No other units of measurement are included in this standard.
1.6 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility of the user of this standard to establish appropriate safety, health, and environmental practices and determine t...
General Information
- Status
- Published
- Publication Date
- 31-May-2021
- Technical Committee
- E11 - Quality and Statistics
- Drafting Committee
- E11.20 - Test Method Evaluation and Quality Control
Relations
- Effective Date
- 01-Nov-2023
- Effective Date
- 01-Nov-2023
- Effective Date
- 01-Apr-2022
- Effective Date
- 01-Sep-2019
- Effective Date
- 01-Apr-2019
- Effective Date
- 01-Nov-2017
- Effective Date
- 01-Oct-2017
- Effective Date
- 01-Oct-2017
- Effective Date
- 01-Nov-2016
- Effective Date
- 01-Oct-2014
- Effective Date
- 01-Jun-2014
- Effective Date
- 01-May-2014
- Effective Date
- 15-Nov-2013
- Effective Date
- 15-Nov-2013
- Effective Date
- 15-Nov-2013
Overview
ASTM E2935-21, Standard Practice for Evaluating Equivalence of Two Testing Processes, provides laboratories and quality professionals with a robust statistical methodology for equivalence testing of numerical data originating from two different testing processes. The main objective of this standard is to determine whether the means, variances, or other statistical characteristics of results from a modified testing process differ from those of the current (or reference) process by no more than a predetermined and acceptable limit-referred to as the "equivalence limit." This is essential for validating changes in test methods, equipment, or procedures in regulated and quality-driven environments.
Key Topics
Equivalence Testing Fundamentals
- Defines equivalence as agreement within a prescribed equivalence limit, set by consensus among domain experts.
- Supports equivalence studies for means, variances, bias, and other parameters.
Experiment Design and Statistical Methods
- Guidance on designing robust experiments, including how to collect and analyze data from two sources.
- Utilizes paired and independent sample designs, as well as single sample bias equivalence studies.
- Advises on selecting appropriate sample sizes and controlling statistical risks (consumer’s risk and producer’s risk).
Types of Equivalence Addressed
- Means Equivalence: Ensures average test results of two processes are similar within the equivalence limit.
- Variance/Precision Equivalence: Confirms consistency and reproducibility are not compromised.
- Slope and Range Equivalence: Validates linearity and comparable performance across a range of values.
- Non-Inferiority Testing: Confirms the modified process is not unacceptably worse than the current process for key performance attributes.
Performance Characteristics
- Considers bias, precision, sensitivity, specificity, linearity, range, as well as test cost and duration.
Applications
This standard is widely applicable across laboratory and manufacturing settings where continuous improvement, validation, and technology transfer drive better efficiency and quality. Key uses include:
- Test Method Changes: Justifying modifications to operating parameters, instruments, reagents, or procedures without compromising results.
- Equipment Upgrades: Validating new or replacement instruments or apparatus in routine testing.
- Personnel and Site Transfers: Supporting equivalence when new personnel or laboratory locations are introduced.
- Quality and Compliance: Meeting regulatory requirements for demonstrating method equivalency in regulated industries.
- Method Transfer: Facilitating the reliable transfer of analytic methods between laboratories or to contract testing organizations.
Industries utilizing ASTM E2935-21 include pharmaceuticals, chemicals, environmental testing, food safety, and other sectors where measurement assurance and quality control are essential.
Related Standards
ASTM E2935-21 references and complements several other standards, including:
- ASTM E122: Calculating sample size for specified precision.
- ASTM E177: Use of terms precision and bias in test methods.
- ASTM E456: Quality and statistics terminology.
- ASTM E2282: Defining the test result of a test method.
- ASTM E2586: Calculating and using basic statistics.
- ASTM E3080: Regression analysis with a single predictor variable.
- USP <1223>: Validation of alternative microbiological methods.
These related standards provide additional guidance on sampling, analysis, terminology, and validation best practices for quality and laboratory professionals.
Keywords: ASTM E2935-21, equivalence testing, laboratory method validation, statistical equivalence, means equivalence, variance testing, method transfer, quality control, risk management, experimental design, test method comparison.
Buy Documents
ASTM E2935-21 - Standard Practice for Evaluating Equivalence of Two Testing Processes
REDLINE ASTM E2935-21 - Standard Practice for Evaluating Equivalence of Two Testing Processes
Get Certified
Connect with accredited certification bodies for this standard

BSI Group
BSI (British Standards Institution) is the business standards company that helps organizations make excellence a habit.

Bureau Veritas
Bureau Veritas is a world leader in laboratory testing, inspection and certification services.

DNV
DNV is an independent assurance and risk management provider.
Sponsored listings
Frequently Asked Questions
ASTM E2935-21 is a standard published by ASTM International. Its full title is "Standard Practice for Evaluating Equivalence of Two Testing Processes". This standard covers: ABSTRACT This practice provides statistical methodology for conducting equivalence testing on numerical data from two sources to determine if their true means or variances differ by no more than predetermined limits. This standard provides guidance on experiments and statistical methods needed to demonstrate that the test results from a modified testing process are equivalent to those from the current testing process, where equivalence is defined as agreement within a prescribed limit, termed an equivalence limit. SIGNIFICANCE AND USE 4.1 Laboratories conducting routine testing have a continuing need to make improvements in their testing processes. In these situations it must be demonstrated that any changes will neither cause an undesirable shift in the test results from the current testing process nor substantially affect a performance characteristic of the test method. This standard provides guidance on experiments and statistical methods needed to demonstrate that the test results from a modified testing process are equivalent to those from the current testing process, where equivalence is defined as agreement within a prescribed limit, termed an equivalence limit. 4.1.1 The equivalence limit, which represents a worst-case difference or ratio, is determined prior to the equivalence test and its value is usually set by consensus among subject-matter experts. 4.1.2 Examples of modifications to the testing process include, but are not limited, to the following: (1) Changes to operating levels in the steps of the test method procedure, (2) Installation of new instruments, apparatus, or sources of reagents and test materials, (3) Evaluation of new personnel performing the testing, and (4) Transfer of testing to a new location. 4.1.3 Examples of performance characteristics directly applicable to the test method include bias, precision, sensitivity, specificity, linearity, and range. Additional characteristics are test cost and elapsed time needed to conduct the test procedure. 4.2 Equivalence studies are performed by a designed experiment that generates test results from the modified and current testing procedures on the same types of materials that are routinely tested. The design of the experiment depends on the type of equivalence needed as discussed below. Experiment design and execution for various objectives is discussed in Section 5. 4.2.1 Means equivalence is concerned with a potential shift in the mean test result in either direction due to a modification in the te... SCOPE 1.1 This practice provides statistical methodology for conducting equivalence studies on numerical data from two sources of test results to determine if their true means, variances, or other parameters differ by no more than predetermined limits. 1.2 Applications include (1) equivalence studies for bias against an accepted reference value, (2) determining means equivalence of two test methods, test apparatus, instruments, reagent sources, or operators within a laboratory or equivalence of two laboratories in a method transfer, and (3) determining non-inferiority of a modified test procedure versus a current test procedure with respect to a performance characteristic. 1.3 The guidance in this standard applies to experiments conducted either on a single material at a given level of the test result or on multiple materials covering a selected range of test results. 1.4 Guidance is given for determining the amount of data required for an equivalence study. The control of risks associated with the equivalence decision is discussed. 1.5 The values stated in SI units are to be regarded as standard. No other units of measurement are included in this standard. 1.6 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility of the user of this standard to establish appropriate safety, health, and environmental practices and determine t...
ABSTRACT This practice provides statistical methodology for conducting equivalence testing on numerical data from two sources to determine if their true means or variances differ by no more than predetermined limits. This standard provides guidance on experiments and statistical methods needed to demonstrate that the test results from a modified testing process are equivalent to those from the current testing process, where equivalence is defined as agreement within a prescribed limit, termed an equivalence limit. SIGNIFICANCE AND USE 4.1 Laboratories conducting routine testing have a continuing need to make improvements in their testing processes. In these situations it must be demonstrated that any changes will neither cause an undesirable shift in the test results from the current testing process nor substantially affect a performance characteristic of the test method. This standard provides guidance on experiments and statistical methods needed to demonstrate that the test results from a modified testing process are equivalent to those from the current testing process, where equivalence is defined as agreement within a prescribed limit, termed an equivalence limit. 4.1.1 The equivalence limit, which represents a worst-case difference or ratio, is determined prior to the equivalence test and its value is usually set by consensus among subject-matter experts. 4.1.2 Examples of modifications to the testing process include, but are not limited, to the following: (1) Changes to operating levels in the steps of the test method procedure, (2) Installation of new instruments, apparatus, or sources of reagents and test materials, (3) Evaluation of new personnel performing the testing, and (4) Transfer of testing to a new location. 4.1.3 Examples of performance characteristics directly applicable to the test method include bias, precision, sensitivity, specificity, linearity, and range. Additional characteristics are test cost and elapsed time needed to conduct the test procedure. 4.2 Equivalence studies are performed by a designed experiment that generates test results from the modified and current testing procedures on the same types of materials that are routinely tested. The design of the experiment depends on the type of equivalence needed as discussed below. Experiment design and execution for various objectives is discussed in Section 5. 4.2.1 Means equivalence is concerned with a potential shift in the mean test result in either direction due to a modification in the te... SCOPE 1.1 This practice provides statistical methodology for conducting equivalence studies on numerical data from two sources of test results to determine if their true means, variances, or other parameters differ by no more than predetermined limits. 1.2 Applications include (1) equivalence studies for bias against an accepted reference value, (2) determining means equivalence of two test methods, test apparatus, instruments, reagent sources, or operators within a laboratory or equivalence of two laboratories in a method transfer, and (3) determining non-inferiority of a modified test procedure versus a current test procedure with respect to a performance characteristic. 1.3 The guidance in this standard applies to experiments conducted either on a single material at a given level of the test result or on multiple materials covering a selected range of test results. 1.4 Guidance is given for determining the amount of data required for an equivalence study. The control of risks associated with the equivalence decision is discussed. 1.5 The values stated in SI units are to be regarded as standard. No other units of measurement are included in this standard. 1.6 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility of the user of this standard to establish appropriate safety, health, and environmental practices and determine t...
ASTM E2935-21 is classified under the following ICS (International Classification for Standards) categories: 03.120.30 - Application of statistical methods. The ICS classification helps identify the subject area and facilitates finding related standards.
ASTM E2935-21 has the following relationships with other standards: It is inter standard links to ASTM E2282-23, ASTM E3080-23, ASTM E456-13a(2022)e1, ASTM E3080-19, ASTM E2586-19e1, ASTM E3080-17, ASTM E456-13A(2017)e1, ASTM E456-13A(2017)e3, ASTM E3080-16, ASTM E2282-14, ASTM E2586-14, ASTM E177-14, ASTM E456-13ae1, ASTM E456-13a, ASTM E456-13ae3. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.
ASTM E2935-21 is available in PDF format for immediate download after purchase. The document can be added to your cart and obtained through the secure checkout process. Digital delivery ensures instant access to the complete standard document.
Standards Content (Sample)
This international standard was developed in accordance with internationally recognized principles on standardization established in the Decision on Principles for the
Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
Designation: E2935 − 21 An American National Standard
Standard Practice for
Evaluating Equivalence of Two Testing Processes
This standard is issued under the fixed designation E2935; the number immediately following the designation indicates the year of
original adoption or, in the case of revision, the year of last revision.Anumber in parentheses indicates the year of last reapproval.A
superscript epsilon (´) indicates an editorial change since the last revision or reapproval.
1. Scope 2. Referenced Documents
1.1 This practice provides statistical methodology for con- 2.1 ASTM Standards:
ducting equivalence studies on numerical data from two E122PracticeforCalculatingSampleSizetoEstimate,With
Specified Precision, the Average for a Characteristic of a
sources of test results to determine if their true means,
variances, or other parameters differ by no more than prede- Lot or Process
E177Practice for Use of the Terms Precision and Bias in
termined limits.
ASTM Test Methods
1.2 Applications include (1) equivalence studies for bias
E456Terminology Relating to Quality and Statistics
against an accepted reference value, (2) determining means
E2282Guide for Defining the Test Result of a Test Method
equivalence of two test methods, test apparatus, instruments,
E2586Practice for Calculating and Using Basic Statistics
reagent sources, or operators within a laboratory or equiva-
E3080Practice for Regression Analysis with a Single Pre-
lence of two laboratories in a method transfer, and (3)
dictor Variable
determiningnon-inferiorityofamodifiedtestprocedureversus
2.2 USP Standard:
a current test procedure with respect to a performance charac-
USP <1223> Validation of Alternative Microbiological
teristic.
Methods
1.3 The guidance in this standard applies to experiments
conductedeitheronasinglematerialatagivenlevelofthetest 3. Terminology
resultoronmultiplematerialscoveringaselectedrangeoftest
3.1 Definitions—See Terminology E456 for a more exten-
results.
sive listing of statistical terms.
1.4 Guidance is given for determining the amount of data
3.1.1 accepted reference value, n—a value that serves as an
required for an equivalence study. The control of risks associ- agreed-upon reference for comparison, and which is derived
ated with the equivalence decision is discussed.
as: (1) a theoretical or established value, based on scientific
principles, (2) an assigned or certified value, based on experi-
1.5 The values stated in SI units are to be regarded as
mental work of some national or international organization, or
standard. No other units of measurement are included in this
(3) a consensus or certified value, based on collaborative
standard.
experimental work under the auspices of a scientific or
1.6 This standard does not purport to address all of the
engineering group. E177
safety concerns, if any, associated with its use. It is the
3.1.2 bias, n—the difference between the expectation of the
responsibility of the user of this standard to establish appro-
test results and an accepted reference value. E177
priate safety, health, and environmental practices and deter-
3.1.3 confidence interval, n—an interval estimate [L, U]
mine the applicability of regulatory limitations prior to use.
with the statistics L and U as limits for the parameter θ and
1.7 This international standard was developed in accor-
with confidence level 1 – α, where Pr(L ≤ θ ≤ U) ≥1– α.
dance with internationally recognized principles on standard-
E2586
ization established in the Decision on Principles for the
3.1.3.1 Discussion—Theconfidencelevel,1– α,reflectsthe
Development of International Standards, Guides and Recom-
proportion of cases that the confidence interval [L, U] would
mendations issued by the World Trade Organization Technical
containorcoverthetrueparametervalueinaseriesofrepeated
Barriers to Trade (TBT) Committee.
random samples under identical conditions. Once L and U are
1 2
This test method is under the jurisdiction ofASTM Committee E11 on Quality For referenced ASTM standards, visit the ASTM website, www.astm.org, or
and Statistics and is the direct responsibility of Subcommittee E11.20 on Test contact ASTM Customer Service at service@astm.org. For Annual Book of ASTM
Method Evaluation and Quality Control. Standards volume information, refer to the standard’s Document Summary page on
Current edition approved June 1, 2021. Published June 2021. Originally the ASTM website.
ɛ1 3
approved in 2013. Last previous edition approved in 2020 as E2935 – 20 . DOI: Available from U.S. Pharmacopeial Convention (USP), 12601 Twinbrook
10.1520/E2935-21. Pkwy., Rockville, MD 20852-1790, http://www.usp.org.
Copyright © ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States
E2935 − 21
given values, the resulting confidence interval either does or whichservestoprovideinformationthatmaybeusedasabasis
doesnotcontainit.Inthissense“confidence”appliesnottothe for making a decision concerning the larger collection. E2586
particular interval but only to the long run proportion of cases
3.1.19 sample size, n, n—number of observed values in the
when repeating the procedure many times.
sample. E2586
3.1.4 confidence level, n—thevalue,1–α,oftheprobability
3.1.20 sample statistic, n—summary measure of the ob-
associated with a confidence interval, often expressed as a
served values of a sample. E2586
percentage. E2586
3.1.21 standard deviation—of a population, σ, the square
3.1.4.1 Discussion—α is generally a small number. Confi-
root of the average or expected value of the squared deviation
dence level is often 95 % or 99 %.
of a variable from its mean; of a sample, s, the square root of
3.1.5 confidence limit, n—each of the limits, L and U, of a
thesumofthesquareddeviationsoftheobservedvaluesinthe
confidence interval, or the limit of a one-sided confidence
sample from their mean divided by the sample size minus1.
interval. E2586
E2586
3.1.6 degrees of freedom, n—the number of independent
3.1.22 test result, n—the value of a characteristic obtained
data points minus the number of parameters that have to be
by carrying out a specified test method. E2282
estimated before calculating the variance. E2586
3.1.23 test unit, n—thetotalquantityofmaterial(containing
3.1.7 equivalence, n—condition that two population param-
one or more test specimens) needed to obtain a test result as
eters differ by no more than predetermined limits.
specified in the test method. See test result. E2282
2 2
3.1.8 intermediate precision conditions, n—conditions un- 3.1.24 variance, σ,s,n—square of the standard deviation
der which test results are obtained with the same test method
of the population or sample. E2586
usingtestunitsortestspecimenstakenatrandomfromasingle
3.2 Definitions of Terms Specific to This Standard:
quantity of material that is as nearly homogeneous as possible,
3.2.1 bias equivalence, n—equivalence of a population
and with changing conditions such as operator, measuring
mean with an accepted reference value.
equipment, location within the laboratory, and time. E177
3.2.2 equivalence limit, E, n—in equivalence testing, a limit
3.1.9 mean, n—of a population, µ, average or expected
on the difference between two population parameters.
¯
value of a characteristic in a population; of a sample, X sum of
the observed values in the sample divided by the sample size.
3.2.2.1 Discussion—In certain applications, this may be
E2586
termed practical limit or practical difference.
3.2.3 equivalence test, n—a statistical test conducted within
3.1.10 percentile, n—quantile of a sample or a population,
predetermined risks to confirm equivalence of two population
for which the fraction less than or equal to the value is
parameters.
expressed as a percentage. E2586
3.2.4 means equivalence, n—equivalence of two population
3.1.11 population, n—the totality of items or units of
means.
material under consideration. E2586
3.2.5 non-inferiority, n—condition that the difference in
3.1.12 population parameter, n—summary measure of the
means or variances of test results between a modified testing
values of some characteristic of a population. E2586
process and a current testing process with respect to a
3.1.13 precision, n—the closeness of agreement between
performance characteristic is no greater than a predetermined
independent test results obtained under stipulated conditions.
limit in the direction of inferiority of the modified process to
E177
the current process.
3.1.14 quantile, n—valuesuchthatafraction fofthesample
3.2.5.1 Discussion—Other terms used for non-inferior are
or population is less than or equal to that value. E2586 “equivalent or better” or “at least equivalent as.”
3.1.15 repeatability, n—precision of test results from tests 3.2.6 paired samples design, n—in means equivalence
conductedwithintheshortestpracticaltimeperiodonidentical testing, single samples are taken from the two populations at a
materialbythesame test methodinasinglelaboratorywithall number of sampling points.
known sources of variability conditions controlled at the same
3.2.7 power, n—in equivalence testing, the probability of
levels (see repeatability conditions). E177
accepting equivalence, given the true difference between two
3.1.16 repeatability conditions, n—conditions where inde- population means.
pendent test results are obtained with the same method on
3.2.7.1 Discussion—In the case of testing for bias equiva-
identicaltestitemsinthesamelaboratorybythesameoperator
lence the power is the probability of accepting equivalence,
using the same equipment within short intervals of time. E177
given the true difference between a population mean and an
accepted reference value.
3.1.17 repeatability standard deviation (s ), n—the standard
r
deviation of test results obtained under repeatability
3.2.8 range equivalence, n—equivalence of two population
conditions. E177 means over a range of test result values.
3.1.18 sample, n—a group of observations or test results, 3.2.9 slope equivalence, n—equivalence of the slope of a
taken from a larger collection of observations or test results, linear statistical relationship with the value one (1).
E2935 − 21
3.2.10 two independent samples design, n—in means
t = Student’s t statistic (6.1.4)(7.1.3)(9.1.3)
th
equivalence testing, replicate test results are determined inde-
t =(1– α) percentile of the Student’s t distribution
12α,f
pendently from two populations at a single sampling time for
with f degrees of freedom (A1.1.2)
th th
each population. X = j test result from the i population (6.1)
ij
UCL = upper confidence limit for 5 (10.3.1)
R
3.2.10.1 Discussion—This design is termed a completely
¯
= test result average (9.1.1)
X
th
randomized design for a general number of sampled popula-
¯
= test result average for the i population (6.1.1)
X
i
tions.
¯
= test result average for population 1 (6.1.3)
X
3.2.11 two one-sided tests (TOST) procedure, n—a statisti-
¯
= test result average for population 2 (6.1.3)
X
th
cal procedure used for testing the equivalence of the param-
Z =(1– α) percentile of the standard normal distribu-
12α
eters from two distributions (see equivalence).
tion (A1.7.1)
α = (alpha) consumer’s risk (5.2.2)(6.2)(7.2)
3.3 Symbols:
β = (beta) producer’s risk (5.4.1)
B = bias (9.1.1) β = (beta) intercept parameter (8.1)
b = intercept estimate (8.3.1.4) β = (beta) slope parameter (8.1)
0 1
b = slope estimate (8.3.1.3) ∆ = (capital delta) true mean difference between popu-
d = difference between a pair of test results at sampling
lations (5.4.1)
j
point j (7.1.1) δ = (delta) measurement error of X (A2.1.1)
¯
= average difference (7.1.1) ε = (epsilon) measurement error of Y (A2.1.1)
d
η = (eta) true mean of Y (A2.1.1)
D = difference in sample means (6.1.2)(A1.1.2)
θ = (theta) angle of the straight line to the horizontal
E = equivalence limit (5.2.1)
E = lower equivalence limit (5.2.1) axis (8.3.2.1)
ˆ
= estimate of θ (8.3.2.1)
E = upper equivalence limit (5.2.1) θ
e = residual estimate (8.6.2)
κ = (kappa squared) information size (A2.4)
i
f = degrees of freedom for s (9.1.1)(A1.1.2)
λ = (lambda) ratio of measurement error variances of Y
th
F =(1– α) percentile of the F distribution (10.3.1)
over X (A2.1.1.1)
1-α
f = degrees of freedom for s (6.1.1)
µ = (mu) population mean (A1.5.1)
i i
th
f = degrees of freedom for s (6.1.2)
p p µ = (mu) i population mean (A1.1.1)
i
^(•) = the cumulative F distribution function (A1.7.3)
ν = (nu) approximate degrees of freedom for s
D
H : = null hypothesis (A1.1.1)
0 (A1.1.4)
H : = alternate hypothesis (A1.1.1)
a ξ = (xi) true mean of X (A2.1.1)
n = sample size (number of test results) from a popu-
σ = (sigma) standard deviation of the true difference
d
lation (5.4)(6.1.3)(7.1.1)(9.1.1)
between two populations (7.2)
th
n = sample size from i population (6.1.1) 2
i
σ = (sigma) measurement error variances of Y (8.2)
ε
n = sample size from population 1 (6.1.2)
(A2.1.1)
n = sample size from population 2 (6.1.2) 2
σ = (sigma) measurement error variances of X (8.2)
δ
R = ratio of two sample variances (5.5.2.1)
(A2.1.1)
r = sample correlation coefficient (8.5.1)
τ = (tau) perpendicular distance from line to origin
5 = ratio of two population variances (A1.7.3)
(A2.1.4)
S = sum of squared deviations of X from their mean
XX
Φ(•) = (capital phi) standard normal cumulative distribu-
(8.3.1.2)
tion function (A1.7.1)
S = sumofproductsofdeviationsof XandYfromtheir
XY
φ = (phi) half width of confidence interval for θ
means (8.3.1.2)
(8.3.2.2)
S = sum of squared deviations of Y from their mean
YY
υ = (upsilon) probability associated with informative
(8.3.1.2)
confidence interval (A2.4.1.1)
s = sample standard deviation (9.1.1)
ω = (omega) width of the equivalence interval for θ
s = sample standard deviation for bias (9.1.2)
B
(8.3.3)
s = standard deviation of the difference between two
d
3.4 Acronyms:
test results (7.1.1)
s = sample standard deviation for mean difference
D 3.4.1 ARV, n—accepted reference value (5.5.1.1)(9.1)
(6.1.3)(A1.1.2)
(A1.5.1)
th
s = sample standard deviation for i population (6.1.1)
i
th
2 3.4.2 CRM, n—certified reference material (5.5.1.1)(9.1)
s = sample variance for i population (6.1.1)
i
s = sample variance for population 1 (6.1.2)
3.4.3 ILS, n—interlaboratory study (6.2)
s = variance of test results from the current process
3.4.4 IUT, n—intercept-union test (8.7)(A1.4)
(10.3.1)
s = sample variance for population 2 (6.1.2)
3.4.5 LCL, n—lower confidence limit (6.1.4)(7.2.3)
s = variance of test results from the modified process
3.4.6 TOST, n—two one-sided tests (5.5.1) (Section 6)
(10.3.1)
s = pooled sample standard deviation (6.1.2) (Section 7) (Section 9)(Annex A1)
p
s = repeatability sample standard deviation (6.2)(7.2)
r
3.4.7 UCL, n—upper confidence limit (6.1.4)(7.2.3)
E2935 − 21
4. Significance and Use testing procedure. Non-inferiority may involve the compari-
sons of means, standard deviations, or other statistical param-
4.1 Laboratories conducting routine testing have a continu-
eters.
ing need to make improvements in their testing processes. In
4.2.4.1 Non-inferiority studies may involve trade-offs in
these situations it must be demonstrated that any changes will
performance characteristics between the modified and current
neither cause an undesirable shift in the test results from the
procedures. For example, the modified process may be slightly
current testing process nor substantially affect a performance
inferior to the established process with respect to assay
characteristic of the test method. This standard provides
sensitivity or precision but may have off-setting advantages
guidance on experiments and statistical methods needed to
such as faster delivery of test results or lower testing costs.
demonstratethatthetestresultsfromamodifiedtestingprocess
are equivalent to those from the current testing process, where
4.3 Risk Management—Guidance is provided for determin-
equivalence is defined as agreement within a prescribed limit,
ing the amount of data required to control the risks of making
termed an equivalence limit.
the wrong decision in accepting or rejecting equivalence (see
4.1.1 The equivalence limit, which represents a worst-case
5.4 and Section A1.2).
difference or ratio, is determined prior to the equivalence test
4.3.1 The consumer’s risk is the risk of falsely declaring
and its value is usually set by consensus among subject-matter
equivalence. The probability associated with this risk is di-
experts.
rectly controlled to a low level so that accepting equivalence
4.1.2 Examples of modifications to the testing process
gives a high degree of assurance that the true difference is less
include, but are not limited, to the following:
than the equivalence limit.
(1)Changes to operating levels in the steps of the test
4.3.2 The producer’s risk is the risk of falsely rejecting
method procedure,
equivalence. The probability associated with this risk is con-
(2)Installationofnewinstruments,apparatus,orsourcesof
trolled by the amount of data generated by the experiment. If
reagents and test materials,
valid improvements are rejected by equivalence testing, this
(3)Evaluation of new personnel performing the testing,
can lead to opportunity losses to the company and its labora-
and
tories (the producers) or cause unnecessary additional effort in
(4)Transfer of testing to a new location.
improving the testing process.
4.1.3 Examples of performance characteristics directly ap-
plicable to the test method include bias, precision, sensitivity,
5. Planning and Executing the Equivalence Study
specificity, linearity, and range. Additional characteristics are
testcostandelapsedtimeneededtoconductthetestprocedure.
5.1 This section discusses the stages of conducting an
4.2 Equivalence studies are performed by a designed ex- equivalencestudy:(1)determiningtheinformationneeded,(2)
periment that generates test results from the modified and
settingupandconductingthestudydesign,and(3)performing
current testing procedures on the same types of materials that thestatisticalanalysisoftheresultingdata.Thestudyisusually
are routinely tested. The design of the experiment depends on
conducted either in a single laboratory or, in the case of a
thetypeofequivalenceneededasdiscussedbelow.Experiment method transfer, in both the originating and receiving labora-
design and execution for various objectives is discussed in
tories. Using multiple laboratories will almost always increase
Section 5. the inherent variability of the data in the study, which will
4.2.1 Means equivalence is concerned with a potential shift
increase the cost of performing the study due to the need for
in the mean test result in either direction due to a modification more data.
in the testing process. Test results are generated under repeat-
5.2 Prior information required for the study design includes
abilityconditionsbythemodifiedandcurrenttestingprocesses
the equivalence limit, the consumer’s risk, and an estimate of
on the same material, and the difference in their mean test
the test method precision.
results is evaluated.
5.2.1 Formeansequivalenceteststherearetwoequivalence
4.2.1.1 In situations where testing cannot be conducted
limits, –E and E, because the need to detect a potential shift in
under repeatability conditions, such as using in-line
either direction. Limits may be non-symmetrical around zero,
instrumentation, test results may be generated in pairs of test
such as –E and E , and this will usually be the case for slope
1 2
resultsfromthemodifiedandcurrenttestingprocesses,andthe
equivalence.Fornon-inferioritytestsonlyoneoftheselimitsis
mean differences among paired test results are evaluated.
tested.
4.2.2 Slope equivalence evaluates the slope of the linear
5.2.2 The consumer’s risk may be determined by an indus-
statistical relationship between the test results from the two
trynormoraregulatoryrequirement.Aprobabilityvalueoften
testing procedures. If the slope is equivalent to the value one
used is α = 0.05, which isa5% risk to the user of the test
(1), then the two testing processes meet slope equivalence.
results that the study falsely declares equivalence due to the
4.2.3 Range equivalence evaluates the differences in means
modification of the testing process.
over a selected wider range of test results and the experiment
uses materials that cover that range. The combination of slope 5.2.3 A prior estimate of the test method precision is
equivalence and means equivalence defines range equivalence.
essential for determining the number of test results required in
4.2.4 Non-inferiority is concerned with a difference only in the equivalence study design for adequate producer’s risk
the direction of an inferior outcome in a performance charac- control. This estimate can be available from method develop-
teristic of the modified testing procedure versus the current ment work, from an interlaboratory study (ILS), or from other
E2935 − 21
sources. The precision estimate should take into account the each of the n materials to provide these precision estimates
testconditionsoftheILS,suchas repeatabilityor intermediate needed for estimation of their ratio.
precision conditions. 5.3.3 The Single Sample Design used for bias equivalence.
5.2.4 For slope equivalence an additional piece of required In this design, test results are generated by the current testing
information is the ratio λ of the measurement variability of the process on a certified reference material.
modified and current test methods, expressed as variances.
5.4 Sample size in the design context refers to the number n
These estimates are usually available from experience or from
of test results required by each testing process to manage the
method development work, but see 5.3.2.1.
producer’s risk. It is possible to use different sample sizes for
5.3 The design type determines how the data are collected the modified and current test processes, but this can lead to
and how much data are needed to control the producer’s risk, poor control of the consumer’s risk (see A1.1.4).
or the risk of a wrong decision. For generating test result data 5.4.1 Thenumberoftestresults,symbol n,fromeachofthe
from the modified and current testing processes, three basic two testing processes controls the producer’s risk β of falsely
designs are discussed in this practice, the Two Independent rejectingmeansequivalenceatagiventruemeandifference,�.
Samples Design, the Paired Samples Design, and the Single The producer’s risk may be alternatively stated in terms of the
Sample Design. power, defined as the probability 1 – β of correctly accepting
5.3.1 The Two Independent Samples Design is used for equivalence at a given value of �.
means equivalence and non-inferiority testing. In this design, 5.4.1.1 For symmetric equivalence limits in means equiva-
setsofindependenttestresultsareusuallygeneratedinasingle lence studies the power profile plots the probability 1 – β
laboratory on a quantity of a single homogeneous material by against the absolute value of �, due to the symmetry of the
both testing procedures under repeatability conditions. For equivalence limits. This calculation can be performed using a
method transfers each laboratory generates independent test spreadsheet computer package (see A1.7.1 and Appendix X1).
results using the same testing procedure on the same material 5.4.1.2 An example of a set of power profiles in means
under repeatability conditions at each laboratory. If this is not equivalence studies is shown in Fig. 1. The probability scale
possible due to constraints on time or facilities, then the test forpowerontheverticalaxisvariesfrom0to1.Thehorizontal
results can be conducted under intermediate precision axis is the true absolute difference �. The power profile, a
conditions, but then a statistician is recommended for the reversed S-shaped curve, should be close to a power probabil-
design and analysis of the test. ity of 1 at zero absolute difference and will decline to the
5.3.2 The Paired Samples Design is used for slope equiva- consumerriskprobabilityatanabsolutedifferenceof E.Power
lence and may also be used for means equivalence. In this for absolute differences greater than E are less than the
design, pairs of single test results from each testing procedure consumer risk and decline asymptotically to zero as the
are generated on the same material over different time periods, absolute difference increases.
or on various materials that are sampled either from a manu- 5.4.1.3 In Fig. 1, power profiles are shown for three differ-
facturingprocessovertimeorfromasetofmaterialsthatcover ent sample sizes for testing means equivalence. Increasing the
a predetermined range. sample size moves the power curve to the right, giving a
5.3.2.1 Ifinformationonmeasurementerrorisnotavailable greater chance of accepting equivalence for a given true
for slope equivalence studies, the experiment design can be difference�.EquationsforpowerprofilesareshowninSection
modifiedtorunduplicatetestresultsbyeachtestingprocesson A1.6 and a spreadsheet example in Appendix X1.
FIG. 1 Multiple Power Curves for Lab Transfer Example
E2935 − 21
5.4.2 Power curves for bias equivalence and non-inferiority this parameter, so the test for non-inferiority applies. Because
are constructed by different formulas but have the same shape variancesareascaleparameter,thesinglenon-inferioritytestis
and interpretation as those for means equivalence. based the ratio R of the two sample variances, and the
5.4.2.1 For non-inferiority testing, the power profile plots non-inferiority limit E is also in the form of a ratio.
the probability 1 – β against the true mean difference � (see
A1.7.2) or against the true variance ratio 5 for variances (see 6. The TOST Procedure for Statistical Analysis of Means
A1.7.3).
Equivalence — Two Independent Samples Design
5.4.3 Power curves are evaluated by entering different
6.1 Statistical Analysis—Let the sample data be denoted as
valuesof nandevaluatingthecurveshape.Apracticalsolution
th th
X = the j test result from the i population. The equivalence
ij
istochoosensuchthatthepowerisabovea0.9probabilityout
limit E, consumer’s risk α, and sample sizes have been
to about one-half to two-thirds of the distance from zero to E,
previously determined.
thus giving a high probability that equivalence will be demon-
6.1.1 Calculate averages, variances, and standard
strated for a range of true absolute differences that are deemed
deviations, and degrees of freedom for each sample:
of little or no scientific import in the test result.
n
i
5.4.4 Annex A2 provides criteria for determining the num-
X
( ij
ber of samples required to meet power requirements for slope j51
¯
X 5 , i 51, 2 (1)
i
n
equivalence.
i
n
i
5.5 Thestatisticalanalysisforacceptingorrejectingequiva- 2
¯
~X 2 X !
( ij i
lenceofmeansandvariancesforasinglematerialissimilarfor j51
s 5 , i 51, 2 (2)
i
n 2 1!
all cases and depends on the outcome of one-sided statistical ~
i
hypothesis tests. These calculations are given in detail with
s 5 =s , i 51, 2 (3)
i i
examples in Sections 6, 7, 9, and 10, with statistical theory
f 5 n 21, i 51, 2 (4)
i i
given in AnnexA1.The statistical analysis for slope and range
equivalence is given in Section 8, with statistical theory given
6.1.2 Calculatethepooledstandarddeviationanddegreesof
in Annex A2.
freedom:
5.5.1 The data analysis for means equivalence uses a
2 2
n 2 1 s 1 n 2 1 s
~ ! ~ !
statistical methodology termed the two one-sided tests (TOST) 1 1 2 2
s 5Œ (5)
p
~n 1 n 2 2!
procedure.The null hypothesis (see A1.1.1) is that the average
1 2
difference between two sets of data exceeds an equivalence
It is assumed that the sample variances come from popula-
limit in one of the directions from zero, and this is tested in
tions having equal variances; and, if this appears not to be the
both directions. If the hypothesis is rejected in both directions
case, then use the procedure in A1.1.4.
then the alternate hypothesis that the mean difference is less
If n = n = n, then:
1 2
than the equivalence limit is accepted and the two sources of
2 2
s 1 s
~ !
1 2
data are deemed means equivalent.
s 5
p
NOTE 1—Historically, this procedure originated in the pharmaceutical
f 5 ~n 1 n 2 2! (6)
industry for use in bioequivalence trials (1, 2), and was denoted as the p 1 2
Two One-Sided Tests Procedure, which has since been adopted for use in
6.1.3 Calculate the difference between means and its stan-
testing and measurement applications (3, 4).
dard error:
5.5.1.1 For bias equivalence, the statistical test is based on
¯ ¯
D 5 X 2 X (7)
only a single set of data conducted on a certified reference
2 1
material (CRM) because its accepted reference value (ARV) is
1 1
considered to be a known mean with zero variability for the
s 5 s 1 (8)
Œ
D p
n n
1 2
purpose of the equivalence study.
5.5.2 The data analysis for non-inferiority testing of popu- If n = n = n, then:
1 2
lation means uses a single one-sided test in the direction of an
inferior outcome with respect to a performance characteristic
s 5 s Œ
D p
n
determined by the test results. When the performance charac-
teristic is defined as “higher is better,” such as method
6.1.4 Statistical Test for Equivalence—Compute the upper
sensitivity, the statistical test supports non-inferiority when
(UCL) and lower (LCL) confidence limits for the 100 (1 – 2α)
LCL.2E. Conversely, when the performance characteristic is
% two-sided confidence interval on the true difference. If the
defined as “lower is better,” such as incidence of
confidence interval is completely contained within the equiva-
misclassifications, the statistical test supports non-inferiority
lence limits (0 6 E), equivalently if LCL>–E and UCL < E,
when UCL,E.
then accept equivalence. Otherwise, reject equivalence.
5.5.2.1 For the non-inferiority testing of precision, the
UCL 5 D1ts (9)
D
variancesofthetwodatasetsareused,and“lowerisbetter”for
LCL 5 D 2ts (10)
D
where tistheupper100(1–α)%percentileoftheStudent’s
The boldface numbers in parentheses refer to a list of references at the end of
this standard. t distribution with (n + n – 2) degrees of freedom.
1 2
E2935 − 21
6.2 Example for Means Equivalence—The example shown
UCL = 0.65 + (1.812)(0.310) = 1.21
LCL = 0.65 – (1.812)(0.310) = 0.09
is data from a transfer of an ASTM test method from R&D
Lab1 to Plant Lab 2 (Table 1). An equivalence of limit of 2 The 90 % two-sided confidence interval on the true differ-
ence is 0.09 to 1.21 mg/g and is completely contained within
units was proposed with a consumer risk of 5 %.An interlabo-
ratory study (ILS) on this test method had given an estimate of the equivalence interval of –2 to 2 mg/g. Since 0.09 > –2 and
1.21 < 2, equivalence is accepted.
s = 0.5 units for the repeatability standard deviation.Thus E =
r
2units, α=0.05,andestimated σ=0.5unitsareinputsforthis
7. The TOST Procedure for Statistical Analysis of Means
study (the actual units are unspecified for this example).
Equivalence — Paired Samples Design
6.2.1 Sample Size Determination—Power profiles for n=3,
7.1 Statistical Analysis—Let the sample data be denoted as
6,and20weregeneratedforasetofabsolutedifferencevalues
th th
X = the test result from the i population and the j pair,
ranging 0.00 (0.20) 2.40 units as shown in Fig. 1. All three ij
where i=1or2, j=1, …, n. Each pair represents a test result
curves intersect at the point (2, 0.05) as determined by the
from each population at a given sampling point. The equiva-
consumer’s risk at the equivalence limit.
lence limit E, consumer’s risk α, and sample size (number of
6.2.1.1 A sample size of n = 6 replicate assays per labora-
pairs, symbol n) have been previously determined (see Section
tory yielded a satisfactory power curve, in that the probability
5).
of accepting equivalence (power) was greater than a 0.9
7.1.1 Calculate the n differences, symbol d, between the
probability(ora90%power)foradifferenceofabout1.2units j
twotestresultswithineachpair,theaverageofthedifferences,
or less. Therefore, there would be less than an estimated 10 %
¯
symbol d,andthestandarddeviationofthedifferences,symbol
risktotheproducerthatsuchadifferencewouldfailtosupport
s , with its degrees of freedom, symbol f.
equivalence in the actual study.
d
6.2.1.2 A comparison of the three power curves indicates
d 5 X 2 X ,j 5 1,., n (11)
j 1j 2j
that the n = 3 design would be underpowered, as the power
n
Σ d
j51 j
falls below 0.9 at 0.8 units.The n = 20 design gives somewhat ¯
d 5 5 D (12)
n
morepowerthanthe n=6designbutismorecostlytoconduct
and may not be worth the extra expenditure. 2
n ¯
Σ ~d 2 d!
j51 j
6.2.2 Averages, variances, standard deviations, and degrees
s 5Œ (13)
d
n 2 1
~ !
of freedom for the two laboratories are:
f 5 n 21 (14)
¯
X 5s96.9 1 97.9 1 98.5 1 97.5 1 97.7 1 97.2d⁄6
597.62 mg⁄g
7.1.2 Calculate the standard error of the mean difference,
¯
X 5 97.8 1 97.6 1 98.1 1 98.6 1 98.6 1 98.9 ⁄6
s d
symbol s .
D
598.27 mg⁄g
s
d
s 5 (15)
2 2 2
D
s 5fs96.9 2 97.62d 1 . 1 s97.2 2 97.62d g⁄s6 2 1d
=
n
50.31367
2 2 2
s 5f 97.8 2 98.27 1 . 1 98.9 2 98.27 g⁄ 6 2 1
s d s d s d
7.1.3 Statistical Test for Equivalence—Compute the upper
50.26267
(UCL) and lower (LCL) confidence limits for the 100(1 – 2α)
% two-sided confidence interval on the true difference. If the
s 5 0.3136750.560
1 œ
confidence interval is completely contained within the equiva-
s 5 0.2626750.513
2 œ
lence limits (0 6 E), or equivalently if LCL > –E and UCL <
f 5n 21562155
i i
E, then accept equivalence. Otherwise, reject equivalence.
The estimates of standard deviation are in good agreement
UCL 5 D1ts (16)
D
with the ILS estimate of 0.5 mg/g.
LCL 5 D 2ts (17)
6.2.3 The pooled standard deviation is: D
6 2 1 0.313671 6 2 1 0.26267 2.8817
s d s d
where t is the upper 100(1 – α) % percentile of the Student’s
s 5 5 50.537 mg⁄g
Œ Œ
p
s6 1 6 2 2d 10
t distribution with (n − 1) degrees of freedom.
with 10 degrees of freedom.
7.2 Example for Means Equivalence—Total organic carbon
6.2.4 The difference of means is D = 98.27 – 97.62 = 0.65
inpurifiedwaterwasmeasuredbyanon-lineanalyzer,wherein
mg/g. The plant laboratory average is 0.65 mg/g higher than
a water sample was taken directly into the analyzer from the
the development laboratory average. The standard error of the
pipeline through a sampling port and the test result was
difference of means is s 50.537 =2⁄650.310 mg/g with 10
D
determined by a series of operations within the instrument. A
degrees of freedom (same as that for s ).
p
new analyzer was to be qualified by running aTOC analysis at
th
6.2.5 The 95 percentile of Student’s t with 10 degrees of
the same time as the current analyzer utilizing a parallel
freedom is 1.812. Upper and lower confidence limits for the
sampling port on the pipeline. The sampling time was the
difference of means are:
pairing factor, and the data from the two instruments consti-
tuted a pair of single test results measured at a particular
TABLE 1 Data for Equivalence Test Between Two Laboratories
samplingtime.Samplingwastobeconductedatafrequencyof
Test Results
four hours between sampling periods.
Laboratory 1 96.9 97.9 98.5 97.5 97.7 97.2
An equivalence limit of 2 parts per billion (ppb), or4%of
Laboratory 2 97.8 97.6 98.1 98.6 98.6 98.9
the nominal process average of 50 ppb, was proposed with a
E2935 − 21
TABLE 2 Data for Paired Samples Equivalence Test
consumer risk of 5 %.Arepeatability estimate of s = 0.7 ppb,
r
based on previous validation work, gave an estimate for σ = TOC in Water, ppb
d
Sampling Time
Inst A Inst B Diff
0.7√2 or approximately 1 ppb. Thus E = 2 ppb, α = 0.05, and
1 46.4 48.8 2.4
σ = 1 ppb were inputs for this study.
d 2 44.2 43.5 –0.7
7.2.1 Sample Size Determination—Because the paired 3 52.4 53.0 0.6
4 37.6 37.3 –0.3
samples design uses the differences of the test results within
5 49.3 49.1 –0.2
sampling periods for data analysis, the sample size equals the
6 45.0 44.5 –0.5
7 51.4 51.3 –0.1
numberofpairsforpurposesofcalculatingthepowercurve.In
8 57.6 56.8 –0.8
this example, the cost of obtaining test results was not a major
9 43.4 44.9 1.5
consideration once the new analyzer was installed in the
10 45.2 44.1 –1.1
system. Comparative power profiles for n = 10, 20, and 50 11 59.0 58.5 –0.5
12 43.1 44.1 1.0
sample pairs are shown in Fig. 2. The sample size of 20 pairs
13 39.3 40.9 1.6
yielded a satisfactory power curve, in that the probability of
14 48.2 48.4 0.2
15 48.7 49.0 0.3
acceptingequivalencewasgreaterthana0.9(ora90%power)
16 44.4 46.1 1.7
for a true difference of about 1.25 ppb. Therefore, there would
17 52.7 53.2 0.5
be less than an estimated 10 % risk to the producer that such a
18 43.3 44.6 1.3
19 54.4 56.7 2.3
differencewouldfailtosupportequivalenceintheactualstudy.
20 58.4 58.4 0.0
7.2.2 Test results for the two instruments at each of the 20
Average 48.20 48.66 0.46
sampling times are listed in Table 2. The current analyzer was
Std Dev 6.13 5.99 1.05
designated as Instrument A, and the new analyzer was desig-
natedasInstrumentB.Thedifferences d ateachsamplingtime
j
th
period were calculated and listed in Table 2 as differences in
7.2.3 The 95 percentile of Student’s t with 19 degrees of
the test results of Instrument B minus Instrument A. The
freedom was 1.729. Upper and lower confidence limits for the
averages and standard deviations of the test results for each
difference of means were:
analyzer and their differences are also listed in Table 2.
UCL 5 D1ts 5 0.461~1.729!~0.235! 5 0.87 ppb
D
¯
7.2.2.1 The average difference d was 0.46 ppb and the
LCL 5 D 2ts 5 0.46 2 ~1.729!~0.235! 5 0.05 ppb
D
standard deviation of the differences s was 1.05 ppb with f =
d
19 degrees of freedom. The standard error of the average
The 90 % two-sided confidence interval on the true differ-
difference was:
enceis0.05to0.87ppbandiscompletelycontainedwithinthe
equivalence interval of –2 to 2 ppb. Since 0.05 > –2 and 0.87
1.05
s 5 5 0.235 ppb
< 2, equivalence of the two analyzers is accepted.
D
=20
8. Procedure for Equivalence of Test Results Over a
7.2.2.2 Note that the standard deviations of test results for
Range of Values
each analyzer over time were about 6 ppb due to process
fluctuations in a range of 37–59 ppb. The source of variation 8.1 Range equivalence is the condition that means equiva-
due to pairs (sampling times from the process) is eliminated in lence of two testing processes holds over a predetermined
the variation of the differences by pairing the test results. range of a material’s characteristic being measured. The
FIG. 2 Power Curves for Total Organic Carbon Analyzers Comparison
E2935 − 21
experiment consists of obtaining pairs of test results by each discusses methodology for the situation where λfi1.
process on a number of different material samples. The
8.3 Statistical Analysis for Slope Equivalence—For n pairs
approach taken for evaluating range equivalence is through a
th
of test results, let the i pair consist of test results Y from the
i
linear statistical function, Y5β 1β X, describing a straight-line
0 1
modified testing process and X from the current testing
i
statisticalrelationshipbetweenthetestresultsfromtwotesting
process,where i=1, …, n.Slopeequivalenceisacceptedifthe
processes denoted as X and Y. The Y intercept β is the value
slopeβ fromtheestimatedlineof8.1isequivalenttothevalue
of Y when X = 0, and the slope β is the amount of change in
1, representing a 45° line slope relationship between X and Y.
Y units for a unit change in X. The criterion for range
8.3.1 Calculations for the estimated orthogonal least
equivalenceistwofold:(1)thattheinterceptβ =0,and(2)that
squares regression line are as follows.
the slope β = 1, each within predetermined limits. This states
8.3.1.1 Calculate the averages of X and Y:
that, within limits, the relationship Y = X holds over the range
n
of the data.
X
( i
i51
¯
8.1.1 In many cases, the data range is far removed from X 5 (18)
n
zero, so that the intercept parameter is not precisely estimated
n
andcanevenbeanegativetestresultvalue.Forthatreason,the
Y
( i
i51
equivalence procedure for β = 0 should be replaced by a test
0 ¯
Y 5 (19)
n
that the means of the X and Y test results are equivalent,
indicating that the center of the straight-line relationship is
8.3.1.2 Calculate the sums of squared deviations of X and Y
locally close to the Y = X line. This means-equivalence
from their averages, respectively S and S , and the sum of
XX YY
procedure is covered in Section 7. Note that the estimated line
cross products of the X and Y deviations from their averages,
must pass through the point determined from the X and Y
S :
XY
averages of the data.
n
¯
8.1.2 Theequivalencetestingprocedureforβ =1istermed
1 S 5 ~X 2 X! (20)
XX ( i
i51
slope equivalence and this topic will be covered in the
n
remainder of this section. 2
¯
S 5 ~Y 2 Y! (21)
YY i
(
8.1.3 Statistical tests for means equivalence and slope i51
n
equivalence must each meet equivalence to assert range
¯ ¯
S 5 ~X 2 X!~Y 2 Y! (22)
equivalence, because together they constitute an intersection- XY ( i i
i51
union test (see A1.4.1). If each of the component statistical
8.3.1.3 Calculate b , the estimate of the slope β :
testsisanα-leveltest,thentherangeequivalencetestisalsoan 1 1
α-level test.
2 2
S 2 S 1=~S 2 S ! 14S
YY XX YY XX XY
b 5 (23)
2S
8.2 Slope Equivalence—The Y intercept and the slope for
XY
the linear statistical function in 8.1 are estimated from the data
8.3.1.4 Calculate b , the estimate of the Y intercept β :
0 0
byaprocessknownasleast-squares,whichminimizesthesum
¯ ¯
of the squared deviations of the data points from the estimated b 5 Y 2 b X (24)
0 1
line. Unlike the similar simple linear regression model that
8.3.2 Confidence Intervals—Before consideration of confi-
predicts Yfrom X,wherethe Xvaluesareknownconstantsand
dence intervals on the slope, define the angle θ (theta) that the
only Y is subject to measurement variation (see Practice
line makes with the horizontal (X) axis. The confidence limits
E3080), both X and Y are subject to measurement errors, with
will be symmetrical around the estimate of θ and thus will be
2 2
their variances denoted as σ , σ respectively (see A2.1). This
δ ε
more suitable for use in slope equivalence. Confidence inter-
fact requires a different criterion for the least-squares
vals on the slope will not be symmetrical around the estimate
procedure, which is known as errors-in variables regression.
of β .
Instead of minimizing the point-to-line differences in the Y
ˆ
8.3.2.1 Calculate θ the estimate of θ:
direction, the direction of that distance is dependent on the
2 2
precision ratio of Y with respect to X, denoted as λ5σ ⁄σ .
ε δ ˆ
θ 5 arctan b (25)
~ !
8.2.1 The measurement error variances can be estimated
The arctangent function is available on most spreadsheet or
from experience with current procedure use and method
statistical software programs. For a brief summary of transfor-
developmentdataforthemodifiedprocedure.Alternatively,the
mation to polar coordinates, see A2.1.4.
comparison experiment can conduct duplicate test results from
8.3.2.2 Calculate the half width φ (phi) of the two-sided 90
both procedures for estimating these variances, as noted in the
% confidence interval for θ:
referenced article (5).
8.2.2 In many situations it can be assumed that the two test 2 S S 2 S
YY XX XY
φ 5 0.5 arcsin t Œ
F n22,0.95 2 2 G
methodshavesimilarmeasurementerror,thusdealingwiththe
~S 2 S ! 14S
YY XX XY
=n 2 2
casethat λ=1.Thentheleastsquaresprocedureminimizesthe
(26)
squared differences in the perpendicular direction from the
th
points to the line.This is termed orthogonal least squares, and where t is the upper 95 quantile of the Student’s t
n-2,0.95
this procedure will be described in this section. Annex A2 distribution with n – 2 degrees of freedom.
E2935 − 21
The arcsine function is available on most spreadsheet or approachtoestablishslopeequivalenceforthenewinstrument
statistical software programs. as compared with the current instrument.
8.3.2.3 Calculate the upper (UCL) and lower (LCL) confi-
8.4.1 Measurement error was assumed to be approximately
denc
...
This document is not an ASTM standard and is intended only to provide the user of an ASTM standard an indication of what changes have been made to the previous version. Because
it may not be technically possible to adequately depict all changes accurately, ASTM recommends that users consult prior editions as appropriate. In all cases only the current version
of the standard as published by ASTM is to be considered the official document.
´1
Designation: E2935 − 20 E2935 − 21 An American National Standard
Standard Practice for
Conducting Equivalence Tests for Comparing Evaluating
Equivalence of Two Testing Processes
This standard is issued under the fixed designation E2935; the number immediately following the designation indicates the year of
original adoption or, in the case of revision, the year of last revision. A number in parentheses indicates the year of last reapproval. A
superscript epsilon (´) indicates an editorial change since the last revision or reapproval.
ε NOTE—Terms were corrected editorially in May 2021.
1. Scope
1.1 This practice provides statistical methodology for conducting equivalence testingstudies on numerical data from two sources
of test results to determine if their true means, variances, or other parameters differ by no more than predetermined limits.
1.2 Applications include (1) equivalence testingstudies for bias against an accepted reference value, (2) determining means
equivalence of two test methods, test apparatus, instruments, reagent sources, or operators within a laboratory or equivalence of
two laboratories in a method transfer, and (3) determining non-inferiority of a modified test procedure versus a current test
procedure with respect to a performance characteristic.
1.3 The guidance in this standard applies to experiments conducted either on a single material at a given level of the test result
or on multiple materials covering a selected range of test results.
1.4 Guidance is given for determining the amount of data required for an equivalence trial.study. The control of risks associated
with the equivalence decision is discussed.
1.5 The values stated in SI units are to be regarded as standard. No other units of measurement are included in this standard.
1.6 This standard does not purport to address all of the safety concerns, if any, associated with its use. It is the responsibility
of the user of this standard to establish appropriate safety, health, and environmental practices and determine the applicability of
regulatory limitations prior to use.
1.7 This international standard was developed in accordance with internationally recognized principles on standardization
established in the Decision on Principles for the Development of International Standards, Guides and Recommendations issued
by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
2. Referenced Documents
2.1 ASTM Standards:
E122 Practice for Calculating Sample Size to Estimate, With Specified Precision, the Average for a Characteristic of a Lot or
Process
This test method is under the jurisdiction of ASTM Committee E11 on Quality and Statistics and is the direct responsibility of Subcommittee E11.20 on Test Method
Evaluation and Quality Control.
Current edition approved July 1, 2020June 1, 2021. Published August 2020June 2021. Originally approved in 2013. Last previous edition approved in 20172020 as E2935
ɛ1
– 17.20 . DOI: 10.1520/E2935-20E01.10.1520/E2935-21.
For referenced ASTM standards, visit the ASTM website, www.astm.org, or contact ASTM Customer Service at service@astm.org. For Annual Book of ASTM Standards
volume information, refer to the standard’s Document Summary page on the ASTM website.
Copyright © ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States
E2935 − 21
E177 Practice for Use of the Terms Precision and Bias in ASTM Test Methods
E456 Terminology Relating to Quality and Statistics
E2282 Guide for Defining the Test Result of a Test Method
E2586 Practice for Calculating and Using Basic Statistics
E3080 Practice for Regression Analysis with a Single Predictor Variable
2.2 USP Standard:
USP <1223> Validation of Alternative Microbiological Methods
3. Terminology
3.1 Definitions—See Terminology E456 for a more extensive listing of statistical terms.
3.1.1 accepted reference value, n—a value that serves as an agreed-upon reference for comparison, and which is derived as: (1)
a theoretical or established value, based on scientific principles, (2) an assigned or certified value, based on experimental work of
some national or international organization, or (3) a consensus or certified value, based on collaborative experimental work under
the auspices of a scientific or engineering group. E177
3.1.2 bias, n—the difference between the expectation of the test results and an accepted reference value. E177
3.1.3 confidence interval, n—an interval estimate [L, U] with the statistics L and U as limits for the parameter θ and with
confidence level 1 – α, where Pr(L ≤ θ ≤ U) ≥ 1 – α. E2586
3.1.3.1 Discussion—
The confidence level, 1 – α, reflects the proportion of cases that the confidence interval [L, U] would contain or cover the true
parameter value in a series of repeated random samples under identical conditions. Once L and U are given values, the resulting
confidence interval either does or does not contain it. In this sense “confidence” applies not to the particular interval but only to
the long run proportion of cases when repeating the procedure many times.
3.1.4 confidence level, n—the value, 1 – α, of the probability associated with a confidence interval, often expressed as a percentage.
E2586
3.1.4.1 Discussion—
α is generally a small number. Confidence level is often 95 % or 99 %.
3.1.5 confidence limit, n—each of the limits, L and U, of a confidence interval, or the limit of a one-sided confidence interval.
E2586
3.1.6 degrees of freedom, n—the number of independent data points minus the number of parameters that have to be estimated
before calculating the variance. E2586
3.1.7 equivalence, n—condition that two population parameters differ by no more than predetermined limits.
3.1.8 intermediate precision conditions, n—conditions under which test results are obtained with the same test method using test
units or test specimens taken at random from a single quantity of material that is as nearly homogeneous as possible, and with
changing conditions such as operator, measuring equipment, location within the laboratory, and time. E177
3.1.9 mean, n—of a population,μ, average or expected value of a characteristic in a population; of a sample,X¯ sum of the observed
values in the sample divided by the sample size. E2586
3.1.10 percentile, n—quantile of a sample or a population, for which the fraction less than or equal to the value is expressed as
a percentage. E2586
3.1.11 population, n—the totality of items or units of material under consideration. E2586
3.1.12 population parameter, n—summary measure of the values of some characteristic of a population. E2586
Available from U.S. Pharmacopeial Convention (USP), 12601 Twinbrook Pkwy., Rockville, MD 20852-1790, http://www.usp.org.
E2935 − 21
3.1.13 precision, n—the closeness of agreement between independent test results obtained under stipulated conditions. E177
3.1.14 quantile, n—value such that a fraction f of the sample or population is less than or equal to that value. E2586
3.1.15 repeatability, n—precision of test results from tests conducted within the shortest practical time period on identical material
by the same test method in a single laboratory with all known sources of variability conditions controlled at the same levels (see
repeatability conditions). E177
3.1.16 repeatability conditions, n—conditions where independent test results are obtained with the same method on identical test
items in the same laboratory by the same operator using the same equipment within short intervals of time. E177
3.1.17 repeatability standard deviation (s ), n—the standard deviation of test results obtained under repeatability conditions. E177
r
3.1.18 sample, n—a group of observations or test results, taken from a larger collection of observations or test results, which serves
to provide information that may be used as a basis for making a decision concerning the larger collection. E2586
3.1.19 sample size, n, n—number of observed values in the sample. E2586
3.1.20 sample statistic, n—summary measure of the observed values of a sample. E2586
3.1.21 standard deviation—of a population, σ, the square root of the average or expected value of the squared deviation of a
variable from its mean; of a sample, s, the square root of the sum of the squared deviations of the observed values in the sample
from their mean divided by the sample size minus 1. E2586
3.1.22 test result, n—the value of a characteristic obtained by carrying out a specified test method. E2282
3.1.23 test unit, n—the total quantity of material (containing one or more test specimens) needed to obtain a test result as specified
in the test method. See test result. E2282
2 2
3.1.24 variance, σ ,s ,n—square of the standard deviation of the population or sample. E2586
3.2 Definitions of Terms Specific to This Standard:
3.2.1 bias equivalence, n—equivalence of a population mean with an accepted reference value.
3.2.2 equivalence limit, E, n—in equivalence testing, a limit on the difference between two population parameters.
3.2.2.1 Discussion—
In certain applications, this may be termed practical limit or practical difference.
3.2.3 equivalence test, n—a statistical test conducted within predetermined risks to confirm equivalence of two population
parameters.
3.2.4 means equivalence, n—equivalence of two population means.
3.2.5 non-inferiority, n—condition that the difference in means or variances of test results between a modified testing process and
a current testing process with respect to a performance characteristic is no greater than a predetermined limit in the direction of
inferiority of the modified process to the current process.
3.2.5.1 Discussion—
Other terms used for non-inferior are “equivalent or better” or “at least equivalent as.”
3.2.6 paired samples design, n—in means equivalence testing, single samples are taken from the two populations at a number of
sampling points.
E2935 − 21
3.2.6.1 Discussion—
This design is termed a randomized block design for a general number of populations sampled, and each group of data within a
sampling point is termed a block.
3.2.7 power, n—in equivalence testing, the probability of accepting equivalence, given the true difference between two population
means.
3.2.7.1 Discussion—
In the case of testing for bias equivalence the power is the probability of accepting equivalence, given the true difference between
a population mean and an accepted reference value.
3.2.8 range equivalence, n—equivalence of two population means over a range of test result values.
3.2.9 slope equivalence, n—equivalence of the slope of a linear statistical relationship with the value one (1).
3.2.10 two independent samples design, n—in means equivalence testing, replicate test results are determined independently from
two populations at a single sampling time for each population.
3.2.10.1 Discussion—
This design is termed a completely randomized design for a general number of sampled populations.
3.2.11 two one-sided tests (TOST) procedure, n—a statistical procedure used for testing the equivalence of the parameters from
two distributions (see equivalence).
3.3 Symbols:
B = bias (9.1.1)
b = intercept estimate (8.3.1.4)
b = slope estimate (8.3.1.3)
d = difference between a pair of test results at sampling point j (7.1.1)
j
¯
= average difference (7.1.1)
d
D = difference in sample means (6.1.2) (A1.1.2)
E = equivalence limit (5.2.1)
E = lower equivalence limit (5.2.1)
E = upper equivalence limit (5.2.1)
e = residual estimate (8.6.2)
i
f = degrees of freedom for s (9.1.1) (A1.1.2)
th
F = (1 – α) percentile of the F distribution (10.3.1)
1-α
f = degrees of freedom for s (6.1.1)
i i
f = degrees of freedom for s (6.1.2)
p p
^(•) = the cumulative F distribution function (A1.6.3)
^(•) = the cumulative F distribution function (A1.7.3)
H : = null hypothesis (A1.1.1)
H : = alternate hypothesis (A1.1.1)
a
n = sample size (number of test results) from a population (5.4) (6.1.3) (7.1.1) (9.1.1)
th
n = sample size from i population (6.1.1)
i
n = sample size from population 1 (6.1.2)
n = sample size from population 2 (6.1.2)
R = ratio of two sample variances (5.5.2.1)
r = sample correlation coefficient (8.5.1)
5 = ratio of two population variances (A1.6.3)
5 = ratio of two population variances (A1.7.3)
S = sum of squared deviations of X from their mean (8.3.1.2)
XX
S = sum of products of deviations of X and Y from their means (8.3.1.2)
XY
S = sum of squared deviations of Y from their mean (8.3.1.2)
YY
s = sample standard deviation (9.1.1)
s = sample standard deviation for bias (9.1.2)
B
s = standard deviation of the difference between two test results (7.1.1)
d
s = sample standard deviation for mean difference (6.1.3) (A1.1.2)
D
th
s = sample standard deviation for i population (6.1.1)
i
th
s = sample variance for i population (6.1.1)
i
E2935 − 21
s = sample variance for population 1 (6.1.2)
s = variance of test results from the current process (10.3.1)
s = sample variance for population 2 (6.1.2)
s = variance of test results from the modified process (10.3.1)
s = pooled sample standard deviation (6.1.2)
p
s = repeatability sample standard deviation (6.2) (7.2)
r
t = Student’s t statistic (6.1.4) (7.1.3) (9.1.3)
th
t = (1 – α) percentile of the Student’s t distribution with f degrees of freedom (A1.1.2)
12α,f
th th
X = j test result from the i population (6.1)
ij
UCL = upper confidence limit for 5 (10.3.1)
R
¯
= test result average (9.1.1)
X
th
¯
= test result average for the i population (6.1.1)
X
i
¯
= test result average for population 1 (6.1.3)
X
¯
= test result average for population 2 (6.1.3)
X
th
Z = (1 – α) percentile of the standard normal distribution (A1.6.1)
12α
th
Z = (1 – α) percentile of the standard normal distribution (A1.7.1)
12α
α = (alpha) consumer’s risk (5.2.2) (6.2) (7.2)
β = (beta) producer’s risk (5.4.1)
β = (beta) intercept parameter (8.1)
β = (beta) slope parameter (8.1)
Δ = (capital delta) true mean difference between populations (5.4.1)
δ = (delta) measurement error of X (A2.1.1)
ε = (epsilon) measurement error of Y (A2.1.1)
η = (eta) true mean of Y (A2.1.1)
θ = (theta) angle of the straight line to the horizontal axis (8.3.2.1)
ˆ
= estimate of θ (8.3.2.1)
θ
κ = (kappa squared) information size (A2.4)
λ = (lambda) ratio of measurement error variances of Y over X (A2.1.1.1)
μ = population mean (A1.4.1)
μ = (mu) population mean (A1.5.1)
th
μ = (mu) i population mean (A1.1.1)
i
ν = (nu) probability associated with informative confidence interval (A2.4.1)
ν = (nu) approximate degrees of freedom for s (A1.1.4)
D
ξ = (xi) true mean of X (A2.1.1)
σ = (sigma) standard deviation of the true difference between two populations (7.2)
d
σ = (sigma) measurement error variances of Y (8.2) (A2.1.1)
ε
σ = (sigma) measurement error variances of X (8.2) (A2.1.1)
δ
τ = (tau) perpendicular distance from line to origin (A2.1.4)
Φ(•) = standard normal cumulative distribution function (A1.6.1)
Φ(•) = (capital phi) standard normal cumulative distribution function (A1.7.1)
φ = (phi) half width of confidence interval for θ (8.3.2.2)
υ = (upsilon) probability associated with informative confidence interval (A2.4.1.1)
ω = (omega) width of the equivalence interval for θ (8.3.3)
3.4 Acronyms:
3.4.1 ARV, n—accepted reference value (5.5.1.1) (9.1) (A1.4.1A1.5.1)
3.4.2 CRM, n—certified reference material (5.5.1.1) (9.1)
3.4.3 ILS, n—interlaboratory study (6.2)
3.4.4 IUT, n—intercept-union test (8.7) (A1.4)
3.4.5 LCL, n—lower confidence limit (6.1.4) (7.2.3)
3.4.6 TOST, n—two one-sided tests (5.5.1) (Section 6) (Section 7) (Section 9) (Annex A1)
3.4.7 UCL, n—upper confidence limit (6.1.4) (7.2.3)
E2935 − 21
4. Significance and Use
4.1 Laboratories conducting routine testing have a continuing need to make improvements in their testing processes. In these
situations it must be demonstrated that any changes will neither cause an undesirable shift in the test results from the current testing
process nor substantially affect a performance characteristic of the test method. This standard provides guidance on experiments
and statistical methods needed to demonstrate that the test results from a modified testing process are equivalent to those from the
current testing process, where equivalence is defined as agreement within a prescribed limit, termed an equivalence limit.
4.1.1 The equivalence limit, which represents a worst-case difference or ratio, is determined prior to the equivalence test and its
value is usually set by consensus among subject-matter experts.
4.1.2 Examples of modifications to the testing process include, but are not limited, to the following:
(1) Changes to operating levels in the steps of the test method procedure,
(2) Installation of new instruments, apparatus, or sources of reagents and test materials,
(3) Evaluation of new personnel performing the testing, and
(4) Transfer of testing to a new location.
4.1.3 Examples of performance characteristics directly applicable to the test method include bias, precision, sensitivity, specificity,
linearity, and range. Additional characteristics are test cost and elapsed time needed to conduct the test procedure.
4.2 Equivalence testing isstudies are performed by a designed experiment that generates test results from the modified and current
testing procedures on the same types of materials that are routinely tested. The design of the experiment depends on the type of
equivalence needed as discussed below. Experiment design and execution for various objectives is discussed in Section 5.
4.2.1 Means equivalence is concerned with a potential shift in the mean test result in either direction due to a modification in the
testing process. Test results are generated under repeatability conditions by the modified and current testing processes on the same
material, and the difference in their mean test results is evaluated.
4.2.1.1 In situations where testing cannot be conducted under repeatability conditions, such as using in-line instrumentation, test
results may be generated in pairs of test results from the modified and current testing processes, and the mean differences among
paired test results are evaluated.
4.2.2 Slope equivalence evaluates the slope of the linear statistical relationship between the test results from the two testing
procedures. If the slope is equivalent to the value one (1), then the two testing processes meet slope equivalence.
4.2.3 Range equivalence evaluates the differences in means over a selected wider range of test results and the experiment uses
materials that cover that range. The combination of slope equivalence and means equivalence defines range equivalence.
4.2.4 Non-inferiority is concerned with a difference only in the direction of an inferior outcome in a performance characteristic
of the modified testing procedure versus the current testing procedure. Non-inferiority may involve the comparisons of means,
standard deviations, or other statistical parameters.
4.2.4.1 Non-inferiority testingstudies may involve trade-offs in performance characteristics between the modified and current
procedures. For example, the modified process may be slightly inferior to the established process with respect to assay sensitivity
or precision but may have off-setting advantages such as faster delivery of test results or lower testing costs.
4.3 Risk Management—Guidance is provided for determining the amount of data required to control the risks of making the wrong
decision in accepting or rejecting equivalence (see 5.4 and Section A1.2).
4.3.1 The consumer’s risk is the risk of falsely declaring equivalence. The probability associated with this risk is directly controlled
to a low level so that accepting equivalence gives a high degree of assurance that the true difference is less than the equivalence
limit.
4.3.2 The producer’s risk is the risk of falsely rejecting equivalence. The probability associated with this risk is controlled by the
amount of data generated by the experiment. If valid improvements are rejected by equivalence testing, this can lead to opportunity
losses to the company and its laboratories (the producers) or cause unnecessary additional effort in improving the testing process.
E2935 − 21
5. Planning and Executing the Equivalence Study
5.1 This section discusses the stages of conducting an equivalence test:study: (1) determining the information needed, (2) setting
up and conducting the study design, and (3) performing the statistical analysis of the resulting data. The study is usually conducted
either in a single laboratory or, in the case of a method transfer, in both the originating and receiving laboratories. Using multiple
laboratories will almost always increase the inherent variability of the data in the study, which will increase the cost of performing
the study due to the need for more data.
5.2 Prior information required for the study design includes the equivalence limit, the consumer’s risk, and an estimate of the test
method precision.
5.2.1 For means equivalence tests there are two equivalence limits, –E and E, because the need to detect a potential shift in either
direction. Limits may be non-symmetrical around zero, such as –E and E , and this will usually be the case for slope equivalence.
1 2
For non-inferiority tests only one of these limits is tested.
5.2.2 The consumer’s risk may be determined by an industry norm or a regulatory requirement. A probability value often used is
α = 0.05, which is a 5 % risk to the user of the test results that the study falsely declares equivalence due to the modification of
the testing process.
5.2.3 A prior estimate of the test method precision is essential for determining the number of test results required in the
equivalence study design for adequate producer’s risk control. This estimate can be available from method development work,
from an interlaboratory study, study (ILS), or from other sources. The precision estimate should take into account the test
conditions of the study,ILS, such as repeatability or intermediate precision conditions.
5.2.4 For slope equivalence an additional piece of required information is the ratio λ of the measurement variability of the modified
and current test methods, expressed as variances. These estimates are usually available from experience or from method
development work, but see 5.3.2.1.
5.3 The design type determines how the data are collected and how much data are needed to control the producer’s risk, or the
risk of a wrong decision. For generating test result data from the modified and current testing processes, three basic designs are
discussed in this practice, the Two Independent Samples Design, the Paired Samples Design, and the Single Sample Design.
5.3.1 The Two Independent Samples Design is used for means equivalence and non-inferiority testing. In this design, sets of
independent test results are usually generated in a single laboratory on a quantity of a single homogeneous material by both testing
procedures under repeatability conditions. For method transfers each laboratory generates independent test results using the same
testing procedure on the same material under repeatability conditions at each laboratory. If this is not possible due to constraints
on time or facilities, then the test results can be conducted under intermediate precision conditions, but then a statistician is
recommended for the design and analysis of the test.
5.3.2 The Paired Samples Design is used for slope equivalence and may also be used for means equivalence. In this design, pairs
of single test results from each testing procedure are generated on the same material over different time periods, or on various
materials that are sampled either from a manufacturing process over time or from a set of materials that cover a predetermined
range.
5.3.2.1 If information on measurement error is not available for slope equivalence testing,studies, the experiment design can be
modified to run duplicate test results by each testing process on each of the n materials to provide these precision estimates needed
for estimation of their ratio.
5.3.3 The Single Sample Design used for bias equivalence. In this design, test results are generated by the current testing process
on a certified reference material.
5.4 Sample size in the design context refers to the number n of test results required by each testing process to manage the
producer’s risk. It is possible to use different sample sizes for the modified and current test processes, but this can lead to poor
control of the consumer’s risk (see A1.1.4).
5.4.1 The number of test results, symbol n, from each of the two testing processes controls the producer’s risk β of falsely rejecting
E2935 − 21
means equivalence at a given true mean difference,. The producer’s risk may be alternatively stated in terms of the power, defined
as the probability 1 – β of correctly accepting equivalence at a given value of.
5.4.1.1 For symmetric equivalence limits in means equivalence testsstudies the power profile plots the probability 1 – β against
the absolute value of, due to the symmetry of the equivalence limits. This calculation can be performed using a spreadsheet
computer package (see A1.6.1A1.7.1 and Appendix X1).
5.4.1.2 An example of a set of power profiles in means equivalence testsstudies is shown in Fig. 1. The probability scale for power
on the vertical axis varies from 0 to 1. The horizontal axis is the true absolute difference. The power profile, a reversed S-shaped
curve, should be close to a power probability of 1 at zero absolute difference and will decline to the consumer risk probability at
an absolute difference of E. Power for absolute differences greater than E are less than the consumer risk and decline asymptotically
to zero as the absolute difference increases.
5.4.1.3 In Fig. 1, power profiles are shown for three different sample sizes for testing means equivalence. Increasing the sample
size moves the power curve to the right, giving a greater chance of accepting equivalence for a given true difference. Equations
for power profiles are shown in Section A1.5A1.6 and a spreadsheet example in Appendix X1.
5.4.2 Power curves for bias equivalence and non-inferiority are constructed by different formulas but have the same shape and
interpretation as those for means equivalence.
5.4.2.1 For non-inferiority testing, the power profile plots the probability 1 – β against the true mean difference(see
A1.6.2A1.7.2) or against the true variance ratio 5 for variances (see A1.6.3A1.7.3).
5.4.3 Power curves are evaluated by entering different values of n and evaluating the curve shape. A practical solution is to choose
n such that the power is above a 0.9 probability out to about one-half to two-thirds of the distance from zero to E, thus giving a
high probability that equivalence will be demonstrated for a range of true absolute differences that are deemed of little or no
scientific import in the test result.
5.4.4 Annex A2 provides criteria for determining the number of samples required to meet power requirements for slope
equivalence.
5.5 The statistical analysis for accepting or rejecting equivalence of means and variances for a single material is similar for all
cases and depends on the outcome of one-sided statistical hypothesis tests. These calculations are given in detail with examples
in Sections 6, 7, 9, and 10, with statistical theory given in Annex A1. The statistical analysis for slope and range equivalence is
given in Section 8, with statistical theory given in Annex A2.
5.5.1 The data analysis for means equivalence uses a statistical methodology termed the two one-sided tests (TOST) procedure.
The initialnull hypothesis (see A1.1.1) is that the average difference between two sets of data exceeds an equivalence limit in one
FIG. 1 Multiple Power Curves for Lab Transfer Example
E2935 − 21
of the directions from zero, and this is tested in both directions. If the hypothesis is rejected in both directions then the alternate
hypothesis that the mean difference is less than the equivalence limit is accepted and the two sources of data are deemed means
equivalent.
NOTE 1—Historically, this procedure originated in the pharmaceutical industry for use in bioequivalence trials (1, 2), and was denoted as the Two
One-Sided Tests Procedure, which has since been adopted for use in testing and measurement applications (3, 4).
5.5.1.1 For bias equivalence, the statistical test is based on only a single set of data conducted on a certified reference material
(CRM) because its accepted reference value (ARV) is considered to be a known mean with zero variability for the purpose of the
equivalence study.
5.5.2 The data analysis for non-inferiority testing of population means uses a single one-sided test in the direction of an inferior
outcome with respect to a performance characteristic determined by the test results. When the performance characteristic is defined
as “higher is better,” such as method sensitivity, the statistical test supports non-inferiority when LCL.2E. Conversely, when the
performance characteristic is defined as “lower is better,” such as incidence of misclassifications, the statistical test supports
non-inferiority when UCL,E.
5.5.2.1 For the non-inferiority testing of precision, the variances of the two data sets are used, and “lower is better” for this
parameter, so the test for non-inferiority applies. Because variances are a scale parameter, the single non-inferiority test is based
the ratio R of the two sample variances, and the non-inferiority limit E is also in the form of a ratio.
6. The TOST Procedure for Statistical Analysis of Means Equivalence — Two Independent Samples Design
th th
6.1 Statistical Analysis—Let the sample data be denoted as X = the j test result from the i population. The equivalence limit
ij
E, consumer’s risk α, and sample sizes have been previously determined.
6.1.1 Calculate averages, variances, and standard deviations, and degrees of freedom for each sample:
n
i
X
( ij
j51
¯
X 5 , i 5 1, 2 (1)
i
n
i
n
i
¯
~X 2 X !
( ij i
j51
s 5 , i 5 1, 2 (2)
i
n 2 1!
~
i
s 5=s , i 5 1, 2 (3)
i i
f 5 n 2 1, i 5 1, 2 (4)
i i
6.1.2 Calculate the pooled standard deviation and degrees of freedom:
2 2
n 2 1 s 1 n 2 1 s
~ ! ~ !
1 1 2 2
s 5Œ (5)
p
~n 1 n 2 2!
1 2
It is assumed that the sample variances come from populations having equal variances; and, if this appears not to be the case,
then use the procedure in A1.1.4.
If n = n = n, then:
1 2
2 2
s 1 s
~ !
1 2
s 5
p
f 5 n 1 n 2 2 (6)
~ !
p 1 2
6.1.3 Calculate the difference between means and its standard error:
¯ ¯
D 5 X 2 X (7)
2 1
1 1
s 5 s 1 (8)
Œ
D p
n n
1 2
The boldface numbers in parentheses refer to a list of references at the end of this standard.
E2935 − 21
If n = n = n, then:
1 2
s 5 sŒ
D p
n
6.1.4 Statistical Test for Equivalence—Compute the upper (UCL) and lower (LCL) confidence limits for the 100 (1 – 2α) %
two-sided confidence interval on the true difference. If the confidence interval is completely contained within the equivalence limits
(0 6 E), equivalently if LCL > –E and UCL < E, then accept equivalence. Otherwise, reject equivalence.
UCL 5 D1ts (9)
D
LCL 5 D 2 ts (10)
D
where t is the upper 100 (1 – α) % percentile of the Student’s t distribution with (n + n – 2) degrees of freedom.
1 2
6.2 Example for Means Equivalence—The example shown is data from a transfer of an ASTM test method from R&D Lab 1 to
Plant Lab 2 (Table 1). An equivalence of limit of 2 units was proposed with a consumer risk of 5 %. An interlaboratory study (ILS)
on this test method had given an estimate of s = 0.5 units for the repeatability standard deviation. Thus E = 2 units, α = 0.05, and
r
estimated σ = 0.5 units are inputs for this study (the actual units are unspecified for this example).
6.2.1 Sample Size Determination—Power profiles for n = 3, 6, and 20 were generated for a set of absolute difference values
ranging 0.00 (0.20) 2.40 units as shown in Fig. 1. All three curves intersect at the point (2, 0.05) as determined by the consumer’s
risk at the equivalence limit.
6.2.1.1 A sample size of n = 6 replicate assays per laboratory yielded a satisfactory power curve, in that the probability of
accepting equivalence (power) was greater than a 0.9 probability (or a 90 % power) for a difference of about 1.2 units or less.
Therefore, there would be less than an estimated 10 % risk to the producer that such a difference would fail to support equivalence
in the actual trial.study.
6.2.1.2 A comparison of the three power curves indicates that the n = 3 design would be underpowered, as the power falls below
0.9 at 0.8 units. The n = 20 design gives somewhat more power than the n = 6 design but is more costly to conduct and may not
be worth the extra expenditure.
6.2.2 Averages, variances, standard deviations, and degrees of freedom for the two laboratories are:
¯
X 5s96.9 1 97.9 1 98.5 1 97.5 1 97.7 1 97.2d⁄6597.62 mg⁄g
¯
X 5s97.8 1 97.6 1 98.1 1 98.6 1 98.6 1 98.9d⁄6598.27 mg⁄g
2 2 2
s 5fs96.9 2 97.62d 1 . 1 s97.2 2 97.62d g⁄s6 2 1d50.31367
2 2 2
s 5 97.8 2 98.27 1 . 1 98.9 2 98.27 ⁄ 6 2 1 50.26267
fs d s d gs d
s 5 0.3136750.560
1 œ
s 5 0.2626750.513
2 œ
f 5n 21562155
i i
The estimates of standard deviation are in good agreement with the ILS estimate of 0.5 mg/g.
6.2.3 The pooled standard deviation is:
s6 2 1d0.313671s6 2 1d0.26267 2.8817
s 5 5 50.537 mg⁄g
Œ Œ
p
s6 1 6 2 2d 10
with 10 degrees of freedom.
6.2.4 The difference of means is D = 98.27 – 97.62 = 0.65 mg/g. The plant laboratory average is 0.65 mg/g higher than the
development laboratory average. The standard error of the difference of means is s 50.537 =2⁄650.310 mg/g with 10 degrees of
D
freedom (same as that for s ).
p
TABLE 1 Data for Equivalence Test Between Two Laboratories
Test Results
Laboratory 1 96.9 97.9 98.5 97.5 97.7 97.2
Laboratory 2 97.8 97.6 98.1 98.6 98.6 98.9
E2935 − 21
th
6.2.5 The 95 percentile of Student’s t with 10 degrees of freedom is 1.812. Upper and lower confidence limits for the difference
of means are:
UCL = 0.65 + (1.812)(0.310) = 1.21
LCL = 0.65 – (1.812)(0.310) = 0.09
The 90 % two-sided confidence interval on the true difference is 0.09 to 1.21 mg/g and is completely contained within the
equivalence interval of –2 to 2 mg/g. Since 0.09 > –2 and 1.21 < 2, equivalence is accepted.
7. The TOST Procedure for Statistical Analysis of Means Equivalence — Paired Samples Design
th th
7.1 Statistical Analysis—Let the sample data be denoted as X = the test result from the i population and the j block,pair, where
ij
i = 1 or 2.2, j = 1, …, n. Each blockpair represents a pair of single test results from each population. For example, the blocking
factor may be time of sampling from a process. test result from each population at a given sampling point. The equivalence limit
E, consumer’s risk α, and sample size (number of blocks,pairs, symbol n) have been previously determined (see Section 5).
7.1.1 Calculate the n differences, symbol d , between the two test results within each block,pair, the average of the differences,
j
¯
symbol d, and the standard deviation of the differences, symbol s , with its degrees of freedom, symbol f.
d
d 5 X 2 X ,j 5 1,., n (11)
j 1j 2j
n
Σ d
j51 j
¯
d 5 5 D (12)
n
n ¯
Σ ~d 2 d!
j51 j
s 5Œ (13)
d
n 2 1
~ !
f 5 n 2 1 (14)
7.1.2 Calculate the standard error of the mean difference, symbol s .
D
s
d
s 5 (15)
D
=n
7.1.3 Statistical Test for Equivalence—Compute the upper (UCL) and lower (LCL) confidence limits for the 100(1 – 2α) %
two-sided confidence interval on the true difference. If the confidence interval is completely contained within the equivalence limits
(0 6 E), or equivalently if LCL > –E and UCL < E, then accept equivalence. Otherwise, reject equivalence.
UCL 5 D1ts (16)
D
LCL 5 D 2 ts (17)
D
where t is the upper 100(1 – α) % percentile of the Student’s t distribution with (n − 1) degrees of freedom.
7.2 Example for Means Equivalence—Total organic carbon in purified water was measured by an on-line analyzer, wherein a water
sample was taken directly into the analyzer from the pipeline through a sampling port and the test result was determined by a series
of operations within the instrument. A new analyzer was to be qualified by running a TOC analysis at the same time as the current
analyzer utilizing a parallel sampling port on the pipeline. The sampling time was the blockingpairing factor, and the data from
the two instruments constituted a pair of single test results measured at a particular sampling time. Sampling was to be conducted
at a frequency of four hours between sampling periods.
An equivalence limit of 2 parts per billion (ppb), or 4 % of the nominal process average of 50 ppb, was proposed with a
consumer risk of 5 %. A repeatability estimate of s = 0.7 ppb, based on previous validation work, gave an estimate for σ = 0.7√2
r d
or approximately 1 ppb. Thus E = 2 ppb, α = 0.05, and σ = 1 ppb were inputs for this study.
d
7.2.1 Sample Size Determination—Because the paired samples design uses the differences of the test results within sampling
periods for data analysis, the sample size equals the number of pairs for purposes of calculating the power curve. In this example,
the cost of obtaining test results was not a major consideration once the new analyzer was installed in the system. Comparative
power profiles for n = 10, 20, and 50 sample pairs are shown in Fig. 2. The sample size of 20 pairs yielded a satisfactory power
curve, in that the probability of accepting equivalence was greater than a 0.9 (or a 90 % power) for a true difference of about 1.25
ppb. Therefore, there would be less than an estimated 10 % risk to the producer that such a difference would fail to support
equivalence in the actual trial.study.
E2935 − 21
FIG. 2 Power Curves for Total Organic Carbon Analyzers Comparison
7.2.2 Test results for the two instruments at each of the 20 sampling times are listed in Table 2. The current analyzer was
designated as Instrument A, and the new analyzer was designated as Instrument B. The differences d at each sampling time period
j
were calculated and listed in Table 2 as differences in the test results of Instrument B minus Instrument A. The averages and
standard deviations of the test results for each analyzer and their differences are also listed in Table 2.
¯
7.2.2.1 The average difference d was 0.46 ppb and the standard deviation of the differences s was 1.05 ppb with f = 19 degrees
d
of freedom. The standard error of the average difference was:
1.05
s 5 5 0.235 ppb
D
=20
7.2.2.2 Note that the standard deviations of test results for each analyzer over time were about 6 ppb due to process fluctuations
in a range of 37–59 ppb. The source of variation due to blockspairs (sampling times from the process) is eliminated in the variation
of the differences by pairing the test results.
TABLE 2 Data for Paired Samples Equivalence Test
TOC in Water, ppb
Sampling Time
Inst A Inst B Diff
1 46.4 48.8 2.4
2 44.2 43.5 –0.7
3 52.4 53.0 0.6
4 37.6 37.3 –0.3
5 49.3 49.1 –0.2
6 45.0 44.5 –0.5
7 51.4 51.3 –0.1
8 57.6 56.8 –0.8
9 43.4 44.9 1.5
10 45.2 44.1 –1.1
11 59.0 58.5 –0.5
12 43.1 44.1 1.0
13 39.3 40.9 1.6
14 48.2 48.4 0.2
15 48.7 49.0 0.3
16 44.4 46.1 1.7
17 52.7 53.2 0.5
18 43.3 44.6 1.3
19 54.4 56.7 2.3
20 58.4 58.4 0.0
Average 48.20 48.66 0.46
Std Dev 6.13 5.99 1.05
E2935 − 21
th
7.2.3 The 95 percentile of Student’s t with 19 degrees of freedom was 1.729. Upper and lower confidence limits for the difference
of means were:
UCL 5 D1ts 5 0.461~1.729!~0.235!5 0.87 ppb
D
LCL 5 D 2 ts 5 0.46 2 1.729 0.235 5 0.05 ppb
~ !~ !
D
The 90 % two-sided confidence interval on the true difference is 0.05 to 0.87 ppb and is completely contained within the
equivalence interval of –2 to 2 ppb. Since 0.05 > –2 and 0.87 < 2, equivalence of the two analyzers is accepted.
8. Procedure for Equivalence of Test Results Over a Range of Values
8.1 Range equivalence is the condition that means equivalence of two testing processes holds over a predetermined range of a
material’s characteristic being measured. The experiment consists of obtaining pairs of test results by each process on a number
of different material samples. The approach taken for evaluating range equivalence is through a linear statistical function, Y5β
1β X, describing a straight-line statistical relationship between the test results from two testing processes denoted as X and Y. The
Y intercept β is the value of Y when X = 0, and the slope β is the amount of change in Y units for a unit change in X. The criterion
0 1
for range equivalence is twofold: (1) that the intercept β = 0, and (2) that the slope β = 1, each within predetermined limits. This
0 1
states that, within limits, the relationship Y = X holds over the range of the data.
8.1.1 In many cases, the data range is far removed from zero, so that the intercept parameter is not precisely estimated and can
even be a negative test result value. For that reason, the equivalence procedure for β = 0 should be replaced by a test that the means
of the X and Y test results are equivalent, indicating that the center of the straight-line relationship is locally close to the Y = X
line. This means-equivalence procedure is covered in Section 7. Note that the estimated line must pass through the point
determined from the X and Y averages of the data.
8.1.2 The equivalence testing procedure for β = 1 is termed slope equivalence and this topic will be covered in the remainder of
this section.
8.1.3 The two equivalence tests are independent, but if range equivalence is to be considered as a single procedure, the alpha risk
of the two component procedures can be split, usually equally. Thus, for example, Statistical tests for means equivalence and slope
equivalence should each be tested at must each meet equivalence to assert range equivalence, because together they constitute an
intersection-union test (see A1.4.1the ). If each of the component statistical tests is an α = 0.025 level to give a combined consumer
risk of -level test, then the range equivalence test is also an α = 0.05.-level test.
8.2 Slope Equivalence—The Y intercept and the slope for the linear statistical function in 8.1 are estimated from the data by a
process known as least-squares, which minimizes the sum of the squared deviations of the data points from the estimated line.
Unlike the similar simple linear regression model that predicts Y from X, where the X values are known constants and only Y is
subject to measurement variation (see Practice E3080), both X and Y are subject to measurement errors, with their variances
2 2
denoted as σ , σ respectively (see A2.1). This fact requires a different criterion for the least-squares procedure, which is known
δ ε
as errors-in variables regression. Instead of minimizing the point-to-line differences in the Y direction, the direction of that distance
2 2
is dependent on the precision ratio of Y with respect to X, denoted as λ5σ ⁄σ .
ε δ
8.2.1 The measurement error variances can be estimated from experience with current procedure use and method development
data for the modified procedure. Alternatively, the comparison experiment can conduct duplicate test results from both procedures
for estimating these variances, as noted in the referenced article (5).
8.2.2 In many situations it can be assumed that the two test methods have similar measurement error, thus dealing with the case
that λ = 1. Then the least squares procedure minimizes the squared differences in the perpendicular direction from the points to
the line. This is termed orthogonal least squares, and this procedure will be described in this section. Annex A2 discusses
methodology for the situation where λfi1.
th
8.3 Statistical Analysis for Slope Equivalence—F
...








Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...