EN 62429:2008
(Main)Reliability growth - Stress testing for early failures in unique complex systems
Reliability growth - Stress testing for early failures in unique complex systems
This International Standard gives guidance for reliability growth during final testing or acceptance testing of unique complex systems. It gives guidance on accelerated test conditions and criteria for stopping these tests.
Zuverlässigkeitswachstum - Beanspruchungsprüfung auf Frühausfälle in einzelnen komplexen Systemen
Croissance de fiabilité - Essais de contraintes pour révéler les défaillances précoces d'un système complexe et unique
La présente Norme internationale donne des recommandations applicables à la croissance de fiabilité au cours des essais finaux ou des essais d'acceptation d'un système complexe et unique. Elle donne des indications relatives aux conditions d'essais accélérés et des critères pour l'arrêt de ces essais.
Rast zanesljivosti - Obremenjevalno preskušanje za odkrivanje zgodnjih odpovedi v edinstvenih kompleksnih sistemih (IEC 62429:2007)
General Information
Standards Content (Sample)
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.Rast zanesljivosti - Obremenjevalno preskušanje za odkrivanje zgodnjih odpovedi v edinstvenih kompleksnih sistemih (IEC 62429:2007)Zuverlässigkeitswachstum - Beanspruchungsprüfung auf Frühausfälle in einzelnen komplexen SystemenCroissance de fiabilité - Essais de contraintes pour révéler les défaillances précoces d'un système complexe et uniqueReliability growth - Stress testing for early failures in unique complex systems21.020Characteristics and design of machines, apparatus, equipment03.120.01Kakovost na splošnoQuality in generalICS:Ta slovenski standard je istoveten z:EN 62429:2008SIST EN 62429:2008en01-junij-2008SIST EN 62429:2008SLOVENSKI
STANDARD
EUROPEAN STANDARD EN 62429 NORME EUROPÉENNE
EUROPÄISCHE NORM April 2008
CENELEC European Committee for Electrotechnical Standardization Comité Européen de Normalisation Electrotechnique Europäisches Komitee für Elektrotechnische Normung
Central Secretariat: rue de Stassart 35, B - 1050 Brussels
© 2008 CENELEC -
All rights of exploitation in any form and by any means reserved worldwide for CENELEC members.
Ref. No. EN 62429:2008 E
ICS 03.120.01; 03.120.99
English version
Reliability growth -
Stress testing for early failures in unique complex systems (IEC 62429:2007)
Croissance de fiabilité -
Essais de contraintes pour révéler
les défaillances précoces
d'un système complexe et unique (CEI 62429:2007)
Zuverlässigkeitswachstum -
Beanspruchungsprüfung auf Frühausfälle in einzelnen komplexen Systemen (IEC 62429:2007)
This European Standard was approved by CENELEC on 2008-03-01. CENELEC members are bound to comply with the CEN/CENELEC Internal Regulations which stipulate the conditions for giving this European Standard the status of a national standard without any alteration.
Up-to-date lists and bibliographical references concerning such national standards may be obtained on application to the Central Secretariat or to any CENELEC member.
This European Standard exists in three official versions (English, French, German). A version in any other language made by translation under the responsibility of a CENELEC member into its own language and notified to the Central Secretariat has the same status as the official versions.
CENELEC members are the national electrotechnical committees of Austria, Belgium, Bulgaria, Cyprus, the Czech Republic, Denmark, Estonia, Finland, France, Germany, Greece, Hungary, Iceland, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, the Netherlands, Norway, Poland, Portugal, Romania, Slovakia, Slovenia, Spain, Sweden, Switzerland and the United Kingdom.
Foreword The text of document 56/1232/FDIS, future edition 1 of IEC 62429, prepared by IEC TC 56, Dependability, was submitted to the IEC-CENELEC parallel vote and was approved by CENELEC as EN 62429 on 2008-03-01. The following dates were fixed: – latest date by which the EN has to be implemented
at national level by publication of an identical
national standard or by endorsement
(dop)
2008-12-01 – latest date by which the national standards conflicting
with the EN have to be withdrawn
(dow)
2011-03-01 Annex ZA has been added by CENELEC. __________ Endorsement notice The text of the International Standard IEC 62429:2007 was approved by CENELEC as a European Standard without any modification. In the official version, for Bibliography, the following notes have to be added for the standards indicated: IEC 60300-1 NOTE
Harmonized as EN 60300-1:2003 (not modified). IEC 60300-2 NOTE
Harmonized as EN 60300-2:2004 (not modified). IEC 60300-3-1 NOTE
Harmonized as EN 60300-3-1:2004 (not modified). IEC 60706-5 NOTE
Harmonized as EN 60706-5:2007 (not modified). IEC 60812 NOTE
Harmonized as EN 60812:2006 (not modified). IEC 61014 NOTE
Harmonized as EN 61014:2003 (not modified). IEC 61025 NOTE
Harmonized as EN 61025:2007 (not modified). IEC 61078 NOTE
Harmonized as EN 61078:2006 (not modified). IEC 61160 NOTE
Harmonized as EN 61160:2005 (not modified). ISO 9000 NOTE
Harmonized as EN ISO 9000:2005 (not modified). __________
- 3 - EN 62429:2008 Annex ZA (normative)
Normative references to international publications with their corresponding European publications
The following referenced documents are indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.
NOTE
When an international publication has been modified by common modifications, indicated by (mod), the relevant EN/HD applies.
Publication Year Title EN/HD Year
IEC 60050-191 1990 International Electrotechnical Vocabulary (IEV) -
Chapter 191: Dependability and quality of service - -
IEC 60300-3-5 -1) Dependability management -
Part 3-5: Application guide - Reliability test conditions and statistical test principles - -
IEC 60605-2 -1) Equipment reliability testing -
Part 2: Design of test cycles - -
IEC 61163-1 2006 Reliability stress screening -
Part 1: Repairable assemblies manufactured in lots EN 61163-1 2006
IEC 61163-2 -1) Reliability stress screening -
Part 2: Electronic components - -
IEC 61164 -1) Reliability growth - Statistical test and estimation methods EN 61164 20042)
IEC 61710 -1) Power law model - Goodness-of-fit tests and estimation methods - -
1) Undated reference. 2) Valid edition at date of issue. SIST EN 62429:2008
IEC 62429Edition 1.0 2007-11INTERNATIONAL STANDARD NORME INTERNATIONALEReliability growth – Stress testing for early failures in unique complex systems
Croissance de fiabilité – Essais de contraintes pour révéler les défaillances précoces d’un système complexe et unique
INTERNATIONAL ELECTROTECHNICAL COMMISSION COMMISSION ELECTROTECHNIQUE INTERNATIONALE WICS 03.120.01;
03.120.99 PRICE CODECODE PRIXISBN 2-8318-9427-1
– 2 – 62429 © IEC:2007 CONTENTS FOREWORD.4
1 Scope.6 2 Normative references.6 3 Terms, definitions, abbreviations and symbols.7 3.1 Terms and definitions.7 3.2 Acronyms.9 3.3 Symbols.9 4 General.10 5 Planning and performing a reliability growth test.13 5.1 Step 1 – Should a reliability growth test be used?.13 5.2 Step 2 – Failure definitions and data collection.13 5.3 Step 3 – Stress levels.14 5.3.1 General.14 5.3.2 Increased operating load.14 5.3.3 Increased environmental stress.15 5.4 Step 4 – Failure analysis and classification of failures.15 5.4.1 General.15 5.4.2 Relevant failures.16 5.4.3 Non-relevant failures.17 5.5 Step 5 – Stop criteria.17 5.5.1 General.17 5.5.2 Method 1 – Fixed testing programs.17 5.5.3 Method 2 – Graphical analysis.18 5.5.4 Method 3 – Success ratio test.19 5.5.5 Method 4 – Estimation of reliability.21 5.5.6 Method 5 – Comparison with acceptable instantaneous failure intensity.22 5.5.7 Method 6 – Estimation of remaining latent faults.24 5.5.8 Method 7 – Reliability indicator testing.24 5.6 Step 6 – Verification of repairs and reliability growth.25 5.7 Step 7 – Reporting and feedback.26
Annex A (informative)
Practical example of method 3 – Success ratio test.27 Annex B (informative)
Practical example of method 5 –
Comparison with acceptable instantaneous failure intensity.28 Annex C (informative)
Practical example of method 6 –
Estimation of remaining latent faults.31
Bibliography.33
Figure 1 – The bathtub curve.12 Figure 2 – Evaluating whether the cumulative failure curve has levelled out.18 Figure 3 – Method 2.19 Figure B.1 – A reliability growth plot of the data from Table B.1.29 SIST EN 62429:2008
62429 © IEC:2007 – 3 –
Table 1 – Probability that a system with failure probability of 0,001
will pass N successive tests.21 Table 2 – Probability that a system with failure probability of 0,000 001
will pass N successive tests.21 Table 3 – Correct and incorrect decisions using reliability indicators.25 Table B.1 – Reliability growth and stopping times for the practical example.28 Table C.1 – Determining when to stop the test.32
– 4 – 62429 © IEC:2007 INTERNATIONAL ELECTROTECHNICAL COMMISSION ____________
RELIABILITY GROWTH –
STRESS TESTING FOR EARLY FAILURES
IN UNIQUE COMPLEX SYSTEMS
FOREWORD 1) The International Electrotechnical Commission (IEC) is a worldwide organization for standardization comprising all national electrotechnical committees (IEC National Committees). The object of IEC is to promote international co-operation on all questions concerning standardization in the electrical and electronic fields. To this end and in addition to other activities, IEC publishes International Standards, Technical Specifications, Technical Reports, Publicly Available Specifications (PAS) and Guides (hereafter referred to as “IEC Publication(s)”). Their preparation is entrusted to technical committees; any IEC National Committee interested in the subject dealt with may participate in this preparatory work. International, governmental and non-governmental organizations liaising with the IEC also participate in this preparation. IEC collaborates closely with the International Organization for Standardization (ISO) in accordance with conditions determined by agreement between the two organizations. 2) The formal decisions or agreements of IEC on technical matters express, as nearly as possible, an international consensus of opinion on the relevant subjects since each technical committee has representation from all interested IEC National Committees.
3) IEC Publications have the form of recommendations for international use and are accepted by IEC National Committees in that sense. While all reasonable efforts are made to ensure that the technical content of IEC Publications is accurate, IEC cannot be held responsible for the way in which they are used or for any misinterpretation by any end user. 4) In order to promote international uniformity, IEC National Committees undertake to apply IEC Publications transparently to the maximum extent possible in their national and regional publications. Any divergence between any IEC Publication and the corresponding national or regional publication shall be clearly indicated in the latter. 5) IEC provides no marking procedure to indicate its approval and cannot be rendered responsible for any equipment declared to be in conformity with an IEC Publication. 6) All users should ensure that they have the latest edition of this publication. 7) No liability shall attach to IEC or its directors, employees, servants or agents including individual experts and members of its technical committees and IEC National Committees for any personal injury, property damage or other damage of any nature whatsoever, whether direct or indirect, or for costs (including legal fees) and expenses arising out of the publication, use of, or reliance upon, this IEC Publication or any other IEC Publications.
8) Attention is drawn to the Normative references cited in this publication. Use of the referenced publications is indispensable for the correct application of this publication. 9) Attention is drawn to the possibility that some of the elements of this IEC Publication may be the subject of patent rights. IEC shall not be held responsible for identifying any or all such patent rights. International Standard IEC 62429 has been prepared by IEC technical committee 56: Dependability. The text of this standard is based on the following documents: FDIS Report on voting 56/1232/FDIS 56/1249/RVD
Full information on the voting for the approval of this standard can be found in the report on voting indicated in the above table. This publication has been drafted in accordance with the ISO/IEC Directives, Part 2. The committee has decided that the contents of this publication will remain unchanged until the maintenance result date indicated on the IEC web site under "http://webstore.iec.ch" in the data related to the specific publication. At this date, the publication will be
62429 © IEC:2007 – 5 – • reconfirmed, • withdrawn, • replaced by a revised edition, or • amended.
– 6 – 62429 © IEC:2007 RELIABILITY GROWTH –
STRESS TESTING FOR EARLY FAILURES
IN UNIQUE COMPLEX SYSTEMS
1 Scope This International Standard gives guidance for reliability growth during final testing or acceptance testing of unique complex systems. It gives guidance on accelerated test conditions and criteria for stopping these tests. “Unique” means that no information exists on similar systems, and the small number of produced systems means that information deducted from the test has limited use for future production. This standard concerns reliability growth of repairable complex systems consisting of hardware with embedded software. It can be used for describing the procedure for acceptance testing, "running-in", and to ensure that reliability of a delivered system is not compromised by coding errors, workmanship errors or manufacturing errors. It only covers the early failure period of the system life cycle and neither the constant failure period, nor the wear out failure period. It can also be used when a company wants to optimize the duration of internal production testing during manufacturing of prototypes, single systems or small series.
It is applicable mainly to large hardware/software systems, but does not cover large networks, for example telecommunications and power networks, since new parts of such systems cannot usually be isolated during the testing. It does not cover software tested alone, but the methods can be used during testing of large embedded software programs in operational hardware, when simulated operating loads are used. It addresses growth testing before or at delivery of a finished system. The testing can therefore take place at the manufacturer's or at the end user's premises.
If the user of a system performs reliability growth by a policy of updating hardware and software with improved versions, this standard can be used to guide the growth process. This standard covers a wide field of applications, but is not applicable to health or safety aspects of systems. This standard does not apply to systems that are covered by IEC 62279[39]. 2 Normative references The following referenced documents are indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies. IEC 60050-191:1990, International Electrotechnical Vocabulary – Chapter 191: Dependability and quality of service IEC 60300-3-5, Dependability management – Part 3-5: Application guide – Reliability test conditions and statistical test principles IEC 60605-2, Equipment reliability testing – Part 2 Design of test cycles SIST EN 62429:2008
62429 © IEC:2007 – 7 – IEC 61163-1:2006, Reliability stress screening – Part 1: Repairable assemblies manufactured in lots IEC 61163-2, Reliability stress screening – Part 2: Electronic components IEC 61164, Reliability growth – Statistical test and estimation methods IEC 61710, Power law model – Goodness-of-fit and estimation methods 3 Terms, definitions, abbreviations and symbols 3.1 Terms and definitions For the purposes of this document, the terms and definitions given in IEC 60050-191, as well as the following, apply. 3.1.1 time compression reducing test time by testing with higher use time than in the field NOTE An example is testing a system that is used 8 h a day for 24 h a day. 3.1.2 accelerated test test in which the applied stress level is chosen to exceed that stated in the reference conditions in order to shorten the time duration required to observe the stress response of the item, or to magnify the response in a given time duration NOTE To be valid, an accelerated test should not alter the basic fault modes and failure mechanisms, or their relative prevalence. [IEV 191-14-07] 3.1.3 (time) acceleration factor ratio between the time durations necessary to obtain the same stated number of failures or degradations in two equal size samples, under two different sets of stress conditions involving the same failure mechanisms and fault modes and their relative prevalence. NOTE One of the two sets of stress conditions should be a reference set. [IEV 191-14-10] 3.1.4 execution time time to perform a stated number of transactions 3.1.5 fault state of an item characterized by inability to perform a required function, excluding the inability during preventive maintenance or other planned actions, or due to lack of external resources. NOTE 1 A fault is often the result of a failure of the item itself, but may exist without prior failure. [IEV 191-05-01] SIST EN 62429:2008
– 8 – 62429 © IEC:2007 NOTE 2 In English, the term “fault” is also used in the field of electric power systems with the meaning as given in IEV 604-02-01[42]1; then, the corresponding term in French is “défaut”. NOTE 3 In this standard, the term “latent fault” is used to emphasize that the fault has not yet caused a failure. NOTE 4 Software alone is deterministic. But this standard considers software embedded in hardware where the software can have latent faults relating to the hardware and the environment, e.g. insufficient protection against double keying, no checksum in communication, or no sanity check of input data or output data. 3.1.6 bug popular name for a software latent fault 3.1.7 reliability indicator non-functional parameter that points to a probable failure in a short time 3.1.8 success ratio test test repeated a number of times of which all have to be passed without failures 3.1.9 system set of interrelated or interacting elements [ISO 9000:2005, 3.2.1] [41] NOTE 1 In the context of dependability, a system will have – a defined purpose expressed in terms of intended functions, – stated conditions of operation/use, and – defined boundaries. NOTE 2 The structure of a system may be hierarchical [IEC 60300-1, 3.6] [43]. NOTE 3 For some systems, such as information technology products, data is an important part of the system elements. [Future IEC 60300-3-15, modified] [44]. 3.1.10 transaction set of input parameters and preconditions selected from operating loads for the system 3.1.11 root cause analysis activity to identify the cause of a fault or failure, so it can be removed by design or process changes 3.1.12 error discrepancy between a computed, observed or measured value or condition and the true, specified or theoretically correct value or condition NOTE 1 An error can be caused by a faulty item, e.g. a computing error made by faulty computer equipment. NOTE 2 The French term “erreur” may also designate a mistake (see IEV 191-05-25). [IEV 191-05-24] ——————— 1
References in square brackets refer to the biblioraphy. SIST EN 62429:2008
62429 © IEC:2007 – 9 – 3.1.13 mistake human error human action that produces an unintended result [IEV 191-05-25] 3.1.14 failure termination of the ability of an item to perform a required function NOTE 1 After failure the item has a fault NOTE 2 "Failure" is an event, as distinguished from "fault", which is a state. NOTE 3 This concept as defined does not apply to items consisting of software only [IEV 191-04-01] NOTE 4 Software alone is deterministic. But this standard considers software embedded in hardware where the software can have latent faults relating to the hardware and the environment, e.g. insufficient protection against double keying, no checksum in communication, or no validity check of input data or output data. 3.1.15 failure intensity failure intensity; instantaneous failure intensity
z(t)
limit, if this exists, of the ratio of the mean number of failures of a repaired item in a time interval (t, t + ût), and the length of this interval, ût, when the length of the time interval tends to zero NOTE 1 The instantaneous failure intensity is expressed by the formula as formula as
()()()[]ttNttNEtztΔ−Δ+=+→Δ0lim [IEV 191-12-04] NOTE 2 To avoid confusion this standard will use “instantaneous failure intensity” since a system is repaired when it fails, and a latent fault is repaired (removed) when precipitated as a failure. 3.2 Abbreviations
CPU Central processor unit EMC Electro magnetic compatibility ESD Electro static discharge FMEA Failure mode and effect analysis MTBF Mean operating time between failures RAM Random access memory 3.3 Symbols C total number of transactions )(tD the number of faults detected by time t Fu unacceptable number of failed transactions out of C transactions SIST EN 62429:2008
– 10 – 62429 © IEC:2007 i fault number M probability that a system with an unacceptable reliability passes N tests without a failure m number of latent faults in the system N number of transactions to be performed without failure p unacceptable probability of failure per transaction RCM r(Tt) risk criterion metric for remaining latent faults at total test time Tt rc the estimated number of remaining latent faults in the system r(Tt) remaining (undetected) latent faults predicted at accumulated test time Tt s number of test time intervals used in the Schneidewind model to estimate the model parameters t actual test time tstatus test time at status ()DtT the accumulated test time by which D(t) faults were detected iT the accumulated test time when fault i was detected minT minT the minimum test time that shall be accumulated by the system for 0 failures tT accumulated test time measured in time units of the Schneidewind model z the acceptable instantaneous failure intensity zi the instantaneous failure intensity of fault i iθ cumulative mean operating time between failures (MTBF) when fault i was detected
NOTE The term “cumulative MTBF” is used to be in line with other reliability growth models described in the literature. It is instructive in displaying a growth in reliability due to defect root cause elimination. The cumulative MTBF (θt) for each fault i is determined as iiTiθ=. α empirical constant in the Schneidewind model – failure intensity at test time = 0 β empirical constant in the Schneidewind model – proportionality constant for failure intensity over time – Unit: (time)-1 δ the probability of no failure occurring by minT for a given acceptable instantaneous failure intensity 4 General
This standard is one of a series of standards under the application guide IEC 61014 [34]. This standard applies to large hardware-software systems when tested using a simulated operating load. Therefore, it is not known during the test if a failure is caused by hardware, software, operating load, or a combination of these. A failure may be caused by a hardware failure, e.g. a random access memory (RAM) failure, a change of timing causing data collision, or an electromagnetic disturbance, changing data transmitted. The failure may also be caused by a software latent fault or by illegal data. How the failed item is repaired or the software is changed is, for this standard, only relevant to the extent that it influences the test decisions, e.g. through the assumptions of the statistical model. SIST EN 62429:2008
62429 © IEC:2007 – 11 – Nearly all modern systems contain embedded software. The software is typically tested on development hardware using transactions derived from the system specifications. Often the software is finished late so that the time for testing the software in the actual hardware is limited. It is usually not acceptable that the customer is the first to operate the software in the real hardware. Therefore, there is a need for a standard to guide testing and reliability growth of hardware with the embedded software. With hardware, it is assumed that early failures are caused by a latent fault in the hardware. Depending on the stress type and stress level, these latent faults can be precipitated into permanent or intermittent failures after some time. An example could be a crack in a component. Under dry operating conditions without vibration or shocks, the latent fault may remain a latent fault. But under moist operating conditions, moisture and contaminants may penetrate the crack and cause corrosion, ending in a permanent fault. Similarly, vibration or shock can cause crack propagation that may cause a permanent fault after some time. Software alone is deterministic. This means that a latent fault in the software (commonly called a software bug) will not result in a failure until the part of the code containing the latent fault is activated. The moment when this occurs depend on the operating conditions (e.g. input parameters and the internal states of the program, e.g. memory content). Therefore, there is a similarity between hardware latent faults and software latent faults. The software latent fault, once activated, may
cause a permanent fault but will often only cause an intermittent failure. Logical failures are systematic (i.e. they can be reproduced at will once the trigger for the associated fault is known). Since the trigger for any latent fault is encountered at random in the operating environment of the system, logical failures are observed as a stochastic process. Therefore, the usual measures of reliability can be applied (probability of time to next failure, failure intensity, etc.) Reliability growth will normally occur as latent faults are removed. In this standard the term "latent fault" will therefore be used to cover weaknesses in hardware as well as bugs in software [10]. A failure caused by a combination of hardware and software could be, for example, that a hardware latent fault causes insufficient cooling of a component. The temperature rise changes the time delays in the circuit, causing data collision that results in a software failure. Another combination could be that a hardware design error causes insufficient shielding of signal wires. The increased level of electromagnetic noise corrupts the data in the signal wires causing a software failure, given that the software does not have an error correction feature, and the operating environment has a high electromagnetic noise level. This standard covers repairable systems that are produced in a very small number of copies, so that experience from tests of previous similar systems is limited or non-existent. It can be used when a manufacturer wants to optimize the duration of internal acceptance testing and running-in. It addresses growth testing before or at delivery of a finished system. The testing can therefore take place at the manufacturer's or at the end user's premises. It can also be used when a company wants to optimize the duration of final production testing during manufacturing of single items, small series or during testing of a prototype. It can also be used by the owner of only one, or a few, large systems to improve those systems only. If the user of a system performs reliability growth by a policy of updating hardware and software with improved versions, this standard can be used to control the growth process. This standard does not cover software alone, but it can be used when embedded software is tested in a hardware system using test strategies that give a diminishing number of failures as a function of test time, for example a software test with simulated operational load. The methods described are well suited to test and improve the robustness of a software program against transients and disturbances caused by the operational load and by the hardware SIST EN 62429:2008
– 12 – 62429 © IEC:2007 system. It addresses large hardware/software systems, but does not cover large networks, for example, telecommunications and power networks, since the new parts of these are difficult to isolate during the testing process. Reliability growth is a method aimed at improving quality by identifying and removing latent faults, but should not be used as the primary means of achieving the intended quality and reliability of the systems produced. Large systems are often produced in a small number of copies. Often only one or a few systems are produced. The remaining latent faults introduced through the design and manufacturing processes therefore shall be identified via growth testing of the finished system. However, an appropriate process control should be used and preventive methods such as an FMEA process (see IEC 60812) [33], fault tree analysis (see IEC 61025 [35]) and design reviews (see IEC 61160 [37]) should be used to reduce the number of latent faults in the produced system(s). Further, the manufacturing processes and assembly processes should be controlled, for example using statistical process control. In some cases, it may be possible to divide a large system into a number of similar modules on which the methods of IEC 61163-1 can be used. The similar modules are then regarded as a lot consisting of similar items. This will cover latent faults in the modules but not failures caused by the interaction of the modules and interactions between the modules and the embedded software. The failures caused by the interaction between the modules can be found only by growth testing the finished system. In modern systems, many failures are caused by an interaction between hardware and software. These failures cannot be found before the whole system is finished and functional. When the prototype is the only system produced, prototype testing and growth testing merge into one activity. This standard covers only the early failure period of the system life cycle. This means that it does not cover the random failure period or the wear out failure period of the bathtub curve, as illustrated in Figure 1.
Operating time Instantaneous failure intensity Early failure period Randomfailure period Wear-out failure period IEC
2259/07
Figure 1 – The bathtub curve NOTE This standard applies to the early failure period. Due to increased stress or time compression, this part of the operating time may be covered by a shorter period of growth testing. When planning a reliability growth testing process, the decision makers should carefully consider time and cost against the performance of the system including the risks and costs associated with early failures in the system after delivery. All failures identified during testing shall be carefully analysed in order to find the root cause, and to ensure that the experiences are used to prevent similar problems in other systems. The finished system(s) shall be repaired or updated, re-tested for normal operation, and the system documentation shall be updated as appropriate. SIST EN 62429:2008
62429 © IEC:2007 – 13 – If discrepancies arise between this standard and the relevant contract or specification(s), the latter shall apply. 5 Planning and performing a reliability growth test 5.1 Step 1 – Should a reliability growth test be used? A reliability growth test is relevant in the following cases: • the savings in costs due to reduction of early failures is larger than the cost of the test including the necessary monitoring and test equipment; • where no previous test data exist for the whole system, since only one or a few systems have been produced, or only one system requires testing; • where early failures are expected due to latent faults introduced in the assembly processes and the components or due to tolerance interference between components in the system; • where relevant early failures in modules and components should be screened out by reliability stress screening before the start of the system test (see IEC 61163-1 and IEC 61163-2); • where early failures are expected due to interaction between the hardware of the system and the embedded software; • when using a test strategy where reliability growth is expected, i.e. the failure intensity should decrease with test time; • when tests are performed using simulated operating loads, when possible higher than average loads can be used, and where relevant abnormal loads (noisy data, illegal data or overload conditions) can be added; or • where possible hardware latent faults are precipitated into permanent or intermittent failures by increasing environmental stresses, i.e. by increasing temperature, temperature changes, vibration, shock, etc. 5.2 Step 2 – Failure definitions and data collection A practical approach is to list the system requirements and check which requirements should be monitored. Then determine how the system can be monitored during the test. The test specification shall define relevant and non-relevant failures. Relevant failures are sudden failures (function missing) as well as gradual failures (degradation). Further software related failures, i.e. no answer, wrong answer, system locked or excessive response time, should be defined. The failures may be caused by hardware, the embedded software or the interaction between the hardware and the software, e.g. shift in time delays causing data collision or electromagnetic noise changing data. Non-relevant failures are failures caused by the test equipment, the monitoring equipment or by the test operators. If robustness testing of the system against human errors (mistakes made by the operator) is to be included in the growth test, these errors shall be defined as relevant failures. If possible, the system should be monitored continuously for function and performance. To the extent that this is not possible, a functional test, including check of function of redundant units, should be made at fixed intervals. When stress cycles are used, the system should be checked for function after each cycle. The status of redundancy and automatic reconfiguration as well as other relevant internal system parameters should be monitored during the testing. System changes such as replacing a module or switching operating modes shall also be recorded. A practical procedure is to report all events, e.g. start, stop, failure, upgrade, change of configuration, i.e. operating mode, etc., in the test protocol. It is recommended to SIST EN 62429:2008
– 14 – 62429 © IEC:2007 invite the test team and user operators to comment and make suggestions on the operation of the system. For methods 1, 2, 4, 5 and 6, the test time to failure shall be registered. The time reference shall be defined. It can, for example, be test time in hours or minutes, operating time or central processor unit time (CPU time). To reduce test time, time compression or increased stresses (accelerated testing) can be used. For method 3, the number of transactions to failure shall be registered. 5.3 Step 3 – Stress levels
5.3.1 General A detailed testing procedure shall be made before the reliability growth process starts. This plan shall list the method(s) used for the testing as well as decision procedures and confidence levels. The failure analysis and reporting procedures should also be described. The processes should be tailored to the specific system as well as to the available stress equipment, and the possible means of stressing the system (see IEC 61163-1 for guidance). In order to precipitate the latent faults as failures as fast as possible, the systems under test should be stressed in a manner that is appropriate for the appearance of relevant failures without introducing failure modes unrelated to field failures, and without reducing the lifetime of the system significantly, i.e. wearing out solder joints or life limited components. The test conditions may lie beyond the specified operating conditions but shall still be kept within design capabilities. The purpose is to prevent system damage and avoid introducing failures that would not occur in the field. The size of most large systems limit the stress that can be applied. Therefore low acceleration factors are usually used. Since the tests look for early failures, this is seldom a problem. Time compression only accelerates the failure modes influenced by the increased stress(es). The consequence may be that some failure modes, e.g. corrosion, are not accelerated or are even reduced. In most cases, however, this is less of a problem since the tests are looking for early failures and not wear out failures. Increased stress is used in this test to precipitate latent faults as failures faster than in the field. For the methods that are based directly on diminishing return of the test time, e.g. methods 1.2, 2, 3, 6 and 7, there is no need to estimate the acceleration factor. For methods 1.1, 4 and 5, the acceleration factor needs to be estimated if the reliability target is specified for operation in the field. Methods to estimate the acceleration factor can be found in IEC 61163-2. 5.3.2 Increased operating load
The stress type that is most easy to increase is usually the operating load. Operating and usage profiles should be the basis for defining the operating load during the test. A very useful method is time compression, e.g. increasing the number of operating loads per time unit. In this case the acceleration factor on the operating load can easily be estimated as the ratio between the transactions in the test over the transactions in the field during the same time period. For software, the operating load can often be increased by using real or simulated input data with a higher occurrence or volume than in normal operation. It should be decided if the operating load should simulate normal operating loads or also include unusual operating conditions, e.g. unbalanced load, load surge or extreme operating conditions such as illegal, noisy or corrupted data. Normally the highest specified operational load should be used. In a contract situation, the parties may agree that the load can be increased above the specified maximum load. Outside a contract situation, the load shall not be increased above the specification limit except based on a management decision. SIST EN 62429:2008
62429 © IEC:2007 – 15 – In the case of redundant or protective devices which are normally not in operation in a system, conditions should be created for activating these devices at regular time intervals. 5.3.3 Increased environmental stress 5.3.3.1 General In principle, the stress types described in IEC 61163-1 may be used for small systems. For large systems, the possible stress types are restricted by the limitations caused by their large size, e.g. the system may be too large to fit into a climatic chamber or on vibration test equipment. Certain parts of the system may be inaccessible when the system is assembled and in operation. Furthermore, the presence of operating personnel may reduce the possibilities for increasing the stress level, for example the ambient temperature. Indirect methods, for example reliability indicator testing (see 5.5.8 and IEC 60706-5 [32]), should be considered as a supplement to, or as a replacement for, increased stresses (see also [3]). Stress cycles can be designed using IEC 60605-2.
The test plan shall list the chosen stress types as well as the stress levels and their duration. Reduction of lifetime for life-limited items due to the test shall be estimated when relevant. 5.3.3.2 Thermal stress
The operating temperature of the system can often be increased by raising the temperature in the room or by restricting the cooling (i.e. cover inlets or outlets, or by reducing speed of fans). The flow rate of cooling air or cooling water flow can be decreased. Furthermore, the temperature can be cycled (thermal cycling). Temperature cycling should include a cold start as this will often cause maximum thermal gradients in the system. 5.3.3.3 Moisture level
Corrosion testing is usually conducted on component level, but high relative humidity may cause increased leakage currents. Electrostatic discharge (ESD) is usually a separate test, but low relative humidity may cause ESD discharge from persons or from movable parts. Therefore, it may in some cases be relevant to increase or decrease the relative humidity for the system or part of the system during the test. 5.3.3.4 Mechanical stress
Mechanical vibrations can be introduced by using vibration equipment or a pneumatic hammer on the chassis of the system[1]. 5.3.3.5 Voltage and electrical transients Voltage from power supplies can be increased or decreased as relevant. Transients can be introduced to the voltage supply and to signal cables (see IEC 60605-2). 5.4 Step 4 – Failure analysis and classification of failures 5.4.1 General When a failure is observed, the first action shall be to note the test time or number of transactions to failure. Thereafter it shall be decided if the system has to be stopped, if it is not already stopped by the failure. It can be necessary to stop the operation of the system for the following reasons: SIST EN 62429:2008
– 16 – 62429 © IEC:2007 • for safety reasons; • in order for the failure not to cause secondary failures, destroying the system or part of the system; • in order to conduct a failure analysis; or • in order to repair the failed item. As soon as evidence for the failure classification has been collected, it should be decided if the failed item should be repaired immediately, or if the repair should be postponed. In some cases it may be possible to continue the testing without repairing the failed item. The condition is that an analysis shows that it is probable that the failure will not cause secondary failures and that it will still be possible to test the major part of the remaining system. This judgement will require engineering knowledge of the system. In the test protocol, it should be recor
...








Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...